SerialReads

Operational & Query-Planning Considerations (System Design Deep Dive)

Jun 08, 2025

Replication & Quorum Math

In distributed databases, data is often replicated to multiple nodes for durability and availability. Leader-follower (primary/secondary) replication designates one node as the leader for writes and propagates changes to follower replicas. Consensus algorithms like Raft or Paxos coordinate this process by electing a leader and requiring a majority of nodes to acknowledge each log entry, achieving strong consistency at the expense of some availability. Other systems use leaderless replication with tunable quorums: a write must be confirmed by W nodes and a read will query R nodes such that W + R > N (for N total replicas). This ensures every read overlaps with at least one node that has the latest write, guaranteeing consistency if quorums are chosen appropriately (often W and R are majorities). For example, in Amazon Dynamo-style systems, you might choose W=2 and R=2 out of N=3 so that reads and writes intersect on at least one node.

Replication strategies yield different consistency levels. Strong consistency means after a successful write, any subsequent read (to any replica) will return that value – this typically requires synchronous replication or a read to contact a quorum/leader. Timeline consistency is a slightly weaker level often seen with async primary-secondary setups: replicas apply updates in the same order as the primary, so each replica is consistent with some prefix of the timeline of updates. You won’t see out-of-order data, but a replica might be a few seconds behind (e.g. not reflecting the most recent commit). If the primary fails, some latest writes may not have made it to followers (hence a new leader might come up slightly behind, losing those transactions). Eventual consistency is weaker still – replicas update at their own pace with no ordering guarantees, so a read might temporarily see an older state even if a newer write was done elsewhere, or might see writes out of order. In an eventually-consistent store, if no new updates occur, all replicas will converge to the same state, but during update propagation you can see anomalies (e.g. reading a newer message but missing an older one). Many NoSQL databases allow clients to tune consistency per-request (trading off latency vs. freshness), choosing strong (read from leader or majority) vs. eventual (read from any replica) or options in between.

Secondary-Index Lag & Read-After-Write Gaps

Secondary indexes (alternative query paths on data) often have consistency gaps relative to the primary data. If index updates are applied asynchronously, there’s a window where a recently written item is visible via the primary key but not via the index. For example, in DynamoDB, a Local Secondary Index (LSI) is updated atomically with the base table and thus supports consistent reads, whereas a Global Secondary Index (GSI) is updated asynchronously. As a result, GSI queries are always eventually consistent – immediately after a write, a query on the GSI may not return the item until the index update completes. During this lag, a read-after-write to the index can fail to find the item that was just written to the main table. In fact, GSI reads can even be non-monotonic: you might read an item’s new value on one query, then see its old value on a subsequent query shortly after, due to propagation delays.

There are several strategies to mitigate such inconsistencies:

In DynamoDB, one best practice for strong read-after-write is to design your data model to avoid GSIs on attributes that require freshness (or use an LSI or the base table for those queries). In general, truly eliminating read-after-write gaps on a secondary index requires making the index update part of the write transaction. Otherwise, architects must handle the temporary inconsistency in application logic (as above) or tolerate the brief inconsistency window.

Query Planner Basics

Modern databases include a query optimizer that decides how to execute SQL queries efficiently. Most use a cost-based optimizer that consults a statistics catalog to estimate the cost of different plans. The optimizer has many choices: which index to use (if any), in what order to join multiple tables, which join algorithm to use (nested loop, hash join, etc.), how to employ parallel execution, and so on. Using table statistics (number of rows, value distributions, etc.), the optimizer assigns an estimated cost to each plan and picks the lowest-cost plan. Older databases (and some simpler query engines) used rule-based optimizers, which rely on fixed heuristics rather than numeric cost. For example, a rule-based planner might always choose an index scan over a full table scan or always join tables in the order written. These fixed rules can misbehave if, say, the index is on a very low-selectivity column. Cost-based planners are much more adaptive, though they rely on up-to-date stats to be effective.

One crucial decision is join order. Joining tables in the wrong order can lead to an explosion in intermediate rows. The optimizer’s job is to find an order (and method) that minimizes work – e.g. filter early, avoid Cartesian products, use indexes to avoid scanning huge tables unnecessarily. The planner also considers whether to use an index or do a sequential scan; if a filter is not very selective, a full scan might actually be cheaper than many index lookups.

To understand and debug these decisions, developers use EXPLAIN. The EXPLAIN command (in SQL databases) shows the chosen execution plan and the estimated cost for each step. EXPLAIN ANALYZE goes further by actually running the query and reporting the actual time and rows for each step. This is incredibly useful to see if the optimizer’s estimates were accurate and which step is the bottleneck. For instance, you might discover the optimizer expected a filter to match 100 rows but it actually matched 100k (due to outdated stats), leading to a suboptimal plan.

Several anti-patterns can prevent the optimizer from doing its job well. A classic one is an implicit type cast in a query. If you have an indexed column user_id as a string and your query accidentally compares it to an integer, the DB will have to cast every user_id to integer (or vice-versa) at runtime, which typically means it can’t use the index and does a full scan. Always ensure the data types match or perform explicit casts in the query so the planner knows it in advance. Another anti-pattern is wrapping an indexed column in a function in the WHERE clause (without a matching functional index). For example, WHERE YEAR(timestamp) = 2023 will not use an index on timestamp because the planner sees the function and gives up on using the raw index. It will resort to scanning all rows and computing YEAR() for each. The solution is to either create a functional index on YEAR(timestamp) or rewrite the query to a range filter: e.g. WHERE timestamp >= '2023-01-01' AND timestamp < '2024-01-01', which can use a normal index on the timestamp. In summary, query planning works best when the SQL is sargable (Search-Argumentable): simple predicates on columns (not inside functions or casts), and when the database’s statistics are current.

Online Schema Change

Altering a large table’s schema in production (adding columns, changing column types, adding an index, etc.) can be very tricky to do without downtime. A naive ALTER TABLE may lock the table for minutes or hours, during which reads/writes queue up. It can also wreak havoc with replication – the schema change has to replay on replicas, often causing replication lag or even stopping replication if not handled carefully. Modern databases and tools have developed techniques for online schema changes to mitigate these issues.

One approach is to use non-blocking, in-place DDL when the database supports it. For example, recent MySQL and PostgreSQL versions can add a column or create an index without a full table lock in many cases. These use tricks like building an index in the background. However, not all operations are supported online, and even online operations can put extra load on the server (copying data, rebuilding indexes), which may slow down queries and cause replicas to lag behind.

A widely used strategy is the ghost table method employed by tools like Percona’s pt-online-schema-change and GitHub’s gh-ost. In this method, you create a shadow copy of the table with the new schema (for instance, the same data but with an extra column or new index). Then you copy data from the original table to the ghost table in small chunks – not locking the whole table at once. During this copy, any new writes to the original table are captured: pt-online-schema-change sets up triggers on the original table to record any INSERT/UPDATE/DELETE and apply the same change to the ghost table. gh-ost, alternatively, reads the database’s binary log to capture those changes instead of using triggers (which imposes less load on the DB). Once all existing data is backfilled into the ghost table (copying in batches to control load and keep replication happy), you do a quick swap: briefly lock the original table, apply any final queued changes from the log, and rename tables – the new table takes the place of the old. This final swap is very fast (a few seconds or less), so from the application’s perspective, the schema change occurs with minimal downtime.

The ghost table method incurs overhead: it uses additional disk space (temporarily two copies of the data) and extra writes (from triggers or log playback). But it avoids long locks and keeps replication lag low by throttling the copy rate. In production, one would monitor the replica lag and the tool will pause copying if lag grows too much, then resume when it catches up. The gh-ost tool even allows pausing and resuming the migration without downtime.

For example, if you need to add an index on a 100 million row table, using pt-online-schema-change or gh-ost, you can do it while the database continues to serve traffic. The index is built on the ghost table in the background. Only at the end does a brief lock occur to ensure no lost updates. Companies like Facebook and GitHub have used these tools to add columns or indexes online on huge MySQL tables.

Lastly, zero-downtime schema change requires careful testing. It’s advisable to test the procedure on a staging environment, ensure the triggers or binlog playback correctly mirror all changes, and have a rollback plan (e.g. be ready to switch back to the old table if something goes wrong). Also, during the long copy, the new schema changes aren’t live yet – so sometimes application changes need to be deployed that can work with both the old and new schema until cutover. Despite the complexity, these techniques have become standard to update production databases without stopping the application.

Observability Signals

Running a database in production, one must closely watch certain metrics and logs to identify performance issues or bottlenecks. Key signals include:

All these signals tie together. An increase in full scans or a drop in cache hit ratio may cause higher p99 latency. Or high lock contention might show up as certain queries in the slow log waiting on locks. Modern monitoring tools and cloud platforms help correlate these. For example, AWS RDS Performance Insights provides a dashboard of database load and you can break it down by wait events or by SQL query to see what is consuming time. It will show if the load is due to CPU, I/O waits, locks, etc., and the top queries contributing to that load. Using such tools, an operator might discover that “Our p99 latency is high because of buffer pool misses leading to disk reads – we need more memory” or “Lock waits on table X are causing slow queries – we might consider sharding that table or tuning the workload.”

In summary, robust observability of a database involves capturing metrics at multiple levels (query latency, resource utilization, wait events) and using logs (like slow queries) to zoom in on offending queries. This allows DBAs to be proactive in optimization and quickly pinpoint the cause of performance issues.

Adaptive Capacity & Auto-Scaling

As workloads evolve, databases should adapt to avoid performance bottlenecks. A number of techniques exist for adaptive capacity and auto-scaling:

In practice, adaptive capacity and auto-scaling significantly reduce the need for operators to manually reconfigure the system in response to workload changes. It improves resilience – the database can handle an unexpected spike on one key by adding capacity or splitting, rather than failing or slowing down. A well-known example is Amazon DynamoDB’s auto-scaling and adaptive capacity which make it “serverless” in the sense that it automatically adjusts to your usage. Another is Google Spanner, which can add servers to a spanserver pool when data grows. The key for designers is to ensure these adjustments are safe (no data loss, minimal performance impact during re-partitioning) and fast enough to meet the workload demands. Monitoring is still important – e.g. watch for any throttling events or increasing latency, which might indicate the workload is hitting some limit of the adaptive system (like hitting an account-level throughput limit or a maximum partition size where manual intervention is needed).

Backup, PITR & Disaster Drills

Effective backup and recovery planning can be the difference between a minor incident and a major data loss. Several concepts are important here:

In summary, backup strategy is about data safety (multiple copies, including offline copies to protect against ransomware or cloud account issues) and restore strategy is about time to recovery. Both need equal attention. Regularly verify that backups can actually be used to spin up a working database, and that your team knows the steps under pressure. As the saying goes, “Your recovery plan is only as good as your last test.”

Failover Choreography

When a primary database instance fails (or is presumed dead) and a standby needs to take over, the sequence of steps must be carefully managed to avoid data inconsistency. Key considerations in failover choreography include detection, election/promotion, and preventing split-brain:

Once a failover completes (new primary is live), clients need to be redirected. Solutions include using a floating IP or DNS name that points to the primary – and updating it – or using a proxy that routes to the current primary. There’s often a slight outage while this happens (connections break and need to reconnect to the new node).

Failover drills are as important as backup drills. You don’t want the first time you try a failover to be in production during an incident. Regularly test failover procedures in staging or even prod (some setups do planned failovers for maintenance – which is a good opportunity to see if your apps seamlessly handle it). In cloud-managed databases (Aurora, RDS, etc.), failover is usually automated – but you should still test how the application responds (does it time out for a few seconds and then recover? Does your connection pooling layer properly replay transactions or at least throw retriable errors?). For self-managed clusters, ensure the failover script or service (Pacemaker, etc.) is tuned – e.g. not too sensitive, and that fencing truly works in your environment.

In summary, failover choreography is about speed (minimizing downtime) but also safety (avoiding split-brain and data divergence). Through heartbeats, consensus/quorums, and fencing, a new primary can take over reliably. And always remember to bring the old primary (when it comes back) into sync before it becomes a replica – usually by discarding any un-applied transactions it had and catching it up with the new primary’s state (which cluster frameworks handle by letting the old node rejoin as a follower).

Capacity Planning

Capacity planning for a database ensures that the system has enough headroom to handle current and anticipated loads with acceptable performance. Key factors include data sizing, traffic patterns, and resource provisioning:

By addressing all these factors – working set sizing, IOPS/throughput provisioning, pooling, caching, and forward-looking scaling plans – you can ensure your database continues to meet SLAs as the system grows. Capacity planning is as much an art as a science: you use measurements and estimations, but also keep some safety margin and be prepared to adjust when reality diverges from predictions. The aim is to never be caught flat-footed by a sudden increase in load or data volume.

system-design