Operational & Query-Planning Considerations (System Design Deep Dive)

Jun 08, 2025

Replication & Quorum Math

In distributed databases, data is often replicated to multiple nodes for durability and availability. Leader-follower (primary/secondary) replication designates one node as the leader for writes and propagates changes to follower replicas. Consensus algorithms like Raft or Paxos coordinate this process by electing a leader and requiring a majority of nodes to acknowledge each log entry, achieving strong consistency at the expense of some availability. Other systems use leaderless replication with tunable quorums: a write must be confirmed by W nodes and a read will query R nodes such that W + R > N (for N total replicas). This ensures every read overlaps with at least one node that has the latest write, guaranteeing consistency if quorums are chosen appropriately (often W and R are majorities). For example, in Amazon Dynamo-style systems, you might choose W=2 and R=2 out of N=3 so that reads and writes intersect on at least one node.

Replication strategies yield different consistency levels. Strong consistency means after a successful write, any subsequent read (to any replica) will return that value – this typically requires synchronous replication or a read to contact a quorum/leader. Timeline consistency is a slightly weaker level often seen with async primary-secondary setups: replicas apply updates in the same order as the primary, so each replica is consistent with some prefix of the timeline of updates. You won’t see out-of-order data, but a replica might be a few seconds behind (e.g. not reflecting the most recent commit). If the primary fails, some latest writes may not have made it to followers (hence a new leader might come up slightly behind, losing those transactions). Eventual consistency is weaker still – replicas update at their own pace with no ordering guarantees, so a read might temporarily see an older state even if a newer write was done elsewhere, or might see writes out of order. In an eventually-consistent store, if no new updates occur, all replicas will converge to the same state, but during update propagation you can see anomalies (e.g. reading a newer message but missing an older one). Many NoSQL databases allow clients to tune consistency per-request (trading off latency vs. freshness), choosing strong (read from leader or majority) vs. eventual (read from any replica) or options in between.

Secondary-Index Lag & Read-After-Write Gaps

Secondary indexes (alternative query paths on data) often have consistency gaps relative to the primary data. If index updates are applied asynchronously, there’s a window where a recently written item is visible via the primary key but not via the index. For example, in DynamoDB, a Local Secondary Index (LSI) is updated atomically with the base table and thus supports consistent reads, whereas a Global Secondary Index (GSI) is updated asynchronously. As a result, GSI queries are always eventually consistent – immediately after a write, a query on the GSI may not return the item until the index update completes. During this lag, a read-after-write to the index can fail to find the item that was just written to the main table. In fact, GSI reads can even be non-monotonic: you might read an item’s new value on one query, then see its old value on a subsequent query shortly after, due to propagation delays.

There are several strategies to mitigate such inconsistencies:

Write-through indexing: update secondary indexes in the same transaction as the primary write. This is how many relational databases maintain indexes (ensuring no lag). The trade-off is slower writes. Some NoSQL systems avoid this to keep writes fast, accepting eventual consistency on indexes.
Double-read (or read repair): on a critical read, if data isn’t found via the index, fallback to read the primary store directly (or read from another index). This ensures read-your-write at the cost of an extra read. For instance, an application might first query a cache or index, and on a miss, query the source of truth – thereby not missing freshly written data.
Change streams: use a reliable event stream of changes (e.g. database binlog or Kafka pipeline) to update indexes and caches as quickly as possible. This ensures updates apply in order. It reduces the lag window, but consumers of the stream still see data eventually. Some systems provide outbox or change data capture to let you monitor when asynchronous index updates have been applied.

In DynamoDB, one best practice for strong read-after-write is to design your data model to avoid GSIs on attributes that require freshness (or use an LSI or the base table for those queries). In general, truly eliminating read-after-write gaps on a secondary index requires making the index update part of the write transaction. Otherwise, architects must handle the temporary inconsistency in application logic (as above) or tolerate the brief inconsistency window.

Query Planner Basics

Modern databases include a query optimizer that decides how to execute SQL queries efficiently. Most use a cost-based optimizer that consults a statistics catalog to estimate the cost of different plans. The optimizer has many choices: which index to use (if any), in what order to join multiple tables, which join algorithm to use (nested loop, hash join, etc.), how to employ parallel execution, and so on. Using table statistics (number of rows, value distributions, etc.), the optimizer assigns an estimated cost to each plan and picks the lowest-cost plan. Older databases (and some simpler query engines) used rule-based optimizers, which rely on fixed heuristics rather than numeric cost. For example, a rule-based planner might always choose an index scan over a full table scan or always join tables in the order written. These fixed rules can misbehave if, say, the index is on a very low-selectivity column. Cost-based planners are much more adaptive, though they rely on up-to-date stats to be effective.

One crucial decision is join order. Joining tables in the wrong order can lead to an explosion in intermediate rows. The optimizer’s job is to find an order (and method) that minimizes work – e.g. filter early, avoid Cartesian products, use indexes to avoid scanning huge tables unnecessarily. The planner also considers whether to use an index or do a sequential scan; if a filter is not very selective, a full scan might actually be cheaper than many index lookups.

To understand and debug these decisions, developers use EXPLAIN. The EXPLAIN command (in SQL databases) shows the chosen execution plan and the estimated cost for each step. EXPLAIN ANALYZE goes further by actually running the query and reporting the actual time and rows for each step. This is incredibly useful to see if the optimizer’s estimates were accurate and which step is the bottleneck. For instance, you might discover the optimizer expected a filter to match 100 rows but it actually matched 100k (due to outdated stats), leading to a suboptimal plan.

Several anti-patterns can prevent the optimizer from doing its job well. A classic one is an implicit type cast in a query. If you have an indexed column user_id as a string and your query accidentally compares it to an integer, the DB will have to cast every user_id to integer (or vice-versa) at runtime, which typically means it can’t use the index and does a full scan. Always ensure the data types match or perform explicit casts in the query so the planner knows it in advance. Another anti-pattern is wrapping an indexed column in a function in the WHERE clause (without a matching functional index). For example, WHERE YEAR(timestamp) = 2023 will not use an index on timestamp because the planner sees the function and gives up on using the raw index. It will resort to scanning all rows and computing YEAR() for each. The solution is to either create a functional index on YEAR(timestamp) or rewrite the query to a range filter: e.g. WHERE timestamp >= '2023-01-01' AND timestamp < '2024-01-01', which can use a normal index on the timestamp. In summary, query planning works best when the SQL is sargable (Search-Argumentable): simple predicates on columns (not inside functions or casts), and when the database’s statistics are current.

Online Schema Change

Altering a large table’s schema in production (adding columns, changing column types, adding an index, etc.) can be very tricky to do without downtime. A naive ALTER TABLE may lock the table for minutes or hours, during which reads/writes queue up. It can also wreak havoc with replication – the schema change has to replay on replicas, often causing replication lag or even stopping replication if not handled carefully. Modern databases and tools have developed techniques for online schema changes to mitigate these issues.

One approach is to use non-blocking, in-place DDL when the database supports it. For example, recent MySQL and PostgreSQL versions can add a column or create an index without a full table lock in many cases. These use tricks like building an index in the background. However, not all operations are supported online, and even online operations can put extra load on the server (copying data, rebuilding indexes), which may slow down queries and cause replicas to lag behind.

A widely used strategy is the ghost table method employed by tools like Percona’s pt-online-schema-change and GitHub’s gh-ost. In this method, you create a shadow copy of the table with the new schema (for instance, the same data but with an extra column or new index). Then you copy data from the original table to the ghost table in small chunks – not locking the whole table at once. During this copy, any new writes to the original table are captured: pt-online-schema-change sets up triggers on the original table to record any INSERT/UPDATE/DELETE and apply the same change to the ghost table. gh-ost, alternatively, reads the database’s binary log to capture those changes instead of using triggers (which imposes less load on the DB). Once all existing data is backfilled into the ghost table (copying in batches to control load and keep replication happy), you do a quick swap: briefly lock the original table, apply any final queued changes from the log, and rename tables – the new table takes the place of the old. This final swap is very fast (a few seconds or less), so from the application’s perspective, the schema change occurs with minimal downtime.

The ghost table method incurs overhead: it uses additional disk space (temporarily two copies of the data) and extra writes (from triggers or log playback). But it avoids long locks and keeps replication lag low by throttling the copy rate. In production, one would monitor the replica lag and the tool will pause copying if lag grows too much, then resume when it catches up. The gh-ost tool even allows pausing and resuming the migration without downtime.

For example, if you need to add an index on a 100 million row table, using pt-online-schema-change or gh-ost, you can do it while the database continues to serve traffic. The index is built on the ghost table in the background. Only at the end does a brief lock occur to ensure no lost updates. Companies like Facebook and GitHub have used these tools to add columns or indexes online on huge MySQL tables.

Lastly, zero-downtime schema change requires careful testing. It’s advisable to test the procedure on a staging environment, ensure the triggers or binlog playback correctly mirror all changes, and have a rollback plan (e.g. be ready to switch back to the old table if something goes wrong). Also, during the long copy, the new schema changes aren’t live yet – so sometimes application changes need to be deployed that can work with both the old and new schema until cutover. Despite the complexity, these techniques have become standard to update production databases without stopping the application.

Observability Signals

Running a database in production, one must closely watch certain metrics and logs to identify performance issues or bottlenecks. Key signals include:

Scan vs. point-lookup ratio – This compares how often the database is doing full table scans or large index scans versus efficient point lookups (using an index or primary key). A high scan-to-point ratio often indicates missing indexes or unoptimized queries. Ideally, most queries should be using indexes (point lookups), especially in an OLTP system. If you see lots of full scans, you might enable the slow query log and find queries to optimize. (As a rule, adding appropriate indexes can eliminate expensive sequential scans on large tables.)
p99 query latency – The 99th percentile latency (or 95th, depending on goals) is a critical measure of performance for user-facing applications. It captures the worst-case delays among queries. Even if average latency is 5 ms, if the 99th percentile is 200 ms, some users will experience slowness. Tracking p99 helps ensure the tail end of latency is under control. Spikes in p99 latency often correlate with issues like locking, I/O stalls, or GC pauses. It’s important to investigate them: e.g. was there a large query or a batch job causing outlier latency?
Lock contention – This measures how often transactions are waiting on locks. In a relational DB with ACID transactions, if many clients try to update the same rows, some will wait. High lock wait times or frequent deadlocks are signs of contention. Lock contention happens when multiple processes try to access the same data concurrently and at least one has to wait. You can monitor metrics like “lock wait % of time” or the number of waiting queries. If contention is high, solutions include lowering isolation level (if acceptable), adding indexing (to reduce the locked rows set), or re-architecting to reduce hot spots.
Buffer-cache hit ratio – Databases keep recently accessed data in memory (buffer cache). The hit ratio is the fraction of reads that are served from memory vs. disk. A high hit ratio (e.g. 99%) means most reads avoid disk I/O, which is good. If the hit ratio starts dropping, it means the working set is larger than memory and the DB is paging data from disk more often, which hurts performance. DBAs watch this metric and may add more RAM or optimize indexes if the cache hit rate is too low. In cloud caches like DynamoDB DAX (a caching layer in front of DynamoDB), the cache hit ratio is similarly a key metric – you want as many reads as possible served from the in-memory cache. You can monitor DAX’s performance metrics (e.g. ItemCacheHitRate) to ensure it’s effectively caching results.
Replication delay – For systems with read replicas or a primary-secondary setup, replication lag indicates how far behind the replicas are. This is crucial for read-after-write consistency on replicas and for failover. For example, MySQL’s Seconds_Behind_Master or Postgres’s pg_replication_lag tells you the lag. If lag is increasing over time, the replica might be struggling (possibly due to applying a heavy write burst or slow network). Too much lag means stale reads on the replica and longer failover time. Ideally, lag is just a few seconds or less. Cloud services like AWS RDS expose a ReplicaLag metric for read replicas – you’d alarm if this exceeds some threshold (e.g. 30s).
Slow query log – Most databases have the ability to log queries that exceed a certain execution time (for instance, log any query taking > 1 second). Reviewing the slow query log is one of the first steps in performance tuning. It tells you which queries or operations are the most time-consuming. Often a single bad query (missing an index or doing a huge sort) can impact the whole system. By monitoring this log, you can catch regressions when a new query suddenly starts appearing, and then optimize that query or add indexing. It’s an indispensable observability tool for operations.

All these signals tie together. An increase in full scans or a drop in cache hit ratio may cause higher p99 latency. Or high lock contention might show up as certain queries in the slow log waiting on locks. Modern monitoring tools and cloud platforms help correlate these. For example, AWS RDS Performance Insights provides a dashboard of database load and you can break it down by wait events or by SQL query to see what is consuming time. It will show if the load is due to CPU, I/O waits, locks, etc., and the top queries contributing to that load. Using such tools, an operator might discover that “Our p99 latency is high because of buffer pool misses leading to disk reads – we need more memory” or “Lock waits on table X are causing slow queries – we might consider sharding that table or tuning the workload.”

In summary, robust observability of a database involves capturing metrics at multiple levels (query latency, resource utilization, wait events) and using logs (like slow queries) to zoom in on offending queries. This allows DBAs to be proactive in optimization and quickly pinpoint the cause of performance issues.

Adaptive Capacity & Auto-Scaling

As workloads evolve, databases should adapt to avoid performance bottlenecks. A number of techniques exist for adaptive capacity and auto-scaling:

Hotspot detection and splitting: In a sharded or partitioned database, an uneven load distribution can cause one shard to become a hotspot (handling a disproportionate amount of traffic). This could happen if, say, one specific customer or one key (like a popular hashtag) gets extremely popular and all requests go to the shard holding that key. The system should detect this (via metrics like per-shard QPS or CPU). Adaptive systems will then split the hot shard into two (or more) pieces, spreading the keys across shards. For instance, Bigtable/HBase will split a tablet when it grows past a certain size or load. Cassandra and MongoDB hash partitioning avoid some hot spots by randomizing distribution, but they can still face them if a single partition key is very hot. Automatic splitting ensures that no single node becomes the bottleneck.
Adaptive request distribution: Similarly, some databases can temporarily route more resources to a hot partition. In DynamoDB, for example, each partition has a throughput budget (based on provisioning). If one partition (say partition #4) suddenly gets a surge of traffic beyond its limit, adaptive capacity will increase that partition’s capacity so it can serve the load without throttling. Essentially, DynamoDB will borrow unused capacity from other partitions in the table to accommodate the hot partition. This happens transparently and instantaneously, allowing uneven workloads to be handled up to certain limits. Additionally, DynamoDB’s adaptive behavior can rebalance items – if a particular record or small set of records is extremely hot, Dynamo might move them to their own partition (so that they no longer share capacity with other items). This isolation means that one very hot key can get the full 3,000 RCUs / 1,000 WCUs (the max for a single partition) without affecting others.
Auto-scaling compute instances: Many cloud databases offer automated vertical scaling. Aurora Serverless v2 is a prime example – instead of fixed instance sizes, it can scale the CPU and memory up or down in fine-grained increments of 0.5 ACUs (Aurora Capacity Units) based on load. If your workload spikes, Aurora adds more resources within seconds, and it does so without causing downtime or connection interruption. It monitors utilization every second; when it sees that the current resources are insufficient (e.g. high CPU or queued IO), it can rapidly scale up (often calling it “instantaneous” for scaling up). Conversely, when activity drops, it scales down to save cost – but carefully. Aurora Serverless will step down capacity gradually and waits a bit to ensure load is truly down, mainly to avoid ejecting too much cached data and causing a performance jitter. This kind of auto-scaling is ideal for variable workloads (e.g. a dev/test environment or a spiky traffic pattern). Other databases may not scale the instance size automatically, but you can achieve a similar effect with orchestration – e.g. running on Kubernetes and using the Horizontal Pod Autoscaler to add DB pods, or scheduling manual scale-ups during known peak periods.
Shard rebalancing and merging: When you add capacity (nodes) to a cluster, you need to rebalance partitions (move some data to the new node). Systems like Cassandra or Yugabyte use consistent hashing so that adding a node automatically causes some data to redistribute. In other systems, an operator might invoke a rebalance to spread out partitions. The opposite – removing nodes or consolidating – might involve merging shards. Some systems do this automatically when nodes leave, but often it’s manual or requires careful planning to avoid overloading remaining nodes.
Predictive scaling: Advanced auto-scaling might attempt to predict load spikes (say based on time-of-day or known events) and scale out in advance, so that capacity is ready by the time it’s needed. AWS Auto Scaling for EC2, for instance, allows scheduled scaling or predictive scaling based on patterns. In databases, one must be cautious with predictive moves (like splitting shards proactively), as a wrong guess could cause unnecessary churn. But it’s an area of interest – e.g., algorithms that see a trend of increasing load and trigger a split a bit earlier than the absolute threshold, to avoid any period of throttling.

In practice, adaptive capacity and auto-scaling significantly reduce the need for operators to manually reconfigure the system in response to workload changes. It improves resilience – the database can handle an unexpected spike on one key by adding capacity or splitting, rather than failing or slowing down. A well-known example is Amazon DynamoDB’s auto-scaling and adaptive capacity which make it “serverless” in the sense that it automatically adjusts to your usage. Another is Google Spanner, which can add servers to a spanserver pool when data grows. The key for designers is to ensure these adjustments are safe (no data loss, minimal performance impact during re-partitioning) and fast enough to meet the workload demands. Monitoring is still important – e.g. watch for any throttling events or increasing latency, which might indicate the workload is hitting some limit of the adaptive system (like hitting an account-level throughput limit or a maximum partition size where manual intervention is needed).

Backup, PITR & Disaster Drills

Effective backup and recovery planning can be the difference between a minor incident and a major data loss. Several concepts are important here:

Logical vs. Physical backups: A physical backup is essentially a byte-for-byte copy of the database’s data files, tablespaces, and possibly transaction logs. For example, taking an EBS volume snapshot of an RDS instance captures a physical backup. This type is usually fastest to restore on the same database engine – you essentially copy the files back and start the server. A logical backup, on the other hand, is an export of the data in a logical format (SQL INSERT statements, CSV dumps, etc.). Tools like pg_dump or mysqldump produce logical backups. Logical backups are often larger and slower to create/restore, but they have the advantage of being usable across versions or even different database systems (since it’s just data and schema in SQL). Often, teams use both: e.g. nightly physical snapshots for quick full restores, and weekly logical dumps in case you need to restore just a subset or guard against logical corruption. Physical backups are great for fast point-in-time recovery since they contain everything in one image, whereas logical backups are great for granular recovery (you can pick and choose what to restore, e.g. one table).
Point-In-Time Recovery (PITR): PITR is the ability to restore a database to an exact state at a specific time (down to some granularity, often seconds). This typically involves continuous backup of the transaction log (WAL in Postgres, binlog in MySQL) in addition to periodic base snapshots. For instance, AWS RDS PITR works by taking a snapshot and then continuously shipping transaction logs to S3. To recover to time T, RDS will grab the latest snapshot before T, then replay logs up through time T. PITR gives you a finer recovery point objective (RPO) – you might only lose at most a few seconds of data (depending on log backup frequency) as opposed to losing all data since the last full backup. Administrators must ensure the log backups are reliable and not corrupted. Additionally, practice how to execute a PITR: it can be non-trivial (for example, to restore a specific point in time on a new instance, apply logs, etc.). PITR is extremely useful for recovering from logical errors – e.g. if someone accidentally dropped a table at 3:00 PM, you can restore the state as of 2:59 PM just before it happened.
Restore latency and drills: Having backups is one thing, being able to restore them quickly is another. A good backup strategy meets the business’s Recovery Time Objective (RTO) – the maximum acceptable downtime. If your physical backups take 4 hours to download and restore, that might be fine for a small internal tool but unacceptable for a critical system. Techniques like storing backups on faster media, doing incremental backups, or keeping a warm standby can improve RTO. Disaster recovery drills are the only way to verify this. Teams should periodically simulate a scenario: “our primary database is gone – what do we do?” This could involve actually restoring the backup to a new instance and pointing the app at it. Drills often uncover issues (e.g. backup scripts that weren’t including all necessary files, or documentation gaps in the restore procedure). Practicing also makes the real thing go much faster. It’s common to keep a checklist or runbook for disaster recovery – and drills ensure it stays up to date. Moreover, monitoring restore time is important: some backup solutions allow instant restore (like database clones or storage-level snapshots that come up quickly), while others require lengthy data copy. If your RTO is tight (say 15 minutes), you likely need a hot standby or a cluster solution rather than relying purely on backups.
Cross-AZ/Region replication: Backups protect against data loss, but what if an entire data center (or cloud region) goes down? For high availability, databases often use multi-AZ replication (for example, an RDS primary in one AZ and a replica in another AZ, ready to take over on instance failure). For disaster recovery across regions, one can replicate data to a completely different region. Aurora Global Database is an AWS feature that keeps a read-only replica cluster in another region with typically under 1 second of lag. If the primary region has an outage, you can promote the global replica to read-write in seconds and have your application failover to it. This provides an extremely low RPO (Aurora Global’s SLA is often cited as 1 second RPO) and low RTO (maybe <1 minute to flip DNS and reconnect). Even without such fancy tech, you can achieve cross-region DR by shipping backups to another region or using database-level streaming replication to a server in another region. The important part is to ensure your backups themselves are stored off-site – e.g. if you only kept backups in the same region and that region is inaccessible, your backups are too. Cross-region replication and backups defend against regional disasters (rare but high impact). As with everything, you should periodically test your cross-region failover or backup restore. (Netflix famously does Chaos Engineering where they simulate region outages to ensure systems can handle it.)

In summary, backup strategy is about data safety (multiple copies, including offline copies to protect against ransomware or cloud account issues) and restore strategy is about time to recovery. Both need equal attention. Regularly verify that backups can actually be used to spin up a working database, and that your team knows the steps under pressure. As the saying goes, “Your recovery plan is only as good as your last test.”

Failover Choreography

When a primary database instance fails (or is presumed dead) and a standby needs to take over, the sequence of steps must be carefully managed to avoid data inconsistency. Key considerations in failover choreography include detection, election/promotion, and preventing split-brain:

Failure detection (health checks): There needs to be a mechanism that notices a primary is down. This could be a dedicated monitor (e.g. a cluster manager or watchdog) or the replicas themselves via heartbeat messages. For example, in MySQL with MHA or Orchestrator, an observer pings the primary; in Postgres, a tool like Patroni or Pacemaker might be used. If heartbeats fail for, say, >10 seconds, the primary is assumed down. Fast detection is good for availability, but one must balance not triggering on transient network blips. Often multiple signals (lack of heartbeat and inability to reach the DB) are used to confirm.
Promotion vs. election: Once a failure is detected, either an automated election process occurs (as in distributed consensus systems) or a pre-configured promotion takes place. In a simple replicated database, you might have one designated failover target: the system will promote the most up-to-date replica to primary. In more decentralized setups, nodes hold a vote (e.g. etcd, Spanner, and other CP systems use quorum voting to choose a leader). Elections require a majority of nodes to be communicating; this ties into the quorum concept mentioned earlier. The chosen new primary will then start accepting writes. It’s important that other would-be primaries don’t also promote themselves – which is prevented by the consensus algorithm or external coordinator.
Preventing split-brain (fencing): A major risk in failover is split-brain, where the old primary might still be running and thinks it’s primary, while a new primary is also active. This can lead to divergent writes that are very hard to reconcile. To prevent this, we need to fence off the old primary. “Fencing” means ensuring the old node cannot accidentally do harm – typically by cutting off its access. In practice, this could be done by the cluster manager sending a command to power off the old server, or revoking its storage access token, etc. For instance, in cluster software, STONITH (“Shoot The Other Node In The Head”) is used to forcibly reset a node that is thought to be down but isn’t responding. Without fencing, a “failed” primary is really in an unknown state – it might be alive but isolated. As one expert puts it: a proper failover solution requires a way to fence failed nodes, because otherwise you have a high chance of split-brain. In cloud environments, fencing might be done by a fencing agent that calls the cloud API to disable the old instance. In consensus systems (Raft/Paxos), split-brain is avoided by quorum – the minority partition won’t commit new writes – but even there, if the minority can’t see the majority, it must shut itself down or risk serving stale reads.
Quorum and consensus: Using a quorum of nodes for decision-making ensures that only one “view” of the cluster is active at a time. In a 3-node setup, if one node goes down, the remaining two form a majority and can elect a new leader. The down node, when it comes back, will see there’s already an active primary and won’t accept writes. In a 2-node cluster, this gets tricky because there is no majority if they split – that’s why 2-node clusters often use an external tie-breaker or simply aren’t recommended due to split-brain potential. It’s generally stated that you need at least 3 nodes (or 2 + a tiebreaker witness) to safely automate failover. Consensus algorithms handle this internally – e.g. in Raft, a candidate can only become leader if it wins a majority vote, and the old leader won’t commit any new log entries without majority ack.
Fail-fast: Another aspect is designing the database to stop serving as soon as it detects it might be partitioned. For example, a primary could monitor its ability to write to a shared storage or get heartbeats from others; if those fail, it can voluntarily demote or shut itself down (“fail fast”). This complements external fencing by having the node itself give up quickly when things go wrong, rather than stubbornly continuing in isolation.

Once a failover completes (new primary is live), clients need to be redirected. Solutions include using a floating IP or DNS name that points to the primary – and updating it – or using a proxy that routes to the current primary. There’s often a slight outage while this happens (connections break and need to reconnect to the new node).

Failover drills are as important as backup drills. You don’t want the first time you try a failover to be in production during an incident. Regularly test failover procedures in staging or even prod (some setups do planned failovers for maintenance – which is a good opportunity to see if your apps seamlessly handle it). In cloud-managed databases (Aurora, RDS, etc.), failover is usually automated – but you should still test how the application responds (does it time out for a few seconds and then recover? Does your connection pooling layer properly replay transactions or at least throw retriable errors?). For self-managed clusters, ensure the failover script or service (Pacemaker, etc.) is tuned – e.g. not too sensitive, and that fencing truly works in your environment.

In summary, failover choreography is about speed (minimizing downtime) but also safety (avoiding split-brain and data divergence). Through heartbeats, consensus/quorums, and fencing, a new primary can take over reliably. And always remember to bring the old primary (when it comes back) into sync before it becomes a replica – usually by discarding any un-applied transactions it had and catching it up with the new primary’s state (which cluster frameworks handle by letting the old node rejoin as a follower).

Capacity Planning

Capacity planning for a database ensures that the system has enough headroom to handle current and anticipated loads with acceptable performance. Key factors include data sizing, traffic patterns, and resource provisioning:

Working set estimation: Determine how much of the data is “hot” (frequently accessed). If the database’s working set (active rows and indexes) can fit in memory, performance will be much better (no disk I/O needed for most operations). As a rule of thumb, if you can afford it, have RAM >= working set. Real-world tests have shown huge benefits – e.g. keeping the working set in memory yielded up to an 8x throughput improvement in one Postgres benchmark for read-heavy workloads. If the working set is larger than memory, consider strategies like adding an in-memory cache (Redis/Memcached) for the most frequently accessed keys, or upgrading to machines with more RAM, or sharding the data (to get more aggregate memory across nodes). Monitor metrics like cache hit rate or pages read from disk to infer if the working set is spilling to disk.
Read/write mix: Understand the ratio of reads to writes in your workload. This affects how you scale. For instance, if you have 100x more reads than writes, you can add read replicas or a CDN/cache to scale reads horizontally, while writes go to a single primary. But if you have a heavy write workload, adding read replicas won’t help much for writes – you may need to scale up the primary or partition the data so writes are distributed. Also, write-heavy loads may stress disk throughput and log flush latency (you might need faster disks or tune fsync settings). Knowing your read/write mix also informs connection pool sizing and thread models (writes might benefit from more background worker threads for flushing, etc.). In capacity planning, one often calculates something like: “Peak expected write TPS = 500, each write = ~5 random IOs, so need 2500 IOPS for writes; Peak read TPS = 5000, mostly cached so maybe 10% hit disk = 500 reads hitting disk, etc.” – this ties into IOPS and throughput considerations.
IOPS vs. throughput: IOPS (I/Os per second) is crucial for random access patterns (small reads/writes all over the dataset), while throughput (MB/s) matters for large scans or bulk loads. Different storage systems have different limits: e.g., a SATA SSD might handle ~100k IOPS but also deliver 500 MB/s. A spinning disk might only do 200 IOPS but still give 100 MB/s sequential throughput. If your database workload is lots of random reads, you will max out IOPS long before throughput. If it’s large analytics queries, you might saturate MB/s. Capacity planning should ensure the storage system can handle both dimensions. For example, on AWS EBS you might provision an io2 volume with 10k IOPS for random IO and ensure the instance has enough EBS bandwidth. It’s also important to factor in replication overhead on IOPS: a write might cause multiple IOPS (writing to log, data file, plus replication to followers). Monitoring can show if you are bottlenecked on IOPS (high disk queue, high avg latency per IO) or on throughput (bandwidth maxed out). Matching the storage to the workload (NVMe for high IOPS, or maybe striped HDDs for a scan-heavy data warehouse) is part of capacity planning.
Connection pooling: It’s generally detrimental to have an unbounded number of client connections directly hitting the database. Each connection consumes memory and can contend with others. Many databases can handle a few hundred connections efficiently but will struggle with thousands. Connection pooling is a must for web applications – rather than each web request opening a new DB connection, a pool maintains a fixed number of connections that are reused. This reduces the overhead of connection establishment and teardown (which can be significant CPU and network overhead) and also limits concurrency to a level the DB can handle. Opening and closing a DB connection for every operation is very expensive and adds latency; doing so at scale will bottleneck the DB with connection overhead. By reusing connections, you amortize that cost and keep the database busy doing real work (queries) rather than authenticating and allocating resources repeatedly. Connection pooling middleware (like pgbouncer for Postgres, HikariCP for Java, etc.) can improve throughput and reduce latency markedly. From a capacity perspective, you plan for connection count: e.g. “We will have 10 application servers, each with a pool of 20 connections = 200 total, which our DB can handle.” If more concurrency is needed, scaling up the DB or adding replicas might be required, but you rarely want thousands of direct connections.
Statement caching and preparation: Similar to connection pooling, caching query plans can save resources. Many databases will cache the execution plan of a prepared statement so that subsequent executions avoid re-planning. For example, in PostgreSQL, preparing a statement will cache its plan, and executing it 1000 times will reuse that plan each time. This is great for frequently run queries – it saves CPU on parsing and planning, and can even improve consistency of performance. When doing capacity planning, consider the overhead of query parsing/planning: if your workload consists of many repetitive queries (like “SELECT * FROM table WHERE id = ?”), using prepared statements or a plan cache can reduce CPU usage on the DB, allowing more throughput. Some databases (SQL Server, etc.) have an automatic plan cache for queries even if not explicitly prepared, but you then have to watch out for cache size and eviction policies. In any case, part of scaling is not just raw hardware – it’s also eliminating waste. Reusing connections and execution plans are two ways to cut waste.
Scaling strategy: Finally, plan how to scale when you hit limits. Capacity planning isn’t one-and-done; it’s a continuous process. Have metrics dashboards to know when you’re nearing capacity on any dimension (CPU, memory, IOPS, storage space, connection count). Set up alerts (e.g. if CPU > 80% for 15 min or replication lag > X or cache hit ratio drops below Y). And have a playbook: e.g. if writes are saturating disk IOPS, maybe you move to a bigger instance or add an SSD; if the dataset is approaching the machine’s RAM limit, maybe scale out with sharding or add a caching layer. On AWS, this could involve enabling Auto Scaling for the DB instance class or using Aurora’s autoscaling as discussed. In a self-managed environment, it might mean having standby hardware to which you can failover on a larger box. It’s also wise to do capacity testing – run a load test that simulates 2x or 3x current traffic to see what breaks first (CPU, IO, locks, etc.), so you know the headroom. This helps avoid nasty surprises in production spikes.

By addressing all these factors – working set sizing, IOPS/throughput provisioning, pooling, caching, and forward-looking scaling plans – you can ensure your database continues to meet SLAs as the system grows. Capacity planning is as much an art as a science: you use measurements and estimations, but also keep some safety margin and be prepared to adjust when reality diverges from predictions. The aim is to never be caught flat-footed by a sudden increase in load or data volume.

system-design

SerialReads