SerialReads

High Availability Topologies: Active-Active vs. Active-Passive vs. Anycast

May 29, 2025

Great, I’ll craft a concept-driven executive overview titled “Resilience, Observability & Modern Trends in Load Balancers.” It will focus on engineering fundamentals with supporting examples where they clarify concepts, and structure the content to help reinforce system design readiness. I’ll let you know once it’s ready.

TL;DR: Highly available load balancers achieve resilience through redundant topologies (active-active clusters, active-passive failovers, anycast global VIPs) that eliminate single points of failure. Modern designs anticipate failures (avoiding split-brain via quorum, using graceful connection draining) and rely on observability of “golden signals” (QPS, tail latency P99, error rates, backlog/saturation) to detect issues early. New trends include capacity pre-warming for traffic spikes, leveraging LBs as security choke-points (WAF, rate-limiting, enforcing mTLS in a zero-trust model), deploying sidecar proxies in service meshes (Envoy/Istio), ultra-fast kernel-level data planes with eBPF/XDP (Katran, Cilium), and strict separation of control-plane vs. data-plane for scalability.

High Availability Topologies: Active-Active vs. Active-Passive vs. Anycast

In an active-active HA cluster, two or more nodes are all live, concurrently sharing the traffic load. For example, two load balancer instances might both serve clients at the same time under a shared virtual IP. This boosts throughput and fault tolerance, since all nodes contribute capacity and no single node is a bottleneck. Users benefit from lower latency and minimal interruption because traffic is spread across multiple active nodes. However, active-active setups demand careful coordination (e.g. state synchronization or consistent hashing) to avoid conflicts and ensure each connection is handled by one node. By contrast, an active-passive cluster keeps one node fully active while one or more passive nodes stand by for failover. Only the primary node handles traffic during normal operation, and if it fails, a standby takes over the virtual IP or service role. This configuration is simpler (no concurrent state sharing), but the passive capacity sits idle and overall throughput is limited to a single active node’s capability. As a result, active-passive can’t match the performance of active-active clusters, since the lone active node must handle all clients. Still, it provides reliability: the backup can quickly assume duty, usually via heartbeat monitoring and IP takeover, to minimize downtime.

Multi-region high availability extends these concepts across geographic regions. Two main approaches are anycast and DNS-based load balancing. Anycast uses the same IP address (VIP) announced in multiple locations via BGP. Clients connect to the nearest instance of that IP (based on Internet routing). This effectively creates an active-active deployment across regions – each region’s LB serves local users under a common anycast VIP. If an entire region goes down, its BGP announcements are withdrawn and traffic is automatically routed to the remaining region(s) within seconds or less. The big advantage is failover speed: routing convergence can shift traffic in milliseconds without waiting on clients to retry or DNS to update. By contrast, DNS-based global load balancing (e.g. using DNS round-robin, latency-based routing, or geo-DNS) relies on the DNS resolver to direct clients to different region endpoints. DNS can also implement active-passive failover (health-checking a primary and failing over to backup IPs). While simpler (no BGP required), DNS failover is slower – it depends on DNS TTLs expiring and clients honoring updates, which might take minutes. Additionally, DNS-based distribution can be suboptimal if records are cached or if users don’t respect TTLs. Anycast avoids those issues at the network level, though it requires anycast-capable infrastructure and careful design (e.g. to handle uneven traffic and ensure users stay “sticky” to one region for session consistency). Often, global services combine techniques (anycast for infrastructure-level routing and DNS for initial discovery or as a backup mechanism).

Example: Below is an ASCII diagram of an active-active load balancer pair using an anycast VIP across two regions:

                 Internet Clients
                       |
               Anycast VIP (global IP)
              /                     \
       [Load Balancer A]        [Load Balancer B]
         (Region A)                (Region B)
             |                        |
         <Backends A>             <Backends B>

In this illustration, both LB A and LB B advertise the same virtual IP via anycast. Users hitting the shared VIP are routed by the network to the nearest load balancer (Region A or B). Each LB then forwards requests to local backend servers. If Region A goes offline, BGP withdraws Region A’s route, and all clients automatically flow to Region B with minimal disruption.

Comparison of HA Approaches: The table below summarizes active-passive vs. active-active vs. anycast multi-region designs:

Approach Traffic Distribution Failover Characteristics Complexity & Considerations
Active-Passive One primary node handles all traffic; standby is idle until needed. Standby takes over on primary failure (IP or service failover). Some brief interruption during switchover is possible. Simpler to configure; no load-sharing. Must avoid “split-brain” (ensure only one active at a time). Regular heartbeats and a quorum or fencing mechanism protect against both nodes becoming active.
Active-Active All nodes actively serve traffic simultaneously, sharing the load. If one node fails, traffic is automatically redistributed to surviving nodes. No single point of failure; users may notice little to no impact. Higher throughput and better latency, but more complex. Requires synchronization of state (or stateless design) to prevent conflicts. Split-brain prevention is critical so two nodes don’t double-serve the same IP or clients.
Anycast Multi-Region Multiple regions serve clients using the same anycast IP advertised globally. Each user is routed to the nearest region. Fastest failover – routing updates steer traffic away from a failed region in milliseconds. Minimizes latency by serving users from closest region. DNS-based alternatives are slower to fail over. Complex network setup (BGP announcements, anycast routing). Need to ensure consistency (e.g. user sessions or data replication across regions). Testing and monitoring are needed to avoid imbalance or flapping between regions.

Failure Modes & Safeguards

Even with robust topologies, engineers must anticipate failure modes in load balancer clusters and apply safeguards. A notorious pitfall in HA clusters is split-brain – a scenario where a clustering fault (often a network partition) causes multiple nodes to each believe they are the active leader. In an active-passive pair, split-brain could lead to both nodes claiming the VIP and serving traffic simultaneously, which can corrupt sessions or send duplicate packets. Preventing split-brain requires quorum and fencing mechanisms. A common strategy is to use a quorum (majority) vote or a witness node so that only one side of a partition continues as active. In practice, this means if the link between two LBs breaks, a third arbiter or some tie-break rule determines which LB keeps the VIP, while the other stands down. Heartbeat messages and quorum witnesses (e.g. a small third node or cloud-based tie-breaker) ensure that a minority partition will realize it’s isolated and avoid taking action, thus maintaining a single active load balancer at all times. Another safeguard is state synchronization in active-active clusters – sharing connection tracking or persistence data between nodes – to avoid inconsistencies if one node fails.

When a load balancer node does need to be removed (for maintenance or scaling down) or fails health checks, graceful connection draining is essential. Rather than abruptly dropping open connections, the load balancer can stop accepting new traffic on that node but allow existing sessions to continue for a grace period. This prevents users from seeing half-loaded pages or sudden errors mid-transaction. For example, AWS load balancers support a “connection draining” (or deregistration delay) setting – when enabled, taking a backend instance out of rotation waits for ongoing requests to finish (up to a timeout) before closing connections. This ensures a rolling deployment or scale-in event doesn’t terminate users’ sessions prematurely. Similarly, LBs often perform health checks and remove unhealthy backends only after draining, to promote seamless failover at the node level. In summary, high-availability LBs must be built to fail gracefully: using quorum to avoid split-brain activation, and draining or handoff techniques so that neither planned maintenance nor sudden crashes disrupt active user connections more than necessary.

Golden Signals & Observability

Operating load balancers at scale demands strong observability. Google’s SRE doctrine highlights four “Golden Signals” to monitor: Latency, Traffic, Errors, and Saturation. For load balancers, these translate into key metrics:

To achieve robust observability, modern LBs export detailed metrics. For example, an Nginx/Envoy proxy can emit per-endpoint latencies and response codes; AWS ALB provides CloudWatch metrics like RequestCount, TargetResponseTime, HTTPCode_ELB_5XX_Count, etc. Dashboards typically plot QPS vs. error rate, and P99 latency over time. An anomaly in any golden signal might warrant shifting traffic or adding capacity. Moreover, tracing and logging complement metrics: logs of requests (with timestamps, chosen backend, etc.) and distributed tracing (propagating trace IDs through the LB) can pinpoint where delays occur in a request’s path. In summary, by watching the golden signals – high percentile latency spikes, increasing error percentages, traffic surges, and queue lengths – engineers can catch issues in the load balancing layer before they impact users significantly.

Capacity Dynamics & Burst Handling

A resilient system not only balances load, but also plans for load growth and spikes. Cloud load balancers can scale to huge throughput, but they don’t do so instantaneously. Pre-warming is the practice of preparing a load balancer (or any service) for a large surge by gradually ramping up traffic or explicitly reserving capacity ahead of an event. Without pre-warm, a sudden traffic spike (say a flash sale or viral event) can overwhelm the LB’s current capacity before it auto-scales, leading to dropped requests. For example, earlier-generation AWS ELBs sometimes needed manual pre-warming – if you expected a 10× jump in traffic at a specific time, you’d contact AWS to pre-allocate more LB capacity. Even today, extreme spikes (many thousands of RPS arriving within seconds) can cause a brief overload. As one AWS engineer noted, a rapid surge to gigabits of traffic can “cause LB nodes to fall over while the LB scales to accommodate that traffic” if not pre-warmed. In practice, modern LBs like AWS ALB or Google’s Cloud LB handle growth automatically, but they still have ramp-up curves. It’s wise to ramp traffic gradually (canary deployments, phased releases) when possible, so the LB and its targets scale smoothly.

Autoscaling of backends further complicates this dynamic. Auto-scaling groups will launch new application instances when load increases, but these take time (e.g. tens of seconds to minutes to boot, initialize app, etc.). During a sudden spike, the LB might experience high latency or errors because not enough backends are ready to accept the load in that instant. Autoscaling is reactive – it responds after metrics indicate load, so there’s an inherent lag. If a traffic burst grows faster than the scale-out rate, the system can saturate in the interim. For example, if CPU on servers goes 100% under a surge, it may trigger adding instances, but users might see errors or timeouts during those crucial first minutes. To mitigate this, engineers use aggressive scaling policies (scaling on shorter intervals and on leading indicators like queue length) and often maintain a buffer of excess capacity. Another technique is over-provisioning slightly during expected peak periods so that the first hit of traffic doesn’t arrive to a cold, minimal cluster. In some scenarios, a queue or waiting room at the edge (e.g. using a token bucket or virtual waiting room) is used to throttle incoming requests until capacity catches up. Cloud providers have also introduced features to reserve capacity units for load balancers in advance, essentially pre-warming by configuration.

Ramp-up limits can exist at multiple layers: the LB might only be able to open new connections at a certain rate until it scales out internally; the instances might have cold caches or JIT warm-up; external dependencies (like databases) might not accept sudden load without performance loss. Therefore, capacity planning remains important despite auto-scaling – know your baseline load, your spike tolerance, and how quickly your system can grow. Logging metrics like SpilloverCount (requests rejected by LB for capacity) or RejectedConnectionCount can inform if you’ve hit an LB limit unexpectedly. If so, it’s a sign that pre-warming or a higher LB bandwidth allocation is needed. In summary, handle capacity dynamics by designing for gradual growth when possible, using auto-scaling with fast reaction settings, and monitoring for any throttling signals. It’s far better to absorb a spike with some headroom than to play catch-up after users have already seen failures.

Security Choke-Points in Load Balancing

A load balancer often sits at a strategic point in the network – making it an ideal choke-point for security enforcement. Modern L7 load balancers and API gateways commonly integrate a Web Application Firewall (WAF) to inspect incoming requests and filter out malicious traffic. A WAF applies a set of rules to HTTP requests, blocking patterns that match known attacks (SQL injection, XSS, etc.) or abnormal behavior. For instance, Cloudflare’s WAF can deploy managed rule sets that instantly protect against the OWASP Top 10 attacks and emerging zero-day exploits. By placing the WAF at the load balancer, you shield all backend servers from bad traffic centrally – the LB will drop or challenge a threat before it even reaches an application server. This central rule enforcement is easier than securing each of dozens of services individually. It’s vendor-neutral too: whether using an AWS ALB with AWS WAF, an Nginx with ModSecurity, or F5 appliance with built-in WAF, the concept is to leverage the LB’s vantage point to guard the apps behind it.

Beyond filtering attacks, load balancers can enforce rate limiting to prevent abuse. For example, an API endpoint might be abused by a single IP flooding requests; the LB can detect this and reject or throttle after a threshold (e.g. 100 requests per second per client). This not only mitigates certain DoS attacks but also protects backend workloads from overload by controlling the rate of traffic intake. Many LBs allow defining rate-limit policies (by IP, user token, etc.), serving as the first line of defense against spikes from individual sources. Coupled with that, LBs may integrate with authentication and authorization systems – some act as an OAuth2/OIDC gateway or check API keys/JWTs before routing traffic, thus offloading auth from services.

A crucial aspect of modern security is mTLS (mutual TLS). In a zero-trust architecture, clients must prove their identity with certificates, not just the server. Load balancers can operate in TLS pass-through mode or TLS termination mode with mTLS. In pass-through, the LB simply forwards encrypted data to backend servers, which handle the TLS (including client certificate validation). In termination mode, the LB itself terminates the TLS connection, authenticates the client certificate against a trusted CA bundle, and then usually establishes a separate TLS to the backend. Some cloud LBs now directly support mTLS verification at the edge – for example, AWS Application Load Balancer can require clients to present trusted certs and will block requests from unauthorized clients, offloading that CPU-intensive handshake from the app servers. Offloading mTLS to the LB means your services can operate in an already-authenticated context (the LB can pass along the client’s certificate info in headers), simplifying service logic. Either way, mutual TLS ensures both parties are authenticated, fitting the zero-trust mantra that every connection, even internal ones, must be verified.

The zero-trust security posture assumes no request or user is implicitly trusted just because it’s inside your network boundary. It entails strict identity verification and least-privilege access for every single connection. Load balancers contribute to zero-trust by enforcing these checks uniformly. They can require encryption (TLS) everywhere, validate client identities (via mTLS or tokens), and never allow “anonymous” traffic through by default. This is a shift from older “perimeter security” models – instead of a hard shell (firewall) and soft interior, zero-trust treats every hop with suspicion. For instance, if microservice A calls microservice B through the mesh, the sidecar LBs (Envoy) will ensure A’s identity is presented and authorized to call B. Traditional load balancers are now often augmented with identity-aware proxies to achieve this. It’s worth noting that a load balancer is not a replacement for a network firewall – you still want a firewall blocking unwanted ports or protocols at the edge. But once traffic hits the LB, the LB acts as a gatekeeper for application-level security. In practice, deploying a zero-trust model might involve LBs doing TLS client cert checks, JWT validation, and passing identity claims to backends, as well as implementing segmentation (only route requests to allowed services). The load balancer thus becomes a unified enforcement point for security policies, drastically limiting exposure if any one service is compromised.

The concept of load balancing has expanded beyond the classic hardware appliance or simple reverse proxy. Service mesh architectures have brought load balancing into each application instance via sidecar proxies. Instead of all traffic funneling through a central LB tier, a service mesh (e.g. Istio, Linkerd) deploys a lightweight proxy (like Envoy) alongside each service pod. These sidecars handle traffic routing and load balancing on the client-side – when Service A calls Service B, Envoy in A’s pod will consult its discovery info and choose an instance of B to send the request to. This effectively distributes the load balancing logic throughout the network: every service instance does intelligent client-side load balancing with global visibility provided by the mesh’s control-plane. The benefits are significant: you get fine-grained control over traffic (each call can be decided based on real-time health, or follow specific routing rules for canary releases, etc.), and reducing hops (no need to always bounce through a central LB for inter-service calls). Envoy proxies can implement advanced policies – e.g. automatic retries, circuit breaking, and latency-aware load balancing – that improve reliability of calls between microservices. Modern service meshes also unify observability (the sidecars report metrics and traces for every service-to-service call) and security (they can encrypt and authenticate every call with mTLS by default). In short, sidecar load balancers in a mesh provide application-layer load balancing that is application-aware and highly dynamic. For example, Envoy is a popular open-source proxy used in Istio and others to handle service mesh traffic management and load balancing for cloud-native apps. It runs as an out-of-process proxy next to the app, and a central control-plane (like Istio’s Pilot or Kubernetes’s service API) distributes endpoint information and policies to each Envoy. This is a powerful extension of the load balancing concept, moving it deeper into the internal fabric of distributed systems.

At the other end of the spectrum, performance at scale has driven innovations like kernel-bypassing and eBPF-based load balancers. Traditional software LBs (HAProxy, NGINX) run in user space and handle packets through the kernel’s networking stack, which adds overhead. Companies like Facebook (Meta) have pioneered using eBPF/XDP to speed up L4 load balancing. XDP (eXpress Data Path) is a Linux kernel feature that lets you attach a custom program to the network driver, processing packets at the earliest point in the kernel (even before most of the stack). Facebook’s Katran is an XDP/eBPF-based L4 load balancer that runs on commodity Linux servers to handle massive traffic with very low CPU usage. Essentially, the LB logic (like picking a backend server for a new connection) is done in a BPF program at the kernel level, which then forwards the packet to the chosen backend with minimal overhead. This can achieve performance on par with specialized hardware, but with the flexibility of software. Similarly, Cilium (an eBPF-based networking project) can replace Kubernetes’s kube-proxy with an in-kernel load balancer for service IPs, achieving high packet rates and even maglev hashing for consistent client-to-server mapping. These eBPF LBs also make it easier to co-locate load balancing on general nodes (because they’re efficient and can run alongside applications). The trend here is using modern Linux capabilities to create fast-path data planes that avoid context switches and leverage high-speed packet handling. Cloudflare’s network load balancer and others have also adopted XDP for DDoS mitigation and load balancing tasks. In practice, this means your traffic can be distributed by code running directly in the NIC driver context – a far cry from the days of requiring proprietary hardware for high throughput L4 load balancing.

Another modern trend is the integration of load balancing with edge computing and CDN networks. Content Delivery Networks (CDNs) already function as globally distributed load balancers for caching content. Now, they are extending to handle dynamic content and compute at the edge. For example, a service like Cloudflare or Fastly will not only direct users to the nearest edge location via anycast (global load balancing), but can also run serverless functions at the edge to process requests, thereby offloading work from the origin servers. This effectively brings load balancing and application logic closer to users. If you deploy an app across multiple regions or edge locations, modern global load balancers (like AWS Global Accelerator, Google Cloud HTTP LB, or Cloudflare) route each user to the optimal location based on latency and load. They often tie in with DNS and anycast under the hood, abstracting it for the developer. The edge integration means that the distinction between “load balancer” and “reverse proxy cache” and “edge compute node” is blurring. We have multi-layer load balancing: global anycast to the nearest data center, then maybe a layer-7 load balancer within that data center to the correct service or cluster, and finally service mesh for local pod load balancing. The cutting edge is using CDNs as the first-layer load balancer for everything – terminating traffic in dozens of cities globally, executing some logic (auth, caching, small compute tasks), and then forwarding to a regional cluster if needed. This improves resiliency (as even a whole region failure can be masked by the CDN routing to another region) and performance (lower latency, since static responses or even cached API responses might be served from edge).

Control-Plane vs. Data-Plane Separation

In large-scale systems (including load balancers), a clear separation between the control-plane and data-plane is fundamental. The data-plane is the component that actually handles live traffic – in networking terms, it forwards packets or in L7 terms, it proxies requests. The control-plane is the brains that makes decisions about where traffic should go and configures the data-plane. Traditionally in hardware routers, for example, the control-plane computes routing tables (via protocols like BGP/OSPF) and the data-plane forwards packets accordingly. In load balancers, a control-plane might be responsible for tasks such as: watching service registry/endpoint health, computing load balancing decisions or weights, and distributing configuration (like new routes or scaling events) to the load balancer instances. The data-plane is then the worker that applies those rules to each incoming connection efficiently.

Why separate them? Because it improves scalability, performance, and reliability. Data-planes are ideally kept simple and fast, focused on the per-packet or per-request handling at high volume. Control-planes can be more complex and can even be allowed to run slightly slower, since they are not in the direct path of every user request. By separating concerns, you avoid heavy computations or state updates from slowing down traffic processing. For instance, consider Envoy in a service mesh: Envoy (data-plane) is optimized in C++ for high-performance proxying, while Istio’s Pilot (control-plane) aggregates service discovery info and pushes config to Envoy. Envoy doesn’t need its own complex discovery logic; Pilot doesn’t touch every packet. This division also aids reliability – if the control-plane has a glitch or is upgrading, the data-plane can continue operating with the last known good config, so user traffic isn’t impacted. AWS architecture similarly notes that keeping data-planes simple (with fewer moving parts) makes failures less likely in the data-path. The control systems (APIs, configuration stores, etc.) can be isolated so that any issue there doesn’t immediately take down data handling. At massive scale, it’s also about manageability: a centralized or logically centralized control-plane can manage hundreds of distributed data-plane instances. For example, Google’s Cloud Load Balancing uses a control-plane that monitors backend pool health and dynamically programs vMAGs (virtual load balancers) at the edges to steer traffic. The user only interacts with control-plane APIs (to add a backend, change a rule, etc.), and those propagate to all the data-planes behind the scenes.

In practice, when designing a system for scale, you might run a fleet of stateless load balancer processes (data-plane) that are updated by a separate controller service (control-plane). The control-plane handles leader election, global configuration, scaling decisions, and health monitoring, then instructs data-plane nodes “take server X out of rotation” or “here’s a new VIP for service Y”. This way, the packet-forwarding logic on each node remains lean and fast – often in-kernel or event-driven user-space – while the higher-level logic can afford to use richer computation, databases, etc. without risking data-plane throughput. Modern proxies like Envoy explicitly follow this design (“xDS” APIs for external discovery). F5’s BIG-IP device even internally separates its management plane (tmm vs. control daemon). The result is easier horizontal scaling (just add more data-plane instances for capacity, and the control-plane will configure them), and improved fault tolerance (if one data-plane instance crashes, control can redistribute its load; if control-plane briefly disappears, data-plane keeps forwarding using last config). In summary, control vs. data plane separation is about divide and conquer: the control-plane makes the decisions and the data-plane executes them. This yields a more robust, scalable load balancing system – one where high-rate traffic handling is not slowed by configuration logic, and where you can update or change policies centrally without touching every packet flow in real-time.

Further Reading

system-design