Core Architecture & Building Blocks of Load Balancers

May 27, 2025

Great — I’ll draft a 1,200–1,500 word executive overview titled “Core Architecture & Building Blocks of Load Balancers,” tailored for your study guide. It will use a narrative, problem-solution style and cover all the specified topics clearly and concisely. I’ll let you know as soon as it’s ready for your review.

Core Architecture & Building Blocks of Load Balancers

TL;DR: Load balancers distribute client traffic across multiple servers, improving reliability and scalability. They actively check which servers are healthy (and avoid the unhealthy), keep individual user sessions tied to the same server when needed (sticky sessions), and handle secure connections via TLS either by decrypting at the load balancer or passing through to backends. Under the hood, a robust load balancer design includes health checks, connection hashing for flow consistency, session persistence, TLS termination decisions, dynamic backend registration/deregistration, high-availability via shared virtual IPs, and a clear separation between the control plane (brains) and data plane (traffic path).

Imagine an e-commerce site on Black Friday. Thousands of shoppers click “Buy,” and dozens of backend servers are standing by. The load balancer sits in front, deciding which server should handle each request. It must ensure no one is sent to a crashed server, each user sticks to one server for their whole session (to keep their shopping cart), and that if a load balancer node fails, another seamlessly takes over. Let’s explore how load balancers achieve all this through their core architecture and building blocks.

Active vs Passive Health Checks (Fail-In and Fail-Out)

One fundamental job of a load balancer (LB) is to detect unhealthy servers and stop sending them traffic. This is done through health checks. Active health checks are proactive: the LB periodically pings each server (for example, sending an HTTP request or a TCP “heartbeat”) to see if it responds correctly. If a server fails to respond or returns an error, the LB marks it as unhealthy. Typically, the LB uses thresholds: e.g. if N checks in a row fail (fail-out threshold), the server is taken out of rotation. This avoids yanking a server after one fluke error – it needs consistent failures. Conversely, a fail-in (or rise) threshold defines how many successful checks are needed to bring a previously failed server back online. These thresholds prevent flapping, so a flaky server isn’t rapidly added/removed on intermittent responses.

Passive health checks, on the other hand, rely on observing real user traffic. The LB watches the responses from servers as it proxies requests. If a server starts timing out or returning failures to user requests repeatedly, the LB can infer it’s unhealthy and stop sending new traffic there. Passive checks have the advantage of detecting issues that occur under load or specific query patterns (which an active check might not catch if it only probes a simple endpoint). However, passive checks only trigger if there is user traffic – if the system is quiet, a server could be down and passive monitoring wouldn’t notice until a client request fails. In practice, many LBs use both: active pings for baseline monitoring and passive for catching sudden failures between pings.

Real-world example: Suppose one backend server’s database connection breaks. An active health check hitting /health might still return “OK” (since the webserver is up), missing the deeper issue. But as soon as real user requests to that server start failing, passive health check logic can mark it unhealthy and the LB will route users elsewhere. Using a combination, the LB minimizes impact: active checks remove outright dead servers, and passive checks catch partial failures quickly.

Connection Stickiness and 5-Tuple Flow Hashing

Load balancers not only decide which server to use, but must ensure that once a decision is made, all packets for that connection go to the same chosen server. At the network layer (Layer 4), this is handled by flow stickiness using a 5-tuple hash or connection table. The “5-tuple” refers to the five elements that define a TCP/UDP connection: source IP, source port, destination IP, destination port, and protocol. By hashing or tracking this 5-tuple, the LB ensures that each unique connection (or flow) is consistently routed to one backend. For example, if a client opens a TCP connection, the initial packet is routed to Server X; subsequent packets with the same 5-tuple will follow to Server X as well. This prevents breaking TCP streams mid-flight. Many load balancers (e.g. Azure’s) default to a 5-tuple hash algorithm, meaning the selection of server is based on these five fields.

Crucially, 5-tuple hashing provides stickiness only for the life of that transport session. Once the client closes the connection (and perhaps opens a new one), the source port will likely differ, producing a new 5-tuple and potentially a different backend choice. In other words, L4 stickiness makes sure each connection stays intact on one server, but it doesn’t guarantee the next connection from the same user goes to the same server.

To manage flows, some LBs keep an explicit connection table (mapping each active flow to its backend). Others use stateless hashing: e.g. computing a hash of the 5-tuple mod the number of servers. Stateless hashing is simple and fast, but if the server pool size changes, it can break consistency (some providers mitigate this with consistent hashing techniques). A connection table (stateful LB) can handle dynamic membership by updating entries on the fly, but uses more memory. Either way, flow stickiness is vital for protocols like TCP – imagine a user’s WebSocket or file download connection jumping between servers mid-connection if the LB didn’t pin the flow!

In summary, connection-level stickiness ensures low-level consistency. It’s the foundation before we even consider higher-level session persistence. Every packet belongs to some flow, and the LB’s data-plane uses hashing or lookup to send it to the correct server. With that in place, we can layer on more application-aware persistence when needed.

Session Persistence Mechanisms (Sticky Sessions)

Above and beyond per-connection stickiness, many applications need session persistence (also known as affinity or sticky sessions). This means that the same client is sent to the same backend server for the duration of their session or activity. This is crucial, for example, if servers store user-specific data in memory (shopping cart, login state, etc.). Without sticky sessions, a user might log in on Server A, then a subsequent request goes to Server B which doesn’t know about the existing session – causing a logout or empty cart. Persistence avoids that by binding the user to Server A for all future requests.

Load balancers implement session persistence through a few common mechanisms:

IP Affinity (Client IP hashing): The LB uses the client’s source IP address as the key to always pick the same backend. This can be as simple as “hash the client IP and mod by number of servers” or a table remembering “IP X -> Server Y”. IP-based persistence is straightforward and requires no special client-side support. However, it has downsides: many users share public IPs behind NAT (e.g. corporate networks), so one server might get overloaded with all users from one office. Mobile clients might change IPs, breaking the persistence. And if traffic is unevenly distributed by IP, it can lead to imbalance. Despite its simplicity, IP affinity is a rather blunt tool.
Cookie-Based Persistence: Here the load balancer uses an HTTP cookie to identify the session. The LB can either inject its own cookie in the user’s browser (e.g. a cookie named SERVERID=backend-id), or leverage an existing application cookie like a session ID. On the first request, the LB assigns a server and sets a cookie; subsequent requests from that browser include the cookie, and the LB reads it to route the request to the same server. Cookie persistence is very precise (it sticks individual user agents), works even if multiple users share an IP, and can expire naturally when the session ends. The downside is it only works for HTTP(S) traffic and requires that clients accept cookies. Also, there’s a bit of extra overhead and a slight security consideration – an attacker who steals the LB’s cookie from a user could potentially impersonate that session (though the cookie typically contains only a random server token, not full credentials).
Custom Header or Token: In some setups, a custom header or token is used for persistence. For example, an application might issue a session token to the client (in an HTTP header or URL parameter) and the LB is configured to route based on that token value. Some load balancers allow a specific HTTP header (like X-Session-ID or similar) to dictate the routing to a particular server. This approach provides flexibility – the application can control the token assignment – and can work in protocols beyond HTTP if you define a way to convey the token. However, it requires more integration: clients or servers must generate and understand the token, and the LB must be configured accordingly. It’s also not universally supported on all load balancers out-of-the-box. Custom header persistence is often used in specialized environments or legacy systems where other methods aren’t viable.

Other persistence techniques exist (URL path affinity, SSL session IDs for HTTPS, etc.), but the IP, cookie, and header-based methods are the most common and illustrative. It’s worth noting that any form of session persistence somewhat undermines pure load-balancing – you’re constraining some users to specific servers rather than always picking the least loaded server. Therefore, sticky sessions can lead to uneven load if, say, a “chatty” user gets tied to one server and heavily uses it. Architects sometimes try to avoid session affinity by externalizing session state (so any server can serve any user). But in practice, sticky sessions remain a pragmatic solution in many systems.

Below is a quick reference comparing a few session persistence techniques:

Persistence Technique	How It Works	Pros	Cons/Trade-offs
Client IP Affinity	Hash or remember client’s IP to pick a server.	No client-side config; works for any protocol.	Inaccurate if clients share IP (NAT/proxies); can imbalance load; IP can change.
Cookie-Based	LB sets/reads an HTTP cookie identifying the server.	Precise per-user stickiness; works with browsers; no special app code needed (if LB inserts cookie).	HTTP only; requires cookie support; slight overhead; multiple LBs must share cookie method (in clusters).
Custom Header/Token	App or LB uses a custom identifier (header, param) to consistently route.	Flexible – can be used in custom ways or non-browser clients; works if cookies disabled.	Requires app cooperation or config; not universally supported; potential complexity in setup.

Table: Common session persistence methods and their trade-offs.

In an interview setting, it’s good to mention that sticky sessions should be used deliberately. They solve correctness issues (user state), but at the cost of some load distribution efficiency. A strong candidate might note techniques like “back-end session replication” as an alternative, but if the question is on load balancers, explaining the above methods and when to use them (e.g. “use cookie persistence for web apps requiring login sessions”) shows solid understanding.

TLS Termination vs Passthrough (Where HTTPS is Decrypted)

Modern services almost always run over HTTPS for security. A load balancer, therefore, must handle TLS (SSL) connections from clients. There are two approaches: TLS termination at the load balancer, or TLS passthrough to the backends.

TLS Termination (Offloading): In this mode, the load balancer acts as the HTTPS endpoint for clients. It possesses the X.509 certificate and private key for the service domain, and clients establish the secure TLS connection directly with the LB. The LB decrypts all traffic, then typically sends it unencrypted to the backend servers (usually over the internal network). Because the LB has access to the plaintext HTTP, it can do smart things: HTTP routing, cookie inspection or insertion (for persistence), compression, WAF security checks, etc. TLS offloading can also reduce CPU load on backend servers (they don’t spend cycles on encryption/decryption) and centralizes certificate management to the LB tier. The trade-off is that decrypted data is traveling between LB and backend – which might be a security concern if that link isn’t secure or if you have strict end-to-end encryption requirements. Also, the load balancer becomes a critical holder of private keys (so it must be secured and possibly replicated across LB nodes).
TLS Passthrough: In this mode, the load balancer does not decrypt the traffic – it simply proxies the encrypted data to a backend server. The backend servers themselves handle the TLS handshake and decryption. This provides true end-to-end encryption (client to server), which is desirable for sensitive data or compliance. Additionally, the LB can be simpler (L4 load balancing based on IP/port, without needing to understand HTTP). However, there are significant limitations: since the LB can’t see inside the packets, it cannot make routing decisions based on HTTP content (it can’t even see the hostname or URL, except perhaps the TLS SNI extension which gives the target domain name). Features like inserting cookies or headers for persistence are off the table. Also, managing certificates now moves to each backend server – you need a cert (or at least the same cert) on each, and possibly a way to keep them updated in sync. Passthrough is often used in “network load balancer” scenarios or when users insist on their encryption terminating only at the service. But one should note that loss of visibility in the LB means losing Layer-7 capabilities and some optimizations.

There is a middle ground known as TLS bridging (or end-to-end TLS with re-encryption). That’s when the load balancer does terminate the incoming TLS (so it can inspect or modify the traffic), but then initiates a new TLS connection to the backend. In essence, the LB is a TLS client to the server. This way, data is encrypted over both legs (client-LB and LB-server), yet the LB can still do content filtering or routing because it sees the plaintext briefly. The downside here is increased latency (two TLS handshakes in the path) and more complex certificate management (the LB needs its own cert for clients and also needs to trust the backend’s cert or have certs for talking to backends).

Certificate location: In summary, with termination or bridging, the certificates live on the load balancer (for the client-facing side). With passthrough, the certificates live only on the backend servers. In some architectures (e.g. Kubernetes Ingress controllers or cloud LBs), you might upload your TLS certificate to the load balancer service, which is effectively TLS termination at the LB.

From an interview perspective, key points to highlight are: performance vs security trade-off (termination is efficient and feature-rich, passthrough preserves end-to-end encryption), and knowing that if asked “where do you put the cert?”, the answer depends on this choice (LB for termination, servers for passthrough, or both for bridging). Also mention that many enterprises terminate at the LB for performance and then use a secure network or even re-encrypt to backend if needed – it often depends on security requirements.

Backend Registration, Autoscaling, and Graceful Deregistration

In dynamic environments (like cloud auto-scaling or container orchestrators), the set of backend servers can change frequently. Load balancer architecture must accommodate backend registration and deregistration seamlessly.

Consider an auto-scaling group that adds two new application servers to handle increased traffic. How does the load balancer know about them? Typically, there’s a control-plane mechanism: either the load balancer polls a service registry or is notified via an API call/hook. For example, in cloud platforms, when a new instance launches, it can be automatically registered with the load balancer’s target pool (often through an API call from the scaling system). In container systems like Kubernetes, the equivalent is updating Endpoints/Ingress to include the new pod IPs. In traditional setups, it might be more manual – e.g. an operator runs a script to update the HAProxy config and reload it. No matter the method, a modern LB needs to handle dynamic membership of backends.

Equally important is graceful deregistration (also known as connection draining). When a server is going down or being removed, we don’t want to drop users in the middle of their session. A well-behaved load balancer will, upon deregistering a server, stop sending it new requests but allow existing connections to finish. It might wait for active sessions to complete or until a timeout passes, before fully removing the node. For example, AWS load balancers support a “draining” mode so that during deploys or scale-in, clients aren’t abruptly cut off. This graceful deregistration prevents disruption, allowing active connections to complete before the instance is terminated. In practice, the LB will check each new connection – if the target is in “drain” state, it won’t pick it, but if an existing connection ID maps there (from the connection table or a cookie), it will still forward until done.

Autoscaling hooks often tie into this process. When an autoscaling event wants to remove a server, it typically informs the LB to drain it first. Only once the LB reports the server has zero or minimal connections will the autoscaler actually terminate it. Conversely, when adding a server, some LBs do a “warm-up” or slow start – e.g. gradually increasing the proportion of traffic to a new node so it isn’t hit with full load immediately (this helps with cold caches or just-started JVM warm-up). This isn’t always automatic, but many load balancers (like NGINX Plus, HAProxy) have a slow-start option for new servers.

From a control-plane perspective, backend registration can be manual (editing a config, etc.) or automated (service discovery, cloud APIs). A strong design decouples this from the data-plane: you can add/remove servers via control-plane actions without interrupting the packet flow. In an interview, you might mention tools/protocols like DNS-based service discovery or APIs (AWS, etc.) or even health-check based auto-registration (where the LB scans a range and finds new healthy instances). The key concept is that load balancers are not static; they work hand-in-hand with scaling processes.

One more aspect: what happens when a previously failed server recovers? The LB’s health checks will eventually mark it healthy (after it passes the fail-in threshold of successes). At that point, the control-plane should reintroduce it to the rotation. Some load balancers immediately start sending it traffic; others might wait a bit or send a lower rate initially (again, to avoid an unstable server flapping). Graceful handling on recovery is as important as on removal.

In short, a robust LB architecture supports dynamic backends and ensures the transition of servers in or out is smooth. This guarantees that deployments, scaling events, or instance failures are transparent to users. You can deploy a new version of your app server one instance at a time: deregister, update, re-register – and the LB makes sure users barely notice.

High-Availability Topologies (Redundant LBs via VRRP, BGP, or Cloud Control Plane)

So far we’ve discussed how LBs deal with backend servers, but what about the load balancer itself? It becomes the new critical component: if the LB dies, the service is unavailable. Hence, load balancers are almost always deployed in a redundant, high-availability topology.

The classic approach in traditional environments is to have a pair of load balancer machines (or appliances) in an active/standby configuration, sharing one Virtual IP (VIP) address. The active node owns the VIP and handles traffic; if it fails, the standby node takes over the VIP and resumes service. A common protocol to coordinate this is VRRP (Virtual Router Redundancy Protocol). VRRP lets a set of machines on the same network elect a master that holds the virtual IP, while others monitor. If the master doesn’t send a heartbeat, a backup takes over the IP. This failover typically happens within a second or two, so there’s minimal downtime. Many software LBs (like Keepalived+HAProxy setups) use VRRP for HA. The key advantage is it’s simple and at Layer 2: to clients, the VIP is a single IP; which box answers is managed by VRRP, and the switchover is transparent beyond a brief hiccup.

Another topology uses dynamic routing with BGP. In a BGP-based HA, each load balancer node advertises the service IP (VIP) into the network via BGP (Border Gateway Protocol) to upstream routers. If one node goes down, it withdraws its route, and the other node’s route is used, effectively shifting the traffic. BGP can even allow active-active load balancers: e.g. two nodes both advertise the IP with the same metrics, and traffic can be split between them (or using equal-cost multipath). BGP isn’t limited by the nodes being on the same subnet or broadcast domain like VRRP is. Big online services often use BGP anycast – advertising the same IP from multiple data centers and routing clients to the nearest. In context of a single cluster, BGP HA might be considered more complex to set up, but it’s powerful for scaling out LBs or avoiding a single point of failure at the network level.

In cloud environments, you usually don’t see VRRP explicitly. Cloud load balancers are typically provided as a service where high-availability is handled behind the scenes by the cloud provider’s control plane. For example, AWS Network Load Balancer or Application Load Balancer isn’t one machine – it’s a cluster of machines in different Availability Zones, but Amazon gives you one stable IP or DNS name. If one goes down, the cloud control plane automatically shifts traffic to others (often using mechanisms like anycast or AWS’s internal networking). Similarly, Google Cloud’s load balancers use a global anycast IP that is served from multiple Google frontend locations simultaneously – if one instance fails, others continue taking traffic, and clients aren’t aware because the IP doesn’t change. Essentially, cloud LBs have distributed architecture: you get the benefit of VRRP/BGP-style failover without having to configure it yourself. The trade-off is you rely on the provider’s implementation.

It’s worth mentioning that on-premises, some high-end LBs support clustering beyond just active/passive. For instance, certain F5 or Netscaler deployments can do an active/active cluster with a virtual IP, dividing traffic (though often one is primary for a given connection, etc., to avoid confusion). But the most common pattern is two-nodes with a failover.

Here’s a simplified diagram of a typical active/passive HA pair of load balancers sharing a VIP and monitoring backend health:

                    Clients ↣ (VIP: 203.0.113.1) ↢ Clients
                           |                 (shared virtual IP)
                 ...............................
                 :             VRRP/BGP         :   (Heartbeat sync)
      [Load Balancer A]  ◀─────heartbeat────▶  [Load Balancer B]
      (Active - owns VIP)                       (Standby)
            │ \                                      / │
            │  \                                    /  │
      (Health checks)                        (Health checks)
            │    \                                /    │
            v     v                              v     v
         [App Server 1]  🗸                   [App Server 1]  🗸
         [App Server 2]  🗸    ... pool ...   [App Server 2]  🗸
         [App Server 3]  🗷                   [App Server 3]  🗷
         (🗸=healthy, 🗷=unhealthy)         (Both LBs check health and share state)

Diagram: Two load balancer nodes (A and B) share a virtual IP. A is active serving clients; B is on standby. They exchange heartbeats (via VRRP or similar). Both perform health checks on the backend server pool. If A fails, B will take over the VIP and continue service.

In this diagram, note that both LB nodes independently monitor the backends. The standby isn’t sitting completely idle – it knows which servers are healthy too, and may even be mirroring some state. This way, if a failover happens, the new active LB already has up-to-date health info and (if state is shared) even existing connection mappings if possible. Some HA setups also share persistence tables or connection state to make failover seamless (this is more challenging, and often a brief interruption is accepted instead).

Summary: High availability for load balancers can be achieved via network-layer redundancy (VRRP for L2 failover, BGP for L3 routing failover), or by the platform’s magic in cloud. It ensures that the “front door” of your service isn’t a single point of failure. In an interview, emphasizing that you’d deploy at least two LBs with a failover scheme demonstrates you understand real-world reliability.

Control Plane vs Data Plane Separation

A key design principle in load balancers (and networking devices in general) is the separation of the control plane and data plane. In simple terms, the data plane is the part of the LB that actually handles the traffic – inspecting packets/requests and forwarding them to the chosen backend. It needs to be fast, streamlined, and reliable, because it’s in the hot path of every user’s connection. The control plane is the “brain” that runs in the background: it’s responsible for configuration, decision-making, and management tasks – but not every packet goes through it.

In a load balancer, the control plane encompasses things like: monitoring health checks and deciding when to add/remove a server from rotation, processing admin commands or API calls (such as an operator adding a new backend or changing a rule), and computing the load-balancing decisions or persistence mappings (which then get programmed into the data plane as rules). The data plane, by contrast, is the packet-forwarding engine: it takes each incoming connection or request, looks up the appropriate backend (based on the rules/ tables it got from the control plane), and dispatches the traffic.

Why separate them? Because they have different performance and reliability requirements. The data plane should be as simple and bullet-proof as possible – often implemented in lower-level code, kernel space, or even hardware (in case of physical ADCs), to push gigabits of traffic with minimal latency. The control plane can be more flexible and complex (maybe written in higher-level code, with richer logic), and it can run a bit slower since it’s not handling every packet, just occasional decisions. By separating them, a heavy operation on the control plane (say recomputing a routing table or doing a slow health check) does not directly slow down user traffic – the data plane keeps running with the last known good state. Likewise, if the control plane process crashes or is upgrading, the data plane can often continue forwarding existing connections unaffected (at least for a while). This separation improves both performance and availability of the load balancer.

For example, consider an AWS Application Load Balancer: the data plane consists of many distributed “load balancer nodes” that actually handle traffic in/out. The control plane is a centralized service (in AWS’s region control infrastructure) that monitors targets, scaling, health, and updates the routing tables on those nodes. If a target goes unhealthy, the control plane updates the data plane’s configuration to stop sending to it. If you, as a user, reconfigure the LB (add a new listener or target), you’re interacting with the control plane – it then orchestrates the necessary data plane changes.

In software like HAProxy, this separation is less obvious because it’s all in one process, but conceptually HAProxy’s runtime has a data plane (the packet processing threads) and a control plane (the management layer that can take new configurations or do health checks, etc.). Modern designs like HAProxy Data Plane API and others explicitly name it this way. Another example is Kubernetes: the Ingress controller might have a control component deciding how to map URLs to services, and the data plane is the proxy (Envoy/NGINX) actually moving the bytes.

Understanding this concept is important because it explains design choices: e.g., why you might run separate threads or even separate instances for health checking versus actual proxying. It also comes up in SDN (Software Defined Networking) discussions (with controllers vs switches). In summary, control plane = the decision-maker, data plane = the work-horse. A great system keeps the data plane lean and mean, and the control plane smart and resilient. Interviewers might not ask explicitly about this, but if you bring up “control-plane vs data-plane” when talking about load balancer internals, it shows a deeper insight into system design. For instance, “In designing a highly available load balancer, I’d separate control and data planes – so health check logic or config changes don’t interfere with high-speed packet forwarding. This also means I could run multiple control-plane instances for redundancy without duplicating the data path, etc.” Such points paint you as someone who understands the inner workings, not just the surface.

In conclusion, the core architecture of load balancers involves a combination of clever techniques and principles: health checks (active/passive) to only send users to healthy servers, flow hashing to keep connections intact, session persistence when required for application state, judicious handling of TLS for security and performance, dynamic registration and draining of backends for elasticity, redundant topologies to avoid single points of failure, and a clean control/data plane separation for reliability and scalability. These building blocks work in concert to deliver the seamless experience users expect – even under heavy load or component failures, a good load balancer makes the distributed system behave like a single robust service. Understanding these will not only help in SDE interviews but also in designing and debugging real-world distributed systems.

SerialReads

Core Architecture & Building Blocks of Load Balancers

Core Architecture & Building Blocks of Load Balancers

Active vs Passive Health Checks (Fail-In and Fail-Out)

Connection Stickiness and 5-Tuple Flow Hashing

Session Persistence Mechanisms (Sticky Sessions)

TLS Termination vs Passthrough (Where HTTPS is Decrypted)

Backend Registration, Autoscaling, and Graceful Deregistration

High-Availability Topologies (Redundant LBs via VRRP, BGP, or Cloud Control Plane)

Control Plane vs Data Plane Separation

Further Reading