SerialReads

Traffic Distribution & Data-Path Mechanics in Load Balancers

May 28, 2025

Great. I’ll write a 1,200–1,500 word executive overview titled “Traffic Distribution & Data-Path Mechanics in Load Balancers.”

It will use a clear, problem-solution storytelling style with structured headings. The document will cover core distribution algorithms (including Maglev and two-choice), forwarding modes (like NAT and DSR), ECMP vs full-proxy architectures, connection reuse strategies, session state handling, and performance tuning knobs. It will include an ASCII diagram contrasting NAT and DSR data paths, a comparison table of key algorithms, and finish with five high-quality further-reading sources. Code snippets will be used sparingly to clarify concepts, while keeping the focus vendor-neutral and architecture-centric.

I’ll let you know once it’s ready for your review.

Traffic Distribution & Data-Path Mechanics in Load Balancers

TL;DR: Modern load balancers distribute client requests across servers using various algorithms (round-robin, least-load, hashing) to maximize performance and reliability. They can forward traffic via network address translation (NAT), direct server return (DSR), or tunneling, each affecting data paths differently. Design choices like stateless versus stateful load balancing, connection reuse (keep-alives, HTTP/2 multiplexing, QUIC), session stickiness vs. stateless tokens, and performance tuning (buffer sizes, backlog, slow-start) all trade off simplicity, efficiency, and complexity in a system’s architecture.

Core Traffic Distribution Algorithms

Choosing how to assign incoming requests to backend servers is a fundamental role of load balancers. Static algorithms like Round Robin (RR) simply cycle through servers sequentially, which is trivial to implement and automatically spreads load evenly if all requests and servers are equal. The downside is static algorithms ignore differences in server capacity or momentary load. For example, naive round-robin “is completely oblivious” to one server being slower or overloaded. Weighted Round Robin (WRR) refines this by giving more powerful servers a higher weight (appearing more often in the rotation). This handles known capacity differences, but weights are typically fixed and WRR remains non-adaptive at runtime – it won’t notice a server that’s bogged down or failing unless you adjust weights or remove it.

Dynamic algorithms measure load in real time. Least Connections (or Least Load) directs each new session to the server with the fewest active connections (or lowest current load). Similarly, a Least Response Time policy might send traffic to the server responding fastest on average. These approaches adapt to fluctuations: if one server is under heavy load, it stops getting new requests until its load drops. This can dramatically improve efficiency when request workloads are uneven (e.g. some queries are expensive while others are trivial). The trade-off is complexity – the balancer must continuously poll servers or track metrics to know their load. Polling too infrequently undermines the algorithm’s effectiveness, while polling too often adds overhead.

Another clever strategy is the Power of Two Choices (P2C). Instead of querying every server’s status, the balancer picks two servers at random and compares their load, assigning the request to the less busy of the pair. This “has shown to be nearly as good as an O(N) full scan” of all servers, but with much less overhead. P2C balancing avoids overloading while preventing herd behavior (i.e. not everyone picks the same “least loaded” server). Many modern L4/L7 balancers implement P2C by default because it combines the benefits of randomness and load-awareness efficiently.

For certain use cases, hash-based algorithms ensure the same client or session is routed consistently to the same backend. A common approach is consistent hashing (e.g. ring hash or Ketama hashing): each server is assigned positions on a hash ring, and each request (or session key) is hashed to a point on that ring, served by the next server clockwise. The appeal is stability: when servers are added/removed, the hashing only remaps a small fraction of clients to new targets – e.g. “adding or removing one host from N affects only ~1/N of requests”. Google’s Maglev algorithm refines this with a precomputed lookup table for consistent hashing, achieving both even distribution and minimal disruption on changes. Hash-based schemes are great for session affinity (like ensuring a user’s session or a cache key stays on one server) and for multi-LB redundancy, but a pure hash can lead to imbalance if traffic isn’t uniform. (Maglev addresses this by carefully populating its hash table to evenly spread load.) Generally, consistent hashing trades a bit of load balance optimality for session stickiness and fault tolerance.

Below is a comparison of popular load balancing algorithms and their pros/cons:

Algorithm Pros Cons
Round Robin Very simple; evenly cycles through servers. Ignores server load or capacity differences.
Weighted Round Robin Accounts for static capacity differences via weights. Still static (weights fixed); can’t adapt to runtime load.
Least Connections Dynamic load-responsive distribution; avoids sending traffic to busy servers. Requires tracking server load (polling/metrics); assumes connection count correlates to work.
Least Response Time Sends traffic to the fastest-responding server; adapts to actual performance. Needs continuous latency measurements; can be skewed by outliers or caching effects.
Power of Two Choices Near-optimal load spread with minimal overhead (checks only 2 servers). Random selection can occasionally choose two slow servers; not as precise as full load scan.
Consistent Hashing Keeps a client/session on the same server; minimal disruption if servers change. Possibly uneven load if some hashes get more traffic; primarily useful when affinity > perfect balance.

Forwarding Modes: NAT vs DSR vs Tunneling

How a load balancer forwards traffic to backends impacts network behavior and performance. Common modes include NAT, Direct Server Return (DSR), and IP tunneling:

Diagram: NAT vs DSR Packet Flow

NAT Mode:
Client --> [LB: VIP] --> Server   (LB rewrites dest IP to server’s IP)
Client <-- [LB: VIP] <-- Server   (LB rewrites source IP back to VIP or client)

DSR Mode:
Client --> [LB: VIP] --> Server   (dest IP remains VIP; server has VIP on loopback)
Client <-- Server (direct response to Client, source IP = VIP)

In NAT mode, the load balancer is in the path for the full round-trip, performing address translations on both request and response. In DSR or tunnel mode, the LB is only in the request path, handing off packets and letting servers reply directly. This reduces latency and LB load significantly, at the expense of a more constrained network setup.

Stateless ECMP vs Stateful Designs

Load balancers come in stateless and stateful flavors. A stateless design means the LB doesn’t store per-connection information – it makes a routing decision for each packet or new flow using a deterministic hash or algorithm, and relies on that consistency to keep packets in the same flow going to the same server. For example, routers in front of a cluster of L4 load balancers might use Equal-Cost Multi-Path (ECMP) routing, hashing each packet’s 5-tuple to pick a downstream LB or server. This is lightning-fast and scales horizontally: you can add more LBs and ECMP will spread flows to them evenly. The big advantage is redundancy and scale – if one stateless LB instance goes down, new packets will hash to others with no single coordinator needed. Cloudflare, for instance, moved to stateless Maglev-based L4 load balancers so that “it doesn't matter which load balancer the router selects…they'll end up reaching the same backend server” for a given client flow. In other words, with a consistent hashing scheme, any LB can handle the packet and still route it to the correct backend, eliminating the need for session affinity to a specific LB node.

However, stateless load balancing makes it harder to do complex features. A stateful or full-proxy load balancer tracks sessions and often terminates connections, giving it the ability to buffer and inspect traffic, do advanced routing (e.g. content-based switching), and handle features like SSL offloading. The trade-off is that if a client’s next packet goes to a different LB (due to route changes or failover) the new LB won’t have the required state and the connection could break. Techniques like Google’s Maglev use consistent hashing plus distributed connection tracking to mitigate this, ensuring even if routing “flaps” and a packet arrives at a new LB, it can hash to the same backend or quickly rebuild context. Generally, stateful designs have higher overhead (memory and CPU to maintain sessions) and often require pairing or clustering for high availability (so that a failover LB can assume the session state). Stateless designs are simpler and extremely scalable (often just built on fast hashing in kernel or network hardware), but they can’t provide application-layer awareness or modifications — they usually operate at L3/L4 with basic algorithms. Many modern systems blend these approaches: e.g. using stateless L4 load balancing (ECMP or hashing) to distribute traffic to a fleet of L7 proxies which are stateful. This combination yields both scalability and rich features.

Connection Reuse and Protocol Multiplexing

Another major factor in load balancer efficiency is how it handles connections to clients and servers. Reusing and multiplexing connections can significantly improve performance. For example, establishing a fresh TCP connection (and especially a TLS handshake) for every single HTTP request is expensive. With HTTP keep-alive (persistent connections), a client can send multiple requests over one TCP connection, and similarly, a full-proxy LB can maintain a pool of open connections to each backend server to recycle them for many requests. This amortizes handshake overhead and can dramatically increase throughput – one case saw a 50% increase in max requests handled by enabling keep-alive reuse, along with lower CPU and latency. The load balancer, acting as an L7 proxy, might accept thousands of client connections but funnel requests through a handful of long-lived connections to each server, reducing the churn of setup/teardown.

Modern protocols take this further. HTTP/2 (and HTTP/3/QUIC) introduce multiplexing: multiple independent streams of requests/responses share a single underlying connection. An L7 load balancer using HTTP/2 can concurrently route many client requests over one TCP (or QUIC) connection to each server. This not only cuts down on connection overhead but also avoids head-of-line blocking that plagued HTTP/1.1. The LB must decide whether to speak HTTP/2 to the backend – many will, or at least use multiplexing to servers if they support it. With HTTP/2, it’s common to see huge reductions in backend connections; e.g. instead of 100 parallel TCP connections, the LB might use 4 persistent ones to carry those streams. Fewer connections = lower syscall and context-switch overhead, and better utilization of each connection’s congestion window.

QUIC (HTTP/3) poses new considerations: QUIC runs over UDP and encrypts even the transport handshakes, so load balancers can’t simply use the 5-tuple hashing alone if clients roam or NAT rebinding occurs. QUIC uses a Connection ID that a savvy load balancer can use to route all packets of a flow to the same server (to handle the case where client IP/port change). Some load balancers operate in a QUIC-aware mode to parse the initial packet and extract that ID for consistent hashing. Also, QUIC inherently multiplexes streams within one connection like HTTP/2, and has built-in keep-alive and faster handshakes. A full-proxy load balancer might terminate QUIC from clients (for example, to do content switching or because servers don’t speak QUIC) – in which case it should maintain long-lived QUIC connections to clients and reuse HTTP/2 or TCP to servers if needed. The key theme is maximizing reuse: whether via keep-alive, pipelining, multiplexing, or connection pooling, fewer connections means less overhead and often higher throughput.

One must be careful: enabling very long-lived connections can tie up resources if not managed (e.g. file descriptors, or if a server has limits). Load balancers typically have settings for connection idle timeouts, max reuse counts, etc. Tuning those (and ensuring servers have keep-alive enabled) is important so that reuse helps rather than hurts. But when done right, persistent connections and multiplexing are a win-win: more requests per connection means less CPU and faster response for users.

Session State Management and Stickiness

In an ideal world, any request can go to any server without issue – the application tier would be completely stateless such that it doesn’t matter which backend handles each request. In practice, many applications maintain session state (user login info, shopping carts, personalized data) that traditionally was stored in server memory or cache. Load balancers can provide session stickiness (affinity) to ensure all requests from a given user/session ID go to the same backend server, to avoid having to share that state. This can be done by cookies or IP affinity: e.g. the LB issues a cookie on first response identifying the chosen server, so subsequent requests from that user bypass the normal algorithm and go to that same server.

Sticky sessions make life easy for developers in the short term – you can use in-memory session objects and RAM caches without complex distributed storage. It also saves servers from having to constantly exchange session info or hit a database on each request. And it can improve cache hit rates (e.g. a user’s second request finds data warmed in the memory of the same server). However, stickiness comes at a cost. It “makes it more difficult to keep servers in balance” – some servers might get overloaded if many active users happen to stick there, while others are underutilized. If a sticky server goes down mid-session, the user’s session might be lost or need reauthentication, since their state wasn’t on the other machines. In the worst case, a surge of traffic could all stick to one server (if the LB’s first few assignments all happened to hit the same instance) and defeat the purpose of load balancing.

Because of these downsides, modern scalable architectures strive to externalize or avoid session state. Stateless tokens like JWT (JSON Web Tokens) let the client carry the session data (or a key to it) so that any server can serve any request – “being non-sticky is preferred under load” in this model. Alternatively, an application can store session data in an external store (database or distributed cache) that all servers can query. This removes the need for the LB to pin a user to one server. In fact, many cloud API services and microservices avoid sticky sessions entirely; they treat each request independently or use client-side state. The Stack Overflow community notes that sticky sessions are “best avoided these days” for API servers, precisely because they hinder scaling. There are cases, though, where enabling LB affinity makes sense – for example, a short-lived session during a multi-step transaction where sharing state externally would be too slow or complex, or for legacy applications that are expensive to rewrite for external state. In those cases, you might enable stickiness but limit its duration (many LBs let you expire the affinity after N minutes of inactivity or always reset on new login, etc.). It’s important to document those choices, because sticky sessions couple your load balancing layer with application state, which can complicate failover scenarios.

A happy middle-ground in many systems is to use application-layer cookies or tokens (like a signed JWT or session cookie) that any server can read to get the user context, combined with an external cache for heavier state. That way, the LB can remain round-robin or least-load (no affinity needed), improving resilience and distribution. Summarily, use stickiness only when necessary – the trend is toward statelessness for web scale. When you do need persistence, LBs offer flexible mechanisms (cookie-based, IP-based, etc.), but plan for the impact on load distribution and fault tolerance.

Performance Tuning Knobs

To get the best performance from a load balancer (and the entire system behind it), engineers have several “knobs” to tweak. These become especially important at high scale:

In summary, load balancers are the “traffic cops” of scalable systems, and like any good traffic system, they require smart algorithms and well-tuned mechanics. By understanding core distribution algorithms, choosing the right forwarding mode, and carefully managing state, connections, and performance parameters, you can design a load balancing solution that is efficient, robust, and suited to your application’s needs.

Further Reading

  1. Load Balancing Algorithms Explained Visually – Quastor
  2. High Availability Load Balancers with Maglev – Cloudflare
  3. Maglev: A Fast and Reliable Software Network Load Balancer – Google Research
  4. Sticky Sessions (Session Persistence) – Imperva
  5. Stop Wasting Connections, Use HTTP Keep-Alive – Lob Engineering

system-design