Traffic Distribution & Data-Path Mechanics in Load Balancers

May 28, 2025

Great. I’ll write a 1,200–1,500 word executive overview titled “Traffic Distribution & Data-Path Mechanics in Load Balancers.”

It will use a clear, problem-solution storytelling style with structured headings. The document will cover core distribution algorithms (including Maglev and two-choice), forwarding modes (like NAT and DSR), ECMP vs full-proxy architectures, connection reuse strategies, session state handling, and performance tuning knobs. It will include an ASCII diagram contrasting NAT and DSR data paths, a comparison table of key algorithms, and finish with five high-quality further-reading sources. Code snippets will be used sparingly to clarify concepts, while keeping the focus vendor-neutral and architecture-centric.

I’ll let you know once it’s ready for your review.

Traffic Distribution & Data-Path Mechanics in Load Balancers

TL;DR: Modern load balancers distribute client requests across servers using various algorithms (round-robin, least-load, hashing) to maximize performance and reliability. They can forward traffic via network address translation (NAT), direct server return (DSR), or tunneling, each affecting data paths differently. Design choices like stateless versus stateful load balancing, connection reuse (keep-alives, HTTP/2 multiplexing, QUIC), session stickiness vs. stateless tokens, and performance tuning (buffer sizes, backlog, slow-start) all trade off simplicity, efficiency, and complexity in a system’s architecture.

Core Traffic Distribution Algorithms

Choosing how to assign incoming requests to backend servers is a fundamental role of load balancers. Static algorithms like Round Robin (RR) simply cycle through servers sequentially, which is trivial to implement and automatically spreads load evenly if all requests and servers are equal. The downside is static algorithms ignore differences in server capacity or momentary load. For example, naive round-robin “is completely oblivious” to one server being slower or overloaded. Weighted Round Robin (WRR) refines this by giving more powerful servers a higher weight (appearing more often in the rotation). This handles known capacity differences, but weights are typically fixed and WRR remains non-adaptive at runtime – it won’t notice a server that’s bogged down or failing unless you adjust weights or remove it.

Dynamic algorithms measure load in real time. Least Connections (or Least Load) directs each new session to the server with the fewest active connections (or lowest current load). Similarly, a Least Response Time policy might send traffic to the server responding fastest on average. These approaches adapt to fluctuations: if one server is under heavy load, it stops getting new requests until its load drops. This can dramatically improve efficiency when request workloads are uneven (e.g. some queries are expensive while others are trivial). The trade-off is complexity – the balancer must continuously poll servers or track metrics to know their load. Polling too infrequently undermines the algorithm’s effectiveness, while polling too often adds overhead.

Another clever strategy is the Power of Two Choices (P2C). Instead of querying every server’s status, the balancer picks two servers at random and compares their load, assigning the request to the less busy of the pair. This “has shown to be nearly as good as an O(N) full scan” of all servers, but with much less overhead. P2C balancing avoids overloading while preventing herd behavior (i.e. not everyone picks the same “least loaded” server). Many modern L4/L7 balancers implement P2C by default because it combines the benefits of randomness and load-awareness efficiently.

For certain use cases, hash-based algorithms ensure the same client or session is routed consistently to the same backend. A common approach is consistent hashing (e.g. ring hash or Ketama hashing): each server is assigned positions on a hash ring, and each request (or session key) is hashed to a point on that ring, served by the next server clockwise. The appeal is stability: when servers are added/removed, the hashing only remaps a small fraction of clients to new targets – e.g. “adding or removing one host from N affects only ~1/N of requests”. Google’s Maglev algorithm refines this with a precomputed lookup table for consistent hashing, achieving both even distribution and minimal disruption on changes. Hash-based schemes are great for session affinity (like ensuring a user’s session or a cache key stays on one server) and for multi-LB redundancy, but a pure hash can lead to imbalance if traffic isn’t uniform. (Maglev addresses this by carefully populating its hash table to evenly spread load.) Generally, consistent hashing trades a bit of load balance optimality for session stickiness and fault tolerance.

Below is a comparison of popular load balancing algorithms and their pros/cons:

Algorithm	Pros	Cons
Round Robin	Very simple; evenly cycles through servers.	Ignores server load or capacity differences.
Weighted Round Robin	Accounts for static capacity differences via weights.	Still static (weights fixed); can’t adapt to runtime load.
Least Connections	Dynamic load-responsive distribution; avoids sending traffic to busy servers.	Requires tracking server load (polling/metrics); assumes connection count correlates to work.
Least Response Time	Sends traffic to the fastest-responding server; adapts to actual performance.	Needs continuous latency measurements; can be skewed by outliers or caching effects.
Power of Two Choices	Near-optimal load spread with minimal overhead (checks only 2 servers).	Random selection can occasionally choose two slow servers; not as precise as full load scan.
Consistent Hashing	Keeps a client/session on the same server; minimal disruption if servers change.	Possibly uneven load if some hashes get more traffic; primarily useful when affinity > perfect balance.

Forwarding Modes: NAT vs DSR vs Tunneling

How a load balancer forwards traffic to backends impacts network behavior and performance. Common modes include NAT, Direct Server Return (DSR), and IP tunneling:

NAT44 / Full-Proxy: In NAT mode, the LB acts as a middleman for both request and response. It modifies network addresses – typically the LB will rewrite the destination IP of incoming packets from the virtual IP (VIP) to the real server’s IP (and sometimes translate source IP to itself) before forwarding. The servers return traffic to the LB’s address, and the LB then NATs the source back to the VIP or client’s address. This effectively proxies the connection. The upside is simplicity (works across any network topology, and LB can inspect/modify traffic if needed). The downside is the LB becomes a bottleneck for return traffic and adds an extra hop, which can introduce latency and throughput limits.
Direct Server Return (DSR): DSR is a technique where the LB only handles the inbound packets, and servers send responses directly back to the client without passing again through the LB. The LB typically does no IP rewriting – it forwards the packet with the VIP intact (often by updating only the MAC address to reach the server locally). The server must accept the VIP (e.g. configured on a loopback) so it responds as if it were the VIP. The result is much lower latency and bandwidth load on the LB, since responses bypass it entirely. The trade-off: DSR requires the LB and servers be on the same network (so the client’s return path can go direct), and it can be complex to set up (servers need the VIP configured and must not ARP for it). It’s popular in high-performance L4 load balancers in data centers where network topology allows direct returns. If encryption or complex request handling is needed, DSR alone isn’t sufficient because the LB isn’t seeing the responses.
IP-in-IP or GRE Tunneling: Tunneling is similar to DSR’s one-way LB → server path but works across broader network setups. The LB encapsulates the client’s packet inside a new IP header (or GRE packet) addressed to the target server. The server decapsulates to get the original request, processes it, and sends the response directly to the client (with the VIP as source). This “tunnel mode looks like DSR except traffic between LB and server can be routed” even across layer-3 networks. The benefit is that backends need not be on the same subnet as the LB – you can load balance to servers in different networks or data centers, and the LB’s egress bandwidth isn’t a bottleneck. Drawbacks include a bit of added packet overhead (for the tunnel headers) and complexity (backends must support IPIP/GRE decapsulation). Tunneling and DSR both assume the client can reach the server directly, so they’re best for environments with stable, high-performance networks between clients and servers.

Diagram: NAT vs DSR Packet Flow

NAT Mode:
Client --> [LB: VIP] --> Server   (LB rewrites dest IP to server’s IP)
Client <-- [LB: VIP] <-- Server   (LB rewrites source IP back to VIP or client)

DSR Mode:
Client --> [LB: VIP] --> Server   (dest IP remains VIP; server has VIP on loopback)
Client <-- Server (direct response to Client, source IP = VIP)

In NAT mode, the load balancer is in the path for the full round-trip, performing address translations on both request and response. In DSR or tunnel mode, the LB is only in the request path, handing off packets and letting servers reply directly. This reduces latency and LB load significantly, at the expense of a more constrained network setup.

Stateless ECMP vs Stateful Designs

Load balancers come in stateless and stateful flavors. A stateless design means the LB doesn’t store per-connection information – it makes a routing decision for each packet or new flow using a deterministic hash or algorithm, and relies on that consistency to keep packets in the same flow going to the same server. For example, routers in front of a cluster of L4 load balancers might use Equal-Cost Multi-Path (ECMP) routing, hashing each packet’s 5-tuple to pick a downstream LB or server. This is lightning-fast and scales horizontally: you can add more LBs and ECMP will spread flows to them evenly. The big advantage is redundancy and scale – if one stateless LB instance goes down, new packets will hash to others with no single coordinator needed. Cloudflare, for instance, moved to stateless Maglev-based L4 load balancers so that “it doesn't matter which load balancer the router selects…they'll end up reaching the same backend server” for a given client flow. In other words, with a consistent hashing scheme, any LB can handle the packet and still route it to the correct backend, eliminating the need for session affinity to a specific LB node.

However, stateless load balancing makes it harder to do complex features. A stateful or full-proxy load balancer tracks sessions and often terminates connections, giving it the ability to buffer and inspect traffic, do advanced routing (e.g. content-based switching), and handle features like SSL offloading. The trade-off is that if a client’s next packet goes to a different LB (due to route changes or failover) the new LB won’t have the required state and the connection could break. Techniques like Google’s Maglev use consistent hashing plus distributed connection tracking to mitigate this, ensuring even if routing “flaps” and a packet arrives at a new LB, it can hash to the same backend or quickly rebuild context. Generally, stateful designs have higher overhead (memory and CPU to maintain sessions) and often require pairing or clustering for high availability (so that a failover LB can assume the session state). Stateless designs are simpler and extremely scalable (often just built on fast hashing in kernel or network hardware), but they can’t provide application-layer awareness or modifications — they usually operate at L3/L4 with basic algorithms. Many modern systems blend these approaches: e.g. using stateless L4 load balancing (ECMP or hashing) to distribute traffic to a fleet of L7 proxies which are stateful. This combination yields both scalability and rich features.

Connection Reuse and Protocol Multiplexing

Another major factor in load balancer efficiency is how it handles connections to clients and servers. Reusing and multiplexing connections can significantly improve performance. For example, establishing a fresh TCP connection (and especially a TLS handshake) for every single HTTP request is expensive. With HTTP keep-alive (persistent connections), a client can send multiple requests over one TCP connection, and similarly, a full-proxy LB can maintain a pool of open connections to each backend server to recycle them for many requests. This amortizes handshake overhead and can dramatically increase throughput – one case saw a 50% increase in max requests handled by enabling keep-alive reuse, along with lower CPU and latency. The load balancer, acting as an L7 proxy, might accept thousands of client connections but funnel requests through a handful of long-lived connections to each server, reducing the churn of setup/teardown.

Modern protocols take this further. HTTP/2 (and HTTP/3/QUIC) introduce multiplexing: multiple independent streams of requests/responses share a single underlying connection. An L7 load balancer using HTTP/2 can concurrently route many client requests over one TCP (or QUIC) connection to each server. This not only cuts down on connection overhead but also avoids head-of-line blocking that plagued HTTP/1.1. The LB must decide whether to speak HTTP/2 to the backend – many will, or at least use multiplexing to servers if they support it. With HTTP/2, it’s common to see huge reductions in backend connections; e.g. instead of 100 parallel TCP connections, the LB might use 4 persistent ones to carry those streams. Fewer connections = lower syscall and context-switch overhead, and better utilization of each connection’s congestion window.

QUIC (HTTP/3) poses new considerations: QUIC runs over UDP and encrypts even the transport handshakes, so load balancers can’t simply use the 5-tuple hashing alone if clients roam or NAT rebinding occurs. QUIC uses a Connection ID that a savvy load balancer can use to route all packets of a flow to the same server (to handle the case where client IP/port change). Some load balancers operate in a QUIC-aware mode to parse the initial packet and extract that ID for consistent hashing. Also, QUIC inherently multiplexes streams within one connection like HTTP/2, and has built-in keep-alive and faster handshakes. A full-proxy load balancer might terminate QUIC from clients (for example, to do content switching or because servers don’t speak QUIC) – in which case it should maintain long-lived QUIC connections to clients and reuse HTTP/2 or TCP to servers if needed. The key theme is maximizing reuse: whether via keep-alive, pipelining, multiplexing, or connection pooling, fewer connections means less overhead and often higher throughput.

One must be careful: enabling very long-lived connections can tie up resources if not managed (e.g. file descriptors, or if a server has limits). Load balancers typically have settings for connection idle timeouts, max reuse counts, etc. Tuning those (and ensuring servers have keep-alive enabled) is important so that reuse helps rather than hurts. But when done right, persistent connections and multiplexing are a win-win: more requests per connection means less CPU and faster response for users.

Session State Management and Stickiness

In an ideal world, any request can go to any server without issue – the application tier would be completely stateless such that it doesn’t matter which backend handles each request. In practice, many applications maintain session state (user login info, shopping carts, personalized data) that traditionally was stored in server memory or cache. Load balancers can provide session stickiness (affinity) to ensure all requests from a given user/session ID go to the same backend server, to avoid having to share that state. This can be done by cookies or IP affinity: e.g. the LB issues a cookie on first response identifying the chosen server, so subsequent requests from that user bypass the normal algorithm and go to that same server.

Sticky sessions make life easy for developers in the short term – you can use in-memory session objects and RAM caches without complex distributed storage. It also saves servers from having to constantly exchange session info or hit a database on each request. And it can improve cache hit rates (e.g. a user’s second request finds data warmed in the memory of the same server). However, stickiness comes at a cost. It “makes it more difficult to keep servers in balance” – some servers might get overloaded if many active users happen to stick there, while others are underutilized. If a sticky server goes down mid-session, the user’s session might be lost or need reauthentication, since their state wasn’t on the other machines. In the worst case, a surge of traffic could all stick to one server (if the LB’s first few assignments all happened to hit the same instance) and defeat the purpose of load balancing.

Because of these downsides, modern scalable architectures strive to externalize or avoid session state. Stateless tokens like JWT (JSON Web Tokens) let the client carry the session data (or a key to it) so that any server can serve any request – “being non-sticky is preferred under load” in this model. Alternatively, an application can store session data in an external store (database or distributed cache) that all servers can query. This removes the need for the LB to pin a user to one server. In fact, many cloud API services and microservices avoid sticky sessions entirely; they treat each request independently or use client-side state. The Stack Overflow community notes that sticky sessions are “best avoided these days” for API servers, precisely because they hinder scaling. There are cases, though, where enabling LB affinity makes sense – for example, a short-lived session during a multi-step transaction where sharing state externally would be too slow or complex, or for legacy applications that are expensive to rewrite for external state. In those cases, you might enable stickiness but limit its duration (many LBs let you expire the affinity after N minutes of inactivity or always reset on new login, etc.). It’s important to document those choices, because sticky sessions couple your load balancing layer with application state, which can complicate failover scenarios.

A happy middle-ground in many systems is to use application-layer cookies or tokens (like a signed JWT or session cookie) that any server can read to get the user context, combined with an external cache for heavier state. That way, the LB can remain round-robin or least-load (no affinity needed), improving resilience and distribution. Summarily, use stickiness only when necessary – the trend is toward statelessness for web scale. When you do need persistence, LBs offer flexible mechanisms (cookie-based, IP-based, etc.), but plan for the impact on load distribution and fault tolerance.

Performance Tuning Knobs

To get the best performance from a load balancer (and the entire system behind it), engineers have several “knobs” to tweak. These become especially important at high scale:

Connection Backlog: Just like any server, a LB has a limited queue for new incoming connections waiting to be accepted. If your service sees bursts of new connections, you want a large enough backlog (and corresponding OS settings like somaxconn) to buffer those spikes. Otherwise, clients may get connection refused errors if the queue overflows. Tuning the backlog high allows the LB (or the app server behind it) to smoothing out spike loads – the LB will accept as fast as it can, and the OS will queue the rest briefly. The drawback to an overly large backlog is using more memory and possibly queueing more work than the server can handle, but in practice it’s safer to have headroom. Similarly, if the LB is proxying and creating its own outbound connections to servers, the server’s listen backlog should be sized appropriately. Many load balancers and kernels default to backlog sizes (e.g. 128) that might be insufficient for thousands of connection attempts per second.
Buffer Sizes: Network I/O buffers determine how much data can be in-flight or queued in sockets. LBs often need to handle mismatched speeds – maybe a fast client and slow server or vice versa. Generous socket buffers (both at OS-level and in the LB application) help absorb short bursts and maintain throughput. For example, if a server produces data quickly but the client reads slowly (or is on a high-latency link), a larger buffer lets the LB store the data until it can deliver, rather than forcing a stall. However, buffers that are too large can consume a lot of RAM if you have many connections, and can add latency if data sits in a queue too long. The key is to size buffers based on your traffic patterns (e.g. if you serve large responses like video files, you’ll need larger buffers than if you serve small JSON APIs). Also consider TLS record sizes, etc., when tuning – often making buffers at least a few multiples of MSS (maximum segment size) or typical response size is wise.
Slow-Start (Ramp-up): When a new server instance is added to a pool (or a sick one recovers), immediately sending it a full share of traffic can overwhelm it. Perhaps it has cold caches or just came online. Slow-start is a feature that initially throttles the proportion of traffic to a new backend and gradually increases it over a short period. For example, AWS Application Load Balancer can enable a slow-start so that over, say, 30 seconds, the target goes from 0% to 100% of its allocated traffic weight instead of 100% immediately. Envoy’s slow start implementation scales the endpoint’s weight from a low value up to normal over a time window to allow warm-up. This prevents a thundering herd of requests from hitting a cold server that might still be JIT compiling code or loading data into memory. The “ramp-up” period is usually on the order of tens of seconds to a couple minutes – long enough to avoid timeouts, short enough to get the server productive quickly. Tuning the window and aggression (some systems allow exponential or linear ramps) can ensure a smooth transition. In low-traffic scenarios, slow-start isn’t very effective (if only one new server exists, it will still get a chunk of traffic), but in high-traffic environments it’s a lifesaver for reliability.
Other Knobs: There are many other performance settings one might adjust. Out-of-band health checks frequency can be tuned so that failed servers are detected and removed quickly, but not so fast that false-positives cause flapping. Threading or event-loop settings in software LBs (like Nginx, HAProxy) can be tweaked to utilize CPU cores efficiently. TLS offload settings (number of handshake threads, etc.) will affect throughput for HTTPS-heavy loads. Even the choice of L4 vs L7 balancing is a performance consideration: an L4 NAT balancer in kernel space can push a lot more packets per second than an L7 proxy doing heavy HTTP parsing – though the latter provides more features. In virtualized or cloud environments, you might consider enhanced networking features (like AWS Elastic Network Adapters, etc.) to give your LB more packets per second. And don’t forget logging: verbose logging on a busy LB can become a bottleneck; tune the log level appropriately or use asynchronous logging if possible.

In summary, load balancers are the “traffic cops” of scalable systems, and like any good traffic system, they require smart algorithms and well-tuned mechanics. By understanding core distribution algorithms, choosing the right forwarding mode, and carefully managing state, connections, and performance parameters, you can design a load balancing solution that is efficient, robust, and suited to your application’s needs.

SerialReads

Traffic Distribution & Data-Path Mechanics in Load Balancers

Traffic Distribution & Data-Path Mechanics in Load Balancers

Core Traffic Distribution Algorithms

Forwarding Modes: NAT vs DSR vs Tunneling

Stateless ECMP vs Stateful Designs

Connection Reuse and Protocol Multiplexing

Session State Management and Stickiness

Performance Tuning Knobs

Further Reading