CDN Architecture & Data-Path Mechanics (System Design Deep Dive)

Jun 08, 2025

Content Delivery Networks (CDNs) form a complex, distributed system designed to speed up content delivery and reduce load on origin servers. This deep dive covers key aspects of CDN architecture and the data path, from caching hierarchies to transport protocols and failover mechanisms. The discussion is vendor-agnostic and conceptual (with AWS CloudFront as an illustrative example), targeting intermediate-to-advanced engineers preparing for system design interviews.

Tiered Caching Hierarchy (Edge ↔ Mid-tier ↔ Origin) and Prefetching

Modern CDNs use a tiered cache hierarchy to minimize latency and origin load. Edge caches are deployed in numerous Points of Presence (PoPs) near end-users, serving most requests. Cache misses at the edge are forwarded to mid-tier or regional caches, which aggregate traffic from multiple edge locations, and finally to the origin server if needed. For example, Google’s Media CDN describes a three-layer topology: deep edge caches close to users (often inside ISP networks), a peering-edge tier that serves as a regional parent cache, and large long-tail caches deeper in the network that act as an origin shield. Amazon CloudFront similarly has Regional Edge Caches as a middle layer since 2016, automatically protecting origins by collapsing requests at the regional level.

In this hierarchy, higher-tier “shield” or parent caches absorb cache misses from many edges. Only the shield cache will retrieve content from the origin on a miss, then share it with child nodes. This dramatically reduces duplicate origin fetches. CloudFront’s recently introduced Origin Shield is essentially a designated regional cache that all other regions funnel through as a final layer, ensuring that simultaneous misses across regions result in just one origin request.

CDNs also employ request collapsing at each cache layer: if multiple users ask for the same uncached object at nearly the same time, the cache will forward only one request upstream (to a parent or origin) and queue the rest. Once the data returns, it is cached and served to all the waiting requests. This collapsed forwarding technique prevents a stampede of identical requests hitting the origin in parallel. It’s a default behavior in many CDN platforms, significantly reducing origin load during traffic spikes.

Another strategy to improve efficiency is prefetching or prefill. Here, the CDN anticipates content that a user will request next and fetches it to the edge in advance. For example, Akamai’s media streaming CDN can use origin hints: when the origin responds to a segment request, it may include a header pointing to the next segment. The CDN then asynchronously retrieves that next segment to the edge cache before the viewer actually requests it. Prefetching (also called cache warming when done proactively) helps ensure that sequential content (like video segments or pages in a series) is already at the edge by the time the user needs it, reducing latency for those subsequent requests.

Transport Stack Optimizations (TLS at the Edge, TCP Tuning, HTTP/2+3)

Once a user’s request reaches an edge server, the CDN’s transport stack takes over to optimize delivery. A key practice is terminating TLS (HTTPS) at the edge. Rather than tunneling encrypted traffic to the origin, the edge server handles the SSL/TLS handshake and decryption, serving content to the user directly. This reduces latency and offloads cryptographic work from the origin. In fact, edge servers are specifically designed with this in mind (often including hardware acceleration for TLS). By decrypting and re-encrypting traffic at the edge, CDNs gain visibility into requests (for caching and routing decisions) and cut down on backhaul overhead.

Under the hood, CDNs fine-tune the TCP/IP layer on their servers to squeeze out performance. One important knob is the initial congestion window (initcwnd). This parameter controls how many TCP segments can be sent immediately in a new connection before waiting for an ACK. Historically, TCP started with a very small window (1–4 segments), but research showed that larger initial windows significantly reduce latency for short web transfers. Today, the de-facto standard initial window is about 10 segments, and Linux has used IW10 by default since 2011. Many CDN servers adopt at least this value, and some experiment with even larger initial windows or other TCP tweaks to improve page load times on modern networks. The goal is to send more data during the “slow-start” phase so that small objects can be delivered in fewer round trips.

CDNs also deploy cutting-edge congestion control algorithms in their network stack. Traditional TCP uses loss-based algorithms like Reno or CUBIC, but newer approaches like BBR (Bottleneck Bandwidth and RTT) model the network to avoid building queues. In practice, some CDN providers have enabled BBR on their edge servers to improve throughput and latency. For instance, Cloudflare uses Google’s BBR algorithm for its TCP connections. BBR can detect the network’s capacity via latency measurements rather than waiting for packet loss, allowing the CDN to fully utilize bandwidth while minimizing packet loss and buffering. Tuning such low-level parameters (including TCP fast open, selective ACKs, etc.) is part of how CDNs accelerate content delivery beyond what a default server could do.

At the HTTP layer, modern CDNs leverage protocol advancements like HTTP/2 and HTTP/3 to improve performance. HTTP/2 (H2) is almost always used between clients and edge servers now. Unlike HTTP/1.1 which required opening 6+ parallel connections to fetch resources, H2 uses a single connection with multiplexed streams – multiple requests and responses can be in flight concurrently on one TCP connection. This eliminates the overhead of many TCP handshakes and lets the server interleave responses (potentially prioritizing more important resources). CDNs optimize their H2 implementations to ensure that high-priority responses aren’t blocked by lower-priority data, sometimes tuning internal buffers and using features like TCP_NOTSENT_LOWAT to reduce unnecessary buffering.

HTTP/3 (H3) is the latest evolution, building on QUIC, a UDP-based transport. Many CDNs have started offering HTTP/3 support for viewer connections to further cut latency. QUIC combines the TCP+TLS handshakes into a single step and eliminates head-of-line blocking at the transport layer. This means faster connection setup and more resilient transfers if packets are lost. In HTTP/2 over TCP, a lost packet blocks all streams until recovery; with QUIC in HTTP/3, each stream is independent. As an illustration, QUIC establishes a secure session with just one round-trip handshake, whereas TCP+TLS needed two or more RTTs. Early adopters report improved page load times, especially in high-latency conditions, thanks to these changes. CDNs terminate QUIC at the edge and typically still use TCP to talk to origins (for compatibility), but the benefit is realized on the client-to-edge leg where it matters most. Notably, Amazon CloudFront enabled HTTP/3 support across all its edge locations in 2022 – it’s an opt-in feature that provides faster connection times, stream multiplexing, and built-in encryption with no extra cost.

Intelligent Routing: Anycast vs Geo-DNS and Outage Handling

A crucial part of CDN architecture is how user requests are routed to the “best” edge location. Two primary techniques are used: Anycast routing and Geo-DNS (geographic load balancing).

Anycast uses the Internet’s routing protocol (BGP) to advertise the same IP address from servers in many different locations. When a user’s device looks up the CDN’s IP and sends traffic, the BGP system automatically routes it to the nearest (in network terms) announcing location. In other words, every PoP shares a single (or a set of) Anycast IP, and the Internet at large delivers packets to the closest PoP without the client needing to know its address. This has the advantage of simplicity and fast failover – if a PoP is unreachable, it withdraws the BGP route, and traffic shifts to the next nearest PoP typically within seconds. BGP is designed to reconverge quickly on failures, meaning Anycast CDNs natively handle a PoP outage by rerouting users, often without manual intervention. Anycast also has a side-benefit in DDoS mitigation: a flood of malicious traffic gets distributed across many PoPs rather than overwhelming one data center, since the attack is also routed to the “closest” PoP from each source.

The downside of pure Anycast is that routing by BGP isn’t always equal to routing by real latency. BGP tends to choose paths with the fewest network hops or shortest AS path, which might not reflect actual speed. Sometimes an Anycast IP might pull in users to a not-quite-optimal PoP if Internet routing is suboptimal. Also, running a single global Anycast address can be suboptimal for very distant regions with limited interconnection; many CDNs therefore deploy regional Anycast clusters – e.g. one Anycast IP per continent or group of PoPs – to avoid weird routing across continents. There’s also the issue that Anycast requires the application to be stateless or handle client rehoming, since successive packets (or flows) from the same user could conceivably go to different PoPs if the network changes. In practice, stable routing and connection-oriented protocols mitigate this, and CDNs ensure anycasted services like HTTPS are designed to tolerate such switches.

DNS-based routing (Geo-DNS), on the other hand, relies on the CDN’s authoritative DNS service to direct users. Each PoP or region has its own IP address, and when a user’s DNS resolver asks for the CDN hostname, the DNS server returns the IP that corresponds to a nearby PoP (based on the resolver’s IP location or other mapping databases). Essentially, the CDN DNS “load balances” requests by geography: a user in Germany might get an IP for the Frankfurt PoP, while one in California gets an IP for a Los Angeles PoP, etc. This method gives the CDN fine-grained control over mapping users to PoPs and can incorporate real-time load or availability data – for example, if a datacenter is near capacity or down, DNS can stop handing out its IP. The trade-off is that DNS responses are cached by resolvers for a time (TTL), so changes due to failures can be slow to propagate. If a PoP goes down, any users who still have its IP cached may keep trying to use it until DNS TTLs expire and a new lookup occurs. As a result, failover via DNS is inherently slower than Anycast BGP failover unless TTLs are kept very low (which then increases DNS query overhead). Furthermore, Geo-DNS sometimes makes mistakes due to DNS resolver location: the DNS query might come from a resolver that isn’t actually close to the end-user, causing the CDN to mis-geo-locate the request. Anycast avoids that particular issue by using the client’s routing directly.

Many large CDNs use a hybrid of these approaches. For instance, a CDN might use DNS-based distribution at a high level but still rely on Anycast for certain networks or as a fail-safe. Some primarily Anycast CDNs partition their network into a few Anycast groups to improve performance (so not a single global cluster). There’s also the concept of connection routing vs content routing – e.g., a CDN might direct initial DNS to a “routing service” which then uses Anycast to reach the nearest edge. The key takeaway is that routing strategies impact how quickly and reliably users are connected to an optimal edge. Fail-out and fail-back behavior under PoP outages is an essential design consideration: CDNs must detect failures (via heartbeat systems or loss of BGP announcements) and reroute traffic with minimal interruption. In anycast scenarios, this is largely automatic with BGP. In DNS scenarios, CDNs often use very short TTLs (like 30 seconds) for DNS responses so they can rapidly redirect new requests if a PoP fails. They may also route traffic to backup locations if one goes down (for example, users of a failed Frankfurt PoP might be pointed to Amsterdam in the interim). Robust systems additionally monitor performance – if a particular routing decision leads to high latency or errors, the mapping can be adjusted on the fly.

Large Object Delivery: Ranges, Chunks, and Adaptive Segments

Delivering large objects (such as video files, big downloads, or game updates) efficiently through a CDN requires special handling to avoid bottlenecks. One technique is using HTTP Range Requests and chunked transfers to break the delivery into smaller pieces. Instead of pulling a 10 GB file in one go, an edge server might request it from the origin in chunks – say 10 MB at a time – caching each chunk as it arrives. This has multiple benefits: the edge can start serving the first chunk to the user immediately (reducing time-to-first-byte), it doesn’t have to hold huge objects entirely in memory, and if a user aborts the download halfway, the CDN doesn’t waste effort retrieving the rest of the file. For instance, Azure Front Door (a CDN/load balancer service) will automatically do object chunking for large files: as the edge gets a request, it retrieves smaller pieces from the origin and streams them to the client, rather than a single monolithic transfer. Most CDNs have an internal threshold (often in the low tens of MB) beyond which they switch to segmented caching – storing an object in portions.

Clients themselves can request byte ranges (the Range: header), and CDNs will honor those, fetching from origin as needed and caching those ranges. If a download is interrupted, a client can request “bytes 5000000-” to resume, and the CDN will serve from cache if possible or go to origin for that range. Caching partial content is tricky but supported: for example, Amazon CloudFront caches range responses so that subsequent range requests for the same file can be served from the cached portions without re-fetching. Some CDNs (like G-Core) explicitly offer a large file optimization where the CDN origin fetch is done in fixed-sized chunks (e.g. 10 MB) which are cached individually. A similar concept in Alibaba Cloud’s CDN is “Range origin fetch”, where edge nodes ask the origin for, say, 512KB chunks and progressively cache a file. Chunking ensures that very popular large files (like a new game patch) don’t thrash the cache or network – the CDN can serve different segments to different users in parallel once the first segments are cached, rather than one user’s slow download occupying the entire file pipeline.

Choosing the optimal chunk size or strategy is an interesting problem. Too small and you overhead more round trips; too large and you reintroduce the slow-start issues for each chunk. Many CDN providers tune this based on file type and observed bandwidth. As one strategy guide notes, segmenting large files and determining chunk size should account for average user throughput and the CDN’s caching capabilities. The CDN can also leverage parallel downloads – for example, if a client can download multiple chunks simultaneously (supported in some download managers or by using multiple TCP flows), the CDN can fulfill those from cache in parallel to boost the overall speed.

For video streaming and other media, large content delivery is often coupled with adaptive segmentation. Videos are cut into small segments (e.g. 2-10 seconds of video each) and encoded at multiple bitrates. This way, the CDN is actually delivering a series of moderate-sized objects rather than one giant file. Users fetch segment by segment, and the player may switch to a different quality on the fly. CDNs assist by caching these segments and sometimes by prefetching the next few segments. Adaptive bitrate streaming (ABR) ensures that even if a user’s network conditions change, they get the best possible quality without interruption. From the CDN’s perspective, each segment is a cacheable object – which keeps the working set of “hot” content smaller (only a few segments ahead of the playhead need to be hot at once). CDNs may also dynamically adjust how far ahead to prefetch segments (prefetch two segments ahead on a fast connection vs. one ahead on a slow one) as an adaptive segment sizing strategy. The principle across these techniques is to divide and conquer large content: break it into pieces that can be managed efficiently, cached intelligently, and delivered adaptively.

Caching Policies and Storage: Multi-layer Caches, Disk Tiering, Eviction Algorithms

Effective CDN performance hinges on smart caching – not just at the network level (which file is cached where), but also within each server. Edge servers maintain caches in layers: typically a portion of data in fast memory, a larger portion on SSD, and possibly even more on slower disks. Managing these tiers requires decisions about which content to keep, which to evict, and how to move objects between tiers.

A common approach is a multi-layer LRU: the cache treats RAM as an L1 cache for the most recently used “hot” objects, and SSD as an L2 for slightly less hot objects. By keeping the hottest items in RAM, the CDN serves those with minimal latency (no disk IO). The SSD (often NVMe drives) holds the bulk of cached content for quick retrieval. For example, a typical edge server might have 256 GB or more of RAM and tens of TB of NVMe SSD storage. Less frequently accessed or large assets might live on the SSD tier. Some CDNs even incorporate HDDs or network storage for an L3 “cold” cache if needed, trading speed for capacity.

Disk tiering and promotion heuristics determine how an object moves between these layers. A naive approach is to write everything to disk as it arrives and also keep a copy in RAM until evicted (essentially a read-through cache with LRU eviction from RAM). However, this can waste disk bandwidth on content that never gets reused (so-called one-hit wonders). An interesting optimization used by Cloudflare was to not immediately commit every object to SSD. They introduced a transient in-memory cache: an item is cached in RAM first and only promoted to SSD if it gets requested multiple times in a short window. This way, items that were requested only once might be served (and maybe buffered in RAM briefly) but never written to disk, since they are unlikely to be needed again. Their experiment showed that avoiding disk writes for one-hit-wonders halved the disk write rate and even reduced tail latency for disk hits by 5%, because the SSD wasn’t busy with useless writes. Essentially, they trade a bit of extra origin fetch (if that one-hit content is requested again later, which was rare) for a significant gain in cache efficiency and hardware longevity.

More generally, CDNs employ advanced cache eviction algorithms beyond plain LRU. LRU (Least Recently Used) evicts the oldest unused item when space is needed, which is simple and works well if recent past use predicts future use. LFU (Least Frequently Used) prioritizes keeping items that are accessed often, which can be better for content with steady popularity. Many CDNs use a hybrid or adaptive scheme – for example, an LRU with a popularity boost, or the ARC (Adaptive Replacement Cache) which balances recent vs. frequent items. A hybrid might keep two lists or use a tiny LFU structure to decide promotion. In fact, combining LRU and LFU can yield better hit ratios, and there’s research into machine-learning-based algorithms approximating the optimal Belady’s algorithm. In practice, a CDN may segment cache space: certain content types might use an LFU bias (to avoid evicting a consistently popular long-term asset), while others use strict recency. One example in documentation suggests using LRU for static web assets but LFU for streaming segments, and combining policies to suit the workload.

The promotion/demotion between memory and disk (or SSD vs HDD) also follows heuristics. As described, one method is to only promote after multiple hits (to filter out ephemeral objects). Conversely, demotion might occur if higher tiers get full: an item in RAM that hasn’t been used in a while might be demoted to SSD-only to free up RAM for hotter content. There’s also the concept of write-back vs write-through caching. Some CDNs might stream data directly from origin to client and only cache on disk once fully received (to avoid partial files on cache), whereas others cache on the fly. If using chunked caching, each chunk can be individually cached and evicted. For disk tiering, the CDN software may implement something akin to a two-queue system: one queue for items recently added (to catch bursty accesses), and another for items that have proven popular over time.

The hardware is selected to support these algorithms: CDN edge servers often use high-end NVMe SSDs specifically because they offer fast random read/write and high endurance for constant cache churn. Large RAM helps in serving the tail of popular content directly from memory. As a result, a cache hit from RAM might be on the order of microseconds, from SSD a few milliseconds, and an origin fetch hundreds of milliseconds – so every layer of caching drastically changes the performance outcome.

In summary, CDNs carefully tune their caching policies to maximize the cache hit ratio while minimizing latency. They remove content that likely won’t be reused and ensure that precious RAM/Disk space is used for the most valuable objects. By using multi-layer LRU/LFU hybrids and clever promotion rules, they can accommodate both flash crowd scenarios and long-tail content in a cost-effective way.

Thundering Herds: Request Coalescing and Collapsed Forwarding

When new or infrequently accessed content suddenly becomes popular (e.g. a viral video or a software update release), CDNs face the thundering herd problem on cache miss: thousands of users may concurrently trigger requests that miss the cache, all needing to be fetched from the origin. Without mitigation, this could storm the origin with duplicate fetches for the same item, potentially overwhelming it. To solve this, CDNs implement request coalescing, also known as collapsed forwarding or request collapsing.

Here’s how it works: on a cache miss, the first request for an object causes the edge server to send an origin fetch (or a fetch to the next tier). If additional requests for that same object arrive while the fetch is in progress, the CDN does not forward those to the origin. Instead, it “coalesces” them – essentially queuing or attaching them to the outstanding fetch. The origin will only see one request, and the edge will wait for that response. When the data returns and is stored in cache, the edge immediately serves that data to all the waiting user requests. In effect, dozens or hundreds of simultaneous user requests get collapsed into one upstream request. This drastically reduces the load spike on the origin and also avoids wasting bandwidth on redundant transfers.

Request collapsing typically occurs at each layer of the CDN. For example, if a regional cache (mid-tier) is querying the origin and multiple edge servers in that region all had a miss, the regional cache will consolidate those into one origin request (this is often called collapsed forwarding at the mid-tier). The concept can even extend to cross-PoP scenarios: some CDN architectures allow one edge to ask a sibling edge or an upper layer for content if it’s known to have it (called “cross-fetching”), thereby avoiding origin calls. Google’s Media CDN notes that all layers of caching support request collapsing by default. Only truly simultaneous cache misses at the top layer (shield) would result in a single origin fetch, no matter how many end-users initiated it.

An allied mechanism to coalescing is negative caching and request collapsing on misses. If an origin is slow or returns an error for a new content, some CDNs might briefly remember that state so they don’t hammer the origin with retries. For example, if 100 users request a new object that is not yet available (say origin is still generating it), the CDN might collapse those and either serve a single error or queue them until one attempt succeeds.

Overall, request coalescing is essential for mitigating “hot origin” scenarios, such as when a celebrity tweet drops a link and a million people click it at once. The first few edge servers to get the request will initiate one fetch each, and everyone else waits a few extra milliseconds but then is served from cache. The origin sees only a tiny fraction of the traffic it would have otherwise. This technique, combined with the tiered caching discussed earlier (especially using an Origin Shield layer), is how CDNs shield origins from sudden bursts of demand.

Origin Selection and Failover Strategies

CDNs act as a smart proxy in front of one or more origin servers. For high availability, it’s common to configure multiple origin servers or endpoints. CDNs provide features for origin selection, health checking, and failover to ensure content is always served even if one origin goes down.

A typical setup might have a primary origin and a secondary origin (or more), often in different data centers or cloud regions. The CDN will normally fetch from the primary. If that fails – due to a network error or an HTTP error response – the CDN can automatically retry the request on the secondary origin. This is usually called origin failover. For instance, Amazon CloudFront allows defining an “origin group” with two origins: if the primary returns certain failure status codes (or doesn’t respond at all), CloudFront will switch to the secondary for that request. The set of status codes that trigger failover is configurable (commonly 500-class errors, timeouts, or specific application-level errors). Google’s Media CDN similarly lets you configure per-origin failover policies – it can be set to fail over on “hard” errors like TCP connect failures or even on specific HTTP codes like 404/429 if desired.

Health-check cadence comes into play to proactively detect origin issues. Some CDNs (or customers) set up continuous health checks: periodic pings or HTTP requests to each origin. If a primary origin is detected as down, the CDN might preemptively switch all traffic to a backup until the primary recovers. Not all CDN platforms do active health checks by default – many simply attempt the primary on each request and failover if it fails. Others integrate with load balancers or have heartbeat systems. For example, Fastly allows configuring regular health probes and will stop sending traffic to an origin that’s marked unhealthy, distributing load among the healthy ones in a pool. Health check intervals might be on the order of a few seconds to a minute, depending on sensitivity.

When failover occurs, an important question is whether to automatically fail-back to the primary once it’s healthy again. Some systems will stick to the secondary until the primary is confirmed healthy for some period (to avoid flapping back and forth). Others might direct a small percentage of traffic back to the primary as a test before full switch.

The terms fail-open vs fail-closed describe how the CDN behaves if no origin is available (both primary and backups have failed) or if other errors occur. In a fail-closed approach, the CDN will simply propagate an error to the user (e.g., HTTP 503 Service Unavailable) if it cannot get fresh content from any origin. This is the safer choice when serving stale content is not acceptable (e.g. highly dynamic or personalized data). However, for mostly-static content, a fail-open strategy can dramatically improve resiliency: the CDN can serve stale content from its cache if the origin is down or unreachable. Essentially, if an object was in cache but has expired, and the CDN attempts to revalidate or refresh it but the origin doesn’t respond, the CDN can choose to “fail open” by serving the client the last cached version (potentially with a warning header) rather than an error page. Many CDNs support this via cache directives like stale-if-error. Fastly, for example, allows enabling serving stale on origin failure: if enabled, when origin is sick and a stale copy exists, Fastly will deliver the stale content instead of an error. This ensures continuity – users might get slightly outdated content, but the site remains up. Once the origin comes back or the error condition clears, normal fresh content serving resumes. Fail-open is often used for static assets or even whole pages if some staleness is acceptable during outages. The CDN essentially becomes a buffer that keeps sites available. If fail-open is not configured, then the CDN is strict (fail-closed) and will return an error if it can’t get a valid response from any origin or a fresh cache copy.

In configuring origin failover, one must also consider consistency and state. If the content is truly identical on the primary and secondary origins (e.g., mirrored S3 buckets), failover is seamless. But if the secondary is a simplified backup (like a static “sorry, we’re down for maintenance” page or a subset of content), the CDN might just serve that limited content. Advanced setups can even involve geographically distributed origins (multi-master) where the CDN chooses the closest origin to fetch from, and if that fails, tries the next closest, etc., combining proximity and failover.

Origin selection isn’t only about failure – CDNs can also load-balance across origins (for example, if you have active-active origins in two regions, a CDN may always use the one closest to the user as first choice, but switch if it’s unavailable). Policies can be static (configured priority) or dynamic (based on health/performance).

In summary, CDNs provide robust origin redundancy: they will try the primary, and if it fails (either detected via error or via health check), they seamlessly route requests to a backup. They minimize user impact by quickly detecting issues (often within one or two failed requests) and by optionally serving cached content in the interim. This greatly increases the overall availability of the service seen by end-users, as even a complete origin outage might go unnoticed if the CDN can continue serving cached data (and perhaps show a banner or limited functionality). Designing for failure at the origin level is a crucial part of CDN-backed architecture in any large-scale system design.

Core Capacity Management: Bandwidth, Surge Queues, Warming, and Connections

Operating a CDN at scale requires managing capacity and ensuring reliable performance during both steady state and traffic surges. Capacity levers refer to the various ways CDNs handle high load and optimize resource usage at each PoP.

Firstly, each PoP (or edge data center) is provisioned with a certain bandwidth capacity – both in terms of the servers’ network interface (often 10–100 Gbps NICs per server) and the upstream transit/peering links that connect that PoP to the internet. CDN operators monitor link utilization and will add capacity or reroute traffic if a link gets too full. At the server level, software can rate-limit or prioritize traffic to ensure critical content isn’t starved by less important transfers. In an interview context, it's worth noting that CDNs carefully plan their PoP placement and connectivity so that no single PoP becomes a bottleneck for an entire region. If one location nears its capacity, traffic might be load-balanced to a nearby PoP via DNS steering or other means.

During sudden surges in requests, such as a flash crowd event or even a regional outage that shifts traffic elsewhere, edge servers might see more concurrent requests than they can instantly handle. Instead of dropping them, CDNs may employ a surge queue – a short queue that holds excess requests briefly until the servers can catch up. This is analogous to how a web server might queue connections if it’s at max throughput. For example, AWS load balancers have a concept of a surge queue for pending requests when all threads are busy, and if that queue fills up, new requests are rejected. In a CDN, if an edge node is saturating its CPU or thread pool, it could enqueue a small number of additional requests for a few milliseconds. If the load subsides quickly, those queued requests get served; if not, the system might start shedding load (returning errors or redirecting new traffic elsewhere). The SpilloverCount metric in CloudFront (and ELB) tracks how often the surge queue overflows, which ideally should be zero – if it’s nonzero, it means the PoP was overloaded at times. Managing surge capacity often involves having a cushion of extra headroom and being able to autoscale or bring more servers online (though CDN PoPs are relatively static during an event; autoscaling happens over minutes, not seconds, so other mechanisms handle instantaneous spikes).

Another lever is pre-warming or cache warming. This was touched on earlier in context of prefetching, but more broadly it means getting the CDN ready for anticipated load. If you know a major event is coming (e.g., a big software update at 10am PST to be delivered via CDN), you might want to warm the caches by pushing the content to edges or at least to regional caches beforehand. CDNs sometimes expose APIs or tricks to do this (for example, there is no direct “push to edge” in many CDNs, but one can simulate it by issuing HEAD requests or using a purge/prefetch tool). Cache warming is a proactive approach: load frequently accessed data into cache before users request it. This avoids the stampede of cache misses at the start of the event. In the absence of explicit warming, CDNs with Origin Shield will at least warm the shield cache on first request so that only one shield node hits origin for the initial wave – but truly warming every edge can further improve the first-user experience across different regions. AWS, for example, suggests pre-warming CloudFront caches for mega launches (they sometimes help by gradually ramping up traffic). If not done carefully, pre-warming can itself look like a surge of traffic (albeit controlled) to the origin, so it should be paced.

Connection pooling and reuse is another critical capacity and performance technique. Each edge server maintains persistent keep-alive connections to origin servers (or to the next tier cache). Instead of opening a new TCP connection for every cache miss, which would be very expensive under high request rates, the CDN reuses connections so that multiple requests share the same TCP/TLS session. This improves efficiency in several ways: it amortizes handshake overhead, it allows the TCP congestion window to grow to optimal size over a series of requests, and it reduces the number of sockets the origin has to deal with. Consider that an edge may handle thousands of client requests per second for a given origin – pooling these into, say, a few dozen long-lived connections to origin is far more scalable. CloudFront, for instance, has a configurable keep-alive timeout for origin connections and encourages using persistent connections to improve performance. The benefit is evident: once an edge-origin connection has been established and perhaps already ramped up its throughput (during previous transfers), subsequent requests can immediately utilize the available bandwidth. A well-known design is for a local edge to maintain a warm connection with the regional cache or origin so that it doesn’t have to perform slow start from scratch each time. If that connection stays open, its congestion window might already be near the maximum, allowing new content to flow at high rate from the first packet. Many CDNs also implement HTTP/2 to origins (if supported) to multiplex multiple origin fetches on one connection, further reducing the load. However, some origins (like cloud storage endpoints) might only support HTTP/1.1, in which case connection reuse is even more important (to avoid redoing TLS handshakes constantly).

Other capacity levers include concurrency limits (throttling how many simultaneous origin fetches can happen for a given customer to protect the origin), rate limiting (to prevent abusive traffic), and dynamic prioritization (ensuring important content like HTML or API responses aren’t queued behind large asset deliveries). Some CDNs also have features like surge protection or waiting rooms – e.g., in extreme cases, queueing user requests at the edge and possibly serving a splash page if the origin is overwhelmed (common in ticket sale scenarios).

Finally, CDNs ensure there’s enough headroom per PoP by scaling out geographically. The core capacity of a CDN is not just per server, but aggregate: they might have, say, 5 Tbps of serving capacity in North America, spread across many PoPs. If one PoP reaches its bandwidth limit due to an event, they can route new users to a slightly farther PoP that has capacity. This is an advantage of having many distributed sites – capacity can be load-balanced at a coarse level by steering traffic. On a per-PoP basis, engineering typically includes “surge pools” – a portion of capacity reserved for unexpected bursts (often in the form of extra servers or bandwidth that runs below 100% utilization most of the time).

In summary, capacity management in CDNs involves: robust networking hardware (high bandwidth NICs and routers), intelligent request queueing to handle momentary overloads, pre-warming caches to avoid cold start issues, pooling connections to minimize overhead, and monitoring everything (with automation to add capacity or shift traffic when thresholds approach). These mechanisms ensure that even at peak loads, the CDN can serve content smoothly without collapsing under the pressure.

AWS CloudFront Example: Tie-In of Concepts

To ground these concepts, let’s look briefly at how AWS CloudFront (a major CDN service) implements some of them:

Tiered Cache Architecture: CloudFront by default uses a two-level cache hierarchy. Edge locations (over 400 globally) serve end-users directly. Misses from edges go to Regional Edge Caches – large mid-tier cache clusters typically per AWS region. The regional cache fetches from the origin and then supplies the object to all edges in that region that requested it. This reduces origin load and latency for multi-user cache misses. CloudFront recently added Origin Shield, which is essentially picking one Regional Cache (of your choice, usually the one nearest your origin) to be an additional layer in front of the origin. With Origin Shield on, even if you have users all around the world, their misses coalesce at the shield cache before hitting the origin – so the origin sees at most one request per object, globally. This is a concrete implementation of the tiered + collapsed forwarding strategy we discussed.
Transport Protocols: CloudFront supports HTTP/2 by default for viewers and as of 2022 supports HTTP/3 (QUIC) for viewer connections as well. Enabling HTTP/3 in CloudFront is as easy as a configuration toggle, and it allows QUIC connections from clients for faster handshakes and better performance on unreliable networks. CloudFront still uses TCP (HTTP/1.1 or 2) on the origin side, so no changes are required on customer origins to benefit from QUIC on the frontend. This separation is typical – the CDN speaks the latest protocols to users but can stick to conservative protocols upstream. CloudFront terminating TLS at the edge is implicit: it has supported HTTPS at edge locations for years, using its own certificates or customer-provided ones for secure delivery.
Routing: CloudFront uses a DNS-based routing approach. When you use CloudFront, you get a domain like abcd.cloudfront.net which resolves to different IPs depending on the user’s location (via AWS’s latency-based DNS routing). CloudFront does not rely on anycast for the last-mile routing; instead, AWS manages a highly optimized DNS system to direct users to the nearest CloudFront edge. This ties back to the Geo-DNS method we described. AWS does have anycast IPs internally for certain services, but CloudFront’s viewer facing IPs are region-specific (and AWS can move them around via DNS as needed). Failover in case of an edge issue is handled by DNS updates or by the Regional Edge layer redirecting traffic.
Large Files and Range Retrieval: CloudFront allows caching objects up to 30 GB (recently increased to 50 GB), and it supports Range GETs. If a client requests a byte range, CloudFront will fetch that range from origin and cache it. CloudFront also has the concept of Segmented Caching for its video streaming (when using AWS media services, it can prefetch the next video segment – an example of prefetching). CloudFront doesn’t explicitly document chunking behavior like some others, but it is known to maintain persistent connections and can handle partial requests efficiently.
Caching policies and eviction: CloudFront’s cache eviction isn’t directly exposed, but under the hood it likely uses an LRU or variant. AWS has noted that enabling Origin Shield can improve cache hit ratio by collapsing requests. For disk vs memory, AWS doesn’t publish their edge server specs in detail, but we can infer they use plenty of SSD and RAM similar to industry practice.
Origin Failover: CloudFront supports defining a primary and secondary origin (origin group). You can configure which HTTP status codes trigger failover (e.g., 500, 503, etc.). When triggered, CloudFront will retry the request on the secondary origin and serve that to the user if successful. CloudFront does not do active health checks by itself; failover is request-driven, though one can simulate health checks using scheduled requests or use Route53 health checks with CloudFront in some cases. CloudFront also has features for custom error responses, so you could have a nicer failover page if both origins fail, but it doesn’t have a built-in “serve stale if origin fails” toggle (though it does respect stale-if-error cache controls, so it can serve stale if configured via headers).
Capacity and Performance: Each CloudFront edge location is connected via Amazon’s private backbone network to regional caches and origins. They maintain keep-alive connections and even use TCP fast open and other optimizations (as hinted by AWS performance blog posts). CloudFront has internal scaling such that if one edge node gets too busy, additional nodes in the same PoP will take on traffic (this is not directly visible to the user, it’s just horizontal scaling internally). AWS also provides metrics like “CapacityExceeded” which indicate if an edge had to turn away requests (rare if configured correctly). The surge queue concept in CloudFront is usually seen on the origin side: if an origin (behind an ELB) is overwhelmed, the ELB’s surge queue might overflow leading to 503 errors which CloudFront then perceives as origin failure.
AWS Reference Touchpoints: To explicitly address the AWS terms: CloudFront Regional Edge Caches (REC) are the mid-tier caches that we aligned with the tiered caching concept. Origin Shield is the optional feature to designate one REC as the shield for your distribution, adding an extra layer for request collapsing and cache hit ratio improvement. And CloudFront’s HTTP/3 support (initially in preview, now generally available) exemplifies how CDNs adopt new protocols to improve the data path – QUIC bringing faster handshakes and recovery to CDN-client connections. CloudFront’s example shows a very vendor-agnostic principle: each part of the CDN (caching, routing, transport, origin handling, etc.) can be optimized and those optimizations are similar across providers, even if the naming differs (e.g., another CDN’s “Shield” cache or “Super PoP” is analogous to REC/Origin Shield in CloudFront, and QUIC support is becoming table stakes across CDNs).

Conclusion: A CDN’s architecture is a layered, dynamic system that brings content closer to users, accelerates delivery with protocol optimizations, and safeguards origin infrastructure through clever caching and routing strategies. Understanding concepts like cache hierarchies, request collapsing, anycast vs DNS routing, large-file chunking, and failover behavior is crucial for system design discussions. These ensure that in an interview (or real-world design) you can reason about how a CDN achieves performance and reliability. By abstracting these mechanisms (as we did conceptually), one can apply them to any CDN or even design a simplified CDN-like system: use caches at multiple levels, keep what’s hot in fast storage, collapse duplicate work, tune the network stack, and always plan for failures. The result is a resilient content delivery pipeline that transparently improves user experience at global scale.

system-design

SerialReads