Caching in System Design

Apr 30, 2025

Great! I’ll compile a structured, high-level report on caching in system design, covering core principles, practical applications, and real-world strategies. The report will include clear headers, concise examples, and visual aids (like diagrams and tables) where appropriate to enhance understanding and recall.

I’ll let you know once it’s ready for review.

Caching in System Design

Fundamental Concepts

What is Caching?
Caching is a technique of storing frequently accessed data in a fast, temporary storage layer (the cache) so future requests for that data can be served more quickly (What Is Caching? How Caching Works and Its Limitations.). By keeping data in memory (or other fast storage) rather than always fetching from a slower backend (like a database or disk), applications reduce latency and load on the primary data source (System design: Caching Strategies - DEV Community). In essence, caching trades off freshness of data for speed: it relies on the principle of locality (recently or frequently used data is likely to be used again) to deliver quick responses, at the risk of serving slightly stale data (A Hitchhiker's Guide to Caching Patterns | Hazelcast). This trade-off is usually worthwhile in systems where performance and scalability are critical.

Cache Hits vs. Cache Misses:
A cache hit occurs when the requested data is found in the cache, allowing the system to return the data quickly without touching the backend source. A cache miss means the data was not in cache, so the system must retrieve it from the original source (incurring higher latency) and—often—store it in the cache for next time. High cache hit rates are the goal, as each hit saves expensive work, improving throughput and response times (System design: Caching Strategies - DEV Community). Too many misses can negate the benefits of caching; in fact, if an application rarely reuses data, checking the cache can add unnecessary overhead. In such cases (almost always misses), performance might worsen because the app spends time looking in the cache and still ends up going to the database (What Is Caching? How Caching Works and Its Limitations.). Therefore, understanding data access patterns (to ensure a cache will be reused) is critical before introducing a cache.

Common Cache Eviction Policies:
Caches have limited space, so when they’re full, some data must be evicted (removed) to make room for new entries. Cache eviction policies decide which data to remove. Choosing the right policy can greatly impact cache effectiveness. Some common strategies include:

Least Recently Used (LRU): Evict the item that has gone unused for the longest time. The assumption is that if data hasn’t been accessed in a while, it’s less likely to be needed soon (What Is Caching? How Caching Works and Its Limitations.). LRU leverages temporal locality, and it’s a sensible default for many scenarios. Use case: General-purpose caches where recent access is a good predictor of near-future access (e.g. caching web pages or UI views, where users tend to revisit recent items). Example: A product catalog might use LRU to cache recently viewed items, on the theory that users will likely view those again soon (System design: Caching Strategies - DEV Community).
Least Frequently Used (LFU): Evict the item that has been accessed the fewest times. This policy tracks access counts, favoring items that are frequently used. It’s useful when certain items are ”hotter” than others and should remain in cache. Use case: Caching content where some entries (like top news articles or popular products) are accessed much more often than others (System design: Caching Strategies - DEV Community). LFU keeps the most popular items in cache. The downside is LFU can cling to items that were heavily used in the past but no longer popular, unless the counts are aged or reset over time.
First-In, First-Out (FIFO): Evict the oldest item added to the cache (the one that entered first). This treats the cache like a queue, without regard to how often or recently items are accessed. Use case: FIFO is simple and sometimes used when items naturally expire after a certain time or in specific scenarios like streaming data where old data is less relevant (System design: Caching Strategies - DEV Community). However, it doesn’t adapt to actual usage patterns, so it may evict still-useful data that just happened to be loaded early.
Random Replacement: Evict a random item. This policy doesn’t use any access pattern information. It’s generally a baseline or used in specialized cases where tracking usage is too costly. Use case: In distributed caches or hardware caches where maintaining LRU/LFU lists is impractical, a random eviction might be chosen for simplicity (System design: Caching Strategies - DEV Community). Random can sometimes approximate the performance of more complex policies, but it’s suboptimal in most software scenarios.
Most Recently Used (MRU): Evict the item that was most recently accessed (the opposite of LRU). This sounds counterintuitive, but it’s useful in workloads where the most recently used items are least likely to be reused. Use case: Consider a sequential scan or one-time processing of data: once an item is used, it won’t be needed again soon. In such cases (e.g., certain buffering or streaming scenarios), MRU evicts the newest items, making room for older items that may soon cycle back into use (ByteByteGo | Cache Eviction Policies). MRU is a niche strategy but can be ideal for specific patterns like cyclic buffers or when you know that recent items are transient.

Eviction Policy Summary: Different policies excel under different access patterns. For example, LRU works well when recently used data tends to be reused (temporal locality), LFU works when a subset of items are accessed much more frequently than others, and MRU can help in cases like stack or stream access patterns. Many caching systems use LRU (or variants like LRU-K or Windowed LFU) by default as it aligns with typical workloads, but tuning the eviction policy to your application’s access characteristics can improve hit rates.

(See table below for a quick comparison of eviction policies and their ideal use cases.)

Eviction Policy	Principle	Ideal When…
LRU (Least Recently Used)	Evicts the item unused for the longest time (What Is Caching? How Caching Works and Its Limitations.). Assumes long-unused data is least likely needed soon.	Recent access is a good predictor of near-future use (temporal locality) – e.g. user viewing history, recently used documents.
LFU (Least Frequently Used)	Evicts the item with the fewest accesses (What Is Caching? How Caching Works and Its Limitations.). Focuses on keeping the most popular items.	Access frequency is highly skewed (some items are hot, others rarely used) – e.g. popular products, top search queries.
FIFO (First-In, First-Out)	Evicts the oldest entry in the cache (no regard to usage) (System design: Caching Strategies - DEV Community).	Items naturally expire with time or one-time use is common – e.g. streaming data, queue processing where old data becomes irrelevant.
Random	Evicts a random item (System design: Caching Strategies - DEV Community). Does not track usage patterns.	Simplicity is paramount or caching at hardware level – e.g. simple caches where implementing LRU/LFU is impractical. (Usually a fallback strategy.)
MRU (Most Recently Used)	Evicts the most recently accessed item first ([ByteByteGo	Cache Eviction Policies](https://bytebytego.com/guides/most-popular-cache-eviction/#:~:text=Most%20Recently%20Used%20)). Frees items likely already used up in one-time or sequential workloads.

Note that many real caches use a combination of strategies (e.g. TLRU – Time-aware LRU with TTL expiration, or segmented policies like Two-Queue which mix an LFU and LRU approach). The eviction policy should be chosen based on data access patterns to maximize the cache’s hit ratio.

Types of Caches

Caching can occur at various layers of a system. Broadly, we can categorize caches into client-side vs. server-side (including intermediary layers like CDNs and databases). Each type serves different purposes in a distributed architecture:

Client-Side Caching (Frontend)

Definition & Mechanism: Client-side caching stores data on the end-user’s device or in the user’s application process. This could be a web browser cache, mobile app cache, or any cache in the client application. When a client (browser/app) requests a resource, it can keep a copy locally so that subsequent requests for the same resource are served from local storage instead of over the network (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium). For example, web browsers cache static assets (HTML, CSS, JavaScript, images) and even API responses. The next time you visit the same page, the browser can load files from its cache instantly rather than fetching from the server, drastically reducing page load time.

Common Client-Side Caches:

Browser Cache: Modern browsers automatically cache resources based on HTTP headers. Images, scripts, and stylesheets are stored so revisiting a site or navigating within it is faster. Browsers also implement specific caches like DNS cache (to remember domain lookups) and even service worker caches for offline-capable web apps.
Mobile App Cache: Mobile and desktop apps might cache API responses, images, or database query results on device storage. For example, a news app might cache the latest articles so that if the user comes back later (even offline), it can show something without a fresh network call.
Local Storage / Session Storage: Web apps can explicitly store data (user preferences, last viewed items, etc.) in local storage on the client for quick retrieval without a server round-trip (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).

Use Cases & Examples: A classic example of client caching is YouTube’s video caching in its app/browser. YouTube will save recently watched video data, thumbnails, and metadata in the browser or app cache. This way, if you replay a video or navigate back to it, it loads smoothly without re-downloading everything (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium). Similarly, single-page applications (SPAs) cache JSON API responses; for instance, a weather app might cache the last fetched weather data for your city so it can show something immediately while fetching the latest in the background. Client-side caching improves perceived performance and enables offline or flaky-network functionality.

Pros & Cons:
✔ Fast local access: Data served from the device’s own storage or memory is extremely fast, with zero network latency (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✔ Reduced server load: Fewer requests go to the server, since repeated access is satisfied locally (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✔ Offline capability: The app can function (at least partially) without network access using cached data (great for mobile apps, PWAs, etc.) (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✖ Data staleness: The cache might serve out-of-date data if the client is offline or hasn’t fetched updates (e.g., you might see yesterday’s data until a new fetch happens) (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✖ Storage limits: Browsers and devices limit how much can be stored. Large caches might be evicted by the browser or OS, and different devices have varying capacities (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✖ Per-user only: A client-side cache is local to one user; it doesn’t help with data that could be shared across users. For shared caching, we rely on server/CDN caches.

Server-Side Caching (Backend)

Definition & Mechanism: Server-side caching happens on the server or in the infrastructure between the client and the primary data store. It refers to caches that the server (or a cluster of servers) uses to store responses or expensive computations so that future requests can be served faster. The first time a request is made, the server generates the response (perhaps by querying a database or performing calculations) and then stores that result in a cache (often an in-memory store like Redis or Memcached). Subsequent requests check the cache first and, if a cached result is available, return it directly without hitting the database (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium) (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).

Server-side caching can take many forms:

Application-level in-memory caches: e.g., an API server might keep a in-memory dictionary of recently used results, or use a library cache.
Distributed caches: dedicated cache servers (or cloud services) like Redis or Memcached that store data for use by multiple application servers.
Disk caches: sometimes servers also cache on disk (slower than RAM but larger), for instance, to store large results or files.
CDNs and reverse proxies (though these are often considered a separate category, covered below).

Use Cases & Examples: One common use case is caching database query results. Suppose an e-commerce site’s backend frequently needs to fetch product details or the result of a complex search query. Instead of querying the database for each user’s identical request, the server can cache the result in memory. For example, if user A searches for “Nike shoes” and the server fetches that from the DB, it can cache the list of Nike shoes. When user B (or A again) performs the same search, the server returns the cached results immediately (Caching in System Design: A Simple Guide with Real-World Examples | by Harsh Kumar Sharma | Medium). This dramatically cuts down database load and response time. Many websites cache user session data, user profiles, frequently viewed items, or computed aggregates on the server side.

A real-world example is Twitter: On the backend, Twitter caches things like trending topics, user timelines, or tweet objects in a fast in-memory store (and also uses CDNs for media). This prevents millions of expensive database or API calls, allowing Twitter to serve content to users quickly and handle huge volumes of reads (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium). Another example is Amazon: Amazon’s product pages and search results employ heavy server-side caching. When millions of users search for a popular item (say “iPhone”), Amazon can serve the results from cache instead of querying the database each time, ensuring fast and reliable responses under massive load (Caching in System Design: A Simple Guide with Real-World Examples | by Harsh Kumar Sharma | Medium).

Pros & Cons:
✔ Shared benefit: A server-side cache is typically shared by all users of the system. If one user’s request populates the cache, many others benefit from that cached data (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium). This is great for scaling reads in multi-user systems.
✔ Reduced database/API load: By catching repeated requests, it offloads work from the primary datastore or external APIs, which can improve overall throughput and allow the system to handle higher traffic (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✔ Centralized control: The caching behavior (invalidation, TTLs, etc.) can be managed in one place on the server, and it’s easier to update if underlying data changes (compared to invalidating many client caches).
✖ Consistency management: When data is updated in the database, the server-side cache may have stale data that must be invalidated or updated. Ensuring the cache doesn’t serve obsolete information can be complex (the classic cache invalidation problem) (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✖ Propagation delay: In some cases, especially with distributed caches, an update might take time to propagate to all cache nodes or may not immediately reflect to users, so recent changes might not appear instantly (trading consistency for performance) (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✖ Overhead & cost: Running a large cache cluster (Redis nodes, etc.) is an additional component of your architecture. While it eases DB load, it’s another system to monitor, scale, and pay for. Also, very dynamic data might not benefit if it changes before cache hits occur (wasted caching).

Content Delivery Networks (CDNs) and Edge Caching

A CDN is a specialized type of cache that lives at the network edge (geographically closer to users). CDNs cache content on proxy servers distributed around the world (or a region) so that users can retrieve data from a nearby location rather than from the origin server every time (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community) (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community). CDNs are often provided by third-parties (like Cloudflare, Akamai, Amazon CloudFront, Google Cloud CDN), and they are heavily used for static assets and media, though they can also cache certain dynamic content.

Use Cases & Examples:

Static Content Caching: This is the classic use: images, CSS/JS files, videos, and downloadable files are served via CDN. For instance, when you stream a Netflix or YouTube video, you are usually getting it from a CDN server near your region, not from Netflix’s or Google’s central servers. Netflix places copies of popular shows and movies on CDN edge servers (including their own Open Connect appliances) worldwide. So when a user hits “Play”, the video segments stream from a nearby cache, resulting in minimal buffering and load on the core servers (Caching in System Design: A Simple Guide with Real-World Examples | by Harsh Kumar Sharma | Medium). Similarly, YouTube uses Google’s extensive edge cache network to serve videos and thumbnails from edge locations, which is why videos often start immediately and run smoothly even for millions of concurrent viewers (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community).
Edge caching dynamic content: With advancements, CDNs now also support caching dynamic or semi-dynamic content. For example, Google Search results themselves are cached at edge locations for popular queries. If many users in an area search for “weather tomorrow”, the CDN might serve the cached result page (if it’s up-to-date) rather than hitting Google’s core servers every time (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community). CDNs use mechanisms like short TTLs or cache validation to ensure content is fresh enough. Another technique is Edge Side Includes (ESI), where fragments of a page are cached at edge. Facebook, for example, might cache the general layout and static parts of a page at a CDN, but not personal data.

Benefits: The primary benefits of CDNs are reduced latency (since the data travels a shorter distance to the user) and offloading traffic from the origin servers. By “bringing the most requested data closer to users, Google reduces latency, improves speed, and saves backend resources.” (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community) In practice, a CDN can massively reduce load on your origin – many requests are served entirely by edge caches without ever reaching your servers. This also provides resilience: even if the core site is under high load, the CDN may continue serving cached content to users.

CDNs are essential for global scale systems. Companies like Amazon use CloudFront (a CDN) to distribute images, scripts, and even whole HTML pages for their e-commerce site, ensuring fast page loads globally. Facebook and Google have their own edge cache networks; for example, Google’s edge caching (part of Google’s CDN infrastructure) stores YouTube videos and even Google Search results in edge servers at ISPs and points-of-presence around the world (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community). This edge layer (often called Google Global Cache in Google’s case) is why a search query or a YouTube video feels instant in many locations – much of the content was served from a local cache rather than halfway across the world.

One more advantage: CDNs and edge caches also absorb traffic spikes and can protect origin servers from traffic surges or certain attacks (CDNs often act as a layer of defense, caching and serving even in case of an origin failure for a short time).

Challenges: Not all content can be cached on CDNs easily—highly personalized or real-time data usually must go to the origin. CDNs also have to be configured with appropriate cache invalidation strategies (e.g., purging or using short TTL for content that updates frequently). There’s also eventual consistency to consider: a user update might invalidate some cached content, and there can be a brief window where one edge node has old data while others have new.

Database Caching (Query Caches, Redis, Memcached)

This refers to caching at or near the database layer. Many databases have internal caches (like MySQL’s query cache, or disk page caches) – those are important but typically automatic. In system design, when we say database caching we often mean using an external cache store to relieve the database. Redis and Memcached are two extremely popular technologies for this purpose. They are in-memory key-value stores often deployed as a layer in front of a database or used by the application to store frequently accessed data.

Memcached is a high-performance, distributed memory cache. It’s simple: it stores key→value pairs entirely in RAM, offering very fast lookups. Companies have used Memcached to cache database results, computed pages, or any object that is expensive to regenerate. A famous example is Facebook – Facebook leveraged Memcached to build a massive distributed cache for their social graph data. This cache tier handles billions of reads per second, dramatically reducing the load on Facebook’s MySQL databases (Scaling memcache at Facebook - Engineering at Meta) (Scaling memcache at Facebook - Engineering at Meta). In fact, Facebook’s ability to scale to billions of users was aided greatly by their Memcached layer, which stores user sessions, profile info, and the results of complex queries so that the web servers rarely need to directly hit the databases for common reads (Scaling memcache at Facebook - Engineering at Meta).

Redis is another popular in-memory cache (which offers more features than Memcached, such as persistence, richer data types, and atomic operations). Redis is often used to cache database query results or as a fast lookup table for things like user authentication tokens, session data, or computed scores/rankings. For example, a web app might store the 100 most recent posts in a Redis list for a feed, or cache the result of a heavy analytic query. Redis is also commonly used as a session store or message cache. Many web frameworks allow plugging Redis in so that each page load can quickly fetch session info from Redis instead of a slower database (System design: Caching Strategies - DEV Community). Redis can persist to disk periodically, which gives it a bit more safety for data longevity than Memcached (which is purely in-memory).

Use Cases: Database caching is useful when you have read-heavy workloads. For instance, an online store might cache product details or inventory counts in Redis to avoid hitting the SQL database for each page view. If the inventory changes, the cache can be updated or invalidated. Another use case: results of expensive queries (say a complex JOIN or an aggregate) can be cached for a few minutes. Reporting dashboards often do this—compute a heavy report once, cache it, and serve the same result for subsequent requests until it expires.

Example – Amazon: Amazon’s architecture includes a caching layer (Amazon ElastiCache service supports Redis/Memcached) to offload reads from databases. Amazon has noted that caching the results of popular product queries (like a search or a product detail page) in a fast cache has been key to delivering quick results during high-traffic events (Caching in System Design: A Simple Guide with Real-World Examples | by Harsh Kumar Sharma | Medium). Rather than every user’s query hitting the underlying product database, many can be served from cache.

Distributed Cache considerations: Both Redis and Memcached can be clustered (distributed across multiple servers) to increase capacity and throughput. A distributed cache means that data is partitioned (and often replicated) across multiple nodes, and clients can connect to any node and retrieve data as needed. This allows the cache to scale horizontally and not be a single point of failure. In large systems, it’s common to have a distributed caching tier with several nodes: this tier can be treated as a shared fast lookup service by all application servers (What Is Caching? How Caching Works and Its Limitations.). The cache cluster might store a huge number of key-value entries representing many users’ data or many query results. This greatly improves scalability, because as demand grows you can add more cache nodes. (Of course, partitioning means a given key is on one node, so that node must handle queries for that subset of data.)

When not to use DB caching: If your data changes constantly and clients rarely read the same data twice, caching might not help. Also, if strong consistency is required (always seeing the latest data), caching must be designed carefully (potentially with very short TTLs or write-through schemes). We’ll discuss cache consistency soon.

In summary, database caching using tools like Redis/Memcached is a fundamental pattern to achieve high performance. It’s employed by virtually all big web platforms: Twitter caches tweets and timelines in memory, YouTube/Google caches video metadata and search results, Amazon caches product info, Facebook caches social data in Memcached, and so on. Each of these drastically reduces direct database load and allows these platforms to scale to millions or billions of operations per second.

Cache Strategies & Patterns

Designing caching in a system isn’t just about whether to cache, but also how the cache interacts with reads and writes. Several common caching strategies/patterns dictate this interaction. The main ones are Cache-Aside (Lazy Loading), Read-Through, Write-Through, Write-Around, and Write-Back (also called Write-Behind). Each pattern balances consistency, write complexity, and read performance differently. Let’s explain each with examples:

Cache-Aside (Lazy Loading) Pattern

In the cache-aside pattern, the application code is responsible for interacting with the cache. The cache sits alongside the data store, and the app pulls data into the cache as needed. On a cache miss, the application goes to the database (or source of truth), fetches the data, returns it to the user, and at the same time inserts it into the cache for next time (How Facebook served billions of requests per second Using Memcached). On a cache hit, of course, the data is returned directly from the cache. This is often called lazy loading because data is loaded into the cache only when first used, not in advance.

For writes or updates, in pure cache-aside, the application usually writes to the database and then invalidates the cache entry (or updates it). That is, the cache is updated after the data store is updated (or the cached entry is simply cleared, so the next read will fetch fresh data from the DB).

Example scenario: Imagine a social media app using cache-aside. When a user requests their profile, the service first checks a Redis cache for the profile data. If it’s not there, the service queries the primary database, gets the profile, and then stores that result in Redis with an appropriate key. Subsequent requests for that profile (by the same or even other services) hit Redis and avoid the database. If the user updates their profile, the service will write the changes to the database and invalidate the cache entry for that profile (so that next read will fetch the updated data from DB and then cache it anew).

Use cases: Cache-aside is the most widely used caching pattern (A Hitchhiker's Guide to Caching Patterns | Hazelcast) because of its simplicity and explicit control. It’s used whenever the application wants fine-grained control over cache contents. For example, Facebook’s Memcache usage is essentially cache-aside – the app servers check memcache first, on miss query MySQL and populate memcache (How Facebook served billions of requests per second Using Memcached). This pattern is also common in ORM data caching or in any scenario where you manually code caching.

Pros:

Simple to implement in application logic (get, check, set logic).
Cache only contains data that was actually used (no need to populate upfront).
Failure of cache is not catastrophic: if the cache goes down, the database is still authoritative (though heavier load will hit DB). The app can continue operating (albeit slower) by going to the DB.
The cache can easily be invalidated by simply deleting keys when underlying data changes.

Cons:

Every cache miss results in higher latency (the first user to request something pays the cost of the database fetch).
Cache consistency is managed by the app: if the app forgets to invalidate or update the cache on a write, stale data will persist.
There’s a small chance of a race condition: two requests for the same key on a cold cache could both query the database if they happen simultaneously, because neither finds it in cache (this can be mitigated by locking or other techniques, see Thundering Herd problem later).

Read-Through Cache Pattern

Read-through is similar to cache-aside in outcome (data gets loaded on demand), but the difference is the caching system itself automatically loads missing data from the backing store. In a read-through pattern, the application doesn’t directly fetch from the database on a miss; instead, it always interacts with the cache. If the data is not in cache, the cache service (assuming it’s a sophisticated cache or integrated middleware) will fetch from the database, populate itself, and return to the application (A Hitchhiker's Guide to Caching Patterns | Hazelcast).

In other words, the cache is given a “loader” capability. For example, some caching libraries allow you to register a callback or use an integrated data store connection so that whenever your code asks the cache for key “X” and X is not present, the cache layer itself knows how to get X from the database and update the cache.

Example scenario: Suppose you use a caching library in your application that supports read-through. Your code might call cache.get(userId). If the user data is cached, it returns. If not, the cache internally queries the users table in the DB (perhaps via a provided DAO or callback), stores the result in the cache, and then returns it. The application code is simpler (just get from cache), and it doesn’t have to explicitly manage cache population logic.

Use cases: Read-through caches are common in frameworks or cloud services. For instance, AWS DynamoDB Accelerator (DAX) is an example of a read-through cache in front of DynamoDB: the app queries DAX as if it were the database; DAX checks its cache and automatically fetches from DynamoDB on a miss (Caching patterns - Database Caching Strategies Using Redis). Similarly, an ORM’s second-level cache might be read-through – your code asks the cache for an object, and the cache layer goes to DB if needed. This pattern is good when you want to decouple cache logic from business logic, delegating the loading to a cache service or library.

Pros:

Simpler application code (just read from cache as a single step). The cache provider handles misses, so the app doesn’t include fallback logic (A Hitchhiker's Guide to Caching Patterns | Hazelcast).
Can batch or optimize loads internally. A sophisticated cache could e.g. aggregate multiple simultaneous requests for the same key into one DB query (preventing stampede).
The application always sees consistent behavior (it doesn’t need to know if data came from cache or DB).

Cons:

The cache system needs to know how to load data (so it requires either tight integration with the data source or a user-provided loader function). This couples the cache with the data source logic.
Slightly less transparent: developers must trust the cache to fetch data, which might make error handling more complex (e.g., if the database call fails inside the cache, how does that bubble up?).
Typically read-through goes hand-in-hand with write-through (explained next) to maintain consistency, which means you are often using a specific caching solution or service that supports both.

Write-Through Cache Pattern

In a write-through strategy, every write operation (create/update/delete) goes through the cache first, and the cache is responsible for writing it to the underlying database synchronously. Essentially, the application never directly writes to the database; it writes to the cache, which immediately writes to the database (Understanding write-through, write-around and write-back caching (with Python)). This ensures the cache is always up-to-date with the database — any data in the cache has been saved to the DB, and any new DB state is reflected in the cache.

How it works: When the app needs to update data (say change a user’s profile), using write-through it would update the cached value and the cache system would in the same operation write the new value to the database. Only after the database write succeeds would the operation be considered complete (Understanding write-through, write-around and write-back caching (with Python)) (Understanding write-through, write-around and write-back caching (with Python)). Because of this, data in cache is never stale relative to the database — they are in sync for writes.

Example scenario: Consider a financial system that caches account balances. Using write-through, when a transaction updates an account balance, the application writes the new balance to the cache. The cache then writes that through to the SQL database. This way, the cache always has the latest balance. Any read can be served from cache confidently, since no write is considered successful until it’s in the DB and cache. This is important for systems that cannot tolerate serving old data. For instance, many banking systems or payment systems would use a write-through or at least an immediate cache update approach to ensure consistency.

Another example: A user profile service might use write-through for critical fields. When a user updates their email, the system writes to the cache (which writes to DB). Subsequent reads get the new email from the cache.

Pros:

Strong consistency for reads: Since every update went through the cache, a read from the cache will get the latest data (no stale values, assuming writes always use the cache path). This makes the cache truly a single source of truth alongside the DB (Caching in System Design: A Simple Guide with Real-World Examples | by Harsh Kumar Sharma | Medium).
Simplified read logic: the data is fresh so you don’t need to worry as much about invalidation on reads.
If using a distributed cache, write-through also propagates the data to the cache immediately, avoiding cases where one app server updates DB but another app server serves an old cached value.

Cons:

Write latency overhead: Every write incurs the cost of updating both cache and database, which can be slower than writing just to the DB (Understanding write-through, write-around and write-back caching (with Python)). The user’s write operation isn’t complete until the database write is done (this is synchronous), so writes can be slightly slower than in a non-cached scenario (and definitely slower than write-back).
Potentially unnecessary writes to cache if data isn’t read soon. If you write a value that won’t be read before it changes again, updating the cache was overhead (this is where write-around might be better).
More complex to implement if not supported natively. Many cache systems support write-through via configurations or built-in store integration. Without that, implementing in application code means double-writing on each operation and handling partial failures (e.g., DB write succeeds but cache update fails or vice versa – you need error handling to keep them in sync).

Use cases: Write-through is ideal when you have a high read-to-write ratio and you cannot afford stale reads. For example, caching session data in an e-commerce site: you write-through so that any change (like items added to a cart) go to DB and cache, and all servers see the updated cart immediately. It’s also used when writes are not too frequent or when slightly slower writes are acceptable in exchange for fast reads of up-to-date data (financial data, inventory counts in a limited-stock sale, etc.). Systems that use read-through caches often pair them with write-through to keep the cache hot.

Write-Around (Write-Around Cache)

Write-around is a strategy where writes skip the cache and go directly to the data store, and the cache is updated only on the next read. In practice, this means on an update, you evict the item from cache (or don’t cache it at all initially), and only when someone tries to read that data later do you fetch the new value from the DB into the cache. It’s called "write-around" because write operations go around the cache.

How it works: On data modification, the application writes to the database as normal. The cache, if it has the old value, may be invalidated/removed, but the new value is not written to the cache at that time (Understanding write-through, write-around and write-back caching (with Python)) (Understanding write-through, write-around and write-back caching (with Python)). Reads still check the cache: if the updated data wasn’t read yet (cache was empty or invalidated), the first read will miss, fetch the latest from DB, then populate the cache.

Example scenario: Suppose we have a news site and an article’s content is updated. With write-around, when an editor updates the article, the system stores the new content in the database and removes the cached copy of that article (if any). It does not put the new content into the cache immediately. Now, when a reader requests that article, the cache is empty, so the app fetches from DB (getting the updated content) and then caches it for subsequent readers. Essentially, the cache only sees the new data when it’s actually needed. If that article is never read again (or not read for a long time), we avoided populating the cache with data that no one used.

Pros:

Avoids cache pollution for infrequently read data: If you have a workload where a lot of data is written but not read often (write-heavy, read-sparse data), write-through would be wasteful (it’d put everything in cache, but many entries might never be read before eviction). Write-around addresses this by only caching data that is actively read (Understanding write-through, write-around and write-back caching (with Python)).
Writes have slightly lower latency than write-through (only one write, to DB, is needed on update), so for heavy write loads it can improve throughput (Understanding write-through, write-around and write-back caching (with Python)).

Cons:

Stale reads possible right after a write: If a user updates data and then immediately reads it, with write-around the first read will be a miss and go to DB (which is fine), but if somehow an old cached value existed and wasn’t evicted, that would be bad. Typically you’d invalidate the cache on write to avoid that. The bigger issue is the next point.
First read penalty for updated data: After a write, the next read will always be a cache miss (since the cache doesn’t have the new data yet), incurring a database hit. In a system with frequent reads after writes (temporal locality on write-read), this adds latency.
If cache eviction (from capacity) had removed that item earlier, then obviously it’s not in cache – similar outcome.

Use cases: Write-around is useful when write traffic is high and you don’t want to bloat the cache with data that might not be re-read soon. For example, logging or audit data might be written constantly but rarely read; you wouldn’t cache every log entry on write. Another scenario: bulk loads or backfills of data into a system – you might disable caching or use write-around during that process to avoid overwhelming the cache with one-time writes. It’s an acceptable approach if the pattern is “write once, read maybe later” as opposed to “write and then read immediately”. A specific use-case: a stream processing pipeline where results are continuously written to a database, but consumers mostly query recent data (so the cache naturally fills with recent reads, and older writes that aren’t read don’t occupy cache). Write-around ensures the cache is populated on demand, not by every single write.

Write-Back (Write-Behind) Cache Pattern

Write-back caching takes a different approach: on an update, data is written only to the cache (marking it as dirty) and the write to the database is deferred to a later time (asynchronous) (Understanding write-through, write-around and write-back caching (with Python)) (Understanding write-through, write-around and write-back caching (with Python)). The cache thus absorbs write operations and “buffers” them, writing to the backing store in the background (or on certain triggers). This is analogous to a disk write-back cache or CPU cache write-back: it’s very fast for writes, but introduces an inconsistency window.

How it works: When a value is updated, the cache is updated immediately in memory and the user’s operation returns success as soon as the cache is updated. The cache then queues the write to persistent storage. Sometime later (could be milliseconds or seconds), the cache flushes the queued writes to the database (this could be batched or one-by-one). If multiple writes happen to the same data in a short time, the database might only see the final state (the cache could coalesce writes).

Example scenario: Consider a high-throughput analytics system collecting sensor readings. Using write-back, each incoming reading is written to a cache (in-memory store) for quick acknowledgment. Periodically or asynchronously, these readings are written in bulk to a database or file system. If the system crashes before the write-back, some data may be lost, but this might be acceptable for the application (or mitigated by replication of the cache). Another example: a user likes/unlikes a post rapidly. A write-back cache might handle those toggles in memory and only occasionally persist the final count to the database, reducing write load.

Pros:

Very fast writes and high throughput: Since writes hit only memory (cache) and return, the latency is low and the system can absorb bursts of writes quickly (Understanding write-through, write-around and write-back caching (with Python)). This is ideal for write-heavy workloads. In fact, write-back can flatten a storm of writes into fewer ops to the DB (e.g., 100 rapid updates in cache might result in 1 database update if they were all to the same key).
Reduced write load on DB: Many writes can be aggregated or collapsed. For example, a counter that’s incremented 1000 times per second in cache could be written to the DB as a single +1000 increment each second, massively reducing DB operations.

Cons:

Risk of data loss: If the cache node crashes or is evicted before writing to the database, any in-flight updates are lost (Understanding write-through, write-around and write-back caching (with Python)). In critical systems, this is dangerous — you could lose transactions. This risk can be mitigated by using a replicated cache (so the data is at least in memory on multiple nodes until persisted) or by using a cache with persistence (e.g., Redis has an append-only file it can use to log writes).
Complex consistency: During the period between the cache write and the DB write, the data in cache is newer than the database. If some other process or service queries the database directly, it will get stale data. Essentially, the source of truth is temporarily the cache, not the DB. This can confuse systems unless carefully managed (often systems that use write-back treat the cache + queued updates as the source of truth, and the DB is just eventual storage).
Complex failure recovery: After a crash, one must reconcile the cache and database. Some systems on restart will scan the cache’s journal and apply any missing updates to the DB, etc. This complexity must be worth the performance gains.

Use cases: Write-back is suitable when performance is paramount and occasional data loss can be tolerated (or there’s another mitigation for it). For example, caching analytics events or click logs: you might accept losing a few events on a crash in exchange for handling an extremely high event rate. Another example is in recommendation systems or ML, where you accumulate user interactions in a cache and periodically flush to a datastore; losing a tiny fraction might not be critical. Write-back is also used in combination with durable caches: e.g., Redis has a feature called “append-only file” which can be thought of as making write-behind safer (it logs every write to disk asynchronously). Systems like Cassandra or Dynomite (in-memory front for DB) could be configured in a write-back style for speed, but usually with replication for safety.

One can combine patterns too. For instance, refresh-ahead is a pattern where the cache proactively refreshes certain items before they expire (to avoid misses). Also, write-behind can be combined with read-through. In practice, many caching systems allow configuration of these behaviors. As a summary of these patterns (A Hitchhiker's Guide to Caching Patterns | Hazelcast):

Cache-Aside: Application manages cache on miss. Simple but requires manual consistency handling. Good when you can tolerate eventual consistency and want fine control.
Read-Through: Cache itself fetches from DB on misses. Simplifies app, but needs a capable cache store. Often used with…
Write-Through: Cache updates go to DB immediately. Ensures cache and DB consistency at cost of write latency.
Write-Around: Writes go to DB (cache possibly invalidated), new data comes to cache later when read. Good to avoid cache churn on write-heavy, read-light data.
Write-Back (Write-Behind): Writes go to cache (fast), DB update deferred. Great for high write throughput, but riskier (cache is temporarily source of truth). Use in systems where slight inconsistency or data loss is acceptable or mitigated by replication.

To illustrate, a typical usage might be: a cache-aside or read-through + write-through strategy for an e-commerce product cache (so reads are fast and always up-to-date after a purchase updates stock), whereas a logging system might use write-back caching to buffer writes for efficiency.

Common Challenges and Trade-offs

While caching improves performance, it introduces additional complexity. Two fundamental challenges are maintaining data consistency versus system availability, and determining when to invalidate or refresh cached data. Additionally, poorly managed caches can cause problems like the Thundering Herd phenomenon. Here we discuss these challenges and how to handle them:

Consistency vs. Availability (CAP Theorem Context)

In a distributed system, the CAP theorem tells us we can’t have perfect Consistency, Availability, and Partition tolerance at the same time. Caching often pushes a system toward favoring availability and performance at the expense of strict consistency. By design, caches store copies of data. If the source data changes, the cache could become stale (inconsistent with the source). Many caching systems opt to serve slightly stale data rather than make users wait, effectively choosing availability (always serve from cache if possible) over strong consistency (Navigating Consistency in Distributed Systems: Choosing the Right Trade-Offs | Hazelcast).

For example, consider a social network: if a user updates their profile picture, you could either (a) immediately invalidate all caches and ensure everyone sees the new picture (consistency) but perhaps stall responses while caches miss, or (b) allow some users to see the old picture for a minute until caches naturally expire (availability/performance). Many systems choose (b) unless it’s critical data. Web caching in general “ensures high-speed data access with tolerable temporary inconsistencies, such as serving stale… cached web pages.” (Navigating Consistency in Distributed Systems: Choosing the Right Trade-Offs | Hazelcast) In other words, a bit of staleness is accepted in exchange for speed and uptime.

Cache Consistency Models:

Strong consistency: Every read gets the latest write. Achieving this with caches usually means on any update, invalidate or update all relevant cache entries immediately. This can be done but requires robust invalidation logic (or using write-through everywhere). It may reduce cache hit rates (because caches might be cleared frequently). In distributed caches, strong consistency might require a coherence protocol (like distributed locking or versioning) so that no node serves stale data after an update. This is complex and often not worth the performance hit except for specific data (e.g., caching in a banking system might require this).
Eventual consistency: Most caches provide eventual consistency — if no new updates occur, eventually everyone will converge to see the last update. In practice, caches often have TTLs that ensure that even if you forget to invalidate something, it won’t live forever. Systems that are read-heavy often live with eventual consistency from caches. For example, Facebook’s cache sacrifices some consistency for massive throughput; for a brief time after a user posts something, not every server’s cache may have it, but within seconds it becomes consistent (either via cache expiry or background update). Facebook chooses availability (the site should always load quickly) over absolute consistency (everyone sees the update instantly), given the scale (Scaling memcache at Facebook - Engineering at Meta).
Read-your-writes consistency: One compromise is ensuring that after a user makes a change, they see their change (even if others might still see an old value for a brief time). This can be handled by cache design: e.g., a user’s session or requests could bypass cache or update the cache immediately for that user’s own data. This is often implemented to improve UX (the author of a post should see it immediately, others might see it a second later when their cache updates).

Distributed Cache and Partitioning: In distributed caches, if a network partition happens (some cache nodes can’t talk to others or to the database), we face a choice: do we continue serving data from the cache nodes we can access (which might be outdated if that partition can’t see new writes), or do we stop serving (which hurts availability)? Many cache clusters choose availability: a cache node will serve whatever it has, even if it’s potentially stale, rather than fail the request. This is another way caching systems lean toward the “AP” side (Available/Partition-tolerant) in CAP. For instance, a CDN node cut off from origin will serve slightly older content rather than nothing.

CAP/PACELC and Caching: Under PACELC (which extends CAP), even when no partitions (P) exist, there’s often a trade between latency (L) and consistency (C). Caches clearly favor low latency. They give fast responses (L) but not always the most up-to-date data (not C). The system designer can add measures to improve consistency (like shorter TTLs, or cache-busting on writes), but often the default is that caches may serve data that’s a few seconds or minutes old. This is usually acceptable for use cases like web pages, social media feeds, product catalogs, etc., but perhaps not for, say, stock trading prices or bank account balances where strict consistency is needed.

In summary, caching can violate strict consistency, so the trade-off must be managed. Techniques to mitigate inconsistency include careful cache invalidation (discussed next) and selecting which data to cache (some data might not be cached at all if it’s extremely critical to be up-to-date). Systems like Redis can also be configured for replication and cluster consistency modes (Redis Cluster offers certain consistency guarantees, and there are CRDTs for eventual consistency across geo-distributed caches). The key point is that introducing a cache means you now have multiple copies of data, so you have to decide how to keep them in sync or how much staleness you tolerate.

Cache Invalidation Strategies (Keeping Data Fresh)

“There are only two hard things in Computer Science: cache invalidation and naming things.” – a famous quote (by Phil Karlton) that underscores how challenging cache invalidation can be. Cache invalidation is the process of discarding or updating cache entries when they are no longer valid (i.e., underlying data changed). If you don’t invalidate properly, your cache will serve stale data indefinitely. If you invalidate too often or too soon, you might lose the performance benefits. Achieving the right balance is tricky (System design: Caching Strategies - DEV Community).

When to invalidate? There are a few common strategies:

Time-based expiration (TTL): This is the simplest approach: every cache entry is given a time-to-live (TTL) when stored. Once the TTL expires, the entry is automatically removed or marked expired in the cache (System design: Caching Strategies - DEV Community). For example, you cache a user’s profile for 5 minutes; after 5 minutes, it’s evicted, so the next read will fetch fresh data. TTL ensures that even if you never explicitly invalidate, no data stays around forever. It’s great for data that changes periodically or eventually needs refresh. For instance, an exchange rate might be cached with a 1-hour TTL – within that hour everyone gets the same rate from cache; after an hour, it fetches a new rate from the source. TTL is easy to implement and understand. The downside is choosing the right TTL: too short and you forfeit performance (cache entries expire before they are reused much), too long and users might see outdated info. Many systems use TTL combined with other methods. For example, CDNs often cache an item with a TTL (say 1 day) but if they receive an invalidation signal from origin, they’ll purge it sooner.
Event-driven invalidation (Explicit cache invalidation): In this strategy, whenever the underlying data changes, the system proactively invalidates or updates the corresponding cache entry. This requires knowing which cache keys are affected by a data change. For instance, if a user updates their profile picture, you’d have logic to remove the cache entry for that user’s profile in the cache cluster immediately. This can be done via messages or hooks in the code. Many databases or applications support publish/subscribe for invalidation: e.g., an update to a record can publish an “invalidate cache for key X” event that all cache nodes listen to and act on. The advantage is caches don’t serve stale data (ideally, you invalidate at the exact right moment). The challenge is complexity: mapping data changes to cache keys, ensuring all caches get the memo (especially in distributed systems). Also, if an update happens, you might need to invalidate multiple related cache entries. Example: Updating a user’s name might require invalidating a cache of that user’s profile, a cache of a group chat list that shows user names, etc. Tracking these relationships is hard, so many systems opt for simpler per-key invalidation and use TTLs to catch anything missed.
Write-through / Write-back coherence: As discussed, if you use write-through, the cache is updated at the same time as the database, so technically you don’t invalidate at all – you update in place. This keeps the cache fresh by design (at least for the data that goes through the cache). For write-back, the cache is the primary store until it flushes to DB; in that window, you wouldn’t invalidate (since the DB is older, not the cache). Instead, one must ensure other caches/clients don’t read the stale DB directly.
Cache Coherence Protocols: In more complex setups (like multiple caches in different services), there might be a coherence layer. For example, two caching layers might use a common pub/sub to notify each other of changes. Systems like Hazelcast or Coherence provide distributed cache coherence out of the box (using internal messaging to keep all nodes in sync about invalidations). Similarly, CDNs allow you to send cache purge requests by content path or tag, which invalidates content globally usually within seconds.

Stale vs. Fresh Trade-off: Invalidation ultimately is about deciding how long you allow data to be possibly stale. If your strategy is TTL = 1 minute, you’re saying “I’m OK if data is up to 1 minute out-of-date in the worst case.” If you use explicit invalidation on changes, you aim for zero staleness, but you might miss something if an invalidation fails. A hybrid is common: use moderate TTLs as a safety net, and also push invalidation messages on critical changes. That way, in the worst case (message lost), data is only stale until TTL kicks in.

Example: Suppose an online shop caches product inventory counts for 10 minutes (TTL). If stock changes (say someone buys the last item), you might also send an invalidation to immediately remove the cached count for that product so the next buyer sees “out of stock” right away. But if that invalidation fails, the TTL ensures within 10 minutes the cache expires and the correct stock is fetched. Ten minutes of potentially showing stock when it’s gone might be an acceptable risk in that design (or maybe TTL is set shorter for such data).

Cache Invalidation is Hard: The difficulty lies in distributed environments and complex data relationships. It’s easy to forget to invalidate something, or to do it too broadly and reduce cache efficiency. Testing and monitoring are important – e.g., cache hit ratio monitoring can reveal if an invalidation strategy is accidentally causing frequent cache drops (thrashing).

The Thundering Herd Problem (Cache Stampede)

The Thundering Herd problem refers to a scenario where a cache miss (or cache invalidation) triggers many requests to the backend all at once, potentially overwhelming it. It typically occurs in high-traffic systems when popular data expires or is evicted from the cache: suddenly, dozens or thousands of clients all miss the cache and all those requests fall through to the database or upstream service simultaneously (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium). This “herd” of requests can crash the database or significantly slow it down, defeating the purpose of the cache.

How it happens: Imagine a cached item that is very frequently accessed (say the homepage HTML of a news site). It has a TTL that just expired at time T. At time T+0, the first request comes in and sees cache miss, so it goes to regenerate the page from the database. But in a heavy load scenario, at time T+1ms, T+2ms, etc., many more requests arrive (there could be hundreds of users requesting that page around the same time). The cache is still empty (the first request is still fetching from DB), so all these requests also hit the database concurrently. The database now faces a sudden spike of queries (perhaps the exact same query) that can flood it (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium). This is a cache stampede – the cache was protecting the DB when it was valid, but once invalidated, the protection temporarily vanished and the DB got hit with all the pent-up demand at once.

Another scenario: after a system restart or cache cluster failure, the cache may be cold. If the system suddenly comes online with full traffic, the initial wave can overload the database because nothing is in cache yet.

Why it’s problematic: The whole point of caching is to reduce load on the backend. A thundering herd is a worst-case where the absence of a cache entry causes a burst of backend load, which can potentially bring down the backend (and then everyone’s request fails, even though the data might have been something we could have served from cache if only one had gone through).

Mitigation Techniques: There are several strategies to mitigate thundering herd issues (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium) (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium):

Cache Locking (Mutual Exclusion): Also known as "dogpile lock". The idea is to ensure only one request (or a small number) goes to the backend to populate the cache, and others wait. For example, the first thread that finds a cache miss places a lock or flag (perhaps an entry like “lock:key” in cache) indicating that data is being loaded. Other threads that come in and see the lock will wait for a short time (or check again after a delay) instead of hitting the DB (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium). Once the first thread loads the data and populates the cache, it releases the lock and all waiting threads can now get the data from cache. This greatly reduces the load spike – effectively serializing the cache refill. Many caching libraries or custom implementations use this approach. It’s important to handle the case where the first thread fails or takes too long (e.g., have a timeout for the lock so others don’t wait forever).
Request Coalescing / Deduplication: In proxy or service mesh layers, identical requests can be combined. For example, a CDN or reverse proxy might see 100 requests for the same URL come in simultaneously. Instead of sending 100 to origin, it could internally merge them into 1 request, send that upstream, and once it gets the response, clone it to all 100 clients (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium). This is similar to cache locking but implemented at the request level. Nginx’s proxy_cache_lock is an example mechanism where it will make one request to populate cache while others wait. Some modern APIs gateways or load balancers have this feature as well.
Staggered (Randomized) Expiration: To avoid a scenario where many popular keys expire at the same exact time (e.g., top of the hour), one trick is to add a small random jitter to TTLs. Instead of everything expiring exactly on the 60-second mark, each key’s TTL might be 60 seconds ± a few seconds random (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium). This way, cache expirations are more evenly distributed over time, and you don’t get a stampede of many keys all expiring simultaneously. This doesn’t help if one single key is extremely hot, but helps if you have many keys expiring.
Preemptive Renewal (Refresh Ahead): This pattern (also called refresh-ahead or auto-refresh) tries to refresh popular cache entries before they expire. For instance, if an item’s TTL is 1 hour, you could have a background job that checks every 59 minutes and refreshes it (by fetching the latest value) so that it never actually goes “stale” for user requests (A Hitchhiker's Guide to Caching Patterns | Hazelcast) (A Hitchhiker's Guide to Caching Patterns | Hazelcast). If done carefully, users never see a miss – the cache entry is always updated just in time. This can prevent herd by ensuring the item doesn’t expire under high concurrent access. However, it might end up doing unnecessary work if the item isn’t accessed as thought.
Graceful Degradation & Queueing: In some systems, if a cache miss happens for a hot item, instead of unleashing all queries to the DB, the system might serve stale data and concurrently fetch in the background. For example, some CDNs have a concept of stale-while-revalidate: if content is expired but still in cache, the CDN can temporarily serve the expired version to new requests while it fetches the fresh version from origin. This means users get something (maybe slightly old) rather than error or slow response, and only one fetch happens. Similarly, an application cache might decide “cache expired 5 seconds ago, but I’ll serve it anyway to these 100 requests and let one of them update it.”
Rate Limiting or Backpressure: As a last resort, systems can detect when a stampede might occur and throttle requests. For instance, if 1000 queries for the same key appear, you might reject some or slow them down intentionally. This is more about protecting the backend by sacrificing some requests.

Example resolution: Using the weather API example from earlier (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium): The site caches weather data for 5 minutes. At expiry, 100 users trigger calls to the API, overloading it. To fix this, the site implements a locking mechanism: when cache is expired, the first user’s request sets a lock and fetches from the API; the other 99 users see the lock and instead of calling the API, they wait a moment. When the first request returns with fresh data (say after 1 second), it updates the cache and releases the lock. Now the 99 waiting requests all get the fresh weather data from the cache and proceed, without each hitting the API. Alternatively, the site could have given each cache entry a random TTL between 4.5 and 5.5 minutes, so not all locations expire exactly at once (if it was a global expiry). Or it could have a background job refresh popular cities’ weather just before expiry.

Thundering herd issues often become noticeable at scale. Many smaller systems might not experience it until a high load event or an outage (when caches reboot). It’s good practice in system design to mention strategies like locking or staggered TTLs to show awareness of this problem.

Other Challenges

(Aside from those mentioned, there are a couple of other common caching challenges worth noting briefly: cache coherence – keeping caches in sync in a distributed environment, which we touched on in consistency; memory management – ensuring the cache doesn’t grow unbounded and cause evictions at bad times; and cold start – when a cache is empty initially, how to warm it up or handle the load. Also over-caching: caching too aggressively (e.g., caching sensitive data on client that shouldn’t be cached, or using too much memory for rarely used data). But the major ones for an overview are consistency, invalidation, and thundering herd as discussed above.)

Real-World Examples and Use Cases

To illustrate how caching is employed in practice, let’s look at how a few major tech companies use caching to achieve the performance and scale they need:

Amazon: Amazon’s e-commerce platform makes extensive use of caching at multiple levels. They cache product information, search query results, session data, and more. For instance, when millions of users search for a popular product (like “iPhone”), Amazon can serve the results from a cache rather than hitting the database each time (Caching in System Design: A Simple Guide with Real-World Examples | by Harsh Kumar Sharma | Medium). Product pages are assembled from many components (images, descriptions, recommendations, stock availability); many of those components are delivered via caches or CDNs (images via CloudFront CDN, and data via in-memory caches). Amazon also offloads sessions and user-specific data to caching systems so that their databases aren’t overloaded with repetitive reads. Moreover, Amazon Web Services offers services like ElastiCache (managed Redis/Memcached) and DynamoDB Accelerator (DAX) – these reflect Amazon’s internal patterns: DynamoDB (a NoSQL DB) can be fronted by DAX which is an in-memory cache that handles millions of requests with microsecond latency. In high-profile shopping events (Prime Day, Black Friday), caching is critical for Amazon to handle traffic spikes. Without caching, Amazon’s databases would not sustain the read load of all users repeatedly viewing the same popular items.
Netflix: Netflix must stream video content to a massive global audience, which is incredibly bandwidth-intensive. Caching is literally at the core of Netflix’s content delivery strategy. Netflix deploys a network of CDN edge servers (Netflix Open Connect) that cache popular movies and shows closer to users (often at ISP data centers). When you hit “play” on Netflix, the video is likely coming from a server just a few hops away from you rather than Netflix’s central servers. This caching ensures minimal startup time and prevents their origin servers from handling every stream (Caching in System Design: A Simple Guide with Real-World Examples | by Harsh Kumar Sharma | Medium). Netflix also caches other things: for example, the Netflix API that delivers your personalized movie list uses in-memory caches (they have a caching system called EVCache, built on Memcached, in their microservices architecture). They cache user preferences, recommendations results, etc., in regional caches so that the microservices don’t recompute or re-fetch the same data repeatedly. As a result, Netflix can handle billions of requests to their API and stream content smoothly, with caches absorbing much of the repetitive workload.
Google: Google employs caching everywhere, from hardware to software to global networks. Two notable areas: Google Search and YouTube. For Search, Google uses edge caches and regional caches so that popular search queries can be served quickly from nearby servers (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community). For instance, if there’s a trending query (like “Olympics 2025 medal count”), Google might cache the results page in data centers around the world. When users search the same thing, the results (which don’t change second-to-second) can be returned from cache rather than computing them anew. This is one reason Google Search is extremely fast – many results are effectively precomputed or cached. Google also caches web content heavily in its indexing and crawling pipeline (they even show “Cached version” of websites – that’s an example of their internal caches). For YouTube, Google built a content distribution network called Google Global Cache (GGC). These are caching servers installed at ISPs that store popular YouTube videos, Google Play downloads, etc. When a viral video hits, instead of every user fetching it from Google’s main data center, thousands of users in a region get it from their local GGC node. Google uses local edge servers that store frequently accessed data (like YouTube videos, search results, thumbnails) to reduce latency and save bandwidth (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community). Additionally, even within your browser, Google products use caching (e.g., Google Docs will cache some of your document data for offline access, Google Maps caches tiles of the map you’ve seen, etc.). Chrome browser itself is highly optimized to cache web resources to speed up revisiting websites.
Facebook: Facebook serves billions of users on its social network, which involves a huge amount of database reads (profiles, posts, comments, likes). Facebook famously attributes a lot of its scalability to a massive memcached tier that they deploy in front of their databases. Facebook’s engineering team wrote a paper “Scaling Memcache at Facebook” describing how they handle billions of cache lookups per second (Scaling memcache at Facebook - Engineering at Meta). They cache user data (profiles, friend lists), content (posts, the results of a complex query like “News Feed stories for user X”), and even intermediate results to speed up page renders. When you load Facebook, most of what you see was likely retrieved from their cache cluster. This reduces direct database queries significantly. Facebook’s cache cluster is distributed across many servers and uses a protocol to ensure even if one cache fails, others can take over (and they have strategies to avoid thundering herd when cache nodes restart). The result is Facebook’s infrastructure can serve billions of requests per second from cache, providing a fast user experience (Scaling memcache at Facebook - Engineering at Meta) while keeping the databases from melting down. Beyond memcached, Facebook also uses edge caching: for static content like images and videos, they use CDNs and their own infrastructure (much like YouTube, they serve videos from edge caches). Facebook’s CDN and caching ensures that, for example, your profile pictures, or Instagram photos, are delivered quickly from a nearby cache server.

These examples highlight a pattern: caching is a cornerstone of scalability for large systems. Each company combines multiple caches: in-memory caches for database acceleration, CDN caches for static content, browser/client caches for front-end assets, etc. Without caching, the response times would be slower and the cost to serve each user would be much higher (due to repeated computations and database reads). Caching enables these companies to serve vast user populations with acceptable performance and reasonable infrastructure cost.

Emerging Trends in Caching

The fundamentals of caching have been stable for years, but new trends and technologies continue to evolve how caching is used in modern system architectures:

Distributed & Decentralized Caching:
As systems scale out, caches are no longer a single server or a simple key-value store; they are distributed across clusters and even geographically. Distributed caching (caches spread over multiple nodes) is now commonplace, but emerging practices focus on making these distributed caches more resilient and easier to use. For example, modern distributed caches (like Apache Ignite, Azure Cosmos DB’s integrated cache, or Redis Cluster) support partitioning, replication, and even transactional consistency across the cache cluster. There’s also interest in decentralized caching in P2P networks or using technologies like blockchain to cache and distribute content (for content delivery in a decentralized manner). Most practically, within microservice architectures, teams are adopting multi-level caches: an in-service (local) cache for ultra-fast access, backed by a larger distributed cache for sharing data across instances. This hierarchy can combine the speed of local memory with the size of a distributed cluster, and is facilitated by frameworks that synchronize local caches with central ones.

Another trend is using AI/ML to optimize caching: for instance, ML models that predict which data should be cached (beyond simple LRU) or to dynamically adjust TTLs based on access patterns. These are still cutting-edge but potentially powerful for adaptive caching.

Additionally, there’s focus on global caching – spanning caches across data centers. Systems like Cloudflare Workers KV or Akamai EdgeWorkers provide a key-value store that replicates data globally so you can cache state close to users worldwide (not just static files but even API responses or user data). This blurs the line between database and cache: some apps store data primarily in these globally distributed caches with eventual consistency.

Caching in Serverless Architectures:
Serverless computing (e.g., AWS Lambda, Google Cloud Functions) introduces new challenges for caching. In traditional servers, you often rely on in-memory caches on each server process that persist as long as the process is running. But with serverless, functions are short-lived and may run on demand, and you don’t have a long-running process to keep a cache warm (a cold start has an empty cache each time). Moreover, serverless scales out by spinning up more independent instances, each with its own ephemeral memory, so you can’t rely on a single in-memory cache shared across requests.

To address this, the caching layer in serverless is often moved out to managed services: e.g., using a service like Redis (hosted) as an external cache that all function instances talk to, or using CDNs and edge caches aggressively. Another strategy is to use the limited lifetime of a serverless container – for example, AWS Lambdas may be reused for multiple requests before being garbage-collected, so you can still use an in-memory static variable as a mini-cache within one Lambda instance. But you don’t control how long that lives or how many instances exist.

Impact: Serverless has actually increased the need for fast, scalable cache services because each stateless function invocation might otherwise hit the database. Many serverless apps incorporate an API Gateway cache (AWS API Gateway has caching feature) to avoid invoking the function at all for common requests. Others use CDNs (like Cloudflare workers) to do work at the edge. The consequence is the rise of fully managed caching services that fit the serverless paradigm: for example, Momento (a “serverless cache” startup) provides a cache service where you don’t manage nodes at all – you just use an API and it scales behind the scenes (Serverless Cache: The missing piece to go fully serverless - Momento). The idea is to avoid the need to provision a Redis cluster (which is a serverful concept) for your serverless app; instead, use a pay-per-use cache service that auto-scales, much like serverless databases do. Traditional cache servers have limits (max connections, fixed memory) that require overprovisioning in serverless environments; new services attempt to remove those limits by dynamically scaling the cache capacity and throughput (Serverless Cache: The missing piece to go fully serverless - Momento).

Also, serverless environments push more caching to the edge. If your backend is fully serverless, you might lean more on CDNs/edge caches to handle as much traffic as possible (because scaling up a lambda on a cache miss might be slower than serving from an edge cache). For instance, a serverless website might use Cloudflare Workers to cache HTML pages at the edge, and only call a lambda for infrequent cache misses.

Example: Consider a serverless e-commerce site: Product pages are rendered by Lambdas. Without caching, hitting the front page 1000 times would invoke 1000 Lambda executions (cost and latency). Instead, they might put CloudFront (CDN) in front with a TTL of a few seconds on pages, or use API Gateway caching. That way, perhaps only 1 in 1000 requests actually triggers a Lambda (the rest are cache hits). This significantly reduces cost (since serverless = pay per execution) and improves speed. As Yan Cui notes, “caching improves response time and also translates to cost savings in serverless (pay-per-use) since fewer function invocations are needed.” (All you need to know about caching for serverless applications | theburningmonk.com).

Edge Computing and Caching Convergence: Another emerging trend is running application logic on edge nodes (serverless at edge, like Cloudflare Workers, AWS Lambda@Edge) which can themselves use caching more dynamically. E.g., an edge function might personalize a cached response with slight tweaks instead of hitting origin.

Hardware and New Cache Technology: At the hardware level, new non-volatile memory (like Intel Optane, etc.) offers an intermediate speed tier – this might lead to new caching tiers (between RAM and disk). Also, content-addressable memory or specialized cache hardware might someday allow more efficient caching for specific workloads (though these are more research/enterprise areas).

Security and Isolation: Emerging thinking also includes caching with security isolation – e.g., multi-tenant caches that isolate data per user for privacy, or encrypted cache entries so that even if a cache is compromised, data isn’t exposed. This is a response to concerns of putting sensitive data in caches.

Conclusion of Trends: In summary, the trend is towards caches that are more distributed, more automated, and closer to the end-user. Distributed caches enable massive scale and high availability, and new “serverless” caching services abstract away the clustering so developers can cache without managing servers. Serverless computing has reinforced the importance of external caches (since in-process ephemeral caches are unreliable in that model) (Serverless Cache: The missing piece to go fully serverless - Momento). And as applications demand ever lower latency globally, caching at the edge (CDNs, edge keys, etc.) is expanding from just static files to dynamic data and even computed results. Caching continues to be a vital part of system design, evolving alongside new architecture paradigms.

In conclusion, caching remains a fundamental technique to achieve high performance and scalability in software systems. By understanding and combining the right types of caches (client vs server, CDN vs database cache) and using appropriate strategies (cache-aside, read/write policies) with awareness of consistency trade-offs, engineers can dramatically improve throughput and latency. The examples of tech giants demonstrate that clever use of caching is often the secret behind handling millions of users and data at scale. As infrastructure shifts towards distributed and serverless models, caching is likewise adapting – becoming more distributed, integrated, and intelligent – but its core goal is unchanged: store expensive-to-fetch data in a faster medium and serve it when needed, as reliably and fresh as possible. The challenge is mastering when and how to do this, which is why caching is both an art and science in system design.

system-design

SerialReads

Caching in System Design

Caching in System Design

Fundamental Concepts

Types of Caches

Client-Side Caching (Frontend)

Server-Side Caching (Backend)

Content Delivery Networks (CDNs) and Edge Caching

Database Caching (Query Caches, Redis, Memcached)

Cache Strategies & Patterns

Cache-Aside (Lazy Loading) Pattern

Read-Through Cache Pattern

Write-Through Cache Pattern

Write-Around (Write-Around Cache)

Write-Back (Write-Behind) Cache Pattern

Common Challenges and Trade-offs

Consistency vs. Availability (CAP Theorem Context)

Cache Invalidation Strategies (Keeping Data Fresh)

The Thundering Herd Problem (Cache Stampede)

Other Challenges

Real-World Examples and Use Cases

Emerging Trends in Caching