SerialReads

Caching in System Design

Apr 30, 2025

Great! I’ll compile a structured, high-level report on caching in system design, covering core principles, practical applications, and real-world strategies. The report will include clear headers, concise examples, and visual aids (like diagrams and tables) where appropriate to enhance understanding and recall.

I’ll let you know once it’s ready for review.

Caching in System Design

Fundamental Concepts

What is Caching?
Caching is a technique of storing frequently accessed data in a fast, temporary storage layer (the cache) so future requests for that data can be served more quickly (What Is Caching? How Caching Works and Its Limitations.). By keeping data in memory (or other fast storage) rather than always fetching from a slower backend (like a database or disk), applications reduce latency and load on the primary data source (System design: Caching Strategies - DEV Community). In essence, caching trades off freshness of data for speed: it relies on the principle of locality (recently or frequently used data is likely to be used again) to deliver quick responses, at the risk of serving slightly stale data (A Hitchhiker's Guide to Caching Patterns | Hazelcast). This trade-off is usually worthwhile in systems where performance and scalability are critical.

Cache Hits vs. Cache Misses:
A cache hit occurs when the requested data is found in the cache, allowing the system to return the data quickly without touching the backend source. A cache miss means the data was not in cache, so the system must retrieve it from the original source (incurring higher latency) and—often—store it in the cache for next time. High cache hit rates are the goal, as each hit saves expensive work, improving throughput and response times (System design: Caching Strategies - DEV Community). Too many misses can negate the benefits of caching; in fact, if an application rarely reuses data, checking the cache can add unnecessary overhead. In such cases (almost always misses), performance might worsen because the app spends time looking in the cache and still ends up going to the database (What Is Caching? How Caching Works and Its Limitations.). Therefore, understanding data access patterns (to ensure a cache will be reused) is critical before introducing a cache.

Common Cache Eviction Policies:
Caches have limited space, so when they’re full, some data must be evicted (removed) to make room for new entries. Cache eviction policies decide which data to remove. Choosing the right policy can greatly impact cache effectiveness. Some common strategies include:

Eviction Policy Summary: Different policies excel under different access patterns. For example, LRU works well when recently used data tends to be reused (temporal locality), LFU works when a subset of items are accessed much more frequently than others, and MRU can help in cases like stack or stream access patterns. Many caching systems use LRU (or variants like LRU-K or Windowed LFU) by default as it aligns with typical workloads, but tuning the eviction policy to your application’s access characteristics can improve hit rates.

(See table below for a quick comparison of eviction policies and their ideal use cases.)

Eviction Policy Principle Ideal When…
LRU (Least Recently Used) Evicts the item unused for the longest time (What Is Caching? How Caching Works and Its Limitations.). Assumes long-unused data is least likely needed soon. Recent access is a good predictor of near-future use (temporal locality) – e.g. user viewing history, recently used documents.
LFU (Least Frequently Used) Evicts the item with the fewest accesses (What Is Caching? How Caching Works and Its Limitations.). Focuses on keeping the most popular items. Access frequency is highly skewed (some items are hot, others rarely used) – e.g. popular products, top search queries.
FIFO (First-In, First-Out) Evicts the oldest entry in the cache (no regard to usage) (System design: Caching Strategies - DEV Community). Items naturally expire with time or one-time use is common – e.g. streaming data, queue processing where old data becomes irrelevant.
Random Evicts a random item (System design: Caching Strategies - DEV Community). Does not track usage patterns. Simplicity is paramount or caching at hardware level – e.g. simple caches where implementing LRU/LFU is impractical. (Usually a fallback strategy.)
MRU (Most Recently Used) Evicts the most recently accessed item first ([ByteByteGo Cache Eviction Policies](https://bytebytego.com/guides/most-popular-cache-eviction/#:~:text=Most%20Recently%20Used%20)). Frees items likely already used up in one-time or sequential workloads.

Note that many real caches use a combination of strategies (e.g. TLRU – Time-aware LRU with TTL expiration, or segmented policies like Two-Queue which mix an LFU and LRU approach). The eviction policy should be chosen based on data access patterns to maximize the cache’s hit ratio.

Types of Caches

Caching can occur at various layers of a system. Broadly, we can categorize caches into client-side vs. server-side (including intermediary layers like CDNs and databases). Each type serves different purposes in a distributed architecture:

Client-Side Caching (Frontend)

Definition & Mechanism: Client-side caching stores data on the end-user’s device or in the user’s application process. This could be a web browser cache, mobile app cache, or any cache in the client application. When a client (browser/app) requests a resource, it can keep a copy locally so that subsequent requests for the same resource are served from local storage instead of over the network (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium). For example, web browsers cache static assets (HTML, CSS, JavaScript, images) and even API responses. The next time you visit the same page, the browser can load files from its cache instantly rather than fetching from the server, drastically reducing page load time.

Common Client-Side Caches:

Use Cases & Examples: A classic example of client caching is YouTube’s video caching in its app/browser. YouTube will save recently watched video data, thumbnails, and metadata in the browser or app cache. This way, if you replay a video or navigate back to it, it loads smoothly without re-downloading everything (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium). Similarly, single-page applications (SPAs) cache JSON API responses; for instance, a weather app might cache the last fetched weather data for your city so it can show something immediately while fetching the latest in the background. Client-side caching improves perceived performance and enables offline or flaky-network functionality.

Pros & Cons:
Fast local access: Data served from the device’s own storage or memory is extremely fast, with zero network latency (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
Reduced server load: Fewer requests go to the server, since repeated access is satisfied locally (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
Offline capability: The app can function (at least partially) without network access using cached data (great for mobile apps, PWAs, etc.) (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✖ Data staleness: The cache might serve out-of-date data if the client is offline or hasn’t fetched updates (e.g., you might see yesterday’s data until a new fetch happens) (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✖ Storage limits: Browsers and devices limit how much can be stored. Large caches might be evicted by the browser or OS, and different devices have varying capacities (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✖ Per-user only: A client-side cache is local to one user; it doesn’t help with data that could be shared across users. For shared caching, we rely on server/CDN caches.

Server-Side Caching (Backend)

Definition & Mechanism: Server-side caching happens on the server or in the infrastructure between the client and the primary data store. It refers to caches that the server (or a cluster of servers) uses to store responses or expensive computations so that future requests can be served faster. The first time a request is made, the server generates the response (perhaps by querying a database or performing calculations) and then stores that result in a cache (often an in-memory store like Redis or Memcached). Subsequent requests check the cache first and, if a cached result is available, return it directly without hitting the database (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium) (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).

Server-side caching can take many forms:

Use Cases & Examples: One common use case is caching database query results. Suppose an e-commerce site’s backend frequently needs to fetch product details or the result of a complex search query. Instead of querying the database for each user’s identical request, the server can cache the result in memory. For example, if user A searches for “Nike shoes” and the server fetches that from the DB, it can cache the list of Nike shoes. When user B (or A again) performs the same search, the server returns the cached results immediately (Caching in System Design: A Simple Guide with Real-World Examples | by Harsh Kumar Sharma | Medium). This dramatically cuts down database load and response time. Many websites cache user session data, user profiles, frequently viewed items, or computed aggregates on the server side.

A real-world example is Twitter: On the backend, Twitter caches things like trending topics, user timelines, or tweet objects in a fast in-memory store (and also uses CDNs for media). This prevents millions of expensive database or API calls, allowing Twitter to serve content to users quickly and handle huge volumes of reads (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium). Another example is Amazon: Amazon’s product pages and search results employ heavy server-side caching. When millions of users search for a popular item (say “iPhone”), Amazon can serve the results from cache instead of querying the database each time, ensuring fast and reliable responses under massive load (Caching in System Design: A Simple Guide with Real-World Examples | by Harsh Kumar Sharma | Medium).

Pros & Cons:
Shared benefit: A server-side cache is typically shared by all users of the system. If one user’s request populates the cache, many others benefit from that cached data (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium). This is great for scaling reads in multi-user systems.
Reduced database/API load: By catching repeated requests, it offloads work from the primary datastore or external APIs, which can improve overall throughput and allow the system to handle higher traffic (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
Centralized control: The caching behavior (invalidation, TTLs, etc.) can be managed in one place on the server, and it’s easier to update if underlying data changes (compared to invalidating many client caches).
✖ Consistency management: When data is updated in the database, the server-side cache may have stale data that must be invalidated or updated. Ensuring the cache doesn’t serve obsolete information can be complex (the classic cache invalidation problem) (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✖ Propagation delay: In some cases, especially with distributed caches, an update might take time to propagate to all cache nodes or may not immediately reflect to users, so recent changes might not appear instantly (trading consistency for performance) (Server-Side Caching vs. Client-Side Caching: A System Design Perspective | by Kumud Sharma | Medium).
✖ Overhead & cost: Running a large cache cluster (Redis nodes, etc.) is an additional component of your architecture. While it eases DB load, it’s another system to monitor, scale, and pay for. Also, very dynamic data might not benefit if it changes before cache hits occur (wasted caching).

Content Delivery Networks (CDNs) and Edge Caching

A CDN is a specialized type of cache that lives at the network edge (geographically closer to users). CDNs cache content on proxy servers distributed around the world (or a region) so that users can retrieve data from a nearby location rather than from the origin server every time (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community) (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community). CDNs are often provided by third-parties (like Cloudflare, Akamai, Amazon CloudFront, Google Cloud CDN), and they are heavily used for static assets and media, though they can also cache certain dynamic content.

Use Cases & Examples:

Benefits: The primary benefits of CDNs are reduced latency (since the data travels a shorter distance to the user) and offloading traffic from the origin servers. By “bringing the most requested data closer to users, Google reduces latency, improves speed, and saves backend resources.” (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community) In practice, a CDN can massively reduce load on your origin – many requests are served entirely by edge caches without ever reaching your servers. This also provides resilience: even if the core site is under high load, the CDN may continue serving cached content to users.

CDNs are essential for global scale systems. Companies like Amazon use CloudFront (a CDN) to distribute images, scripts, and even whole HTML pages for their e-commerce site, ensuring fast page loads globally. Facebook and Google have their own edge cache networks; for example, Google’s edge caching (part of Google’s CDN infrastructure) stores YouTube videos and even Google Search results in edge servers at ISPs and points-of-presence around the world (Part 5: Edge Caching & Google CDN – The Secret Behind Instant YouTube & Search - DEV Community). This edge layer (often called Google Global Cache in Google’s case) is why a search query or a YouTube video feels instant in many locations – much of the content was served from a local cache rather than halfway across the world.

One more advantage: CDNs and edge caches also absorb traffic spikes and can protect origin servers from traffic surges or certain attacks (CDNs often act as a layer of defense, caching and serving even in case of an origin failure for a short time).

Challenges: Not all content can be cached on CDNs easily—highly personalized or real-time data usually must go to the origin. CDNs also have to be configured with appropriate cache invalidation strategies (e.g., purging or using short TTL for content that updates frequently). There’s also eventual consistency to consider: a user update might invalidate some cached content, and there can be a brief window where one edge node has old data while others have new.

Database Caching (Query Caches, Redis, Memcached)

This refers to caching at or near the database layer. Many databases have internal caches (like MySQL’s query cache, or disk page caches) – those are important but typically automatic. In system design, when we say database caching we often mean using an external cache store to relieve the database. Redis and Memcached are two extremely popular technologies for this purpose. They are in-memory key-value stores often deployed as a layer in front of a database or used by the application to store frequently accessed data.

Memcached is a high-performance, distributed memory cache. It’s simple: it stores key→value pairs entirely in RAM, offering very fast lookups. Companies have used Memcached to cache database results, computed pages, or any object that is expensive to regenerate. A famous example is Facebook – Facebook leveraged Memcached to build a massive distributed cache for their social graph data. This cache tier handles billions of reads per second, dramatically reducing the load on Facebook’s MySQL databases (Scaling memcache at Facebook - Engineering at Meta) (Scaling memcache at Facebook - Engineering at Meta). In fact, Facebook’s ability to scale to billions of users was aided greatly by their Memcached layer, which stores user sessions, profile info, and the results of complex queries so that the web servers rarely need to directly hit the databases for common reads (Scaling memcache at Facebook - Engineering at Meta).

Redis is another popular in-memory cache (which offers more features than Memcached, such as persistence, richer data types, and atomic operations). Redis is often used to cache database query results or as a fast lookup table for things like user authentication tokens, session data, or computed scores/rankings. For example, a web app might store the 100 most recent posts in a Redis list for a feed, or cache the result of a heavy analytic query. Redis is also commonly used as a session store or message cache. Many web frameworks allow plugging Redis in so that each page load can quickly fetch session info from Redis instead of a slower database (System design: Caching Strategies - DEV Community). Redis can persist to disk periodically, which gives it a bit more safety for data longevity than Memcached (which is purely in-memory).

Use Cases: Database caching is useful when you have read-heavy workloads. For instance, an online store might cache product details or inventory counts in Redis to avoid hitting the SQL database for each page view. If the inventory changes, the cache can be updated or invalidated. Another use case: results of expensive queries (say a complex JOIN or an aggregate) can be cached for a few minutes. Reporting dashboards often do this—compute a heavy report once, cache it, and serve the same result for subsequent requests until it expires.

Example – Amazon: Amazon’s architecture includes a caching layer (Amazon ElastiCache service supports Redis/Memcached) to offload reads from databases. Amazon has noted that caching the results of popular product queries (like a search or a product detail page) in a fast cache has been key to delivering quick results during high-traffic events (Caching in System Design: A Simple Guide with Real-World Examples | by Harsh Kumar Sharma | Medium). Rather than every user’s query hitting the underlying product database, many can be served from cache.

Distributed Cache considerations: Both Redis and Memcached can be clustered (distributed across multiple servers) to increase capacity and throughput. A distributed cache means that data is partitioned (and often replicated) across multiple nodes, and clients can connect to any node and retrieve data as needed. This allows the cache to scale horizontally and not be a single point of failure. In large systems, it’s common to have a distributed caching tier with several nodes: this tier can be treated as a shared fast lookup service by all application servers (What Is Caching? How Caching Works and Its Limitations.). The cache cluster might store a huge number of key-value entries representing many users’ data or many query results. This greatly improves scalability, because as demand grows you can add more cache nodes. (Of course, partitioning means a given key is on one node, so that node must handle queries for that subset of data.)

When not to use DB caching: If your data changes constantly and clients rarely read the same data twice, caching might not help. Also, if strong consistency is required (always seeing the latest data), caching must be designed carefully (potentially with very short TTLs or write-through schemes). We’ll discuss cache consistency soon.

In summary, database caching using tools like Redis/Memcached is a fundamental pattern to achieve high performance. It’s employed by virtually all big web platforms: Twitter caches tweets and timelines in memory, YouTube/Google caches video metadata and search results, Amazon caches product info, Facebook caches social data in Memcached, and so on. Each of these drastically reduces direct database load and allows these platforms to scale to millions or billions of operations per second.

Cache Strategies & Patterns

Designing caching in a system isn’t just about whether to cache, but also how the cache interacts with reads and writes. Several common caching strategies/patterns dictate this interaction. The main ones are Cache-Aside (Lazy Loading), Read-Through, Write-Through, Write-Around, and Write-Back (also called Write-Behind). Each pattern balances consistency, write complexity, and read performance differently. Let’s explain each with examples:

Cache-Aside (Lazy Loading) Pattern

In the cache-aside pattern, the application code is responsible for interacting with the cache. The cache sits alongside the data store, and the app pulls data into the cache as needed. On a cache miss, the application goes to the database (or source of truth), fetches the data, returns it to the user, and at the same time inserts it into the cache for next time (How Facebook served billions of requests per second Using Memcached). On a cache hit, of course, the data is returned directly from the cache. This is often called lazy loading because data is loaded into the cache only when first used, not in advance.

For writes or updates, in pure cache-aside, the application usually writes to the database and then invalidates the cache entry (or updates it). That is, the cache is updated after the data store is updated (or the cached entry is simply cleared, so the next read will fetch fresh data from the DB).

Example scenario: Imagine a social media app using cache-aside. When a user requests their profile, the service first checks a Redis cache for the profile data. If it’s not there, the service queries the primary database, gets the profile, and then stores that result in Redis with an appropriate key. Subsequent requests for that profile (by the same or even other services) hit Redis and avoid the database. If the user updates their profile, the service will write the changes to the database and invalidate the cache entry for that profile (so that next read will fetch the updated data from DB and then cache it anew).

Use cases: Cache-aside is the most widely used caching pattern (A Hitchhiker's Guide to Caching Patterns | Hazelcast) because of its simplicity and explicit control. It’s used whenever the application wants fine-grained control over cache contents. For example, Facebook’s Memcache usage is essentially cache-aside – the app servers check memcache first, on miss query MySQL and populate memcache (How Facebook served billions of requests per second Using Memcached). This pattern is also common in ORM data caching or in any scenario where you manually code caching.

Pros:

Cons:

Read-Through Cache Pattern

Read-through is similar to cache-aside in outcome (data gets loaded on demand), but the difference is the caching system itself automatically loads missing data from the backing store. In a read-through pattern, the application doesn’t directly fetch from the database on a miss; instead, it always interacts with the cache. If the data is not in cache, the cache service (assuming it’s a sophisticated cache or integrated middleware) will fetch from the database, populate itself, and return to the application (A Hitchhiker's Guide to Caching Patterns | Hazelcast).

In other words, the cache is given a “loader” capability. For example, some caching libraries allow you to register a callback or use an integrated data store connection so that whenever your code asks the cache for key “X” and X is not present, the cache layer itself knows how to get X from the database and update the cache.

Example scenario: Suppose you use a caching library in your application that supports read-through. Your code might call cache.get(userId). If the user data is cached, it returns. If not, the cache internally queries the users table in the DB (perhaps via a provided DAO or callback), stores the result in the cache, and then returns it. The application code is simpler (just get from cache), and it doesn’t have to explicitly manage cache population logic.

Use cases: Read-through caches are common in frameworks or cloud services. For instance, AWS DynamoDB Accelerator (DAX) is an example of a read-through cache in front of DynamoDB: the app queries DAX as if it were the database; DAX checks its cache and automatically fetches from DynamoDB on a miss (Caching patterns - Database Caching Strategies Using Redis). Similarly, an ORM’s second-level cache might be read-through – your code asks the cache for an object, and the cache layer goes to DB if needed. This pattern is good when you want to decouple cache logic from business logic, delegating the loading to a cache service or library.

Pros:

Cons:

Write-Through Cache Pattern

In a write-through strategy, every write operation (create/update/delete) goes through the cache first, and the cache is responsible for writing it to the underlying database synchronously. Essentially, the application never directly writes to the database; it writes to the cache, which immediately writes to the database (Understanding write-through, write-around and write-back caching (with Python)). This ensures the cache is always up-to-date with the database — any data in the cache has been saved to the DB, and any new DB state is reflected in the cache.

How it works: When the app needs to update data (say change a user’s profile), using write-through it would update the cached value and the cache system would in the same operation write the new value to the database. Only after the database write succeeds would the operation be considered complete (Understanding write-through, write-around and write-back caching (with Python)) (Understanding write-through, write-around and write-back caching (with Python)). Because of this, data in cache is never stale relative to the database — they are in sync for writes.

Example scenario: Consider a financial system that caches account balances. Using write-through, when a transaction updates an account balance, the application writes the new balance to the cache. The cache then writes that through to the SQL database. This way, the cache always has the latest balance. Any read can be served from cache confidently, since no write is considered successful until it’s in the DB and cache. This is important for systems that cannot tolerate serving old data. For instance, many banking systems or payment systems would use a write-through or at least an immediate cache update approach to ensure consistency.

Another example: A user profile service might use write-through for critical fields. When a user updates their email, the system writes to the cache (which writes to DB). Subsequent reads get the new email from the cache.

Pros:

Cons:

Use cases: Write-through is ideal when you have a high read-to-write ratio and you cannot afford stale reads. For example, caching session data in an e-commerce site: you write-through so that any change (like items added to a cart) go to DB and cache, and all servers see the updated cart immediately. It’s also used when writes are not too frequent or when slightly slower writes are acceptable in exchange for fast reads of up-to-date data (financial data, inventory counts in a limited-stock sale, etc.). Systems that use read-through caches often pair them with write-through to keep the cache hot.

Write-Around (Write-Around Cache)

Write-around is a strategy where writes skip the cache and go directly to the data store, and the cache is updated only on the next read. In practice, this means on an update, you evict the item from cache (or don’t cache it at all initially), and only when someone tries to read that data later do you fetch the new value from the DB into the cache. It’s called "write-around" because write operations go around the cache.

How it works: On data modification, the application writes to the database as normal. The cache, if it has the old value, may be invalidated/removed, but the new value is not written to the cache at that time (Understanding write-through, write-around and write-back caching (with Python)) (Understanding write-through, write-around and write-back caching (with Python)). Reads still check the cache: if the updated data wasn’t read yet (cache was empty or invalidated), the first read will miss, fetch the latest from DB, then populate the cache.

Example scenario: Suppose we have a news site and an article’s content is updated. With write-around, when an editor updates the article, the system stores the new content in the database and removes the cached copy of that article (if any). It does not put the new content into the cache immediately. Now, when a reader requests that article, the cache is empty, so the app fetches from DB (getting the updated content) and then caches it for subsequent readers. Essentially, the cache only sees the new data when it’s actually needed. If that article is never read again (or not read for a long time), we avoided populating the cache with data that no one used.

Pros:

Cons:

Use cases: Write-around is useful when write traffic is high and you don’t want to bloat the cache with data that might not be re-read soon. For example, logging or audit data might be written constantly but rarely read; you wouldn’t cache every log entry on write. Another scenario: bulk loads or backfills of data into a system – you might disable caching or use write-around during that process to avoid overwhelming the cache with one-time writes. It’s an acceptable approach if the pattern is “write once, read maybe later” as opposed to “write and then read immediately”. A specific use-case: a stream processing pipeline where results are continuously written to a database, but consumers mostly query recent data (so the cache naturally fills with recent reads, and older writes that aren’t read don’t occupy cache). Write-around ensures the cache is populated on demand, not by every single write.

Write-Back (Write-Behind) Cache Pattern

Write-back caching takes a different approach: on an update, data is written only to the cache (marking it as dirty) and the write to the database is deferred to a later time (asynchronous) (Understanding write-through, write-around and write-back caching (with Python)) (Understanding write-through, write-around and write-back caching (with Python)). The cache thus absorbs write operations and “buffers” them, writing to the backing store in the background (or on certain triggers). This is analogous to a disk write-back cache or CPU cache write-back: it’s very fast for writes, but introduces an inconsistency window.

How it works: When a value is updated, the cache is updated immediately in memory and the user’s operation returns success as soon as the cache is updated. The cache then queues the write to persistent storage. Sometime later (could be milliseconds or seconds), the cache flushes the queued writes to the database (this could be batched or one-by-one). If multiple writes happen to the same data in a short time, the database might only see the final state (the cache could coalesce writes).

Example scenario: Consider a high-throughput analytics system collecting sensor readings. Using write-back, each incoming reading is written to a cache (in-memory store) for quick acknowledgment. Periodically or asynchronously, these readings are written in bulk to a database or file system. If the system crashes before the write-back, some data may be lost, but this might be acceptable for the application (or mitigated by replication of the cache). Another example: a user likes/unlikes a post rapidly. A write-back cache might handle those toggles in memory and only occasionally persist the final count to the database, reducing write load.

Pros:

Cons:

Use cases: Write-back is suitable when performance is paramount and occasional data loss can be tolerated (or there’s another mitigation for it). For example, caching analytics events or click logs: you might accept losing a few events on a crash in exchange for handling an extremely high event rate. Another example is in recommendation systems or ML, where you accumulate user interactions in a cache and periodically flush to a datastore; losing a tiny fraction might not be critical. Write-back is also used in combination with durable caches: e.g., Redis has a feature called “append-only file” which can be thought of as making write-behind safer (it logs every write to disk asynchronously). Systems like Cassandra or Dynomite (in-memory front for DB) could be configured in a write-back style for speed, but usually with replication for safety.

One can combine patterns too. For instance, refresh-ahead is a pattern where the cache proactively refreshes certain items before they expire (to avoid misses). Also, write-behind can be combined with read-through. In practice, many caching systems allow configuration of these behaviors. As a summary of these patterns (A Hitchhiker's Guide to Caching Patterns | Hazelcast):

To illustrate, a typical usage might be: a cache-aside or read-through + write-through strategy for an e-commerce product cache (so reads are fast and always up-to-date after a purchase updates stock), whereas a logging system might use write-back caching to buffer writes for efficiency.

Common Challenges and Trade-offs

While caching improves performance, it introduces additional complexity. Two fundamental challenges are maintaining data consistency versus system availability, and determining when to invalidate or refresh cached data. Additionally, poorly managed caches can cause problems like the Thundering Herd phenomenon. Here we discuss these challenges and how to handle them:

Consistency vs. Availability (CAP Theorem Context)

In a distributed system, the CAP theorem tells us we can’t have perfect Consistency, Availability, and Partition tolerance at the same time. Caching often pushes a system toward favoring availability and performance at the expense of strict consistency. By design, caches store copies of data. If the source data changes, the cache could become stale (inconsistent with the source). Many caching systems opt to serve slightly stale data rather than make users wait, effectively choosing availability (always serve from cache if possible) over strong consistency (Navigating Consistency in Distributed Systems: Choosing the Right Trade-Offs | Hazelcast).

For example, consider a social network: if a user updates their profile picture, you could either (a) immediately invalidate all caches and ensure everyone sees the new picture (consistency) but perhaps stall responses while caches miss, or (b) allow some users to see the old picture for a minute until caches naturally expire (availability/performance). Many systems choose (b) unless it’s critical data. Web caching in general “ensures high-speed data access with tolerable temporary inconsistencies, such as serving stale… cached web pages.” (Navigating Consistency in Distributed Systems: Choosing the Right Trade-Offs | Hazelcast) In other words, a bit of staleness is accepted in exchange for speed and uptime.

Cache Consistency Models:

Distributed Cache and Partitioning: In distributed caches, if a network partition happens (some cache nodes can’t talk to others or to the database), we face a choice: do we continue serving data from the cache nodes we can access (which might be outdated if that partition can’t see new writes), or do we stop serving (which hurts availability)? Many cache clusters choose availability: a cache node will serve whatever it has, even if it’s potentially stale, rather than fail the request. This is another way caching systems lean toward the “AP” side (Available/Partition-tolerant) in CAP. For instance, a CDN node cut off from origin will serve slightly older content rather than nothing.

CAP/PACELC and Caching: Under PACELC (which extends CAP), even when no partitions (P) exist, there’s often a trade between latency (L) and consistency (C). Caches clearly favor low latency. They give fast responses (L) but not always the most up-to-date data (not C). The system designer can add measures to improve consistency (like shorter TTLs, or cache-busting on writes), but often the default is that caches may serve data that’s a few seconds or minutes old. This is usually acceptable for use cases like web pages, social media feeds, product catalogs, etc., but perhaps not for, say, stock trading prices or bank account balances where strict consistency is needed.

In summary, caching can violate strict consistency, so the trade-off must be managed. Techniques to mitigate inconsistency include careful cache invalidation (discussed next) and selecting which data to cache (some data might not be cached at all if it’s extremely critical to be up-to-date). Systems like Redis can also be configured for replication and cluster consistency modes (Redis Cluster offers certain consistency guarantees, and there are CRDTs for eventual consistency across geo-distributed caches). The key point is that introducing a cache means you now have multiple copies of data, so you have to decide how to keep them in sync or how much staleness you tolerate.

Cache Invalidation Strategies (Keeping Data Fresh)

There are only two hard things in Computer Science: cache invalidation and naming things.” – a famous quote (by Phil Karlton) that underscores how challenging cache invalidation can be. Cache invalidation is the process of discarding or updating cache entries when they are no longer valid (i.e., underlying data changed). If you don’t invalidate properly, your cache will serve stale data indefinitely. If you invalidate too often or too soon, you might lose the performance benefits. Achieving the right balance is tricky (System design: Caching Strategies - DEV Community).

When to invalidate? There are a few common strategies:

Stale vs. Fresh Trade-off: Invalidation ultimately is about deciding how long you allow data to be possibly stale. If your strategy is TTL = 1 minute, you’re saying “I’m OK if data is up to 1 minute out-of-date in the worst case.” If you use explicit invalidation on changes, you aim for zero staleness, but you might miss something if an invalidation fails. A hybrid is common: use moderate TTLs as a safety net, and also push invalidation messages on critical changes. That way, in the worst case (message lost), data is only stale until TTL kicks in.

Example: Suppose an online shop caches product inventory counts for 10 minutes (TTL). If stock changes (say someone buys the last item), you might also send an invalidation to immediately remove the cached count for that product so the next buyer sees “out of stock” right away. But if that invalidation fails, the TTL ensures within 10 minutes the cache expires and the correct stock is fetched. Ten minutes of potentially showing stock when it’s gone might be an acceptable risk in that design (or maybe TTL is set shorter for such data).

Cache Invalidation is Hard: The difficulty lies in distributed environments and complex data relationships. It’s easy to forget to invalidate something, or to do it too broadly and reduce cache efficiency. Testing and monitoring are important – e.g., cache hit ratio monitoring can reveal if an invalidation strategy is accidentally causing frequent cache drops (thrashing).

The Thundering Herd Problem (Cache Stampede)

The Thundering Herd problem refers to a scenario where a cache miss (or cache invalidation) triggers many requests to the backend all at once, potentially overwhelming it. It typically occurs in high-traffic systems when popular data expires or is evicted from the cache: suddenly, dozens or thousands of clients all miss the cache and all those requests fall through to the database or upstream service simultaneously (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium). This “herd” of requests can crash the database or significantly slow it down, defeating the purpose of the cache.

How it happens: Imagine a cached item that is very frequently accessed (say the homepage HTML of a news site). It has a TTL that just expired at time T. At time T+0, the first request comes in and sees cache miss, so it goes to regenerate the page from the database. But in a heavy load scenario, at time T+1ms, T+2ms, etc., many more requests arrive (there could be hundreds of users requesting that page around the same time). The cache is still empty (the first request is still fetching from DB), so all these requests also hit the database concurrently. The database now faces a sudden spike of queries (perhaps the exact same query) that can flood it (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium). This is a cache stampede – the cache was protecting the DB when it was valid, but once invalidated, the protection temporarily vanished and the DB got hit with all the pent-up demand at once.

Another scenario: after a system restart or cache cluster failure, the cache may be cold. If the system suddenly comes online with full traffic, the initial wave can overload the database because nothing is in cache yet.

Why it’s problematic: The whole point of caching is to reduce load on the backend. A thundering herd is a worst-case where the absence of a cache entry causes a burst of backend load, which can potentially bring down the backend (and then everyone’s request fails, even though the data might have been something we could have served from cache if only one had gone through).

Mitigation Techniques: There are several strategies to mitigate thundering herd issues (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium) (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium):

Example resolution: Using the weather API example from earlier (Stampede Control: Tackling the Thundering Herd Problem in Distributed Systems | by Rubihali | Medium): The site caches weather data for 5 minutes. At expiry, 100 users trigger calls to the API, overloading it. To fix this, the site implements a locking mechanism: when cache is expired, the first user’s request sets a lock and fetches from the API; the other 99 users see the lock and instead of calling the API, they wait a moment. When the first request returns with fresh data (say after 1 second), it updates the cache and releases the lock. Now the 99 waiting requests all get the fresh weather data from the cache and proceed, without each hitting the API. Alternatively, the site could have given each cache entry a random TTL between 4.5 and 5.5 minutes, so not all locations expire exactly at once (if it was a global expiry). Or it could have a background job refresh popular cities’ weather just before expiry.

Thundering herd issues often become noticeable at scale. Many smaller systems might not experience it until a high load event or an outage (when caches reboot). It’s good practice in system design to mention strategies like locking or staggered TTLs to show awareness of this problem.

Other Challenges

(Aside from those mentioned, there are a couple of other common caching challenges worth noting briefly: cache coherence – keeping caches in sync in a distributed environment, which we touched on in consistency; memory management – ensuring the cache doesn’t grow unbounded and cause evictions at bad times; and cold start – when a cache is empty initially, how to warm it up or handle the load. Also over-caching: caching too aggressively (e.g., caching sensitive data on client that shouldn’t be cached, or using too much memory for rarely used data). But the major ones for an overview are consistency, invalidation, and thundering herd as discussed above.)

Real-World Examples and Use Cases

To illustrate how caching is employed in practice, let’s look at how a few major tech companies use caching to achieve the performance and scale they need:

These examples highlight a pattern: caching is a cornerstone of scalability for large systems. Each company combines multiple caches: in-memory caches for database acceleration, CDN caches for static content, browser/client caches for front-end assets, etc. Without caching, the response times would be slower and the cost to serve each user would be much higher (due to repeated computations and database reads). Caching enables these companies to serve vast user populations with acceptable performance and reasonable infrastructure cost.

The fundamentals of caching have been stable for years, but new trends and technologies continue to evolve how caching is used in modern system architectures:

Distributed & Decentralized Caching:
As systems scale out, caches are no longer a single server or a simple key-value store; they are distributed across clusters and even geographically. Distributed caching (caches spread over multiple nodes) is now commonplace, but emerging practices focus on making these distributed caches more resilient and easier to use. For example, modern distributed caches (like Apache Ignite, Azure Cosmos DB’s integrated cache, or Redis Cluster) support partitioning, replication, and even transactional consistency across the cache cluster. There’s also interest in decentralized caching in P2P networks or using technologies like blockchain to cache and distribute content (for content delivery in a decentralized manner). Most practically, within microservice architectures, teams are adopting multi-level caches: an in-service (local) cache for ultra-fast access, backed by a larger distributed cache for sharing data across instances. This hierarchy can combine the speed of local memory with the size of a distributed cluster, and is facilitated by frameworks that synchronize local caches with central ones.

Another trend is using AI/ML to optimize caching: for instance, ML models that predict which data should be cached (beyond simple LRU) or to dynamically adjust TTLs based on access patterns. These are still cutting-edge but potentially powerful for adaptive caching.

Additionally, there’s focus on global caching – spanning caches across data centers. Systems like Cloudflare Workers KV or Akamai EdgeWorkers provide a key-value store that replicates data globally so you can cache state close to users worldwide (not just static files but even API responses or user data). This blurs the line between database and cache: some apps store data primarily in these globally distributed caches with eventual consistency.

Caching in Serverless Architectures:
Serverless computing (e.g., AWS Lambda, Google Cloud Functions) introduces new challenges for caching. In traditional servers, you often rely on in-memory caches on each server process that persist as long as the process is running. But with serverless, functions are short-lived and may run on demand, and you don’t have a long-running process to keep a cache warm (a cold start has an empty cache each time). Moreover, serverless scales out by spinning up more independent instances, each with its own ephemeral memory, so you can’t rely on a single in-memory cache shared across requests.

To address this, the caching layer in serverless is often moved out to managed services: e.g., using a service like Redis (hosted) as an external cache that all function instances talk to, or using CDNs and edge caches aggressively. Another strategy is to use the limited lifetime of a serverless container – for example, AWS Lambdas may be reused for multiple requests before being garbage-collected, so you can still use an in-memory static variable as a mini-cache within one Lambda instance. But you don’t control how long that lives or how many instances exist.

Impact: Serverless has actually increased the need for fast, scalable cache services because each stateless function invocation might otherwise hit the database. Many serverless apps incorporate an API Gateway cache (AWS API Gateway has caching feature) to avoid invoking the function at all for common requests. Others use CDNs (like Cloudflare workers) to do work at the edge. The consequence is the rise of fully managed caching services that fit the serverless paradigm: for example, Momento (a “serverless cache” startup) provides a cache service where you don’t manage nodes at all – you just use an API and it scales behind the scenes (Serverless Cache: The missing piece to go fully serverless - Momento). The idea is to avoid the need to provision a Redis cluster (which is a serverful concept) for your serverless app; instead, use a pay-per-use cache service that auto-scales, much like serverless databases do. Traditional cache servers have limits (max connections, fixed memory) that require overprovisioning in serverless environments; new services attempt to remove those limits by dynamically scaling the cache capacity and throughput (Serverless Cache: The missing piece to go fully serverless - Momento).

Also, serverless environments push more caching to the edge. If your backend is fully serverless, you might lean more on CDNs/edge caches to handle as much traffic as possible (because scaling up a lambda on a cache miss might be slower than serving from an edge cache). For instance, a serverless website might use Cloudflare Workers to cache HTML pages at the edge, and only call a lambda for infrequent cache misses.

Example: Consider a serverless e-commerce site: Product pages are rendered by Lambdas. Without caching, hitting the front page 1000 times would invoke 1000 Lambda executions (cost and latency). Instead, they might put CloudFront (CDN) in front with a TTL of a few seconds on pages, or use API Gateway caching. That way, perhaps only 1 in 1000 requests actually triggers a Lambda (the rest are cache hits). This significantly reduces cost (since serverless = pay per execution) and improves speed. As Yan Cui notes, “caching improves response time and also translates to cost savings in serverless (pay-per-use) since fewer function invocations are needed.” (All you need to know about caching for serverless applications | theburningmonk.com).

Edge Computing and Caching Convergence: Another emerging trend is running application logic on edge nodes (serverless at edge, like Cloudflare Workers, AWS Lambda@Edge) which can themselves use caching more dynamically. E.g., an edge function might personalize a cached response with slight tweaks instead of hitting origin.

Hardware and New Cache Technology: At the hardware level, new non-volatile memory (like Intel Optane, etc.) offers an intermediate speed tier – this might lead to new caching tiers (between RAM and disk). Also, content-addressable memory or specialized cache hardware might someday allow more efficient caching for specific workloads (though these are more research/enterprise areas).

Security and Isolation: Emerging thinking also includes caching with security isolation – e.g., multi-tenant caches that isolate data per user for privacy, or encrypted cache entries so that even if a cache is compromised, data isn’t exposed. This is a response to concerns of putting sensitive data in caches.

Conclusion of Trends: In summary, the trend is towards caches that are more distributed, more automated, and closer to the end-user. Distributed caches enable massive scale and high availability, and new “serverless” caching services abstract away the clustering so developers can cache without managing servers. Serverless computing has reinforced the importance of external caches (since in-process ephemeral caches are unreliable in that model) (Serverless Cache: The missing piece to go fully serverless - Momento). And as applications demand ever lower latency globally, caching at the edge (CDNs, edge keys, etc.) is expanding from just static files to dynamic data and even computed results. Caching continues to be a vital part of system design, evolving alongside new architecture paradigms.


In conclusion, caching remains a fundamental technique to achieve high performance and scalability in software systems. By understanding and combining the right types of caches (client vs server, CDN vs database cache) and using appropriate strategies (cache-aside, read/write policies) with awareness of consistency trade-offs, engineers can dramatically improve throughput and latency. The examples of tech giants demonstrate that clever use of caching is often the secret behind handling millions of users and data at scale. As infrastructure shifts towards distributed and serverless models, caching is likewise adapting – becoming more distributed, integrated, and intelligent – but its core goal is unchanged: store expensive-to-fetch data in a faster medium and serve it when needed, as reliably and fresh as possible. The challenge is mastering when and how to do this, which is why caching is both an art and science in system design.

system-design