SerialReads

Indexing & Denormalization Strategies (System Design Deep Dive)

Jun 08, 2025

Index Anatomy

B-Tree Indexes: Most relational databases use B+ Tree indexes, which store keys in sorted order on pages (nodes) with a high branching factor (fan-out). This high fan-out keeps the tree shallow, often only 3–4 levels deep even for millions of rows. In a B+ Tree, all real data records reside in the leaf nodes (with upper index pages holding only keys), and leaf pages are linked in order for efficient range scans. This structure ensures O(log n) lookup time by performing a few page reads following parent-child pointers down the tree. The trade-off is that each insert or delete might cause page splits or merges to maintain balance, adding some write cost.

LSM-Tree Indexes: Write-intensive systems (like many NoSQL stores) often use Log-Structured Merge Trees (LSM trees). An LSM-tree accumulates writes in an in-memory memtable (C0), then flushes them as sorted SSTable files on disk, merging into larger files in levels C1, C2, etc.. Background compaction processes merge and purge obsolete entries from these SSTables. LSM-trees optimize sequential writes (appends) and greatly reduce random write I/O, at the cost of read amplification – a read may have to check multiple files across levels. Conversely, B-trees tend to have higher write amplification (in-place page updates and splits) but lower read amplification for point queries. In practice, B-tree indexes are often better for read-heavy or mixed workloads, while LSM indexes shine for heavy write workloads or log ingestion.

Bitmap Indexes: Bitmap indexes use bit arrays to indicate the presence of a value in multiple rows. Unlike a B-tree where each index entry points to a single row, a bitmap index entry can cover many rows via a bitmap of row IDs. This is very efficient for low-cardinality columns (e.g. a gender field with values M/F) in analytic queries. Complex predicates on multiple bitmap indexes can be resolved by bitwise AND/OR operations on the bitmaps. The drawback is that updates can be expensive – changing a value requires modifying a bit in a potentially large bitmap – and concurrent updates may lock entire bitmaps. Thus, bitmap indexes are most often used in read-mostly data warehouses and are usually ill-suited for high-concurrency OLTP updates.

Inverted Indexes: An inverted index maps content elements to the rows (or documents) that contain them. It’s called “inverted” because it flips the data paradigm – e.g. for a text column, the index might list each word (token) and which rows contain that word. Relational databases implement this for full-text search and JSON/array fields. For example, PostgreSQL’s GIN (Generalized Inverted Index) stores a set of (key, posting list) pairs, where each key is a token or value and the posting list is the set of row IDs that have that token. A single row can appear in many posting lists (if a document contains multiple indexed words). Inverted indexes allow fast membership queries (“find all rows that contain X”) at the cost of extra storage and slower writes (since one row insert may add many index entries). They are essential for efficiently querying composite data types – e.g. finding rows where a JSON column has a certain key or a text column contains a word – something B-trees cannot handle well.

Covering & Composite Indexes

Composite Indexes & Prefix Rules: A composite (multi-column) index indexes a concatenation of multiple columns. Such an index can accelerate queries filtering on all or just the leading subset of those columns. The order of columns in the index matters: databases can use any leftmost prefix of the index for lookups. For example, an index on (last_name, job_id, salary) can be used to search by last_name alone or last_name + job_id, but not by job_id without last_name. The index is most selective when filters on the leftmost columns are provided. Equality conditions on leading columns allow the index to jump directly to the matching range; additional range conditions on the next column further narrow the scan within that subset. However, if the first indexed column isn’t constrained, the index may be scanned in full (or not used at all) – e.g. a query on salary alone wouldn’t benefit from the (last_name, job_id, salary) index unless the optimizer does an index skip scan, which is a special case.

Covering Indexes (Index-Only Scans): A covering index is one that includes all the columns a query needs, so the database can answer the query using the index alone, without touching the base table. Some databases achieve this by allowing extra included columns in the index that are not part of the search key (e.g. PostgreSQL INCLUDE or SQL Server non-key columns) – these make the index “wider” but cover more queries. When a query can be satisfied entirely by index entries, the engine performs an Index Only Scan, which traverses the index and returns results directly, avoiding expensive table lookups. This yields significant speedups especially when the table rows are large or when many rows are filtered out by the index. One caveat in MVCC databases is that an index-only scan must still check tuple visibility (e.g. PostgreSQL’s visibility map) to ensure a version is committed and not deleted – but this is usually much faster than reading the whole row.

Multi-Column Selectivity: The effectiveness of an index on multiple columns depends on the combined selectivity of those columns. Query planners estimate selectivity using column statistics. By default, many optimizers assume each filter is independent – meaning if one filter selects 10% of rows and another also 10%, together they estimate 1% of rows will match (0.1 * 0.1). In reality, columns can be correlated, and this assumption can mislead the planner. Modern databases offer extended statistics to capture correlations (or functional dependencies) so that multi-column selectivity estimates can be more accurate. If such stats are not in use, a composite index might be underutilized or a suboptimal plan chosen because the DB either overestimates or underestimates how restrictive combined conditions are. Understanding how your DB calculates combined predicate selectivity (and possibly creating multi-column stats) is key to anticipating whether a multi-column index will be chosen.

Partial & Filtered Indexes

What Are Partial Indexes: A partial (filtered) index is an index built on a subset of rows that satisfy a given condition (predicate). The index only stores entries for rows meeting that condition. This is useful when a query pattern only cares about a portion of the data – for example, indexing only “active” records, or only rows where a certain column is NOT NULL. Partial indexes can be much smaller than full indexes, which yields faster scans for those queries and less storage and maintenance overhead. As a rule of thumb, “avoid indexing values you’re not querying”. For instance, if 95% of rows have status = 'inactive' and 5% are 'active' and frequently queried, you might create an index on ... WHERE status = 'active'. The database will then only index that hot subset – making the index tiny and very efficient for active-record lookups, without the bloat of all the inactive rows that queries rarely touch.

Sparse Data and Hot Values: Partial indexes are ideal for sparse or skewed data distributions. They act like sparse indexes in NoSQL: e.g. in DynamoDB one can create a sparse GSI that only tracks items with a certain attribute present. In relational terms, you can index only where a column IS NOT NULL – skipping rows where the indexed key is absent or not useful. This improves performance if queries naturally exclude the majority of rows. Partial indexes can also accelerate hot values or ranges – for example, indexing only recent dates in a log table if most queries ask for recent data, or only rows with priority = 'HIGH'. By tailoring indexes to query patterns, you reduce unnecessary index size and write overhead on unneeded entries.

Maintenance Costs: The flip side is that partial indexes add complexity to writes. Every insert or update must evaluate the index predicate – if a row now meets the condition (or no longer meets it), the index must be updated accordingly. This check adds some CPU overhead. Additionally, if the predicate is based on a column that is frequently updated, rows jumping in or out of the index can incur extra index insert/delete operations. Partial indexes also require the query planner to be aware of the predicate: a query can only use the index if the filter matches the index’s predicate (or is a superset of it). Modern optimizers handle this, but it means that if you have highly specialized partial indexes, you must include the filtering condition in queries, otherwise the index won’t be considered. In summary, partial indexes trade lower space/read costs for higher write costs on the indexed subset. This trade-off is usually favorable when the indexed predicate is rare enough that full indexing would be wasteful, and when writes mostly affect the unindexed portion of data.

Secondary-Index Lag

Not all indexing systems are synchronous or strongly consistent. Secondary index lag refers to scenarios where an index update is decoupled from base data updates, causing a timing gap before the index reflects the latest data. This is common in distributed NoSQL systems for scalability. For example, Amazon DynamoDB’s Global Secondary Indexes (GSIs) are eventually consistent replicas of the base table: when you update the main table, the changes propagate to GSIs asynchronously. As a result, a recently written item might not immediately appear in GSI query results (read-after-write is not guaranteed on GSIs). This consistency gap is a crucial consideration when designing around secondary indexes – e.g. you may need to enforce monotonic reads through your application or use primary-key queries for freshest data.

Async Index Writes and Write Amplification: Many systems perform index maintenance off the critical path of user writes to improve throughput. For instance, search engines often batch new documents and update their inverted index periodically. In distributed databases, secondary indexes might be updated by background processes or at least in a separate partition or node than the primary data. The benefit is higher write throughput to the primary store, but the cost is twofold: (1) Temporal lag as noted, and (2) Write amplification. Every user write may result in multiple internal writes – one to the primary storage and one per index. With DynamoDB GSIs, each indexed attribute write propagates to the index’s storage, consuming additional throughput and latency. In effect, to speed up reads on an alternate access pattern, you pay a price on writes. This can also create failure modes where if an index’s write stream lags (say due to system load or network partition), the index can temporarily serve stale or incomplete results.

Mitigations: Systems often mitigate index lag via quorum reads or read repair. For example, Cassandra’s secondary indexes (which are local to each node) might require querying multiple replicas if data is not fully synchronized, or leveraging read repair to update an out-of-date index entry upon read. DynamoDB disallows strongly consistent reads on GSIs by design, so the application must tolerate eventual consistency or use the base table key for critical read-after-write scenarios. Another approach to reduce lag is to use transactional updates to indexes (some NewSQL databases do this so that index updates commit atomically with data), but this can reduce write throughput. Designing a system, one must weigh if the convenience of secondary access patterns is worth the complexity – in some cases, duplication or denormalization (maintaining a materialized view) might give more control over consistency than a built-in async index.

Denormalization Drivers

Traditional database normalization (e.g. 3NF – Third Normal Form) minimizes redundancy for integrity’s sake, but performance needs often motivate breaking these rules. Denormalization means intentionally introducing redundancy or grouping data to optimize read patterns. The key driver is the access pattern: if your application needs certain data together frequently, and especially in read-heavy scenarios, it may be faster to store that data together even if it violates normalization.

Read vs. Write Amplification: Denormalization is a classic space-time trade-off. By duplicating or co-locating data, you reduce the cost of reads (fewer joins or lookups), at the cost of additional work on writes (needing to update the data in multiple places). In a normalized schema, an update to a customer’s name is one write to the customer table; in a denormalized schema where the name is copied into many order records, that one logical change might require hundreds of writes (update each order) – this is write amplification. However, reading an order with customer info is just one query instead of a join – reducing read amplification. The choice depends on workload: normalization is ideal for write-heavy OLTP systems where data integrity and minimal writes are crucial, whereas denormalization often benefits read-heavy scenarios like analytics and dashboards. As one source puts it: Normalization is about not storing the same fact twice; denormalization is about not calculating the same fact twice.

Wide vs. Narrow Rows: This is especially pronounced in NoSQL and big data designs. A wide row (or wide-table) design packs a lot of attributes (or even repeated groups) into a single item, sometimes using arrays or JSON, so that a single read yields all needed info. A narrow (normalized) design would spread that info across related tables requiring multiple lookups or joins. For example, a relational 3NF design for an order might have separate tables for orders, order_items, customers, shipping_addresses. A denormalized variant might embed customer and address details in each order record (duplicating them for each order), and maybe even embed the items list as an array or separate columns (a wide table with columns for item1, item2, etc., or a JSON blob of items). The wide approach makes the read of an order very fast (no joins), at the cost of potential inconsistencies (e.g. if a customer changes address, old orders have stale data) and extra storage. Wide-column NoSQL databases like Bigtable, Cassandra, or HBase are explicitly designed to handle wide rows and even encourage a single row to contain a large collection of data (e.g. time series events for one user) for locality. They sacrifice some of the ad-hoc flexibility of SQL joins in favor of predetermined query patterns.

When to Break the Rules: Denormalization is justified when the benefit to query performance or simplicity outweighs the drawbacks of redundancy. Common cases include: Data Warehousing – star schemas and snowflake schemas deliberately duplicate dimensional data. A star schema has a central fact table and denormalized dimension tables (with redundancies) to speed up join performance. Analytical queries in a star schema run faster because they avoid joining many normalized tables; instead they join a few big, flat tables. Another case is Caching and Precomputation – e.g. storing aggregate values or summary tables that would otherwise require expensive computations. Modern distributed systems also push denormalization due to the lack of multi-node joins – e.g. DynamoDB or Cassandra expects you to design tables for each query pattern, often duplicating data per pattern (one table by customer, another by order date, etc.). In summary, normalize for correctness, denormalize for speed. If your app is OLTP with frequent updates and strong consistency needs, lean on normalization. If it’s OLAP or read-mostly, denormalize prudently for performance. You can also mix approaches: keep a normalized core for writes, and maintain derived denormalized tables for heavy reads (possibly updated via streams or triggers). The decision should always be informed by whether the reduction in read time is worth the complexity of extra writes and potential stale data.

Normal Forms Refresher (3NF, Star, Snowflake, NoSQL)

Third Normal Form (3NF): This is a classic relational design principle where each fact is stored once, and all non-key attributes depend only on the primary key. Tables in 3NF have no repeating groups, no partial dependencies on composite keys, and no transitive dependencies. This yields a lot of small, well-structured tables (often many joins needed for queries) but guarantees minimal redundancy and anomalies. 3NF (and higher normal forms) are great for transactional consistency and for modeling complex data without duplication.

Star Schema vs. Snowflake: These are denormalized models for analytics. A star schema has one large fact table (e.g. sales transactions) and several denormalized dimension tables (e.g. a product dimension that maybe combines product info and category in one table). In a pure star, each dimension is flattened into one table – which violates 3NF by duplicating attributes (e.g. the category name repeats on every product row). The benefit is simplicity and speed: fewer joins are needed to answer questions, since each dimension is pre-joined into one table. A snowflake schema is a slight normalization of the star: dimensions are normalized into multiple related tables (e.g. a products table and a separate categories table it links to). Snowflake reduces redundancy in dimensions but reintroduces more joins. Star schemas are generally considered denormalized (for query speed at the expense of redundancy) and are widely used in OLAP systems where fast reads matter more than update anomalies. For example, in a star schema a store dimension might include city and state; in a snowflake, city and state might be separate linked tables. Snowflake saves some space, but a star places city and state directly in the store table to avoid extra joins. Both star and snowflake are dimensionally modeled – user queries (e.g. in BI tools) benefit from intuitive “cubes” of data – but star is more denormalized than snowflake.

Wide-Column NoSQL: In wide-column (or “big table”) databases, we see a departure from 3NF entirely. The data model is often one giant table or a few tables that hold nested or repeated data. For instance, an order record in Cassandra might embed all the items and even some customer info as part of that record (perhaps as collections or JSON) instead of normalizing them into separate tables. The database may allow hundreds of columns per row, and different rows can have different sets of columns (sparse storage). This model is optimized for queries that are known in advance – you design the table so that one key lookup fetches exactly what that query needs. It’s essentially denormalization on write: you design your schema around queries, not around avoiding redundancy. Wide-column stores often sacrifice join capabilities altogether – any relationship that you might join in an RDBMS is usually pre-materialized into the table. The advantage is massive horizontal scalability and fast reads (no distributed joins), at the cost that the application is responsible for maintaining consistency of duplicated data. If an attribute (like a user’s name) is stored in multiple places for convenience, the onus is on your update logic to sync those.

When to break normalization rules: Aside from performance reasons, sometimes 3NF is broken for practicality. Example: storing derived data (like a total order amount) with the record, even though it can be calculated by summing line items (violating normalization), because it saves computation on reads. Another example: duplicating a denormalized cache of data from another service or system inside your database for faster access, accepting that it may lag the source of truth. These decisions should be documented and justified by clear needs (e.g. “using star schema because our BI queries need to join 10 tables in 3NF, which was too slow”). With great power comes great responsibility: once you break normalization, you need processes (or code) to manage data consistency, whether through carefully managed updates, triggers, or periodic rebuilds.

JSON/Array Columns & Functional Indexes

Modern relational databases blur the line between SQL and NoSQL by supporting JSON, array, and other complex types. Indexing these requires specialized approaches:

GIN & GIST Indexes: PostgreSQL and some others provide index types like GIN (Generalized Inverted Index) and GiST (Generalized Search Tree) to handle complex data. A GIN index is ideal for cases where each row’s column value contains multiple elements and you need to query for containment of a specific element. JSON fields and array columns are a prime example – a GIN index on a JSONB column will index every key or value, allowing queries like “does this JSON contain key = X” to use the index. In practice, GIN is the go-to index for Postgres JSONB, array membership, and full-text search, because it inverts the document into a map of keys to rows. A GiST index, on the other hand, is a generalized search tree useful for spatial data, ranges, or hierarchical data. GiST can index data by a “bounding” strategy (e.g. bounding boxes for geometries) and is flexible for custom data types. For JSON, PostgreSQL also has a GiST option (with jsonb_path_ops) that can be useful for certain range or containment queries, but GIN is more common for discrete key lookups. One downside of GIN is that it can be slower to update (lots of entries to insert/delete per row) and can be bigger in size, but for static or search-heavy data it’s a game changer. Notably, GIN indexes only support bitmap index scans (they produce a set of matching row IDs, then the executor performs a bitmap heap scan) rather than regular index-to-heap lookups, but this is usually internal detail – the main point is GIN enables queries that B-trees cannot do efficiently.

Functional and Generated Column Indexes: A functional index (also called an expression index) is an index on the result of a function or expression, rather than directly on a column’s stored value. This is incredibly useful for indexing computed values or specific components of a complex column. Classic examples: indexing LOWER(email) to allow case-insensitive email lookups, or indexing a specific JSON field extracted via an expression (e.g. (json_col->>'name')). Oracle introduced function-based indexes long ago, and many databases have equivalents. MySQL historically lacked direct functional indexes, but as of v8.0 it supports indexes on generated columns – basically you add a virtual or stored generated column that computes the expression, then index that. The effect is the same: queries using that function can use the index. Functional indexes are powerful to optimize computed WHERE clauses (like searching by an encrypted or hashed value, a date truncated to day, etc.). They materialize the computation so that the database doesn’t have to compute it on every row at query time – it can just traverse the index. However, every insert/update now has the overhead of computing that function to update the index. This is usually worth it if the function is expensive or the selectivity gained is high. Some databases (Oracle, SQL Server) directly support indexing on expressions; PostgreSQL allows CREATE INDEX ... (function(col)); MySQL uses the generated column trick under the hood.

JSON/Array Indexing Strategies: There are a few ways to index JSON data in an SQL database: (1) Use a GIN/GiST as mentioned for general searches, (2) Use functional indexes on specific JSON attributes if you only care about certain keys. For example, if you have a JSONB column with a field "status", you might create a functional index on (jsonb_col->>'status') to quickly filter by status. This essentially treats that key as a normal column for indexing purposes. Another trick is to use generated columns to pull out important JSON fields and store them as their own columns (which can then be indexed normally). Many systems do this when they have semi-structured data: store the raw JSON for flexibility, but also store a few critical fields in dedicated columns for performance (possibly kept in sync via triggers or application logic). Similarly for arrays, you might index each element if order doesn’t matter (using GIN), or index specific positions if that’s meaningful.

In short, modern databases give you the tools to index beyond flat relational data: you can index inside a JSON document or an array. The choice of GIN vs GiST vs functional indexes depends on your query needs – GIN is great for “does it contain X” queries across many keys, GiST is useful for range and proximity queries (like “find objects within this geometric area” or “JSON that is similar in structure”), and functional indexes are perfect for exact transformations (like extracting a text value or normalizing case/format). Combining these with partial index capabilities can even let you, say, index a JSON key only for rows where another field has a certain value, and so on – very powerful when used judiciously.

Query-Planner Interplay

Database query optimizers are cost-based: they decide whether to use an index or not based on estimates of work. Statistics guide these decisions. The planner uses stats (like number of rows, value distributions, correlation, etc.) to estimate how many rows a given query condition will match, and the cost of using a particular index vs. doing a sequential scan. If stats indicate that a condition is not very selective (say ~50% of the table), the optimizer may choose a full table scan as it expects that reading half the table may be cheaper than many index lookups. On the other hand, a very selective condition (e.g. 0.1% of rows) would likely get an index scan plan. Ensuring your stats are up-to-date (via ANALYZE or the database’s auto-stats) is critical; stale or default stats can lead the planner astray – for example, if a column’s histogram is outdated, the optimizer might think an index is unnecessary when it actually would help, or vice versa.

EXPLAIN Basics: Developers preparing for system design interviews should be comfortable reading an EXPLAIN plan. This shows the steps the planner chose – whether it’s using an Index Scan, Index Only Scan, Bitmap Index Scan, or a Sequential Scan, etc., along with estimated rows and costs. Understanding these operators is key to verifying if your indexes are effective. For example, an EXPLAIN might show a Bitmap Index Scan followed by a Bitmap Heap Scan for a query with multiple conditions – meaning the database is using the index to collect candidate row ids in memory, then fetching those rows efficiently. If you expected an index to be used but see a Seq Scan, it’s a clue either the optimizer didn’t find it advantageous or the query isn’t written to leverage it (e.g. applying a function without an index, or data type mismatch preventing index usage).

Index Hinting Pitfalls: Some databases allow you to force or hint the use of a particular index in a query. While hints can be a last resort to fix a plan, they are generally discouraged in production code. Manually forcing an index can backfire: today it might be the best index, but as data grows or distributions change, the hinted index could become suboptimal while the optimizer might have chosen a better plan on its own. Moreover, hints add maintenance burden – if the index name changes or is removed, the queries will break. The best practice is to let the cost-based optimizer do its job, and only intervene if you’ve verified it cannot make the right decision on its own (perhaps due to lacking statistics or a bug). Even then, prefer higher-level fixes like creating better stats (e.g. extended statistics for correlated columns) or rewriting the query. Use hints sparingly and document them well.

Planner and Index Choices: Note that the planner doesn’t always choose the “obvious” index. If a query needs to sort data and an index can provide that order, the planner might choose an index scan even if it’s slightly more costly, to avoid a separate sort step. Conversely, for a query that needs a large percentage of the table, the planner might choose a sequential scan with a parallel plan, ignoring an index that would result in many random I/Os. Also, different databases have different capabilities – some can do index merge (combine multiple single-column indexes for a multi-condition query), some cannot. Understanding these nuances can help in system design: e.g., MySQL can use at most one index per table per query (except for index merge in some cases), whereas PostgreSQL can combine bitmap indexes arbitrarily. Such differences impact how you decide to add indexes or design composite ones.

In summary, treat the optimizer as an ally: feed it good statistics and reasonable query formulations. Use EXPLAIN to verify decisions. If you hit cases where it chooses poorly (and you can’t fix by stats or rewriting), then consider guiding it with hints or fixed plans, but do so with caution and monitoring.

Observability Signals

Tuning indexing and schema strategies isn’t a one-off task – you need to continuously observe how they perform in the real system. Key metrics and signals include:

In essence, robust indexing and denormalization strategies require a feedback loop: design based on assumptions, then monitor these signals to validate (or invalidate) those assumptions and adjust accordingly.

AWS Reference Touchpoints

Finally, let’s connect these concepts with a few AWS-specific contexts often brought up in system design:

To conclude, indexing and denormalization are critical techniques in system design for scaling reads and writes. The choice between normalized vs. denormalized, B-tree vs. LSM, or using a GSI vs. global table replication – all boil down to understanding access patterns and trade-offs. By mastering these concepts and knowing how to apply them in cloud services like AWS, one can design systems that are both performant and maintainable, explaining the decisions with solid reasoning in an interview setting.

system-design