Indexing & Denormalization Strategies (System Design Deep Dive)
Jun 08, 2025
Index Anatomy
B-Tree Indexes: Most relational databases use B+ Tree indexes, which store keys in sorted order on pages (nodes) with a high branching factor (fan-out). This high fan-out keeps the tree shallow, often only 3–4 levels deep even for millions of rows. In a B+ Tree, all real data records reside in the leaf nodes (with upper index pages holding only keys), and leaf pages are linked in order for efficient range scans. This structure ensures O(log n) lookup time by performing a few page reads following parent-child pointers down the tree. The trade-off is that each insert or delete might cause page splits or merges to maintain balance, adding some write cost.
LSM-Tree Indexes: Write-intensive systems (like many NoSQL stores) often use Log-Structured Merge Trees (LSM trees). An LSM-tree accumulates writes in an in-memory memtable (C0), then flushes them as sorted SSTable files on disk, merging into larger files in levels C1, C2, etc.. Background compaction processes merge and purge obsolete entries from these SSTables. LSM-trees optimize sequential writes (appends) and greatly reduce random write I/O, at the cost of read amplification – a read may have to check multiple files across levels. Conversely, B-trees tend to have higher write amplification (in-place page updates and splits) but lower read amplification for point queries. In practice, B-tree indexes are often better for read-heavy or mixed workloads, while LSM indexes shine for heavy write workloads or log ingestion.
Bitmap Indexes: Bitmap indexes use bit arrays to indicate the presence of a value in multiple rows. Unlike a B-tree where each index entry points to a single row, a bitmap index entry can cover many rows via a bitmap of row IDs. This is very efficient for low-cardinality columns (e.g. a gender
field with values M/F) in analytic queries. Complex predicates on multiple bitmap indexes can be resolved by bitwise AND/OR operations on the bitmaps. The drawback is that updates can be expensive – changing a value requires modifying a bit in a potentially large bitmap – and concurrent updates may lock entire bitmaps. Thus, bitmap indexes are most often used in read-mostly data warehouses and are usually ill-suited for high-concurrency OLTP updates.
Inverted Indexes: An inverted index maps content elements to the rows (or documents) that contain them. It’s called “inverted” because it flips the data paradigm – e.g. for a text column, the index might list each word (token) and which rows contain that word. Relational databases implement this for full-text search and JSON/array fields. For example, PostgreSQL’s GIN (Generalized Inverted Index) stores a set of (key, posting list) pairs, where each key is a token or value and the posting list is the set of row IDs that have that token. A single row can appear in many posting lists (if a document contains multiple indexed words). Inverted indexes allow fast membership queries (“find all rows that contain X”) at the cost of extra storage and slower writes (since one row insert may add many index entries). They are essential for efficiently querying composite data types – e.g. finding rows where a JSON column has a certain key or a text column contains a word – something B-trees cannot handle well.
Covering & Composite Indexes
Composite Indexes & Prefix Rules: A composite (multi-column) index indexes a concatenation of multiple columns. Such an index can accelerate queries filtering on all or just the leading subset of those columns. The order of columns in the index matters: databases can use any leftmost prefix of the index for lookups. For example, an index on (last_name, job_id, salary)
can be used to search by last_name
alone or last_name
+ job_id
, but not by job_id
without last_name
. The index is most selective when filters on the leftmost columns are provided. Equality conditions on leading columns allow the index to jump directly to the matching range; additional range conditions on the next column further narrow the scan within that subset. However, if the first indexed column isn’t constrained, the index may be scanned in full (or not used at all) – e.g. a query on salary
alone wouldn’t benefit from the (last_name, job_id, salary)
index unless the optimizer does an index skip scan, which is a special case.
Covering Indexes (Index-Only Scans): A covering index is one that includes all the columns a query needs, so the database can answer the query using the index alone, without touching the base table. Some databases achieve this by allowing extra included columns in the index that are not part of the search key (e.g. PostgreSQL INCLUDE
or SQL Server non-key columns) – these make the index “wider” but cover more queries. When a query can be satisfied entirely by index entries, the engine performs an Index Only Scan, which traverses the index and returns results directly, avoiding expensive table lookups. This yields significant speedups especially when the table rows are large or when many rows are filtered out by the index. One caveat in MVCC databases is that an index-only scan must still check tuple visibility (e.g. PostgreSQL’s visibility map) to ensure a version is committed and not deleted – but this is usually much faster than reading the whole row.
Multi-Column Selectivity: The effectiveness of an index on multiple columns depends on the combined selectivity of those columns. Query planners estimate selectivity using column statistics. By default, many optimizers assume each filter is independent – meaning if one filter selects 10% of rows and another also 10%, together they estimate 1% of rows will match (0.1 * 0.1). In reality, columns can be correlated, and this assumption can mislead the planner. Modern databases offer extended statistics to capture correlations (or functional dependencies) so that multi-column selectivity estimates can be more accurate. If such stats are not in use, a composite index might be underutilized or a suboptimal plan chosen because the DB either overestimates or underestimates how restrictive combined conditions are. Understanding how your DB calculates combined predicate selectivity (and possibly creating multi-column stats) is key to anticipating whether a multi-column index will be chosen.
Partial & Filtered Indexes
What Are Partial Indexes: A partial (filtered) index is an index built on a subset of rows that satisfy a given condition (predicate). The index only stores entries for rows meeting that condition. This is useful when a query pattern only cares about a portion of the data – for example, indexing only “active” records, or only rows where a certain column is NOT NULL. Partial indexes can be much smaller than full indexes, which yields faster scans for those queries and less storage and maintenance overhead. As a rule of thumb, “avoid indexing values you’re not querying”. For instance, if 95% of rows have status = 'inactive'
and 5% are 'active'
and frequently queried, you might create an index on ... WHERE status = 'active'
. The database will then only index that hot subset – making the index tiny and very efficient for active-record lookups, without the bloat of all the inactive rows that queries rarely touch.
Sparse Data and Hot Values: Partial indexes are ideal for sparse or skewed data distributions. They act like sparse indexes in NoSQL: e.g. in DynamoDB one can create a sparse GSI that only tracks items with a certain attribute present. In relational terms, you can index only where a column IS NOT NULL – skipping rows where the indexed key is absent or not useful. This improves performance if queries naturally exclude the majority of rows. Partial indexes can also accelerate hot values or ranges – for example, indexing only recent dates in a log table if most queries ask for recent data, or only rows with priority = 'HIGH'
. By tailoring indexes to query patterns, you reduce unnecessary index size and write overhead on unneeded entries.
Maintenance Costs: The flip side is that partial indexes add complexity to writes. Every insert or update must evaluate the index predicate – if a row now meets the condition (or no longer meets it), the index must be updated accordingly. This check adds some CPU overhead. Additionally, if the predicate is based on a column that is frequently updated, rows jumping in or out of the index can incur extra index insert/delete operations. Partial indexes also require the query planner to be aware of the predicate: a query can only use the index if the filter matches the index’s predicate (or is a superset of it). Modern optimizers handle this, but it means that if you have highly specialized partial indexes, you must include the filtering condition in queries, otherwise the index won’t be considered. In summary, partial indexes trade lower space/read costs for higher write costs on the indexed subset. This trade-off is usually favorable when the indexed predicate is rare enough that full indexing would be wasteful, and when writes mostly affect the unindexed portion of data.
Secondary-Index Lag
Not all indexing systems are synchronous or strongly consistent. Secondary index lag refers to scenarios where an index update is decoupled from base data updates, causing a timing gap before the index reflects the latest data. This is common in distributed NoSQL systems for scalability. For example, Amazon DynamoDB’s Global Secondary Indexes (GSIs) are eventually consistent replicas of the base table: when you update the main table, the changes propagate to GSIs asynchronously. As a result, a recently written item might not immediately appear in GSI query results (read-after-write is not guaranteed on GSIs). This consistency gap is a crucial consideration when designing around secondary indexes – e.g. you may need to enforce monotonic reads through your application or use primary-key queries for freshest data.
Async Index Writes and Write Amplification: Many systems perform index maintenance off the critical path of user writes to improve throughput. For instance, search engines often batch new documents and update their inverted index periodically. In distributed databases, secondary indexes might be updated by background processes or at least in a separate partition or node than the primary data. The benefit is higher write throughput to the primary store, but the cost is twofold: (1) Temporal lag as noted, and (2) Write amplification. Every user write may result in multiple internal writes – one to the primary storage and one per index. With DynamoDB GSIs, each indexed attribute write propagates to the index’s storage, consuming additional throughput and latency. In effect, to speed up reads on an alternate access pattern, you pay a price on writes. This can also create failure modes where if an index’s write stream lags (say due to system load or network partition), the index can temporarily serve stale or incomplete results.
Mitigations: Systems often mitigate index lag via quorum reads or read repair. For example, Cassandra’s secondary indexes (which are local to each node) might require querying multiple replicas if data is not fully synchronized, or leveraging read repair to update an out-of-date index entry upon read. DynamoDB disallows strongly consistent reads on GSIs by design, so the application must tolerate eventual consistency or use the base table key for critical read-after-write scenarios. Another approach to reduce lag is to use transactional updates to indexes (some NewSQL databases do this so that index updates commit atomically with data), but this can reduce write throughput. Designing a system, one must weigh if the convenience of secondary access patterns is worth the complexity – in some cases, duplication or denormalization (maintaining a materialized view) might give more control over consistency than a built-in async index.
Denormalization Drivers
Traditional database normalization (e.g. 3NF – Third Normal Form) minimizes redundancy for integrity’s sake, but performance needs often motivate breaking these rules. Denormalization means intentionally introducing redundancy or grouping data to optimize read patterns. The key driver is the access pattern: if your application needs certain data together frequently, and especially in read-heavy scenarios, it may be faster to store that data together even if it violates normalization.
Read vs. Write Amplification: Denormalization is a classic space-time trade-off. By duplicating or co-locating data, you reduce the cost of reads (fewer joins or lookups), at the cost of additional work on writes (needing to update the data in multiple places). In a normalized schema, an update to a customer’s name is one write to the customer table; in a denormalized schema where the name is copied into many order records, that one logical change might require hundreds of writes (update each order) – this is write amplification. However, reading an order with customer info is just one query instead of a join – reducing read amplification. The choice depends on workload: normalization is ideal for write-heavy OLTP systems where data integrity and minimal writes are crucial, whereas denormalization often benefits read-heavy scenarios like analytics and dashboards. As one source puts it: Normalization is about not storing the same fact twice; denormalization is about not calculating the same fact twice.
Wide vs. Narrow Rows: This is especially pronounced in NoSQL and big data designs. A wide row (or wide-table) design packs a lot of attributes (or even repeated groups) into a single item, sometimes using arrays or JSON, so that a single read yields all needed info. A narrow (normalized) design would spread that info across related tables requiring multiple lookups or joins. For example, a relational 3NF design for an order might have separate tables for orders, order_items, customers, shipping_addresses. A denormalized variant might embed customer and address details in each order record (duplicating them for each order), and maybe even embed the items list as an array or separate columns (a wide table with columns for item1, item2, etc., or a JSON blob of items). The wide approach makes the read of an order very fast (no joins), at the cost of potential inconsistencies (e.g. if a customer changes address, old orders have stale data) and extra storage. Wide-column NoSQL databases like Bigtable, Cassandra, or HBase are explicitly designed to handle wide rows and even encourage a single row to contain a large collection of data (e.g. time series events for one user) for locality. They sacrifice some of the ad-hoc flexibility of SQL joins in favor of predetermined query patterns.
When to Break the Rules: Denormalization is justified when the benefit to query performance or simplicity outweighs the drawbacks of redundancy. Common cases include: Data Warehousing – star schemas and snowflake schemas deliberately duplicate dimensional data. A star schema has a central fact table and denormalized dimension tables (with redundancies) to speed up join performance. Analytical queries in a star schema run faster because they avoid joining many normalized tables; instead they join a few big, flat tables. Another case is Caching and Precomputation – e.g. storing aggregate values or summary tables that would otherwise require expensive computations. Modern distributed systems also push denormalization due to the lack of multi-node joins – e.g. DynamoDB or Cassandra expects you to design tables for each query pattern, often duplicating data per pattern (one table by customer, another by order date, etc.). In summary, normalize for correctness, denormalize for speed. If your app is OLTP with frequent updates and strong consistency needs, lean on normalization. If it’s OLAP or read-mostly, denormalize prudently for performance. You can also mix approaches: keep a normalized core for writes, and maintain derived denormalized tables for heavy reads (possibly updated via streams or triggers). The decision should always be informed by whether the reduction in read time is worth the complexity of extra writes and potential stale data.
Normal Forms Refresher (3NF, Star, Snowflake, NoSQL)
Third Normal Form (3NF): This is a classic relational design principle where each fact is stored once, and all non-key attributes depend only on the primary key. Tables in 3NF have no repeating groups, no partial dependencies on composite keys, and no transitive dependencies. This yields a lot of small, well-structured tables (often many joins needed for queries) but guarantees minimal redundancy and anomalies. 3NF (and higher normal forms) are great for transactional consistency and for modeling complex data without duplication.
Star Schema vs. Snowflake: These are denormalized models for analytics. A star schema has one large fact table (e.g. sales transactions) and several denormalized dimension tables (e.g. a product dimension that maybe combines product info and category in one table). In a pure star, each dimension is flattened into one table – which violates 3NF by duplicating attributes (e.g. the category name repeats on every product row). The benefit is simplicity and speed: fewer joins are needed to answer questions, since each dimension is pre-joined into one table. A snowflake schema is a slight normalization of the star: dimensions are normalized into multiple related tables (e.g. a products table and a separate categories table it links to). Snowflake reduces redundancy in dimensions but reintroduces more joins. Star schemas are generally considered denormalized (for query speed at the expense of redundancy) and are widely used in OLAP systems where fast reads matter more than update anomalies. For example, in a star schema a store
dimension might include city and state; in a snowflake, city and state might be separate linked tables. Snowflake saves some space, but a star places city and state directly in the store table to avoid extra joins. Both star and snowflake are dimensionally modeled – user queries (e.g. in BI tools) benefit from intuitive “cubes” of data – but star is more denormalized than snowflake.
Wide-Column NoSQL: In wide-column (or “big table”) databases, we see a departure from 3NF entirely. The data model is often one giant table or a few tables that hold nested or repeated data. For instance, an order record in Cassandra might embed all the items and even some customer info as part of that record (perhaps as collections or JSON) instead of normalizing them into separate tables. The database may allow hundreds of columns per row, and different rows can have different sets of columns (sparse storage). This model is optimized for queries that are known in advance – you design the table so that one key lookup fetches exactly what that query needs. It’s essentially denormalization on write: you design your schema around queries, not around avoiding redundancy. Wide-column stores often sacrifice join capabilities altogether – any relationship that you might join in an RDBMS is usually pre-materialized into the table. The advantage is massive horizontal scalability and fast reads (no distributed joins), at the cost that the application is responsible for maintaining consistency of duplicated data. If an attribute (like a user’s name) is stored in multiple places for convenience, the onus is on your update logic to sync those.
When to break normalization rules: Aside from performance reasons, sometimes 3NF is broken for practicality. Example: storing derived data (like a total order amount) with the record, even though it can be calculated by summing line items (violating normalization), because it saves computation on reads. Another example: duplicating a denormalized cache of data from another service or system inside your database for faster access, accepting that it may lag the source of truth. These decisions should be documented and justified by clear needs (e.g. “using star schema because our BI queries need to join 10 tables in 3NF, which was too slow”). With great power comes great responsibility: once you break normalization, you need processes (or code) to manage data consistency, whether through carefully managed updates, triggers, or periodic rebuilds.
JSON/Array Columns & Functional Indexes
Modern relational databases blur the line between SQL and NoSQL by supporting JSON, array, and other complex types. Indexing these requires specialized approaches:
GIN & GIST Indexes: PostgreSQL and some others provide index types like GIN (Generalized Inverted Index) and GiST (Generalized Search Tree) to handle complex data. A GIN index is ideal for cases where each row’s column value contains multiple elements and you need to query for containment of a specific element. JSON fields and array columns are a prime example – a GIN index on a JSONB column will index every key or value, allowing queries like “does this JSON contain key = X” to use the index. In practice, GIN is the go-to index for Postgres JSONB, array membership, and full-text search, because it inverts the document into a map of keys to rows. A GiST index, on the other hand, is a generalized search tree useful for spatial data, ranges, or hierarchical data. GiST can index data by a “bounding” strategy (e.g. bounding boxes for geometries) and is flexible for custom data types. For JSON, PostgreSQL also has a GiST option (with jsonb_path_ops) that can be useful for certain range or containment queries, but GIN is more common for discrete key lookups. One downside of GIN is that it can be slower to update (lots of entries to insert/delete per row) and can be bigger in size, but for static or search-heavy data it’s a game changer. Notably, GIN indexes only support bitmap index scans (they produce a set of matching row IDs, then the executor performs a bitmap heap scan) rather than regular index-to-heap lookups, but this is usually internal detail – the main point is GIN enables queries that B-trees cannot do efficiently.
Functional and Generated Column Indexes: A functional index (also called an expression index) is an index on the result of a function or expression, rather than directly on a column’s stored value. This is incredibly useful for indexing computed values or specific components of a complex column. Classic examples: indexing LOWER(email)
to allow case-insensitive email lookups, or indexing a specific JSON field extracted via an expression (e.g. (json_col->>'name')
). Oracle introduced function-based indexes long ago, and many databases have equivalents. MySQL historically lacked direct functional indexes, but as of v8.0 it supports indexes on generated columns – basically you add a virtual or stored generated column that computes the expression, then index that. The effect is the same: queries using that function can use the index. Functional indexes are powerful to optimize computed WHERE clauses (like searching by an encrypted or hashed value, a date truncated to day, etc.). They materialize the computation so that the database doesn’t have to compute it on every row at query time – it can just traverse the index. However, every insert/update now has the overhead of computing that function to update the index. This is usually worth it if the function is expensive or the selectivity gained is high. Some databases (Oracle, SQL Server) directly support indexing on expressions; PostgreSQL allows CREATE INDEX ... (function(col))
; MySQL uses the generated column trick under the hood.
JSON/Array Indexing Strategies: There are a few ways to index JSON data in an SQL database: (1) Use a GIN/GiST as mentioned for general searches, (2) Use functional indexes on specific JSON attributes if you only care about certain keys. For example, if you have a JSONB column with a field "status"
, you might create a functional index on (jsonb_col->>'status')
to quickly filter by status. This essentially treats that key as a normal column for indexing purposes. Another trick is to use generated columns to pull out important JSON fields and store them as their own columns (which can then be indexed normally). Many systems do this when they have semi-structured data: store the raw JSON for flexibility, but also store a few critical fields in dedicated columns for performance (possibly kept in sync via triggers or application logic). Similarly for arrays, you might index each element if order doesn’t matter (using GIN), or index specific positions if that’s meaningful.
In short, modern databases give you the tools to index beyond flat relational data: you can index inside a JSON document or an array. The choice of GIN vs GiST vs functional indexes depends on your query needs – GIN is great for “does it contain X” queries across many keys, GiST is useful for range and proximity queries (like “find objects within this geometric area” or “JSON that is similar in structure”), and functional indexes are perfect for exact transformations (like extracting a text value or normalizing case/format). Combining these with partial index capabilities can even let you, say, index a JSON key only for rows where another field has a certain value, and so on – very powerful when used judiciously.
Query-Planner Interplay
Database query optimizers are cost-based: they decide whether to use an index or not based on estimates of work. Statistics guide these decisions. The planner uses stats (like number of rows, value distributions, correlation, etc.) to estimate how many rows a given query condition will match, and the cost of using a particular index vs. doing a sequential scan. If stats indicate that a condition is not very selective (say ~50% of the table), the optimizer may choose a full table scan as it expects that reading half the table may be cheaper than many index lookups. On the other hand, a very selective condition (e.g. 0.1% of rows) would likely get an index scan plan. Ensuring your stats are up-to-date (via ANALYZE or the database’s auto-stats) is critical; stale or default stats can lead the planner astray – for example, if a column’s histogram is outdated, the optimizer might think an index is unnecessary when it actually would help, or vice versa.
EXPLAIN Basics: Developers preparing for system design interviews should be comfortable reading an EXPLAIN
plan. This shows the steps the planner chose – whether it’s using an Index Scan, Index Only Scan, Bitmap Index Scan, or a Sequential Scan, etc., along with estimated rows and costs. Understanding these operators is key to verifying if your indexes are effective. For example, an EXPLAIN
might show a Bitmap Index Scan followed by a Bitmap Heap Scan for a query with multiple conditions – meaning the database is using the index to collect candidate row ids in memory, then fetching those rows efficiently. If you expected an index to be used but see a Seq Scan, it’s a clue either the optimizer didn’t find it advantageous or the query isn’t written to leverage it (e.g. applying a function without an index, or data type mismatch preventing index usage).
Index Hinting Pitfalls: Some databases allow you to force or hint the use of a particular index in a query. While hints can be a last resort to fix a plan, they are generally discouraged in production code. Manually forcing an index can backfire: today it might be the best index, but as data grows or distributions change, the hinted index could become suboptimal while the optimizer might have chosen a better plan on its own. Moreover, hints add maintenance burden – if the index name changes or is removed, the queries will break. The best practice is to let the cost-based optimizer do its job, and only intervene if you’ve verified it cannot make the right decision on its own (perhaps due to lacking statistics or a bug). Even then, prefer higher-level fixes like creating better stats (e.g. extended statistics for correlated columns) or rewriting the query. Use hints sparingly and document them well.
Planner and Index Choices: Note that the planner doesn’t always choose the “obvious” index. If a query needs to sort data and an index can provide that order, the planner might choose an index scan even if it’s slightly more costly, to avoid a separate sort step. Conversely, for a query that needs a large percentage of the table, the planner might choose a sequential scan with a parallel plan, ignoring an index that would result in many random I/Os. Also, different databases have different capabilities – some can do index merge (combine multiple single-column indexes for a multi-condition query), some cannot. Understanding these nuances can help in system design: e.g., MySQL can use at most one index per table per query (except for index merge in some cases), whereas PostgreSQL can combine bitmap indexes arbitrarily. Such differences impact how you decide to add indexes or design composite ones.
In summary, treat the optimizer as an ally: feed it good statistics and reasonable query formulations. Use EXPLAIN
to verify decisions. If you hit cases where it chooses poorly (and you can’t fix by stats or rewriting), then consider guiding it with hints or fixed plans, but do so with caution and monitoring.
Observability Signals
Tuning indexing and schema strategies isn’t a one-off task – you need to continuously observe how they perform in the real system. Key metrics and signals include:
-
Cache & Index Hit Ratios: This measures how often data (or index pages) are served from memory cache versus disk. For example, PostgreSQL tracks an index hit ratio – the fraction of index reads that were satisfied from RAM. A high index hit ratio (close to 1.0 or 99%+) means your indexes mostly stay in memory, which is good for read performance. If this ratio drops, it may indicate your indexes are too large for cache or you need more memory. Similarly, a low cache hit ratio for tables means the server is doing a lot of disk I/O – perhaps adding an index could reduce full table scans, or perhaps the working set exceeds available RAM. Many monitoring tools surface these ratios as first-glance health metrics for databases.
-
Index Usage and Scan Counts: Beyond cache hits, look at how often each index is used in queries. Most databases have stats like “index scans” count per index (e.g.
pg_stat_user_indexes
in Postgres shows number of index scans and tuples read). If an index is rarely used (zero or very few scans) but incurs overhead on writes, that’s a candidate for removal or reconsideration. On the other hand, if you see frequent sequential scans on large tables (in pg_stat_user_tables), that might indicate missing indexes. Tracking the ratio of index scans to table scans over time can reveal if queries are benefiting from indexes as expected. -
Index Bloat: Especially in MVCC databases, indexes can accumulate dead entries or empty space due to deletes/updates. Over time, an index might become much larger on disk than the actual live data it indexes, which can slow down scans. Tools like
pgstattuple
orpg_reorg
can help measure and combat bloat. A classic sign of bloat is indexes with a lot of pages where dead_tuple_percent is high (meaning a lot of space taken by dead rows). Regular VACUUM maintenance in PostgreSQL helps keep bloat in check by marking dead index tuples reusable, but it doesn’t always shrink the index size without a reindex or vacuum full. Monitoring bloat metrics or at least index size vs. table size can inform when you need to intervene. In systems like Lucene or Elasticsearch (inverted indexes), “bloat” can occur as fragmentation of segments – they periodically merge segments to optimize this, analogous to vacuuming. -
Vacuum/Compaction Lag: For relational systems, if the autovacuum process can’t keep up (perhaps due to high write volume), you’ll see growing bloat as mentioned. You might monitor autovacuum run times, wait times, or queue depth. In LSM-based stores, compaction lag is a similar concept – if background compaction falls behind incoming writes, the number of SSTables grows, causing read slowdowns. Some DBs expose a “compaction pending” metric or a level 0 SSTable count. CockroachDB, for instance, warns of an “inverted LSM” when the higher levels contain more data than lower levels due to compaction lag. This leads to very inefficient reads (high read amplification). Thus, monitoring compaction activity and backlog is important for systems like Cassandra, RocksDB-based engines, etc. If you see compaction falling behind (or vacuum not keeping up), it may be time to tune those processes (e.g. allocate more IO bandwidth or tweak thresholds) or consider that the write volume is exceeding what the current design can handle (maybe time to scale out or use a different approach).
-
Query Latency (P99): Average query times can be misleading; one needs to watch tail latencies, especially the 95th or 99th percentile (p99). Indexing issues often manifest in outlier queries. For example, if most queries use an index and are fast, but occasionally the planner chooses a bad plan (maybe due to a parameter that makes the index less efficient) and that query is slow – the averages might look fine but p99 latency will spike. By tracking p99, you can catch if some queries are doing expensive table scans or experiencing I/O stalls. If you add a new index intended to improve performance, the p95/p99 latency of relevant queries should drop, and if not, something is off. High variance in latency could indicate inconsistent index usage or stalls from things like lock contention or vacuums – but it often ties back to whether the indexing strategy is robust for all values. In an interview context, citing p99 latency as a metric shows you understand real-world performance, not just theoretical throughput.
-
Throughput and CPU/IO utilization: Indices change where the database spends resources. After adding an index, you might see overall CPU go up (due to faster execution doing more queries per second) or IO go down (less scanning). Monitoring resource metrics alongside indexing changes is key. If write CPU or IO jumps drastically after adding several indexes, that’s the write amplification we discussed quantifiably hitting – maybe it’s acceptable, or maybe an index wasn’t worth it. Tools like AWS CloudWatch, database performance views, or APM tools can correlate query per second, buffer reads, etc., with schema changes.
-
AWS specifics: If using AWS RDS/Aurora, you have specific CloudWatch metrics (e.g. for buffer cache hit ratio, replication lag which can affect indexes in a replica, etc.). DynamoDB provides metrics for consumed capacity on GSIs and throttling – if a GSI is too slow to update, you might see write throttles or increased latency on that index’s operations. Athena’s query metrics (via AWS console or logs) can show how much data was scanned for a query – this is a direct indicator of how effective your partitioning/projection strategy is. For example, if an Athena query scans 100GB when it should have only scanned 1GB with proper partition pruning, that’s a sign your partition projection or predicate pushdown isn’t working as intended.
In essence, robust indexing and denormalization strategies require a feedback loop: design based on assumptions, then monitor these signals to validate (or invalidate) those assumptions and adjust accordingly.
AWS Reference Touchpoints
Finally, let’s connect these concepts with a few AWS-specific contexts often brought up in system design:
-
Amazon RDS (PostgreSQL) – Covering vs. Partial Indexes: PostgreSQL on RDS supports both partial indexes and covering indexes (via
INCLUDE
columns on indexes since PostgreSQL 11). Knowing when to use one vs. the other is useful. A covering index (index with INCLUDE columns) is great when you have a frequently run SELECT that filters by column A but also needs column B from the table. Instead of making B part of the primary index key (which might bloat the index and affect its selectivity), you can INCLUDE B as a non-key column. This way the index still is primarily organized by A, but carries B along for the ride to support index-only retrieval. Use covering indexes to avoid hitting the heap for additional columns needed in a query (an index-only scan, as discussed). On the other hand, a partial index is used when not all rows need indexing – e.g.CREATE INDEX idx_active_orders ON orders(status) WHERE status = 'ACTIVE'
. This index is much smaller and faster if most orders are not active. Sometimes you can even combine both: a partial covering index – index only a subset of rows, and include some extra columns to fully cover certain queries. On RDS PostgreSQL, these features can drastically improve performance if used appropriately. For instance, one might create a partial index on a “hot” partition of data (like the last 6 months) to speed up recent queries while keeping the index small. The AWS RDS platform doesn’t limit these PostgreSQL features – so they’re fully available to leverage in interview answers when discussing how to optimize a relational workload on AWS. -
DynamoDB – GSIs vs. LSIs: DynamoDB offers two types of secondary indexes: Global and Local. An LSI (Local Secondary Index) shares the same partition key as the base table but allows an alternate sort key. It is updated transactionally with the table (strongly consistent reads are allowed on LSIs), and counts against the base table’s storage quota. LSIs are useful when you need a different sorting or filtering on the same partition data. However, you only get 5 LSIs per table and must define them at table creation. A GSI has its own partition key (and optional sort key) and can be queried independently of the base table key. GSIs are more flexible (you can add them on existing tables, with different keys), but they come with eventual consistency and separate throughput provisioning. Under the hood, a GSI is essentially a denormalized copy of a projection of your table, potentially in a totally different partitioning scheme. This means write amplification: each write to the main table that affects indexed attributes will result in a write to the GSI’s table. If your GSIs have heavy write volume, you’ll pay in additional WCU (Write Capacity Units) and might experience increased latency for the write to fully propagate. A common design strategy is to denormalize into GSIs only what you absolutely need for query flexibility, because you’re incurring cost and eventually-consistent behavior. For instance, to support querying users by email, you might make email a GSI partition key (duplicating that data). In system design answers, it’s good to mention that DynamoDB GSIs allow rich query patterns but one must account for the propagation lag and extra costs – perhaps using DynamoDB Streams to monitor update flows if needed. The “Good, Bad, Ugly” of GSIs often boils down to: Good: flexible queries; Bad: cost and write overhead; Ugly: eventual consistency can return stale data if not handled. LSIs, being local, don’t add write cost (they use the base table’s throughput) but are inflexible and also limited in size (an item collection can’t exceed 10GB, across the main item and all its LSI entries). So if asked how to design a schema for, say, an e-commerce system on DynamoDB, you might say: use an LSI if you need an alternative sort on the same partition (e.g. query orders by date and by priority within the same customer partition) and use GSIs for different access patterns (e.g. lookup by email, lookup all orders by product ID, etc.), while being mindful of the trade-offs.
-
Athena (and S3 data lakes) – Partitioning & Projection: AWS Athena is a query service on data stored in S3 (often in Parquet/ORC/JSON). It doesn’t use indexes like a traditional DB, but performance is driven by how data is partitioned and how queries can prune unnecessary data. Partitioning in Athena (through Hive Metastore or Glue) is akin to creating an “index” on a particular column by storing data in directory prefixes. For example, you might partition logs by date (
.../dt=2025-06-01/
etc.). Athena will then scan only the folders relevant to a query’s date filter, skipping others. Partition Projection is an Athena feature where you can define partition patterns and values without having to explicitly crawl them, allowing Athena to infer which S3 paths to scan based on the query, again reducing I/O. Essentially, partition projection is an optimization to avoid both a metastore dependency and excessive S3 listings – it’s like telling Athena “trust me, data is partitioned by this scheme; when a query hasWHERE dt = '2025-06-01'
, just go to that partition path.” This greatly speeds up query planning and avoids scanning irrelevant data. Projection in the context of Athena/Presto also refers to column projection: because formats like Parquet are columnar, Athena will read only the columns needed by the query (this is done automatically, known as predicate and projection pushdown). The idea is similar to a covering index: if your query only needs 3 out of 50 columns, a columnar format acts like an index that allows skipping the other 47 columns. The net effect is that good data layout in S3 (partitioning, sorting within files, using columnar storage) can act as a surrogate for indexing. For instance, instead of an index on “country” in a 10 billion row log dataset, you might partition the data by country and date, such that a query forWHERE country = 'US' AND date = '2025-06-01'
only reads that slice of files – effectively achieving the efficiency of an index lookup in a traditional DB. Athena also recently introduced Iceberg and Glue Data Catalog improvements that allow secondary indexes or metadata pushdowns in some scenarios, but generally, thinking in terms of partitioning strategy is key. So in an interview, if Athena comes up, talk about how you’d partition data (and possibly use partition projection) to minimize scanned data (e.g. “For a multi-tenant dataset, partition by tenant_id or by event date, so queries can skip irrelevant partitions”). Also mention Athena statistics if relevant: Glue can maintain column stats that let Athena’s cost-based engine optimize joins and such, but partition pruning is the primary “index-like” tool.
To conclude, indexing and denormalization are critical techniques in system design for scaling reads and writes. The choice between normalized vs. denormalized, B-tree vs. LSM, or using a GSI vs. global table replication – all boil down to understanding access patterns and trade-offs. By mastering these concepts and knowing how to apply them in cloud services like AWS, one can design systems that are both performant and maintainable, explaining the decisions with solid reasoning in an interview setting.