SerialReads

Database Foundations & Landscape

Jun 02, 2025

Great. I’ll create an executive-level overview of database foundations tailored to senior and staff-level software engineers preparing for system design interviews. It will integrate both foundational concepts and recent trends (like cloud-native or serverless databases), with an emphasis on conceptual clarity and practical reasoning.

I’ll let you know as soon as it’s ready.

Database Foundations & Landscape

Why We Need Databases (The Core Problem)

Imagine keeping a bank’s ledger in a notebook. It works for one teller, but as transactions grow and multiple tellers work in parallel, consistency falls apart. We need a system that multiple people (or services) can use at scale without tripping over each other. This is the core role of a database: to safely store data, handle concurrent updates, and retrieve information efficiently – even as usage explodes. Databases shine where naive solutions (like simple file storage) break down under heavy load, complex queries, or errors. They ensure correctness (you don’t lose money in your account due to a crash) and performance (your social feed loads quickly despite millions of users). The challenge is that achieving this is hard: you must juggle reliability, speed, and scale all at once. As data grows, issues like slow queries, lost updates, or partial failures become serious. A solid mental model of database fundamentals – before diving into vendor specifics – helps in reasoning through these trade-offs on a whiteboard.

Building an Intuitive Mental Model

Think of a database as a highly organized library or ledger clerk for your app’s data. It doesn’t just put data on disk; it also enforces rules (no two users with the same username if that’s a key), coordinates multiple readers/writers (so one client’s update doesn’t corrupt another’s read), and optimizes access (like a library index speeding up book searches). This mental model guides many design decisions. For example, if you picture data as a ledger of transactions, you intuitively understand why atomicity (all or nothing updates) is vital – you wouldn’t want to subtract money from Alice without adding it to Bob in a transfer. Before we name-drop “ACID” or “NoSQL,” it helps to reason in these real-world terms. The goal is to internalize concepts: e.g., eventually, multiple notebooks need to sync up (eventual consistency) or a librarian might categorize books by subject or author (like choosing different data models and indexes). With this intuition, we can now explore technical foundations.

Data Models: Relational vs. NoSQL Families

Relational databases (SQL databases) organize data into tables (think Excel sheets) with predefined schemas. This schema-on-write approach means you decide the structure upfront and every entry conforms. It’s like a well-defined form you must fill out – great for consistency and complex querying (joins), but inflexible if your data is highly varied. In contrast, NoSQL databases embrace a variety of models for flexible or large-scale needs, often using schema-on-read (store data first, define structure when reading). The NoSQL family spans several categories:

Each of these NoSQL types has different strengths – e.g., key-value and wide-column prioritize simple operations and scalability, sometimes at the cost of complex query capability or strict consistency. Relational databases, on the other hand, excel at structured queries and multi-object transactions, but historically were harder to scale out horizontally (though modern SQL engines and NewSQL have mitigated this). A useful way to think of it: Relational is schema-first, great for consistent, interrelated data (like a banking database with accounts and transfers), whereas NoSQL is often schema-flexible, great for fast-evolving or massive-scale data (like capturing billions of web click events or iterating quickly on a new app’s data model). Many systems today use a mix: e.g., user accounts in Postgres, activity logs in Cassandra, caching in Redis (a key-value), and maybe a search index in Elasticsearch – each chosen for what it does best.

Consistency and CAP: ACID vs. BASE Models

When we say a database transaction is ACID, we mean it guarantees a set of properties that keep data correct despite errors or crashes:

Relational databases typically adhere to ACID, which makes life easier when correctness matters – you trust the DB to handle consistency. However, in distributed systems and some NoSQL stores, strict ACID is relaxed in favor of availability or performance.

This is where BASE comes in – an alternative philosophy often used in large-scale NoSQL systems. BASE stands for Basically Available, Soft state, Eventually consistent:

BASE sacrifices the “always consistent right now” guarantee for the sake of partition tolerance and uptime. This ties into the famous CAP Theorem: in a distributed system, you can’t have Consistency, Availability, and Partition tolerance all perfectly at once – you must choose to forgo one. Here, consistency in CAP means all nodes see the same data at the same time, availability means the system can respond (perhaps with older data) even when some nodes are down, and partition tolerance means the system tolerates network splits without total failure. CA systems (e.g., a single-instance relational DB or a sync-master setup) choose consistency+availability over partition tolerance – they might just fail or stop on a network split. AP systems (like Dynamo-style NoSQL stores) choose availability + partition tolerance, accepting stale reads during a partition. CP systems (like MongoDB or Redis in certain modes) choose consistency + partition tolerance, meaning during a partition they may sacrifice availability (e.g., some operations stall) to keep data consistent. The intuitive take-away: know your requirements. For a chat application, maybe it’s okay if a message order is slightly out of sync for a moment (BASE/AP) as long as the service is up; for a bank ledger, you likely want ACID/CP – it’s better to deny service or fail over than show incorrect balances. Good design is about picking the right spot in this spectrum – many modern databases even offer tunable consistency (for example, Cassandra can be configured how many replicas must confirm a write). In interviews, expressing this trade-off thinking (“what do we lose if we want zero downtime across regions?”) is key.

OLTP vs. OLAP Workloads (and the Rise of HTAP)

Not all database use cases are alike. Broadly, we differentiate OLTP (Online Transaction Processing) from OLAP (Online Analytical Processing). OLTP is the bread-and-butter of applications: transactional, fine-grained operations. Imagine an ATM system – each withdrawal is an OLTP transaction: small, quick, and touching only a few records, but the system handles thousands per second with low latency. Here, the emphasis is on latency (each operation must be fast) and concurrency (many operations at once). The data access pattern is typically random reads/writes of a few rows by key (e.g., fetch account by ID and update it).

OLAP, on the other hand, is about analytical queries on large data sets. Think of a business intelligence dashboard or a data warehouse query: “What was the total sales by region last quarter?” These queries crunch through millions of rows. Latency per query can be seconds or minutes, which is fine because they’re doing heavy lifting, and they’re often run by periodic reports or analysts, not live user actions. The emphasis is on throughput (scanning lots of data efficiently) and aggregations. OLAP queries often do full table scans or large joins but not very frequently, whereas OLTP does many small queries continuously.

Because of these differences, systems historically specialized. OLTP databases (e.g., MySQL, PostgreSQL) were row-oriented and optimized for writes and random reads, while OLAP databases (e.g., data warehouse appliances, or modern ones like ClickHouse) are often columnar storage and optimized for big scans and compressibility. It was common to extract data from OLTP systems into a separate OLAP system for analytics (the ETL process), because each is tuned so differently.

However, maintaining two systems is painful (data duplication, lag between OLTP and OLAP). This has led to HTAP (Hybrid Transaction/Analytical Processing) systems that try to do both. Modern cloud data platforms and some databases (TiDB, SingleStore, etc.) aim to let you run analytical queries on the same system that handles transactions – often by mixing storage engines or replicating data under the hood. For example, one approach is to have a row-store and a column-store behind a unified SQL interface, so OLTP queries hit the row storage and OLAP queries can automatically use a columnar replica. The rise of in-memory computing and distributed SQL (NewSQL) also feeds HTAP – using memory or fast networks to soften the traditional trade-off. The bottom line: understand the workload. If an interviewer says the system must handle user actions and heavy analytics, you might propose either splitting the workload or using an HTAP technology. At least mention how row vs. column storage could come into play: OLTP tends to favor row-based storage (all fields of a record together, good for writing a whole record or reading one user’s info) whereas OLAP loves columnar (each column stored contiguously, good for reading one attribute for many records, like summing sales for all rows).

Storage Engines and Indexes: Rows, Columns, and Trees

Physical storage layout has a huge impact on performance. A row-store keeps all columns of a row together on disk. This is great for OLTP – when you fetch a row by its primary key, you get all its info in one disk I/O. A column-store keeps values of each column together (e.g., all customer names in one segment, all customer ages in another). This shines for OLAP – if you run SELECT SUM(amount) FROM sales WHERE region='West', a column-store can read just the amount and region columns, greatly reducing I/O and using CPU-efficient vectorized processing. Column stores also compress data well (adjacent values are similar), so they can scan millions of values in-memory quickly. The trade-off is that writing a new record (which spans many columns) is costlier, and reading an entire entity (all columns) is slower if data is in pieces. Many analytical DBs mitigate this with tricks like batching writes and partitioning.

Within either layout, we have to consider heap files vs. indexed (clustered) storage. A heap file is an unsorted pile of records – new inserts just go to any free space. It’s simple but finding data requires an index or full scan. A clustered index means the main table storage is ordered by a key (e.g., sorted by customer_id). This can make range queries on that key very fast (all records with customer_id between X and Y are contiguous on disk) and also inherently gives you an index on that key (usually a B-tree). The downside is that maintaining that order on inserts can cost extra work (might need to split pages, etc.), and you typically only get one clustered order (others must be secondary indexes). For example, InnoDB (MySQL’s engine) clusters by primary key. If your access pattern aligns (range scans by primary key), it’s a win. If not, it’s no worse than a heap with an index.

Secondary indexes are additional structures that allow fast lookup by other columns. Think of an index in a book that maps a term to pages. Commonly, databases use B-tree indexes for this. A B-tree is a balanced search tree that keeps keys sorted and allows logarithmic time search. It’s optimized for the block-based storage of disks: each node (page) might contain many keys, minimizing disk hits. B-trees excel at range queries and point lookups on sorted data. If you query “give me all users with name starting ‘Sm...’”, a B-tree on the name field can quickly seek to the first Sm and then read in order. Most relational DB indexes are B+ trees (a variant where all data is in leaf nodes and leaves are linked for in-order traversal).

An alternative found in many NoSQL/newer systems is the LSM (Log-Structured Merge) tree. An LSM isn’t a single tree but a design that defers sorting: new data is first buffered (in memory, sorted structure), and periodically flushed to disk in sequential order. Disk storage ends up as multiple sorted runs that are merged in the background. Writes are very fast (sequential I/O) and can be buffered, which is why LSM-based stores (like Cassandra, RocksDB, Scylla) can sustain high write rates. Reads, however, may need to check multiple sorted files and merge results (mitigated by structures like Bloom filters and partition indexes). LSMs tend to have higher write throughput but potentially more read amplification (data might be in several places). B-trees do in-place updates (turning random writes on disk) which is slower on write but faster to read (data is in one structure). With modern SSDs (which handle random I/O much better than HDDs), the B-tree vs LSM trade-off line has moved: random writes are less costly than before, though LSMs still shine for write-heavy workloads or where sequential access is paramount.

Databases also use hash indexes in some cases. A hash index is like a hash table on disk: very fast point lookups (given exactly a key, it tells you where the data is), but no ordering – you can’t get ranges or sorted traversal easily. Some systems use hash indexes for primary keys (Oracle has optional hash clusters; some NoSQL caches are essentially distributed hash tables). In interviews, if you mention index structures, remember: B-trees (good general-purpose, ordered), LSM (optimized for heavy writes, used in many big NoSQL engines), and hash (fast equality lookups, but limited use). Also note index maintenance costs – every index must be updated on writes, so more indexes = slower inserts.

Query Planning and Optimization

When you issue a query (e.g., an SQL SELECT with joins), the database must decide how to execute it. This is the job of the query planner/optimizer. It’s like Google Maps for data: it finds possible routes to get your answer and chooses the fastest by estimated cost. The planner uses statistics about your data: how many rows in a table, value distributions, etc., to guess things like “this filter will hit ~100 rows” or “joining table A and B this way will yield ~1M results”. Based on these estimates, it picks an execution plan: which indexes to use, which table to scan first, which join algorithm (nested loop, hash join, etc.). Modern optimizers are cost-based: they assign a cost to each step (based on I/O, CPU, etc.) and seek the lowest total cost plan.

A simple example: Suppose you have SELECT * FROM Orders o JOIN Customers c ON o.cust_id = c.id WHERE c.region = 'West'. The optimizer might consider using an index on Customers.region to find Western customers, then looping through their orders, versus scanning Orders and looking up each customer, etc. If stats say only 5% of customers are West, using the index and then join might be cheapest. If 90% are West, maybe a different approach (like scanning orders and filtering customers) is chosen.

Cardinality estimation – estimating how many rows pass through each step – is where errors often occur. If the optimizer misestimates (e.g., thinks a filter is highly selective when it’s not, or fails to account for correlated conditions), it might pick a suboptimal plan. A common pitfall: multi-column correlations. Imagine a table with WHERE state = 'CA' AND city = 'San Francisco'. The optimizer might treat these independently and think “state=CA returns 10% of rows, city=SF returns 1%, combined maybe 0.1%” – but in reality SF is in CA, so the second condition doesn’t reduce after the first. Misestimate leads to wrong join choices or index usage. Other pitfalls include outdated stats (the data changed, but stats didn’t, so the planner’s estimates are off) and non-indexable predicates (using a function on a column can thwart index use). Interviewers might probe this by asking, for example, “why might a database not use an index even if one exists for a query?” – expected answer: the optimizer decides the indexed plan isn’t actually cheaper, perhaps due to low selectivity or misestimation, or because reading from the index then table is actually more I/O than a full table scan.

The key point: the database tries to automate query decisions. For a system design discussion, you should be aware that adding an index doesn’t guarantee speed – the query planner must choose to use it, and it might not if the cost model says not to. Also, the complexity of optimization is why database engines are hard to build – as one paper joked, it’s “harder than rocket science” to get right. But as a designer, focus on conceptual behaviors: e.g., a large join across distributed nodes might need data shuffling (expensive), so maybe pre-partition data to avoid that. Those kinds of insights show you understand what the optimizer wants to do.

Durability and Fault Tolerance: WAL, Checkpoints, and Hardware

To guarantee durability (the “D” in ACID), databases use a Write-Ahead Log (WAL). The idea is simple: before applying any changes to the main data files, record the intent in an append-only log on disk. This log is fsynced (force-written) to disk on commit. If the database crashes, it can replay the log to redo any operations that were committed but not fully applied to the data files. The WAL ensures that even if the process dies mid-flight, there’s a record of what was supposed to happen, which will be completed upon recovery. This is analogous to saving your work in a journal – if the system goes down, you open the journal and reconcile any incomplete entries with the main ledger.

Checkpointing complements WAL: periodically the database will flush dirty pages (in-memory changes) to disk and mark a checkpoint in the log. This way, the log doesn’t grow forever – on recovery you only need to replay from the last checkpoint. Checkpoints balance recovery time vs runtime overhead.

fsync (or similar calls) are important because operating systems cache disk writes. A commit that isn’t flushed to physical storage can be lost if the OS crashes. So databases explicitly request flushes at critical moments (like commit) to ensure data hit stable storage. Proper use of fsync (and hardware that honors it) is crucial; infamous bugs and discussions have occurred when disks lie about durability (write caching) or when using certain file systems.

Modern hardware has influenced these designs. With traditional spinning HDDs, sequential writes (like WAL appends) were much faster than random writes (seeking all over a disk). This made WAL + periodic flush (mostly sequential) a big win. SSDs have no seek penalty, drastically improving random I/O performance. This makes things like random write B-tree updates more palatable and also makes fsync less costly (SSDs have fast internal caches), but the fundamental need for WAL hasn’t gone away. However, persistent memory (NVM) and ultra-fast SSDs (NVMe drives) are starting to blur memory and storage. Some systems can memory-map NVM and treat it as an always-persistent store, potentially eliminating the WAL (since you could update data “in place” durably if done carefully, or have a memory-speed log). These trends haven’t removed the need for careful durability logic, but they do change the performance calculus. For example, the cost of flushing 8KB to disk is millions of CPU cycles – but on an NVMe or NVM, it’s much less, so fsync is not as bottleneck-y as before. Also, SSDs reduce the read/write gap, so approaches like LSM vs B-tree become more workload-specific rather than strictly hardware-driven (an LSM on SSD still shines for massive writes, but the read penalty is less severe with a fast SSD, making B-trees competitive for many workloads).

In summary, durability in design interviews: mention WAL for sure (“on a transaction commit, the DB writes to a log on disk to guarantee it can recover”), mention that data pages are written later (and thus the need for checkpointing), and perhaps note that on modern cloud storage and SSDs, these operations are quite fast – and that cloud-native databases often take it further by replicating data to multiple nodes or regions (sometimes instead of or in addition to local fsync). For instance, some cloud DBs will replicate the WAL to another zone before confirming commit, adding a network durability layer (beyond our scope here, but good to hint at if high availability is a goal).

Bottlenecks and Golden Signals in Databases

Operating a database, you’ll want to watch certain key metrics (often called golden signals) to catch performance issues:

Other typical signals include throughput (QPS), error rates (e.g., deadlocks or failed transactions), and resource saturation (CPU, disk IOPS). But the ones above (memory hit ratio, locks, P99 latency) are great to mention for relational systems. For a distributed NoSQL, you might also mention replication lag or split brain occurrences as things to watch, but that’s beyond core fundamentals.

Database Landscape at a Glance (Engines and their Sweet Spots)

Below is a comparison of some popular database engines, highlighting their primary data model and the workloads they fit best:

Database Engine (Type) Primary Data Model Best-Fit Workload & Strengths
PostgreSQL, MySQL (Relational RDBMS) Relational tables (SQL, schema-on-write, ACID); row-store by default Traditional OLTP (transactions, structured data). Great for strong consistency and complex queries/joins. PostgreSQL shines with complex analytics and extensibility; MySQL is popular for read-heavy web workloads (often with read replicas).
MongoDB (Document NoSQL) Document store (JSON-like documents, schema-flexible) Flexible schema use cases and rapid development. Suited for semi-structured data (e.g., content management, user profiles). Can handle high read/write throughput with eventual consistency; now also offers ACID transactions on a smaller scale.
DynamoDB (Key-Value / Wide-Column NoSQL) Key-value with optional document support (partition-key and sort-key model; schema-on-read) Serverless, cloud-native scale-out store. Ideal for simple key-based access patterns at massive scale (e.g., user preferences, IoT data). Excels in availability and auto-scaling. Less optimal for complex queries beyond key lookups or aggregates.
Cassandra (Wide-Column NoSQL) Wide-column store (partitioned rows with many columns; tunable consistency) High write throughput, multi-datacenter replication. Great for time-series, logging, and use cases requiring append-heavy workloads. Trades strict consistency for availability (AP by default). Best when data model can be denormalized into partition-centric queries.
Neo4j (Graph DB) Graph model (nodes, relationships with properties) Connected data queries (e.g., social networks, recommendation engines). Optimized for traversals like “friends-of-friends” or graph analytics. Suitable when relationships are central and you need flexible, depth-first queries that are hard to express in SQL.
ClickHouse (Analytical Columnar DB) Columnar storage, SQL interface (distributed OLAP engine) OLAP / analytical workloads. Excels at aggregating large datasets in real-time (e.g., user event analytics, telemetry). High compression and vectorized execution give speed on big scans. Not designed for heavy transactional updates (append-oriented).

(Note: Many other notable engines exist; this table focuses on the ones requested. Each shines in a particular niche – choosing one in design depends on the data model and access patterns needed.)

What Interviewers Often Probe (Follow-up Questions)


Final tip: Interviewers at the senior/staff level care about conceptual clarity. They might not expect you to recall obscure syntax, but they will probe if you truly understand the why behind databases designs. Focus on explaining trade-offs (e.g., “we choose X over Y because it improves throughput at the cost of latency, which in this scenario is acceptable”) and use analogies when appropriate. By mastering these fundamentals – from data models to durability – you’ll be equipped to tackle system design questions with confidence and authority. Good luck!

system-design