Designing Data-Intensive Applications
Martin Kleppmann's central thesis: data is the central challenge in modern systems — not compute. The book is a map for making sound decisions about storage, replication, consistency, and processing at scale.
Three properties every data system should strive for:
---
Data Models & Query Languages
Every data model embeds assumptions about how data is used. Choosing the wrong one creates friction at every layer.
Relational Model
Data organized as tables of rows. Joins connect related data across tables.
Best for: structured data with predictable relationships and complex queries.
-- Normalized: each fact stored once
SELECT u.name, o.total
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.region = 'SEA';
Document Model
Data stored as self-contained documents (JSON/BSON). Related data is nested or referenced.
Best for: data with a natural tree structure, where reads mostly fetch one document at a time.
{
"user_id": "u123",
"name": "Yudi",
"orders": [
{ "order_id": "o1", "total": 250000 }
]
}
Schema-on-read (documents) vs schema-on-write (relational): documents are flexible but push validation to the application.
Graph Model
Vertices (entities) and edges (relationships). Optimized for traversing many hops between entities.
Best for: social graphs, recommendation engines, fraud detection, knowledge graphs.
(User)-[:FOLLOWS]->(User)-[:PURCHASED]->(Product)-[:SIMILAR_TO]->(Product)
---
Storage & Retrieval
Log-Structured Storage (LSM-Tree + SSTables)
Writes always append to an in-memory structure (memtable), which is periodically flushed as an immutable sorted file (SSTable). Background compaction merges files.
Write → memtable (in RAM)
↓ (when full)
SSTable on disk
↓ (background)
Compaction (merge + deduplicate)
B-Tree Index
Breaks data into fixed-size pages (~4KB). Reads and writes traverse the tree from root to leaf.
Root → Branch pages → Leaf pages (actual data)
OLTP vs OLAP
| Property | OLTP | OLAP |
|---|---|---|
| Read pattern | Small rows, by key | Large scans, aggregations |
| Write pattern | Frequent, low-latency | Bulk loads / ETL |
| Dataset size | GB–TB | TB–PB |
| Used by | Application backends | Analysts, dashboards |
| Examples | PostgreSQL, MySQL | Redshift, BigQuery, Snowflake |
---
Encoding & Schema Evolution
Systems change over time. Old and new code must be able to coexist and exchange data — this is backward and forward compatibility.
Formats Compared
| Format | Schema required | Human-readable | Versioning |
|---|---|---|---|
| JSON / XML | No | Yes | Manual |
| Avro | Yes (in file or registry) | No | Strong |
| Protocol Buffers | Yes (.proto) | No | Strong |
| Thrift | Yes | No | Strong |
Avro Schema Evolution Rules
// v1 schema
{ "type": "record", "name": "User",
"fields": [{ "name": "id", "type": "string" }] }
// v2 schema — added field with default (backward compatible)
{ "type": "record", "name": "User",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "email", "type": "string", "default": "" }
]
}
Rules:
---
Replication
Replication keeps copies of data on multiple nodes for fault tolerance and read scalability.
Leader–Follower (Single-Leader)
All writes go to the leader. Followers replicate and serve reads.
Client → Leader (write)
Leader → Follower 1 (replicate)
Leader → Follower 2 (replicate)
Client → Follower 1 (read)
Replication lag creates consistency problems:
Multi-Leader Replication
Multiple nodes accept writes. Useful for multi-datacenter or offline-capable apps.
Trade-off: write conflicts must be resolved. Common strategies:
Leaderless Replication (Dynamo-style)
Writes and reads go to multiple nodes simultaneously. Quorums determine validity.
Write quorum: w nodes must acknowledge
Read quorum: r nodes must respond
Guarantee: w + r > n (at least one node has the latest value)
Used by: Cassandra, DynamoDB, Riak
---
Partitioning (Sharding)
Split a large dataset across multiple nodes so each node holds a subset.
Key Range Partitioning
Assign contiguous ranges of keys to partitions.
Partition 1: keys A–F
Partition 2: keys G–M
Partition 3: keys N–Z
Hash Partitioning
Apply a hash function to the key, assign to partitions by hash range.
partition = hash(key) % num_partitions
Secondary Indexes on Partitioned Data
---
Transactions
A transaction groups multiple reads and writes into one logical unit that either fully commits or fully rolls back.
ACID
| Property | Meaning |
|---|---|
| Atomicity | All or nothing — partial writes never visible |
| Consistency | Invariants hold before and after (application-defined) |
| Isolation | Concurrent transactions don't interfere with each other |
| Durability | Committed data survives crashes |
Isolation Levels & Anomalies
| Isolation Level | Dirty Read | Non-repeatable Read | Phantom Read |
|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible |
| Read Committed | Prevented | Possible | Possible |
| Repeatable Read | Prevented | Prevented | Possible |
| Serializable | Prevented | Prevented | Prevented |
Serializable Snapshot Isolation (SSI): detects read-write conflicts at commit time and aborts transactions that would create anomalies. Optimistic — no locks held during transaction.
---
Distributed Systems Problems
Unreliable Networks
Messages can be lost, delayed, reordered, or duplicated. Timeouts do not tell you whether a request succeeded — only that you didn't get a response in time.
Request sent → network packet lost → timeout
Request sent → response lost → timeout (but write happened!)
Always design for idempotency: retrying a request produces the same result as running it once.
Unreliable Clocks
Consequence: you cannot use timestamps to determine the order of events across nodes. Use logical clocks (Lamport timestamps, vector clocks) instead.
Process Pauses
A process can be paused at any point (GC pause, OS scheduling, VM migration) and resume without knowing time has passed. A node that holds a lease cannot assume the lease is still valid after a pause.
---
Consistency & Consensus
Linearizability
The strongest consistency guarantee: the system behaves as if there is a single copy of the data, and all operations are atomic.
Without linearizability:
Client A writes x=1
Client B reads x=0 ← stale replica
Client C reads x=1 ← different answer at the same moment
With linearizability:
Once write is confirmed, all subsequent reads see x=1
Cost: requires coordination → high latency, availability loss under network partition (CAP theorem).
CAP Theorem
In the presence of a network Partition, a system must choose between:
Most real systems choose CP (ZooKeeper, etcd) or AP (Cassandra, DynamoDB) depending on the use case.
Consensus Algorithms
Consensus: getting multiple nodes to agree on a single value, even when some nodes fail.
Used for: leader election, distributed locks, atomic broadcast.
Raft (simpler to understand than Paxos):
1. Leader election — one node elected per term
2. Log replication — leader appends entry, replicates to followers
3. Commit — once majority acknowledges, entry is committed
4. All committed entries are durable even if leader crashes
Implementations: etcd, CockroachDB, TiKV
---
Batch Processing
Process a large, bounded dataset all at once. Outputs are derived datasets (indexes, aggregations, ML features).
MapReduce
Two-phase computation:
Input files
↓ Map (emit key-value pairs per record)
Shuffle (group all values for same key)
↓ Reduce (aggregate per key)
Output files
# Map: count word occurrences
def map(record):
for word in record.split():
emit(word, 1)
# Reduce: sum counts per word
def reduce(word, counts):
emit(word, sum(counts))
Modern dataflow engines (Spark, Flink, Beam) replace MapReduce: keep data in memory, support arbitrary DAGs (not just map → reduce), reuse intermediate results.
---
Stream Processing
Process a continuous, unbounded stream of events as they arrive.
Event Streams vs Batch
| Property | Batch | Stream |
|---|---|---|
| Input | Bounded (files) | Unbounded (events) |
| Latency | Minutes–hours | Milliseconds–seconds |
| Trigger | Scheduled run | Event arrival |
Key Concepts
Message broker (Kafka, Pulsar): durable log of events. Consumers can rewind and replay.
Producer → [Topic: order.created] → Consumer A (billing)
→ Consumer B (inventory)
→ Consumer C (analytics)
Stream-stream join: correlate events from two streams within a time window.
ad_impressions INNER JOIN ad_clicks
ON impression.ad_id = click.ad_id
AND click.time BETWEEN impression.time AND impression.time + 1h
Windowing: group events by time.
Exactly-Once Processing
Delivering each event exactly once (not zero, not two) requires:
---
Lambda & Kappa Architecture
Lambda Architecture
Combines batch and stream layers to serve both accurate historical results and low-latency recent results.
Input → Batch layer (recomputes full history) → Serving layer
→ Speed layer (processes recent events) → Serving layer
↓
Query merges both
Drawback: maintaining two code paths for the same logic.
Kappa Architecture
Eliminate the batch layer — use a replayable stream (Kafka) as the source of truth.
Input → Stream layer only (fast path)
Reprocess by replaying from Kafka offset when needed
Simpler, but requires a replayable event log and idempotent processing.
---
Key Takeaways
| Topic | Key Insight |
|---|---|
| Data models | Match your model to your access patterns — not to familiarity |
| Storage engines | LSM-trees optimize writes; B-trees optimize reads |
| Replication | Lag is unavoidable — design for eventual consistency or pay for strong |
| Partitioning | Hash for distribution, range for scans; both can create hot spots |
| Transactions | Isolation levels are trade-offs between correctness and performance |
| Distributed clocks | Never use wall-clock time to order events across machines |
| Consistency | Linearizability is expensive; choose the weakest guarantee you can tolerate |
| Consensus | Leader election and distributed locks require a real consensus algorithm |
| Batch processing | Designed for throughput; reprocessing is the built-in error recovery |
| Stream processing | Designed for latency; exactly-once requires idempotency + transactional writes |