Designing Data-Intensive Applications

Martin Kleppmann's central thesis: data is the central challenge in modern systems — not compute. The book is a map for making sound decisions about storage, replication, consistency, and processing at scale.

Three properties every data system should strive for:

Reliability — works correctly even under faults (hardware, software, human error)

Scalability — handles growth in data, traffic, or complexity

Maintainability — can be understood and modified by people other than the original author

---

Data Models & Query Languages

Every data model embeds assumptions about how data is used. Choosing the wrong one creates friction at every layer.

Relational Model

Data organized as tables of rows. Joins connect related data across tables.

Best for: structured data with predictable relationships and complex queries.

-- Normalized: each fact stored once
SELECT u.name, o.total
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.region = 'SEA';

Document Model

Data stored as self-contained documents (JSON/BSON). Related data is nested or referenced.

Best for: data with a natural tree structure, where reads mostly fetch one document at a time.

{
  "user_id": "u123",
  "name": "Yudi",
  "orders": [
    { "order_id": "o1", "total": 250000 }
  ]
}

Schema-on-read (documents) vs schema-on-write (relational): documents are flexible but push validation to the application.

Graph Model

Vertices (entities) and edges (relationships). Optimized for traversing many hops between entities.

Best for: social graphs, recommendation engines, fraud detection, knowledge graphs.

(User)-[:FOLLOWS]->(User)-[:PURCHASED]->(Product)-[:SIMILAR_TO]->(Product)

---

Storage & Retrieval

Log-Structured Storage (LSM-Tree + SSTables)

Writes always append to an in-memory structure (memtable), which is periodically flushed as an immutable sorted file (SSTable). Background compaction merges files.

Write → memtable (in RAM)
           ↓ (when full)
        SSTable on disk
           ↓ (background)
        Compaction (merge + deduplicate)

Writes are fast (sequential I/O)

Reads may touch multiple SSTables (mitigated with Bloom filters)

Used by: LevelDB, RocksDB, Cassandra, HBase

B-Tree Index

Breaks data into fixed-size pages (~4KB). Reads and writes traverse the tree from root to leaf.

Root → Branch pages → Leaf pages (actual data)

Reads are predictable (O(log n))

Writes do in-place updates (random I/O)

Standard in: PostgreSQL, MySQL, SQLite

OLTP vs OLAP

Property	OLTP	OLAP
Read pattern	Small rows, by key	Large scans, aggregations
Write pattern	Frequent, low-latency	Bulk loads / ETL
Dataset size	GB–TB	TB–PB
Used by	Application backends	Analysts, dashboards
Examples	PostgreSQL, MySQL	Redshift, BigQuery, Snowflake

Column-oriented storage excels at OLAP: stores each column separately so queries only read the columns they need.

---

Encoding & Schema Evolution

Systems change over time. Old and new code must be able to coexist and exchange data — this is backward and forward compatibility.

Formats Compared

Format	Schema required	Human-readable	Versioning
JSON / XML	No	Yes	Manual
Avro	Yes (in file or registry)	No	Strong
Protocol Buffers	Yes (.proto)	No	Strong
Thrift	Yes	No	Strong

Avro Schema Evolution Rules

// v1 schema
{ "type": "record", "name": "User",
  "fields": [{ "name": "id", "type": "string" }] }

// v2 schema — added field with default (backward compatible)
{ "type": "record", "name": "User",
  "fields": [
    { "name": "id",    "type": "string" },
    { "name": "email", "type": "string", "default": "" }
  ]
}

Rules:

Adding a field with a default is safe (old readers ignore it, new readers use the default when reading old data)

Removing a field without a default breaks backward compatibility

Never change a field's type

---

Replication

Replication keeps copies of data on multiple nodes for fault tolerance and read scalability.

Leader–Follower (Single-Leader)

All writes go to the leader. Followers replicate and serve reads.

Client → Leader (write)
Leader → Follower 1 (replicate)
Leader → Follower 2 (replicate)

Client → Follower 1 (read)

Replication lag creates consistency problems:

Read-your-writes — a user reads data they just wrote, but the follower hasn't caught up yet

Monotonic reads — a user sees data disappear because two reads hit different followers at different lag levels

Consistent prefix reads — causally related writes appear out of order

Multi-Leader Replication

Multiple nodes accept writes. Useful for multi-datacenter or offline-capable apps.

Trade-off: write conflicts must be resolved. Common strategies:

Last Write Wins (LWW) — simplest, loses data

Application-level merge — correct, complex

CRDTs (Conflict-free Replicated Data Types) — automatic merging for specific data structures

Leaderless Replication (Dynamo-style)

Writes and reads go to multiple nodes simultaneously. Quorums determine validity.

Write quorum:  w nodes must acknowledge
Read quorum:   r nodes must respond
Guarantee:     w + r > n  (at least one node has the latest value)

Used by: Cassandra, DynamoDB, Riak

---

Partitioning (Sharding)

Split a large dataset across multiple nodes so each node holds a subset.

Key Range Partitioning

Assign contiguous ranges of keys to partitions.

Partition 1: keys A–F
Partition 2: keys G–M
Partition 3: keys N–Z

Supports efficient range scans

Risk: hot spots if keys are not uniformly distributed (e.g., timestamps)

Hash Partitioning

Apply a hash function to the key, assign to partitions by hash range.

partition = hash(key) % num_partitions

Distributes load evenly

Range scans become expensive (data is scattered)

Used by: Cassandra (consistent hashing), MongoDB

Secondary Indexes on Partitioned Data

Local index (document-partitioned) — each partition maintains its own index; reads must query all partitions (scatter/gather)

Global index (term-partitioned) — index itself is partitioned; reads hit one partition, writes update multiple

---

Transactions

A transaction groups multiple reads and writes into one logical unit that either fully commits or fully rolls back.

ACID

Property	Meaning
Atomicity	All or nothing — partial writes never visible
Consistency	Invariants hold before and after (application-defined)
Isolation	Concurrent transactions don't interfere with each other
Durability	Committed data survives crashes

Isolation Levels & Anomalies

Isolation Level	Dirty Read	Non-repeatable Read	Phantom Read
Read Uncommitted	Possible	Possible	Possible
Read Committed	Prevented	Possible	Possible
Repeatable Read	Prevented	Prevented	Possible
Serializable	Prevented	Prevented	Prevented

Snapshot Isolation (used by PostgreSQL, MySQL InnoDB): each transaction reads from a consistent snapshot at its start time. Implemented via MVCC (Multi-Version Concurrency Control).

Serializable Snapshot Isolation (SSI): detects read-write conflicts at commit time and aborts transactions that would create anomalies. Optimistic — no locks held during transaction.

---

Distributed Systems Problems

Unreliable Networks

Messages can be lost, delayed, reordered, or duplicated. Timeouts do not tell you whether a request succeeded — only that you didn't get a response in time.

Request sent → network packet lost → timeout
Request sent → response lost      → timeout (but write happened!)

Always design for idempotency: retrying a request produces the same result as running it once.

Unreliable Clocks

Wall-clock time (time of day) can jump forward or backward (NTP sync, leap seconds)

Monotonic clocks only measure elapsed time; cannot compare across machines

Consequence: you cannot use timestamps to determine the order of events across nodes. Use logical clocks (Lamport timestamps, vector clocks) instead.

Process Pauses

A process can be paused at any point (GC pause, OS scheduling, VM migration) and resume without knowing time has passed. A node that holds a lease cannot assume the lease is still valid after a pause.

---

Consistency & Consensus

Linearizability

The strongest consistency guarantee: the system behaves as if there is a single copy of the data, and all operations are atomic.

Without linearizability:
  Client A writes x=1
  Client B reads x=0   ← stale replica
  Client C reads x=1   ← different answer at the same moment

With linearizability:
  Once write is confirmed, all subsequent reads see x=1

Cost: requires coordination → high latency, availability loss under network partition (CAP theorem).

CAP Theorem

In the presence of a network Partition, a system must choose between:

Consistency (linearizability) — refuse to respond rather than return stale data

Availability — respond with possibly stale data

Most real systems choose CP (ZooKeeper, etcd) or AP (Cassandra, DynamoDB) depending on the use case.

Consensus Algorithms

Consensus: getting multiple nodes to agree on a single value, even when some nodes fail.

Used for: leader election, distributed locks, atomic broadcast.

Raft (simpler to understand than Paxos):

1. Leader election — one node elected per term
2. Log replication — leader appends entry, replicates to followers
3. Commit — once majority acknowledges, entry is committed
4. All committed entries are durable even if leader crashes

Implementations: etcd, CockroachDB, TiKV

---

Batch Processing

Process a large, bounded dataset all at once. Outputs are derived datasets (indexes, aggregations, ML features).

MapReduce

Two-phase computation:

Input files
  ↓ Map (emit key-value pairs per record)
Shuffle (group all values for same key)
  ↓ Reduce (aggregate per key)
Output files

# Map: count word occurrences
def map(record):
    for word in record.split():
        emit(word, 1)

# Reduce: sum counts per word
def reduce(word, counts):
    emit(word, sum(counts))

Modern dataflow engines (Spark, Flink, Beam) replace MapReduce: keep data in memory, support arbitrary DAGs (not just map → reduce), reuse intermediate results.

---

Stream Processing

Process a continuous, unbounded stream of events as they arrive.

Event Streams vs Batch

Property	Batch	Stream
Input	Bounded (files)	Unbounded (events)
Latency	Minutes–hours	Milliseconds–seconds
Trigger	Scheduled run	Event arrival

Key Concepts

Message broker (Kafka, Pulsar): durable log of events. Consumers can rewind and replay.

Producer → [Topic: order.created] → Consumer A (billing)
                                  → Consumer B (inventory)
                                  → Consumer C (analytics)

Stream-stream join: correlate events from two streams within a time window.

ad_impressions INNER JOIN ad_clicks
  ON impression.ad_id = click.ad_id
  AND click.time BETWEEN impression.time AND impression.time + 1h

Windowing: group events by time.

Tumbling window — fixed, non-overlapping (e.g., each minute)

Sliding window — fixed size, overlapping (e.g., last 5 minutes, updated every minute)

Session window — grouped by inactivity gap (e.g., user session ends after 30 min idle)

Exactly-Once Processing

Delivering each event exactly once (not zero, not two) requires:

Idempotent consumers — reprocessing the same event has no additional effect

Transactional writes — write processing result and commit offset atomically

---

Lambda & Kappa Architecture

Lambda Architecture

Combines batch and stream layers to serve both accurate historical results and low-latency recent results.

Input → Batch layer  (recomputes full history)    → Serving layer
      → Speed layer  (processes recent events)    → Serving layer
                                                         ↓
                                                     Query merges both

Drawback: maintaining two code paths for the same logic.

Kappa Architecture

Eliminate the batch layer — use a replayable stream (Kafka) as the source of truth.

Input → Stream layer only (fast path)
        Reprocess by replaying from Kafka offset when needed

Simpler, but requires a replayable event log and idempotent processing.

---

Key Takeaways

Topic	Key Insight
Data models	Match your model to your access patterns — not to familiarity
Storage engines	LSM-trees optimize writes; B-trees optimize reads
Replication	Lag is unavoidable — design for eventual consistency or pay for strong
Partitioning	Hash for distribution, range for scans; both can create hot spots
Transactions	Isolation levels are trade-offs between correctness and performance
Distributed clocks	Never use wall-clock time to order events across machines
Consistency	Linearizability is expensive; choose the weakest guarantee you can tolerate
Consensus	Leader election and distributed locks require a real consensus algorithm
Batch processing	Designed for throughput; reprocessing is the built-in error recovery
Stream processing	Designed for latency; exactly-once requires idempotency + transactional writes

Resources

Designing Data-Intensive Applications — Martin Kleppmann

The Log: What every software engineer should know — Jay Kreps

Raft Consensus Algorithm

Designing Data-Intensive Applications

On This Page

Designing Data-Intensive Applications

Data Models & Query Languages

Relational Model

Document Model

Graph Model

Storage & Retrieval

Log-Structured Storage (LSM-Tree + SSTables)

B-Tree Index

OLTP vs OLAP

Encoding & Schema Evolution

Formats Compared

Avro Schema Evolution Rules

Replication

Leader–Follower (Single-Leader)

Multi-Leader Replication

Leaderless Replication (Dynamo-style)

Partitioning (Sharding)

Key Range Partitioning

Hash Partitioning

Secondary Indexes on Partitioned Data

Transactions

ACID

Isolation Levels & Anomalies

Distributed Systems Problems

Unreliable Networks

Unreliable Clocks

Process Pauses

Consistency & Consensus

Linearizability

CAP Theorem

Consensus Algorithms

Batch Processing

MapReduce

Stream Processing

Event Streams vs Batch

Key Concepts

Exactly-Once Processing

Lambda & Kappa Architecture

Lambda Architecture

Kappa Architecture

Key Takeaways

Resources

Topics

Found This Helpful?