Home / Notebooks / System Design
System Design
advanced

Designing Data-Intensive Applications

Core concepts from Martin Kleppmann's book on building reliable, scalable, and maintainable data systems

April 28, 2026
Updated regularly

Designing Data-Intensive Applications

Martin Kleppmann's central thesis: data is the central challenge in modern systems — not compute. The book is a map for making sound decisions about storage, replication, consistency, and processing at scale.

Three properties every data system should strive for:

  • Reliability — works correctly even under faults (hardware, software, human error)
  • Scalability — handles growth in data, traffic, or complexity
  • Maintainability — can be understood and modified by people other than the original author
  • ---

    Data Models & Query Languages

    Every data model embeds assumptions about how data is used. Choosing the wrong one creates friction at every layer.

    Relational Model

    Data organized as tables of rows. Joins connect related data across tables.

    Best for: structured data with predictable relationships and complex queries.

    -- Normalized: each fact stored once
    SELECT u.name, o.total
    FROM users u
    JOIN orders o ON o.user_id = u.id
    WHERE u.region = 'SEA';
    

    Document Model

    Data stored as self-contained documents (JSON/BSON). Related data is nested or referenced.

    Best for: data with a natural tree structure, where reads mostly fetch one document at a time.

    {
      "user_id": "u123",
      "name": "Yudi",
      "orders": [
        { "order_id": "o1", "total": 250000 }
      ]
    }
    

    Schema-on-read (documents) vs schema-on-write (relational): documents are flexible but push validation to the application.

    Graph Model

    Vertices (entities) and edges (relationships). Optimized for traversing many hops between entities.

    Best for: social graphs, recommendation engines, fraud detection, knowledge graphs.

    (User)-[:FOLLOWS]->(User)-[:PURCHASED]->(Product)-[:SIMILAR_TO]->(Product)
    

    ---

    Storage & Retrieval

    Log-Structured Storage (LSM-Tree + SSTables)

    Writes always append to an in-memory structure (memtable), which is periodically flushed as an immutable sorted file (SSTable). Background compaction merges files.

    Write → memtable (in RAM)
               ↓ (when full)
            SSTable on disk
               ↓ (background)
            Compaction (merge + deduplicate)
    
  • Writes are fast (sequential I/O)
  • Reads may touch multiple SSTables (mitigated with Bloom filters)
  • Used by: LevelDB, RocksDB, Cassandra, HBase
  • B-Tree Index

    Breaks data into fixed-size pages (~4KB). Reads and writes traverse the tree from root to leaf.

    Root → Branch pages → Leaf pages (actual data)
    
  • Reads are predictable (O(log n))
  • Writes do in-place updates (random I/O)
  • Standard in: PostgreSQL, MySQL, SQLite
  • OLTP vs OLAP

    PropertyOLTPOLAP
    Read patternSmall rows, by keyLarge scans, aggregations
    Write patternFrequent, low-latencyBulk loads / ETL
    Dataset sizeGB–TBTB–PB
    Used byApplication backendsAnalysts, dashboards
    ExamplesPostgreSQL, MySQLRedshift, BigQuery, Snowflake
    Column-oriented storage excels at OLAP: stores each column separately so queries only read the columns they need.

    ---

    Encoding & Schema Evolution

    Systems change over time. Old and new code must be able to coexist and exchange data — this is backward and forward compatibility.

    Formats Compared

    FormatSchema requiredHuman-readableVersioning
    JSON / XMLNoYesManual
    AvroYes (in file or registry)NoStrong
    Protocol BuffersYes (.proto)NoStrong
    ThriftYesNoStrong

    Avro Schema Evolution Rules

    // v1 schema
    { "type": "record", "name": "User",
      "fields": [{ "name": "id", "type": "string" }] }
    
    // v2 schema — added field with default (backward compatible)
    { "type": "record", "name": "User",
      "fields": [
        { "name": "id",    "type": "string" },
        { "name": "email", "type": "string", "default": "" }
      ]
    }
    

    Rules:

  • Adding a field with a default is safe (old readers ignore it, new readers use the default when reading old data)
  • Removing a field without a default breaks backward compatibility
  • Never change a field's type
  • ---

    Replication

    Replication keeps copies of data on multiple nodes for fault tolerance and read scalability.

    Leader–Follower (Single-Leader)

    All writes go to the leader. Followers replicate and serve reads.

    Client → Leader (write)
    Leader → Follower 1 (replicate)
    Leader → Follower 2 (replicate)
    
    Client → Follower 1 (read)
    

    Replication lag creates consistency problems:

  • Read-your-writes — a user reads data they just wrote, but the follower hasn't caught up yet
  • Monotonic reads — a user sees data disappear because two reads hit different followers at different lag levels
  • Consistent prefix reads — causally related writes appear out of order
  • Multi-Leader Replication

    Multiple nodes accept writes. Useful for multi-datacenter or offline-capable apps.

    Trade-off: write conflicts must be resolved. Common strategies:

  • Last Write Wins (LWW) — simplest, loses data
  • Application-level merge — correct, complex
  • CRDTs (Conflict-free Replicated Data Types) — automatic merging for specific data structures
  • Leaderless Replication (Dynamo-style)

    Writes and reads go to multiple nodes simultaneously. Quorums determine validity.

    Write quorum:  w nodes must acknowledge
    Read quorum:   r nodes must respond
    Guarantee:     w + r > n  (at least one node has the latest value)
    

    Used by: Cassandra, DynamoDB, Riak

    ---

    Partitioning (Sharding)

    Split a large dataset across multiple nodes so each node holds a subset.

    Key Range Partitioning

    Assign contiguous ranges of keys to partitions.

    Partition 1: keys A–F
    Partition 2: keys G–M
    Partition 3: keys N–Z
    
  • Supports efficient range scans
  • Risk: hot spots if keys are not uniformly distributed (e.g., timestamps)
  • Hash Partitioning

    Apply a hash function to the key, assign to partitions by hash range.

    partition = hash(key) % num_partitions
    
  • Distributes load evenly
  • Range scans become expensive (data is scattered)
  • Used by: Cassandra (consistent hashing), MongoDB
  • Secondary Indexes on Partitioned Data

  • Local index (document-partitioned) — each partition maintains its own index; reads must query all partitions (scatter/gather)
  • Global index (term-partitioned) — index itself is partitioned; reads hit one partition, writes update multiple
  • ---

    Transactions

    A transaction groups multiple reads and writes into one logical unit that either fully commits or fully rolls back.

    ACID

    PropertyMeaning
    AtomicityAll or nothing — partial writes never visible
    ConsistencyInvariants hold before and after (application-defined)
    IsolationConcurrent transactions don't interfere with each other
    DurabilityCommitted data survives crashes

    Isolation Levels & Anomalies

    Isolation LevelDirty ReadNon-repeatable ReadPhantom Read
    Read UncommittedPossiblePossiblePossible
    Read CommittedPreventedPossiblePossible
    Repeatable ReadPreventedPreventedPossible
    SerializablePreventedPreventedPrevented
    Snapshot Isolation (used by PostgreSQL, MySQL InnoDB): each transaction reads from a consistent snapshot at its start time. Implemented via MVCC (Multi-Version Concurrency Control).

    Serializable Snapshot Isolation (SSI): detects read-write conflicts at commit time and aborts transactions that would create anomalies. Optimistic — no locks held during transaction.

    ---

    Distributed Systems Problems

    Unreliable Networks

    Messages can be lost, delayed, reordered, or duplicated. Timeouts do not tell you whether a request succeeded — only that you didn't get a response in time.

    Request sent → network packet lost → timeout
    Request sent → response lost      → timeout (but write happened!)
    

    Always design for idempotency: retrying a request produces the same result as running it once.

    Unreliable Clocks

  • Wall-clock time (time of day) can jump forward or backward (NTP sync, leap seconds)
  • Monotonic clocks only measure elapsed time; cannot compare across machines
  • Consequence: you cannot use timestamps to determine the order of events across nodes. Use logical clocks (Lamport timestamps, vector clocks) instead.

    Process Pauses

    A process can be paused at any point (GC pause, OS scheduling, VM migration) and resume without knowing time has passed. A node that holds a lease cannot assume the lease is still valid after a pause.

    ---

    Consistency & Consensus

    Linearizability

    The strongest consistency guarantee: the system behaves as if there is a single copy of the data, and all operations are atomic.

    Without linearizability:
      Client A writes x=1
      Client B reads x=0   ← stale replica
      Client C reads x=1   ← different answer at the same moment
    
    With linearizability:
      Once write is confirmed, all subsequent reads see x=1
    

    Cost: requires coordination → high latency, availability loss under network partition (CAP theorem).

    CAP Theorem

    In the presence of a network Partition, a system must choose between:

  • Consistency (linearizability) — refuse to respond rather than return stale data
  • Availability — respond with possibly stale data
  • Most real systems choose CP (ZooKeeper, etcd) or AP (Cassandra, DynamoDB) depending on the use case.

    Consensus Algorithms

    Consensus: getting multiple nodes to agree on a single value, even when some nodes fail.

    Used for: leader election, distributed locks, atomic broadcast.

    Raft (simpler to understand than Paxos):

    1. Leader election — one node elected per term
    2. Log replication — leader appends entry, replicates to followers
    3. Commit — once majority acknowledges, entry is committed
    4. All committed entries are durable even if leader crashes
    

    Implementations: etcd, CockroachDB, TiKV

    ---

    Batch Processing

    Process a large, bounded dataset all at once. Outputs are derived datasets (indexes, aggregations, ML features).

    MapReduce

    Two-phase computation:

    Input files
      ↓ Map (emit key-value pairs per record)
    Shuffle (group all values for same key)
      ↓ Reduce (aggregate per key)
    Output files
    
    # Map: count word occurrences
    def map(record):
        for word in record.split():
            emit(word, 1)
    
    # Reduce: sum counts per word
    def reduce(word, counts):
        emit(word, sum(counts))
    

    Modern dataflow engines (Spark, Flink, Beam) replace MapReduce: keep data in memory, support arbitrary DAGs (not just map → reduce), reuse intermediate results.

    ---

    Stream Processing

    Process a continuous, unbounded stream of events as they arrive.

    Event Streams vs Batch

    PropertyBatchStream
    InputBounded (files)Unbounded (events)
    LatencyMinutes–hoursMilliseconds–seconds
    TriggerScheduled runEvent arrival

    Key Concepts

    Message broker (Kafka, Pulsar): durable log of events. Consumers can rewind and replay.

    Producer → [Topic: order.created] → Consumer A (billing)
                                      → Consumer B (inventory)
                                      → Consumer C (analytics)
    

    Stream-stream join: correlate events from two streams within a time window.

    ad_impressions INNER JOIN ad_clicks
      ON impression.ad_id = click.ad_id
      AND click.time BETWEEN impression.time AND impression.time + 1h
    

    Windowing: group events by time.

  • Tumbling window — fixed, non-overlapping (e.g., each minute)
  • Sliding window — fixed size, overlapping (e.g., last 5 minutes, updated every minute)
  • Session window — grouped by inactivity gap (e.g., user session ends after 30 min idle)
  • Exactly-Once Processing

    Delivering each event exactly once (not zero, not two) requires:

  • Idempotent consumers — reprocessing the same event has no additional effect
  • Transactional writes — write processing result and commit offset atomically
  • ---

    Lambda & Kappa Architecture

    Lambda Architecture

    Combines batch and stream layers to serve both accurate historical results and low-latency recent results.

    Input → Batch layer  (recomputes full history)    → Serving layer
          → Speed layer  (processes recent events)    → Serving layer
                                                             ↓
                                                         Query merges both
    

    Drawback: maintaining two code paths for the same logic.

    Kappa Architecture

    Eliminate the batch layer — use a replayable stream (Kafka) as the source of truth.

    Input → Stream layer only (fast path)
            Reprocess by replaying from Kafka offset when needed
    

    Simpler, but requires a replayable event log and idempotent processing.

    ---

    Key Takeaways

    TopicKey Insight
    Data modelsMatch your model to your access patterns — not to familiarity
    Storage enginesLSM-trees optimize writes; B-trees optimize reads
    ReplicationLag is unavoidable — design for eventual consistency or pay for strong
    PartitioningHash for distribution, range for scans; both can create hot spots
    TransactionsIsolation levels are trade-offs between correctness and performance
    Distributed clocksNever use wall-clock time to order events across machines
    ConsistencyLinearizability is expensive; choose the weakest guarantee you can tolerate
    ConsensusLeader election and distributed locks require a real consensus algorithm
    Batch processingDesigned for throughput; reprocessing is the built-in error recovery
    Stream processingDesigned for latency; exactly-once requires idempotency + transactional writes

    Resources

  • Designing Data-Intensive Applications — Martin Kleppmann
  • The Log: What every software engineer should know — Jay Kreps
  • Raft Consensus Algorithm
  • Topics

    System DesignDistributed SystemsDatabasesArchitectureScalability

    Found This Helpful?

    If you have questions or suggestions for improving these notes, I'd love to hear from you.