Home / Notebooks / System Design
System Design
intermediate

System Design Interview – An Insider's Guide

Frameworks, patterns, and classic problems from Alex Xu's system design interview guide

April 28, 2026
Updated regularly

System Design Interview – An Insider's Guide

Alex Xu's book is a practical playbook for system design interviews. Its real value is a repeatable framework for decomposing large, open-ended problems, backed by concrete examples of real-world systems.

---

The Interview Framework

Four steps to structure any system design interview:

Step 1 — Understand the problem (3–10 min)

Ask before designing. Never assume.

  • Who are the users? What do they do?
  • What scale? (DAU, QPS, storage)
  • What features are in scope?
  • What are the consistency, availability, latency requirements?
  • Step 2 — High-level design (10–15 min)

    Sketch the major components and data flow. Get buy-in before diving deep.

    Client → Load Balancer → API Servers → Cache → Database
                                         → Message Queue → Workers
                                         → CDN (for static assets)
    

    Step 3 — Deep dive (10–25 min)

    Pick 2–3 components that matter most for this problem and go deep:

  • Data model and schema
  • Critical algorithms or data structures
  • Bottlenecks and how to resolve them
  • Step 4 — Wrap up (3–5 min)

  • Identify bottlenecks and future scaling concerns
  • Summarize design decisions and trade-offs made
  • ---

    Back-of-Envelope Estimation

    Quick math to validate scale assumptions before designing.

    Common Numbers to Memorize

    ResourceLatency
    L1 cache reference0.5 ns
    Main memory reference100 ns
    SSD random read150 µs
    HDD seek10 ms
    Round-trip within same datacenter0.5 ms
    Round-trip US to Europe150 ms
    UnitValue
    1 million req/day~12 req/sec
    1 billion req/day~11,500 req/sec
    1 byte1 B
    1 KB10³ B
    1 MB10⁶ B
    1 GB10⁹ B
    1 TB10¹² B

    Example: Twitter-like System

    300M MAU, 50% daily active → 150M DAU
    Each user tweets 2x/day → 300M tweets/day
    Tweet size: 140 chars ≈ 280 bytes
    
    Write QPS = 300M / 86400 ≈ 3500 tweets/sec
    Peak QPS  = 3500 × 2     = 7000 tweets/sec
    
    Storage/day = 300M × 280B = 84 GB/day
    Storage/5yr = 84 × 365 × 5 ≈ 153 TB
    

    ---

    Building Blocks

    Load Balancer

    Distributes traffic across multiple servers. Prevents any single server from becoming a bottleneck.

    Client
      ↓
    Load Balancer (round-robin / least-connections / IP hash)
      ↓         ↓         ↓
    Server 1  Server 2  Server 3
    
  • Round-robin — even distribution, simple
  • Least connections — routes to server with fewest active connections
  • IP hash — same client always hits same server (useful for session stickiness)
  • CDN (Content Delivery Network)

    Caches static content (images, JS, CSS, video) at edge nodes geographically close to users.

    User (Jakarta) → CDN edge (Singapore) → Cache hit → served locally
                                          → Cache miss → Origin server → CDN → User
    

    Use CDN for: static assets, video content, anything that doesn't change per user.

    Caching

    Reduces database load. Always ask: what's the cache invalidation strategy?

    StrategyHow it worksBest for
    Cache-asideApp checks cache first, loads from DB on miss, writes to cacheRead-heavy workloads
    Write-throughWrite to cache and DB simultaneouslyData that must be consistent
    Write-backWrite to cache only, flush to DB asynchronouslyWrite-heavy, can tolerate brief data loss
    Read-throughCache sits in front of DB, handles misses automaticallySimpler app code
    Cache-aside:
      Read:  Cache hit → return
             Cache miss → DB → store in cache → return
      Write: Update DB → invalidate cache entry
    

    Eviction policies: LRU (Least Recently Used), LFU (Least Frequently Used), TTL-based.

    Database Replication

    Primary (writes) → Replica 1 (reads)
                    → Replica 2 (reads)
                    → Replica 3 (reads)
    
  • Improves read throughput and fault tolerance
  • Introduces replication lag — reads from replicas may be stale
  • Database Sharding

    Split data horizontally across multiple database instances.

    User IDs 0–9M   → Shard 1
    User IDs 10–19M → Shard 2
    User IDs 20–29M → Shard 3
    

    Challenges:

  • Resharding — moving data when a shard is full or imbalanced
  • Celebrity (hotspot) problem — one shard overloaded by popular data
  • Cross-shard joins — must be done in application layer
  • ---

    Consistent Hashing

    Solves the resharding problem when adding or removing nodes.

    Virtual ring: 0 ────────────────────── 360°
    
    Nodes placed at hash(nodeId) positions on the ring
    Keys assigned to the next clockwise node
    
    Adding a node: only keys between previous node and new node move
    Removing a node: only that node's keys move to its successor
    
  • Minimizes key redistribution when scaling
  • Virtual nodes (vnodes) — each physical node gets multiple positions on the ring to balance load
  • Used by: Cassandra, DynamoDB, Amazon ElastiCache

    ---

    Rate Limiter

    Protects services from being overwhelmed by too many requests.

    Algorithms

    Token Bucket — a bucket holds tokens, each request consumes one, tokens refill at a fixed rate.

    Bucket capacity: 10 tokens
    Refill rate: 2 tokens/sec
    
    Burst of 10 requests → consumed immediately
    Next request → must wait for refill
    

    Allows bursting. Used by: AWS, Stripe.

    Leaky Bucket — requests enter a queue, processed at a fixed output rate. Queue drops excess.

    Incoming: burst of 100 req
    Queue: holds 10
    Output: 2 req/sec steady
    
    → 90 requests dropped
    

    Smooths output. No bursting.

    Fixed Window Counter — count requests per fixed time window (e.g., per minute).

    Problem: burst at window boundary counts across two windows.

    Sliding Window Log — store timestamp of each request; count requests in the last N seconds.

    Most accurate, but memory-heavy.

    Sliding Window Counter — hybrid: fixed window counter + weighted count from previous window.

    Rate = current_window_count + prev_window_count × overlap_ratio
    

    Where to Store Counters

    Redis with atomic increment (INCR) + TTL is the standard solution for distributed rate limiting.

    ---

    Unique ID Generator

    Requirements

  • Globally unique
  • Sortable by time (optional but often required)
  • High throughput
  • Approaches

    UUID — 128-bit random identifier.

    550e8400-e29b-41d4-a716-446655440000
    

    Simple, no coordination. Not sortable, not compact.

    Database auto-increment — does not work across multiple databases.

    Snowflake ID (Twitter) — 64-bit integer composed of:

    | 1 bit (sign) | 41 bits (timestamp ms) | 10 bits (machine ID) | 12 bits (sequence) |
    
  • Sortable by time
  • ~4096 IDs/ms per machine
  • Requires synchronized clocks (NTP)
  • Used by: Twitter, Discord, Instagram

    ---

    URL Shortener

    Core flow:

    POST /shorten  { url: "https://long-url.com/..." }
      → generate short code
      → store { short_code → long_url } in DB
      → return "https://short.ly/abc123"
    
    GET /abc123
      → lookup short_code in cache
      → 301 (permanent) or 302 (temporary) redirect to long_url
    

    Short code generation:

    Base-62 encoding of auto-incremented ID
    Characters: [0-9 A-Z a-z] = 62 symbols
    
    ID = 11157310 → base62 → "B3D3"
    7 chars → 62^7 = 3.5 trillion URLs
    
  • Use 301 for SEO (browser caches redirect, reduces server load)
  • Use 302 for analytics (every click hits server, can track)
  • Schema:

    CREATE TABLE urls (
      id        BIGINT PRIMARY KEY AUTO_INCREMENT,
      short_key VARCHAR(8) UNIQUE NOT NULL,
      long_url  TEXT NOT NULL,
      created_at TIMESTAMP DEFAULT NOW()
    );
    

    ---

    News Feed (Timeline)

    Two core models for delivering posts to followers.

    Fanout on Write (Push)

    When a user posts, immediately push to all followers' feed caches.

    User A posts → fetch A's followers → write to each follower's feed cache
    
  • Reads are fast (pre-computed feed)
  • Writes are expensive for users with millions of followers (celebrities)
  • Fanout on Read (Pull)

    When a user opens their feed, fetch recent posts from everyone they follow.

    User B opens feed → fetch B's followees → merge their recent posts → return
    
  • Writes are cheap
  • Reads are slow and heavy
  • Hybrid (Used by Twitter, Instagram)

  • Regular users: fanout on write
  • Celebrity users (>N followers): fanout on read
  • Feed = precomputed_feed_cache + real-time_celebrity_posts (fetched on read)
    

    ---

    Chat System

    Transport

    Use WebSocket for full-duplex, low-latency messaging.

    Client ←→ WebSocket connection ←→ Chat Server
    

    HTTP long-polling is an alternative but wastes connections.

    Message Storage

  • Recent messages: in-memory cache (Redis) for fast retrieval
  • History: append-only storage (Cassandra or HBase) — high write throughput, time-range scans
  • Message ID requirements:

  • Unique within a chat
  • Sortable by time
  • Use Snowflake IDs or local sequence numbers per chat room.

    Presence (Online Status)

  • On connect: set user:{id}:status = online in Redis with a TTL
  • Heartbeat every 30s resets TTL
  • On disconnect / TTL expiry: status becomes offline
  • Fanout presence updates to friends via pub/sub (Redis Pub/Sub)
  • Group Chat Delivery

    Message sent to group → message stored → one copy per group, not per member
    Member reads → fetch messages after their last_read_message_id
    

    Avoid fanout to every member on send — instead track read position per member.

    ---

    Search Autocomplete

    Suggest completions as the user types.

    Trie Data Structure

    Input: "be"
    
            root
            ├── b
            │   └── e (count: 40)
            │       ├── a (count: 35)  → "bea..."
            │       └── e (count: 20)  → "bee..."
    
  • Each node stores prefix counts
  • Traverse to current prefix node, collect top-k children by count
  • Scaling

    Trie fits in memory for a single server. At scale:

  • Precompute top-k completions per prefix and cache them
  • Update asynchronously (rebuild trie from query logs daily/hourly, not on every search)
  • Shard trie by first character(s) of prefix
  • Shard 1: a–f prefixes
    Shard 2: g–m prefixes
    Shard 3: n–z prefixes
    

    ---

    Video Streaming (YouTube-like)

    Upload Flow

    Client → Chunked upload → Raw storage (S3)
                            → Transcoding workers (multiple resolutions: 360p, 720p, 1080p, 4K)
                            → Transcoded storage (S3)
                            → CDN distribution
                            → Metadata DB (title, description, author)
    

    Streaming Flow

    Client → CDN edge node → cache hit → video chunks served
                           → cache miss → origin S3 → CDN → client
    
    Protocol: DASH or HLS (adaptive bitrate — adjusts quality based on bandwidth)
    

    Why chunked upload?

  • Resumable on failure
  • Can parallelize transcoding per chunk
  • ---

    Notification System

    Deliver push notifications, SMS, and email across multiple channels.

    Architecture

    Event sources (API servers, schedulers)
      ↓
    Notification Service
      ↓
    Message Queue (decouple and buffer)
      ↓
    Workers per channel:
      - iOS/Android → APNs / FCM
      - SMS        → Twilio, SNS
      - Email      → SendGrid, SES
    

    Key Design Decisions

  • At-least-once delivery — use idempotent operations; recipients may get duplicates during retries
  • Priority queues — urgent alerts (security, payment) get a dedicated high-priority queue
  • Rate limiting — cap notifications per user per day to avoid spam
  • Opt-out support — check user preference before sending each notification
  • ---

    Design Trade-off Cheat Sheet

    DecisionOption AOption BPick A when…
    Consistency vs availabilityStrong (CP)Eventually consistent (AP)Finance, inventory
    SQL vs NoSQLRelationalDocument / Wide-columnStructured + joins vs scale + flexibility
    Cache invalidationTTLEvent-driven invalidationAcceptable staleness vs strict freshness
    FanoutOn write (push)On read (pull)Read-heavy vs write-heavy, few vs many followers
    Sync vs asyncSynchronous APIMessage queueLow latency vs high throughput / resilience
    Monolith vs microservicesSingle serviceIndependent servicesSmall team / early stage vs scale / team autonomy
    ---

    Resources

  • System Design Interview Vol. 1 — Alex Xu
  • System Design Interview Vol. 2 — Alex Xu & Sahn Lam
  • ByteByteGo Newsletter
  • Topics

    System DesignArchitectureScalabilityInterview

    Found This Helpful?

    If you have questions or suggestions for improving these notes, I'd love to hear from you.