Home / Notebooks / System Design
System Design
advanced

Building Microservices

Key concepts from Sam Newman's definitive guide to designing, building, and operating microservice systems

April 28, 2026
Updated regularly

Building Microservices

Sam Newman defines microservices as independently deployable services modeled around a business domain. The book is not about making services small — it is about making change safe, fast, and contained.

Five key properties of microservices:

  • Independently deployable — deploy one service without touching others
  • Modeled around a business domain — bounded by what the business cares about, not technical layers
  • Own their own state — no shared databases between services
  • Technology agnostic — each service chooses the right tool for its job
  • Small enough to be owned by a small team — typically a team that can be fed by two pizzas
  • ---

    The Monolith Is Not the Enemy

    Newman distinguishes three monolith types:

    Single-process monolith — all code runs in one process. The most common starting point.

    Modular monolith — one process, but with explicit internal module boundaries. High cohesion inside modules, low coupling between them. Often the right architecture before microservices.

    Distributed monolith — multiple services but tightly coupled through shared databases or synchronous call chains. Gets the worst of both worlds: deployment complexity without autonomy.

    > Start with a modular monolith. Decompose into microservices only when the cost of coordination inside the monolith exceeds the cost of distribution.

    ---

    Modeling Services Around Domains

    Bounded Contexts (from DDD)

    Each microservice should own one bounded context — a clear boundary within which a domain model is internally consistent.

    Order Context:       "Customer" means billing address, payment method
    Marketing Context:   "Customer" means email preferences, campaign history
    
    → Two services, two models, two meanings of "Customer"
    → They communicate through well-defined contracts, not shared tables
    

    Finding Service Boundaries

    Use these signals:

  • High cohesion — things that change together should stay together
  • Loose coupling — services should need to know as little as possible about each other
  • Business capability — can a non-technical person name what this service does?
  • Team ownership — one team can fully understand and own it
  • Volatility matters: separate stable code from frequently-changing code into different services.

    ---

    Splitting the Monolith

    The Strangler Fig Pattern

    Incrementally migrate from a monolith by routing traffic to new services, one capability at a time.

    Phase 1:  All traffic → Monolith
    
    Phase 2:  Traffic → Proxy
                      ├── /orders → OrderService (new)
                      └── /*      → Monolith
    
    Phase 3:  Traffic → Proxy
                      ├── /orders   → OrderService
                      ├── /users    → UserService (new)
                      └── /*        → Monolith (shrinking)
    
    Phase N:  Monolith gone
    

    Never do a big-bang rewrite. Extract one capability at a time, validate, then continue.

    Database Decomposition

    Shared databases are the most dangerous form of coupling — they bind services to each other's schema.

    Steps to split:

    1. Identify which code owns which tables
    2. Add a seam: access shared tables only through an API owned by one service
    3. Separate the schemas in the same database (different schemas, same server)
    4. Move each schema to its own database
    

    Patterns for managing data that multiple services need:

  • Shared database (temporary) — acceptable as a migration step, not a target state
  • Data replication — service B gets its own copy, synced via events from service A
  • API composition — service B asks service A at runtime (introduces coupling)
  • CQRS — service A publishes events; service B builds its own read model
  • ---

    Communication Styles

    The most consequential decision in microservice design: how do services talk to each other?

    Synchronous vs Asynchronous

    SynchronousAsynchronous
    Caller waits?YesNo
    CouplingTemporal (both must be up)Loose
    ComplexityLowHigh
    Failure modelCaller fails if callee is downCaller continues; message queued

    Request–Response (Synchronous)

    REST over HTTP — simple, ubiquitous, human-readable, good for public APIs.

    POST /orders
      → 201 Created { "order_id": "o123" }
    
    GET  /orders/o123
      → 200 OK { "status": "processing" }
    

    gRPC — binary protocol (Protocol Buffers), strongly typed, fast. Best for internal service-to-service calls.

    service OrderService {
      rpc CreateOrder (CreateOrderRequest) returns (OrderResponse);
      rpc GetOrder    (GetOrderRequest)    returns (OrderResponse);
    }
    

    Event-Driven (Asynchronous)

    Services publish events to a broker; consumers subscribe independently.

    OrderService → publishes "OrderPlaced" → Message Broker (Kafka)
                                                  ├── InventoryService (reserves stock)
                                                  ├── BillingService (charges card)
                                                  └── NotificationService (sends email)
    

    Events decouple producers from consumers — OrderService does not know who reacts to OrderPlaced.

    Event types:

  • Domain event — something that happened ("OrderPlaced", "PaymentFailed")
  • Notification event — thin signal, receiver fetches details via API
  • Event-carried state transfer — event includes all data needed; no callback required
  • ---

    Sagas: Managing Distributed Transactions

    Microservices cannot use database transactions across service boundaries. Sagas coordinate multi-step workflows with compensating actions on failure.

    Choreography (Event-Based)

    Each service reacts to events and emits its own events. No central coordinator.

    OrderService    → publishes OrderPlaced
    InventoryService → reserves stock → publishes StockReserved
    PaymentService  → charges card   → publishes PaymentTaken
    OrderService    → publishes OrderConfirmed
    
    On failure:
    PaymentService  → publishes PaymentFailed
    InventoryService → releases stock (compensating action)
    OrderService    → publishes OrderCancelled
    
  • Simple, decoupled
  • Hard to see the full workflow in one place (implicit coordination)
  • Orchestration (Explicit Coordinator)

    A saga orchestrator tells each service what to do, handles failures explicitly.

    OrderSaga (orchestrator):
      1. Call InventoryService.Reserve()
      2. Call PaymentService.Charge()
      3. If PaymentService fails → Call InventoryService.Release()
      4. Emit OrderConfirmed or OrderFailed
    
  • Full workflow visible in one place
  • Easier to reason about, monitor, and debug
  • Risk of orchestrator becoming a God object
  • > Newman recommends starting with choreography for simple workflows and using orchestration when the saga logic grows complex.

    ---

    Build and Deployment

    One Artifact Per Service

    Each service has its own repository (or module) and its own CI pipeline. A commit to OrderService should only trigger a build and deploy of OrderService.

    commit → CI: test + lint → artifact (Docker image) → push to registry → deploy
    

    Shared CI pipelines that build everything together defeat the purpose of independent deployability.

    Container and Kubernetes Basics

    Each microservice runs in its own container.

    FROM python:3.12-slim
    COPY . /app
    RUN pip install -r requirements.txt
    CMD ["uvicorn", "main:app", "--host", "0.0.0.0"]
    

    Kubernetes manages containers at scale:

    Deployment   → defines desired state (3 replicas of OrderService)
    Service      → stable network endpoint (load balances across replicas)
    Ingress      → routes external traffic to services
    ConfigMap    → externalized configuration
    Secret       → sensitive credentials
    

    GitOps

    Treat infrastructure as code in Git. The desired state of the cluster is declared in Git; an operator (ArgoCD, Flux) reconciles actual state to match.

    Developer merges PR → Git repo updated → ArgoCD detects diff → applies to cluster
    

    ---

    Testing Microservices

    Testing Pyramid

             ▲ Few
             │  End-to-End tests  (slow, brittle, expensive)
             │  Integration tests (medium speed, medium cost)
             │  Unit tests        (fast, cheap, many)
             ▼ Many
    

    For microservices, end-to-end tests across many services are extremely fragile. Invest in unit and integration tests per service. Use contract tests to verify integration points.

    Consumer-Driven Contract Testing

    Each consumer service defines what it expects from a producer. The producer runs these contracts as tests.

    OrderService (consumer) defines:
      "I expect GET /users/{id} to return { id, name, email }"
    
    UserService (producer) runs this contract in its CI pipeline:
      → confirms its API still satisfies OrderService's expectations
    

    Tools: Pact, Spring Cloud Contract

    This eliminates the need for a shared integration test environment for most scenarios.

    ---

    Observability

    In a distributed system, you cannot attach a debugger. Observability is how you understand what the system is doing.

    Three Pillars

    Logs — timestamped records of discrete events.

    Best practices:
    - Structured logs (JSON), not plain text
    - Consistent fields: timestamp, service, trace_id, level, message
    - Aggregate with a log platform (ELK, Loki + Grafana)
    

    Metrics — numeric measurements over time.

    Key metrics per service:
    - Request rate (requests/sec)
    - Error rate (errors/sec or %)
    - Latency (p50, p95, p99)
    - Saturation (CPU, memory, queue depth)
    
    USE method (for resources): Utilization, Saturation, Errors
    RED method (for services):  Rate, Errors, Duration
    

    Distributed Tracing — follow a request across multiple services.

    User request → TraceID: abc123
      OrderService    [abc123, span: 1] 12ms
        → UserService [abc123, span: 2]  3ms
        → PaymentService [abc123, span: 3] 45ms  ← bottleneck visible
    

    Tools: OpenTelemetry (standard), Jaeger, Zipkin, Tempo

    Correlation IDs

    Propagate a unique ID through all calls in a request chain. Log it in every service. Allows reconstructing the full picture from logs when tracing is unavailable.

    # Middleware: attach incoming trace ID or generate a new one
    trace_id = request.headers.get("X-Trace-Id") or str(uuid4())
    logger.info("handling request", extra={"trace_id": trace_id})
    

    ---

    Security

    Zero Trust

    Assume the network is compromised. Verify every request explicitly.

    Traditional:  trust everything inside the network perimeter
    Zero trust:   authenticate and authorize every service-to-service call
    

    mTLS (Mutual TLS)

    Both sides of a connection present certificates. Guarantees identity of caller and callee.

    OrderService ←→ PaymentService
      OS presents cert  → PS verifies
      PS presents cert  → OS verifies
    
    → Even if network is compromised, caller identity is guaranteed
    

    Service meshes (Istio, Linkerd) handle mTLS automatically — services don't need to implement it.

    JWT for User Identity

    Pass user identity between services via signed tokens.

    User authenticates → Identity Service issues JWT
    JWT included in all downstream calls
    Each service validates the JWT signature independently (no roundtrip)
    

    Do not re-validate credentials at every service — validate the token, trust the claims.

    ---

    Resiliency Patterns

    Distributed systems fail in partial ways. Design each service to degrade gracefully.

    Timeouts

    Always set timeouts on outbound calls. A call that never returns blocks a thread indefinitely.

    response = requests.get(url, timeout=2.0)  # fail fast after 2 seconds
    

    Retries with Exponential Backoff

    Retry transient failures, but space retries out to avoid hammering a degraded service.

    for attempt in range(3):
        try:
            return call_service()
        except TransientError:
            sleep(0.5 * 2 ** attempt)  # 0.5s, 1s, 2s
    raise MaxRetriesExceeded()
    

    Add jitter to prevent retry storms when many callers fail simultaneously.

    Circuit Breaker

    Stops calling a failing service, giving it time to recover.

    CLOSED  → calls pass through normally
             → failure rate exceeds threshold (e.g. 50% in 10 sec)
    OPEN    → calls fail immediately (no attempt made)
             → after timeout, one test call allowed
    HALF-OPEN → if test succeeds → CLOSED; if fails → OPEN again
    

    Libraries: resilience4j (Java), pybreaker (Python), opossum (Node.js)

    Bulkhead

    Isolate failures so they don't consume all shared resources.

    Without bulkhead:
      PaymentService slow → fills thread pool → all requests starve
    
    With bulkhead:
      PaymentService has its own thread pool (10 threads)
      Other services have separate pools
      PaymentService saturation only blocks payment calls
    

    Fallback

    Define what to return when a call fails — a default, cached response, or degraded result.

    ProductService fails during checkout
    → Fallback: return product name from order history cache
    → User can still complete checkout; images/descriptions missing
    

    ---

    Scaling

    Four Axes of Scaling

    Vertical scaling — bigger machine. Simple, limited, expensive.

    Horizontal scaling — more instances behind a load balancer. Requires stateless services.

    Data partitioning — split data across shards so each instance owns a subset.

    Functional decomposition — extract high-load capabilities into their own service.

    System is slow → profile → 80% of load is Search
    → Extract SearchService → scale it independently
    → Rest of system unaffected
    

    Stateless Services

    Services that hold no in-memory state can be scaled horizontally by adding instances.

    Session state in memory  → breaks horizontal scaling (requests must hit same instance)
    Session state in Redis   → any instance can serve any user
    

    Caching at the Service Level

    Each service owns its own cache. Never share a cache between services.

    UserService: cache user profiles in Redis (TTL: 5 min)
    OrderService: cache order summaries in Redis (TTL: 1 min)
    
    → Independent invalidation
    → No cross-service cache dependency
    

    ---

    Key Takeaways

    PrincipleRule
    Independent deployabilityThe only non-negotiable property of a microservice
    Domain boundariesModel on business capability, not technical layers
    Own your dataNo shared databases — ever
    Strangler figMigrate incrementally; never big-bang rewrite
    Prefer asyncEvent-driven communication reduces temporal coupling
    Sagas over transactionsCompensating actions replace distributed ACID
    Contract testsReplace fragile end-to-end tests for integration verification
    Observe everythingLogs + metrics + tracing are non-negotiable
    Zero trustAuthenticate every call; mTLS at the transport layer
    Fail gracefullyTimeouts, retries, circuit breakers, bulkheads on every outbound call
    ---

    Resources

  • Building Microservices 2nd ed. — Sam Newman
  • Monolith to Microservices — Sam Newman
  • OpenTelemetry
  • Pact Contract Testing
  • Topics

    MicroservicesSoftware ArchitectureDistributed SystemsEngineering

    Found This Helpful?

    If you have questions or suggestions for improving these notes, I'd love to hear from you.