System Design Series Part 5: Microservices Architecture

Microservices architecture structures an application as a collection of loosely coupled, independently deployable services. Each service is owned by a small team and focuses on a specific business capability.

Overview diagram of microservices architecture showing multiple independent services with separate databases communicating through APIs and message queues — Microservices architecture overview — independently deployable services with separate data stores, communicating through well-defined APIs and async messaging

                        
                        Key Insight: Microservices aren't always the answer. Start with a monolith and extract services as complexity grows and team boundaries become clear.
                    

Core Principles

Single Responsibility: Each service does one thing well
Loose Coupling: Services interact through well-defined interfaces
High Cohesion: Related functionality grouped together
Independent Deployment: Deploy services without affecting others
Decentralized Data: Each service owns its data store
Failure Isolation: One service failure doesn't crash the system

Monolith vs Microservices

Understanding when to use each architecture is crucial:

Side-by-side comparison of monolithic architecture as a single deployable unit versus microservices as independent services with separate deployments and databases — Monolith vs microservices — a single deployable unit with shared database compared to independently deployable services with dedicated data stores

Aspect	Monolith	Microservices
Deployment	Deploy entire application	Deploy services independently
Scaling	Scale entire application	Scale individual services
Technology	Single tech stack	Polyglot (different stacks per service)
Team Structure	Large, cross-functional	Small, service-owning teams
Complexity	In-process calls	Network calls, distributed systems
Data	Shared database	Database per service
Testing	End-to-end easier	Integration testing complex

When to Choose Monolith

Small team (< 10 developers)
Early-stage startup exploring product-market fit
Simple domain with clear boundaries
Need rapid development without distributed complexity
Limited DevOps expertise

# Monolith structure example
my_app/
+-- app/
¦   +-- models/
¦   ¦   +-- user.py
¦   ¦   +-- order.py
¦   ¦   +-- product.py
¦   +-- services/
¦   ¦   +-- user_service.py
¦   ¦   +-- order_service.py
¦   ¦   +-- product_service.py
¦   +-- api/
¦   ¦   +-- routes.py
¦   +-- database.py
+-- tests/
+-- requirements.txt

Simple Fast to Start

When to Choose Microservices

Large organization with multiple teams
Different parts of system have different scaling needs
Teams need to deploy independently
Complex domain with clear bounded contexts
Strong DevOps/infrastructure capabilities

# Microservices structure example
platform/
+-- services/
¦   +-- user-service/
¦   ¦   +-- src/
¦   ¦   +-- Dockerfile
¦   ¦   +-- k8s/
¦   +-- order-service/
¦   ¦   +-- src/
¦   ¦   +-- Dockerfile
¦   ¦   +-- k8s/
¦   +-- payment-service/
¦   ¦   +-- src/
¦   ¦   +-- Dockerfile
¦   ¦   +-- k8s/
¦   +-- notification-service/
¦       +-- src/
¦       +-- Dockerfile
¦       +-- k8s/
+-- api-gateway/
+-- infrastructure/
+-- docker-compose.yml

Scalable Team Autonomy

                        
                        The Monolith First Approach: Many successful companies (Shopify, Etsy) started as monoliths and extracted services as they scaled. Don't prematurely optimize for microservices—the complexity isn't free.
                    

Service Decomposition

Breaking a monolith into microservices requires thoughtful decomposition strategies:

Diagram showing DDD bounded contexts mapping to microservices, with an e-commerce domain split into user, catalog, order, inventory, payment, and shipping contexts — Service decomposition using bounded contexts — each DDD context maps to an independent microservice with clear ownership boundaries

Domain-Driven Design (DDD)

DDD provides a toolkit for decomposing complex domains into services. It works at two levels: strategic design (how domains relate) and tactical design (patterns within a domain).

Ubiquitous Language

Each bounded context develops its own ubiquitous language — a shared vocabulary between developers and domain experts. The same real-world concept may mean different things in different contexts:

Term	Order Context	Shipping Context	Payment Context
Customer	Buyer placing an order	Recipient at a delivery address	Billing entity with payment methods
Product	Line item with quantity and price	Package with weight and dimensions	Not relevant
Address	Not stored (delegated)	Delivery destination with GPS coords	Billing address for fraud checks

                        
                        Key Insight: If the same term means different things in different parts of the codebase, you have found a context boundary. Each context should use its own model — never force a single "Customer" class to serve all contexts.
                    

Bounded Contexts & Context Mapping

Use bounded contexts from DDD to identify service boundaries. Each context encapsulates a coherent domain model:

E-Commerce Bounded Contexts

# Bounded Contexts for E-Commerce Platform

# User Context - User identity and preferences
class UserService:
    """Handles user registration, authentication, profiles"""
    def register_user(self, email, password): pass
    def authenticate(self, email, password): pass
    def get_profile(self, user_id): pass

# Catalog Context - Product information
class CatalogService:
    """Manages product listings, categories, search"""
    def get_product(self, product_id): pass
    def search_products(self, query): pass
    def get_category(self, category_id): pass

# Order Context - Order lifecycle
class OrderService:
    """Handles order creation, status, history"""
    def create_order(self, user_id, items): pass
    def get_order_status(self, order_id): pass
    def cancel_order(self, order_id): pass

# Inventory Context - Stock management
class InventoryService:
    """Manages stock levels, reservations"""
    def check_availability(self, product_id, quantity): pass
    def reserve_stock(self, product_id, quantity): pass
    def release_stock(self, product_id, quantity): pass

# Payment Context - Financial transactions
class PaymentService:
    """Processes payments, refunds"""
    def process_payment(self, order_id, amount): pass
    def refund_payment(self, payment_id): pass

# Shipping Context - Fulfillment
class ShippingService:
    """Handles shipping labels, tracking"""
    def create_shipment(self, order_id): pass
    def get_tracking(self, shipment_id): pass

DDD Bounded Context

Contexts interact through well-defined context mapping patterns:

Mapping Pattern	Description	When to Use
Shared Kernel	Two contexts share a small, jointly owned model	Closely collaborating teams with overlapping domain
Customer–Supplier	Upstream context provides what downstream needs	Clear producer/consumer relationship between teams
Anti-Corruption Layer (ACL)	Translation layer that protects your model from external changes	Integrating with legacy systems or third-party APIs
Open Host Service	Published API with a well-defined protocol	Context serves many consumers with a stable contract
Conformist	Downstream adopts upstream's model as-is	No leverage to influence the upstream team

Tactical DDD — Building Blocks

Within a bounded context, DDD provides tactical patterns for modelling the domain:

Pattern

Aggregates, Entities & Value Objects

# Tactical DDD Building Blocks

# Entity: has identity, mutable state
class Order:
    def __init__(self, order_id, customer_id):
        self.order_id = order_id         # Identity
        self.customer_id = customer_id
        self.items = []
        self.status = "pending"
    
    def add_item(self, product_id, quantity, price):
        self.items.append(OrderItem(product_id, quantity, price))
    
    def confirm(self):
        if not self.items:
            raise ValueError("Cannot confirm empty order")
        self.status = "confirmed"

# Value Object: no identity, immutable, compared by attributes
class Money:
    def __init__(self, amount, currency):
        self.amount = amount
        self.currency = currency
    
    def __eq__(self, other):
        return self.amount == other.amount and self.currency == other.currency
    
    def add(self, other):
        assert self.currency == other.currency, "Currency mismatch"
        return Money(self.amount + other.amount, self.currency)

# Aggregate Root: Order is the root, OrderItem is only accessed through it
class OrderItem:
    """Never accessed directly — always through Order aggregate"""
    def __init__(self, product_id, quantity, price):
        self.product_id = product_id
        self.quantity = quantity
        self.price = price  # Money value object

# Domain Event: something that happened in the domain
class OrderConfirmed:
    def __init__(self, order_id, customer_id, total):
        self.order_id = order_id
        self.customer_id = customer_id
        self.total = total
        self.occurred_at = "2025-06-01T10:30:00Z"

print("Aggregate: Order (root) -> OrderItem (child)")
print("Value Object: Money (immutable, no identity)")
print("Domain Event: OrderConfirmed (published to other contexts)")

Aggregate Entity Value Object Domain Event

Strategic vs. Tactical DDD

Aspect	Strategic DDD	Tactical DDD
Scope	System-wide, cross-team	Within a single bounded context
Key Concepts	Bounded contexts, context maps, ubiquitous language	Aggregates, entities, value objects, domain events
Goal	Find the right service boundaries	Model rich domain logic within a service
Who Leads	Architects + domain experts	Developers + domain experts
Impact of Getting It Wrong	Distributed monolith, team coupling	Anemic domain model, logic leaking into services layer

Decomposition Strategies

Strategy	Description	When to Use
By Business Capability	Align services with business functions	Clear business domains exist
By Subdomain	Core, supporting, generic subdomains	Complex domain with varying importance
Strangler Fig	Gradually replace monolith pieces	Migrating existing monolith
By Team	Conway's Law—match org structure	Clear team boundaries

Service Boundaries

Good service boundaries minimize inter-service communication:

Good Boundaries

# Good: Order service owns all order data
class OrderService:
    def create_order(self, user_id, items):
        # All order logic contained within service
        order = Order(user_id=user_id)
        for item in items:
            order.add_item(item)
        order.calculate_total()
        self.db.save(order)
        
        # Emit event for other services (async)
        self.event_bus.publish("order.created", order)
        return order
    
    def get_order_details(self, order_id):
        # No external calls needed
        return self.db.get(order_id)

Bad Boundaries (Distributed Monolith)

# Bad: Order service makes synchronous calls for every operation
class OrderService:
    def create_order(self, user_id, items):
        # Synchronous calls create tight coupling
        user = self.user_service.get_user(user_id)  # Network call
        
        for item in items:
            product = self.catalog_service.get_product(item.id)  # Network call
            price = self.pricing_service.get_price(item.id)  # Network call
            stock = self.inventory_service.check(item.id)  # Network call
        
        # Any service failure = order failure
        # Can't deploy independently
        # Latency compounds with each call

Signs of Bad Boundaries

Services need to call each other synchronously for simple operations
Circular dependencies between services
Shared database between services
Must deploy multiple services together
Same data modified by multiple services

Architecture Patterns

API Gateway

Single entry point that routes requests to appropriate services.

# API Gateway responsibilities
class APIGateway:
    def __init__(self):
        self.services = {
            "/users": "user-service:8080",
            "/orders": "order-service:8080",
            "/products": "catalog-service:8080"
        }
    
    def route_request(self, request):
        # 1. Authentication
        user = self.authenticate(request.headers.get("Authorization"))
        
        # 2. Rate limiting
        if self.rate_limiter.is_limited(user.id):
            return Response(status=429)
        
        # 3. Route to service
        service = self.get_service(request.path)
        
        # 4. Protocol translation (REST -> gRPC)
        response = service.forward(request)
        
        # 5. Response aggregation (if needed)
        return response
    
    def aggregate_product_page(self, product_id):
        """Combine data from multiple services"""
        product = self.catalog_service.get(product_id)
        reviews = self.review_service.get_for_product(product_id)
        inventory = self.inventory_service.check(product_id)
        
        return {
            **product,
            "reviews": reviews,
            "in_stock": inventory.available > 0
        }

Entry Point Cross-Cutting Concerns

Backend for Frontend (BFF)

Separate backend for each client type (web, mobile, IoT).

# BFF Pattern
# Each client gets optimized API

# Mobile BFF - Minimal data, pagination
class MobileBFF:
    def get_product_list(self, page=1, limit=20):
        products = self.catalog.get_products(page, limit)
        return [{
            "id": p.id,
            "name": p.name,
            "price": p.price,
            "thumbnail": p.images[0].url  # Only first image
        } for p in products]

# Web BFF - Rich data, full details
class WebBFF:
    def get_product_list(self):
        products = self.catalog.get_products()
        for product in products:
            product["reviews_summary"] = self.reviews.get_summary(product.id)
            product["availability"] = self.inventory.check(product.id)
            product["related"] = self.recommendations.get_related(product.id)
        return products

Client-Specific Optimized Responses

Database per Service

Each service owns its data store—no shared databases.

# Each service manages its own database
# Order Service - PostgreSQL for transactions
order_db = PostgreSQL("order-db")

# Catalog Service - MongoDB for flexible product data
catalog_db = MongoDB("catalog-db")

# Search Service - Elasticsearch for full-text search
search_db = Elasticsearch("search-cluster")

# Session Service - Redis for fast access
session_db = Redis("session-cache")

# Analytics Service - ClickHouse for time-series
analytics_db = ClickHouse("analytics-cluster")

Challenge: Cross-service queries require careful design (event sourcing, CQRS, or API composition).

Data Isolation Polyglot Persistence

Saga Pattern

Manage distributed transactions across services using compensating transactions.

# Saga Pattern for Order Creation
class OrderSaga:
    def execute(self, order_data):
        saga_log = []
        
        try:
            # Step 1: Reserve inventory
            reservation = self.inventory.reserve(order_data.items)
            saga_log.append(("inventory", reservation.id))
            
            # Step 2: Process payment
            payment = self.payment.charge(order_data.user_id, order_data.total)
            saga_log.append(("payment", payment.id))
            
            # Step 3: Create order
            order = self.orders.create(order_data)
            saga_log.append(("order", order.id))
            
            # Step 4: Notify user
            self.notifications.send(order_data.user_id, "Order confirmed!")
            
            return order
            
        except Exception as e:
            # Compensating transactions (rollback)
            self.compensate(saga_log)
            raise e
    
    def compensate(self, saga_log):
        """Undo completed steps in reverse order"""
        for service, resource_id in reversed(saga_log):
            if service == "order":
                self.orders.cancel(resource_id)
            elif service == "payment":
                self.payment.refund(resource_id)
            elif service == "inventory":
                self.inventory.release(resource_id)

Saga Pattern — Choreography

sequenceDiagram
    participant OS as Order Service
    participant PS as Payment Service
    participant IS as Inventory Service
    participant SS as Shipping Service
    
    Note over OS,SS: Happy Path
    OS->>PS: OrderCreated Event
    PS->>IS: PaymentCompleted Event
    IS->>SS: InventoryReserved Event
    SS-->>OS: ShipmentScheduled Event
    
    Note over OS,SS: Compensation (Failure at Inventory)
    OS->>PS: OrderCreated Event
    PS->>IS: PaymentCompleted Event
    IS--xPS: InventoryFailed Event
    PS-->>OS: PaymentRefunded Event
    OS-->>OS: OrderCancelled

Distributed Transactions Eventual Consistency

Circuit Breaker

Prevent cascading failures by stopping calls to failing services.

# Circuit Breaker Implementation
from enum import Enum
import time

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject calls
    HALF_OPEN = "half_open"  # Testing recovery

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
    
    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
            else:
                raise CircuitOpenError("Circuit is open")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e
    
    def _on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED
    
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
    
    def _should_attempt_reset(self):
        return (time.time() - self.last_failure_time) > self.recovery_timeout

# Usage
payment_circuit = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
result = payment_circuit.call(payment_service.charge, user_id, amount)

Fault Tolerance Resilience

Service Mesh

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It offloads common concerns from application code.

Service Mesh Capabilities

Traffic Management: Load balancing, routing, retries
Security: mTLS encryption, authentication
Observability: Metrics, tracing, logging
Resilience: Circuit breaking, timeouts, rate limiting

Istio Service Mesh Example

# Istio VirtualService for traffic routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
spec:
  hosts:
    - order-service
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: order-service
            subset: v2
          weight: 100
    - route:
        - destination:
            host: order-service
            subset: v1
          weight: 90
        - destination:
            host: order-service
            subset: v2
          weight: 10  # 10% canary traffic

---
# Istio DestinationRule for circuit breaking
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-service
spec:
  host: order-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s

Istio Traffic Management

Popular Service Meshes

Service Mesh	Architecture	Best For
Istio	Envoy sidecar	Full-featured, large deployments
Linkerd	Lightweight proxy	Simplicity, performance
Consul Connect	HashiCorp ecosystem	Multi-cloud, service discovery
AWS App Mesh	Envoy, AWS integration	AWS-native workloads

Service Collaboration Patterns

When operations span multiple services, you need patterns for maintaining data consistency and composing data across service boundaries — without falling back to a shared database.

Data Ownership Strategies

Pattern	Description	Pros	Cons
Database per Service	Each service owns a private database — no other service accesses it directly	Loose coupling, independent scaling, polyglot persistence	Cross-service queries harder, eventual consistency
Shared Database (Anti-pattern)	Multiple services read/write the same database	Simple queries, strong consistency	Tight coupling, schema changes break multiple services, scaling bottleneck
Command-side Replica	A service maintains a read-only copy of another service's data, updated via events	Fast local reads, reduced inter-service calls	Data staleness, storage duplication

                        
                        Anti-pattern — Shared Database: When multiple services share a database, any schema change requires coordinating all consumer services. This creates a distributed monolith — you get the complexity of microservices without the benefits of independent deployment.
                    

API Composition

Implement cross-service queries by invoking the services that own the data and performing an in-memory join:

API Composition Pattern

# API Composition — aggregate data from multiple services
import asyncio

class OrderDetailsComposer:
    """
    Composes a rich order view by querying multiple services
    and joining results in memory.
    """
    def __init__(self, order_svc, user_svc, product_svc):
        self.order_svc = order_svc
        self.user_svc = user_svc
        self.product_svc = product_svc
    
    async def get_order_details(self, order_id):
        # Fetch order first (owns the core data)
        order = await self.order_svc.get(order_id)
        
        # Parallel calls to other services
        user, products = await asyncio.gather(
            self.user_svc.get(order["user_id"]),
            self.product_svc.get_batch(
                [item["product_id"] for item in order["items"]]
            )
        )
        
        # In-memory join
        product_map = {p["id"]: p for p in products}
        return {
            "order_id": order["id"],
            "status": order["status"],
            "customer": {
                "name": user["name"],
                "email": user["email"]
            },
            "items": [
                {
                    **item,
                    "product_name": product_map[item["product_id"]]["name"],
                    "image_url": product_map[item["product_id"]]["image"]
                }
                for item in order["items"]
            ],
            "total": order["total"]
        }

# Typically implemented in the API Gateway or a BFF
print("API Composition: query N services, join in memory")
print("Best for: read-heavy queries spanning 2-4 services")

Read Queries In-Memory Join

CQRS & Event Sourcing

For high-throughput systems with asymmetric read/write loads, CQRS (Command Query Responsibility Segregation) separates the write model from the read model. Combined with Event Sourcing, every state change is stored as an immutable event — enabling full audit trails and temporal queries.

Pattern	How It Works	When to Use
CQRS	Write commands go to a normalised write store; reads served from denormalised, pre-computed views (materialized views)	Read/write ratio > 10:1, complex queries across aggregates
Event Sourcing	Persist every state change as an immutable event; rebuild current state by replaying the event log	Audit-critical domains (finance, healthcare), complex event-driven workflows
Domain Events	Publish an event whenever data changes — other services subscribe to react asynchronously	Decoupled service communication, event-driven choreography

Deep Dive: Full implementations of Event Sourcing and CQRS with code examples are covered in Part 7: Message Queues & Event-Driven Architecture.

Transactional Messaging

A critical challenge in microservices: how do you atomically update a database and publish an event? If you update the DB but the event publish fails (or vice versa), you get inconsistency.

Transactional Outbox Pattern

Write events to an outbox table in the same database transaction as the business data. A separate process reads the outbox and publishes events to the message broker.

# Transactional Outbox Pattern
import json
import uuid

class OrderService:
    def create_order(self, user_id, items, total):
        # Single database transaction
        with self.db.transaction() as tx:
            # 1. Write business data
            order_id = str(uuid.uuid4())
            tx.execute(
                "INSERT INTO orders (id, user_id, total, status) "
                "VALUES (%s, %s, %s, 'created')",
                (order_id, user_id, total)
            )
            for item in items:
                tx.execute(
                    "INSERT INTO order_items (order_id, product_id, qty, price) "
                    "VALUES (%s, %s, %s, %s)",
                    (order_id, item["product_id"], item["qty"], item["price"])
                )
            
            # 2. Write event to outbox (SAME transaction)
            tx.execute(
                "INSERT INTO outbox (id, aggregate_type, aggregate_id, "
                "event_type, payload, created_at) "
                "VALUES (%s, 'Order', %s, 'OrderCreated', %s, NOW())",
                (str(uuid.uuid4()), order_id, json.dumps({
                    "order_id": order_id,
                    "user_id": user_id,
                    "total": total,
                    "items": items
                }))
            )
            # Both writes commit or both roll back — atomicity guaranteed

# Separate process: OutboxRelay reads outbox, publishes to Kafka/RabbitMQ
class OutboxRelay:
    """Polls the outbox table and publishes events to the broker."""
    def relay(self):
        rows = self.db.query(
            "SELECT * FROM outbox WHERE published = FALSE "
            "ORDER BY created_at LIMIT 100"
        )
        for row in rows:
            self.broker.publish(row["event_type"], row["payload"])
            self.db.execute(
                "UPDATE outbox SET published = TRUE WHERE id = %s",
                (row["id"],)
            )

print("Outbox guarantees: DB write + event publish are atomic")
print("Relay methods: polling, CDC (Change Data Capture), log tailing")

Transactional Outbox Atomicity

Outbox Relay Strategies

Strategy	How It Works	Latency	Complexity
Polling Publisher	Periodically query outbox for unpublished events	Medium (polling interval)	Low — simple SQL queries
Transaction Log Tailing	Read the database's transaction log (WAL/binlog) and extract outbox inserts	Low (near real-time)	High — requires CDC tools (Debezium, Maxwell)

Communication & Discovery

Microservices need to communicate with each other and with external clients. The choice of communication style affects coupling, latency, and resilience.

Communication Styles

Style	Mechanism	Coupling	When to Use
Remote Procedure Invocation (RPI)	REST, gRPC, GraphQL — synchronous request/response	Higher (temporal coupling)	Queries, real-time interactions, simple CRUD
Asynchronous Messaging	Kafka, RabbitMQ, SQS — publish/subscribe or point-to-point	Lower (no temporal coupling)	Event-driven workflows, fire-and-forget, long-running operations
Domain-Specific Protocol	SMTP for email, MQTT for IoT, FIX for trading	Varies	Integrating with systems using specialised protocols

Idempotent Consumer

In distributed systems, messages can be delivered more than once (at-least-once delivery). An idempotent consumer ensures that processing the same message multiple times produces the same result.

# Idempotent Consumer Pattern
class PaymentEventHandler:
    def __init__(self, db, payment_service):
        self.db = db
        self.payment_service = payment_service
    
    def handle_order_created(self, event):
        message_id = event["message_id"]
        
        # Check if we've already processed this message
        if self.db.exists("processed_messages", message_id):
            print(f"Duplicate message {message_id} — skipping")
            return  # Idempotent: no side effects
        
        # Process the message
        self.payment_service.charge(
            user_id=event["user_id"],
            amount=event["total"],
            order_id=event["order_id"]
        )
        
        # Record that we processed this message
        self.db.insert("processed_messages", {
            "message_id": message_id,
            "processed_at": "2026-03-26T10:00:00Z",
            "event_type": "OrderCreated"
        })

print("Idempotency key: message_id (UUID assigned by producer)")
print("Storage: processed_messages table with TTL cleanup")

At-Least-Once Deduplication

Self-Contained Services

A self-contained service can handle synchronous requests without waiting for other services to respond. It achieves this by maintaining local replicas of the data it needs:

Subscribe to events from upstream services to keep a local read-only copy of the data
No synchronous calls during request handling — all data needed is available locally
Trade-off: Data may be slightly stale (eventual consistency) but the service never blocks on another service's availability

                        
                        Self-Contained vs Distributed Monolith: A self-contained service fails gracefully when dependencies are down. A distributed monolith cascades failures through synchronous call chains. Prefer self-contained services for all read paths.
                    

Service Discovery

In a dynamic environment where service instances scale up/down and IP addresses change, service discovery lets services find each other:

Pattern	How It Works	Example
Client-Side Discovery	Client queries a service registry and uses a load-balancing algorithm to select an instance	Netflix Eureka + Ribbon
Server-Side Discovery	Client makes a request to a router/load balancer, which queries the registry and forwards the request	AWS ALB, Kubernetes Service
Service Registry	Central database of service instance locations (host + port), health status, and metadata	Consul, Eureka, etcd, ZooKeeper
Self Registration	Service instance registers itself with the registry on startup and deregisters on shutdown	Spring Cloud Eureka client
3rd Party Registration	A separate registrar (e.g., container orchestrator) registers/deregisters instances automatically	Kubernetes (kubelet), AWS ECS

Service Discovery in Practice

# Service Discovery — Client-Side (e.g., with Consul)
import random

class ServiceRegistry:
    """In production, this would be Consul, Eureka, or etcd."""
    def __init__(self):
        self.services = {}
    
    def register(self, service_name, instance_id, host, port):
        self.services.setdefault(service_name, []).append({
            "instance_id": instance_id,
            "host": host,
            "port": port,
            "healthy": True
        })
    
    def deregister(self, service_name, instance_id):
        self.services[service_name] = [
            i for i in self.services[service_name]
            if i["instance_id"] != instance_id
        ]
    
    def get_instances(self, service_name):
        instances = self.services.get(service_name, [])
        return [i for i in instances if i["healthy"]]

class ClientSideLoadBalancer:
    """Client picks an instance from the registry."""
    def __init__(self, registry):
        self.registry = registry
    
    def call(self, service_name, request):
        instances = self.registry.get_instances(service_name)
        if not instances:
            raise Exception(f"No healthy instances for {service_name}")
        
        # Round-robin, random, or weighted selection
        instance = random.choice(instances)
        url = f"http://{instance['host']}:{instance['port']}"
        print(f"Routing to {service_name} at {url}")
        return url

# Kubernetes simplifies this with built-in DNS
# kubectl: order-service.default.svc.cluster.local
registry = ServiceRegistry()
registry.register("order-service", "order-1", "10.0.1.5", 8080)
registry.register("order-service", "order-2", "10.0.1.6", 8080)
registry.register("order-service", "order-3", "10.0.1.7", 8080)

lb = ClientSideLoadBalancer(registry)
lb.call("order-service", "/api/orders/123")

Consul Eureka Kubernetes DNS

Reliability & Resilience

In a distributed system, failures aren't exceptional — they're expected. Resilience patterns prevent a single service failure from cascading across the entire system.

Pattern	Purpose	How It Works
Circuit Breaker	Stop calling a failing service	After N consecutive failures, "open" the circuit and return a fallback. Periodically retry ("half-open") to check recovery.
Bulkhead	Isolate failures to one component	Limit the number of concurrent calls to each dependency using thread pools or semaphores — a slow dependency can't exhaust all threads.
Retry with Backoff	Handle transient failures	Retry failed calls with exponential backoff + jitter (e.g., 100ms → 200ms → 400ms + random). Cap at max retries (typically 3).
Timeout	Bound latency	Set explicit timeouts on all outbound calls (e.g., 3s for sync, 30s for async). Never use "infinite" timeouts.
Fallback	Degrade gracefully	When a dependency fails, return cached data, default values, or a reduced-functionality response instead of an error.

                        
                        Key Insight: Combine patterns in layers: Timeout → Retry → Circuit Breaker → Bulkhead → Fallback. Libraries like Resilience4j (Java) and Polly (.NET) implement all five patterns with declarative configuration. Service meshes like Istio handle these at the infrastructure level.
                    

Cross-Cutting Concerns

Every microservice needs logging, health checks, configuration, security, and metrics. Without patterns for reuse, each team reinvents the wheel — leading to inconsistency and drift.

Pattern	Description	Example
Microservice Chassis	A framework or library that handles cross-cutting concerns (logging, health checks, config, metrics, tracing) so service developers focus on business logic	Spring Boot + Spring Cloud, Go Kit, Dapr
Externalized Configuration	Store configuration (DB URLs, API keys, feature flags) outside the service binary — injected at runtime via environment variables or config servers	Kubernetes ConfigMaps/Secrets, Spring Cloud Config, AWS Parameter Store, HashiCorp Vault
Service Template	A project template implementing standard cross-cutting concerns. Developers copy and customise it to quickly create new services with consistent structure	Cookiecutter templates, GitHub template repos, Yeoman generators

# Kubernetes ConfigMap — Externalized Configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: order-service-config
data:
  DATABASE_URL: "postgresql://db:5432/orders"
  CACHE_TTL: "300"
  LOG_LEVEL: "info"
  FEATURE_NEW_CHECKOUT: "true"
---
# Inject into the pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    spec:
      containers:
        - name: order-service
          envFrom:
            - configMapRef:
                name: order-service-config
            - secretRef:
                name: order-service-secrets  # Sensitive values

Testing Microservices

Testing distributed systems is fundamentally different from monolith testing. The testing pyramid for microservices adds contract and component tests between unit and end-to-end tests:

Test Type	What It Verifies	Scope	Speed
Unit Tests	Individual functions and classes within a service	Single class/function	Milliseconds
Service Component Test	A service in isolation — test its endpoints with test doubles (mocks/stubs) for all external dependencies	Entire service (isolated)	Seconds
Consumer-Driven Contract Test	A test suite written by the consumer team that defines the API contract it expects from a provider service. The provider runs these tests in its CI.	Service API boundary	Seconds
Consumer-Side Contract Test	A test suite for a service client (API wrapper / SDK) verifying it can communicate correctly with the provider	Client library	Seconds
Integration Tests	Interaction between 2-3 real services (no mocks)	Service pair/group	Minutes
End-to-End Tests	Full user journey across all services	Entire system	Minutes–Hours

Consumer-Driven Contract Test (Pact)

# Consumer-Driven Contract Test with Pact
# Written by the ORDER SERVICE team (consumer)
# Tests what it expects from the USER SERVICE (provider)

# Step 1: Consumer defines expected interactions
def test_get_user_contract():
    """Order service expects user-service to return user by ID."""
    pact = Pact(consumer="order-service", provider="user-service")
    
    # Define expected interaction
    pact.given("user 123 exists").upon_receiving(
        "a request for user 123"
    ).with_request(
        method="GET", path="/api/users/123"
    ).will_respond_with(
        status=200,
        body={
            "id": "123",
            "name": Like("Jane Doe"),      # Any string
            "email": Like("jane@test.com")  # Any string
        }
    )
    
    # Run test against mock provider
    with pact:
        user = order_service_client.get_user("123")
        assert user["id"] == "123"
        assert "name" in user
        assert "email" in user
    
    # Pact file generated → shared with provider team
    # Provider runs this contract in their CI pipeline
    print("Contract published to Pact Broker")
    print("Provider verifies contract on every build")

Pact Contract Testing

                        
                        Testing Strategy: Aim for many unit tests, several component tests, contract tests for every service boundary, a handful of integration tests, and very few end-to-end tests. Contract tests catch breaking API changes early without the brittleness of full E2E suites.
                    

Deployment Patterns

How you deploy services affects isolation, resource utilisation, and operational complexity:

Pattern	Description	Isolation	Cost Efficiency
Multiple Instances per Host	Run several service instances on the same physical/virtual host	Low (shared resources, port conflicts)	High (dense packing)
Service Instance per VM	Each service instance runs in its own virtual machine	High (full OS isolation)	Low (VM overhead per instance)
Service Instance per Container	Each instance runs in a container (Docker) — lightweight isolation with shared kernel	Medium–High (namespace + cgroup isolation)	High (fast startup, efficient resources)
Serverless Deployment	Deploy functions (Lambda, Azure Functions) — no server management, scale-to-zero	High (cloud-managed isolation)	Very High for spiky/low traffic
Service Deployment Platform	Use a platform (Kubernetes, ECS, Nomad) that abstracts infrastructure and provides service-level primitives (scaling, health checks, rolling updates)	Configurable	High (automated management)

Industry Standard: Most organisations have converged on container-per-service on Kubernetes as the default pattern. Serverless (Lambda/Cloud Functions) is used for event-driven glue logic and low-traffic services. See Cloud Computing: Containers & Kubernetes and Cloud Computing: Serverless Services for deeper coverage.

Observability Patterns

In a distributed system, you can't SSH into a monolith and read log files. Observability requires structured, centralised tooling across all services:

Pattern	Purpose	Tools
Log Aggregation	Collect logs from all services into a central, searchable store	ELK Stack, Loki + Grafana, CloudWatch Logs
Application Metrics	Instrument code to gather statistics — request rate, error rate, latency percentiles	Prometheus + Grafana, Datadog, CloudWatch Metrics
Distributed Tracing	Assign a unique trace ID to each request and propagate it across services to track the full request lifecycle	Jaeger, Zipkin, AWS X-Ray, OpenTelemetry
Health Check API	Expose a `/health` endpoint for load balancers and orchestrators to probe service liveness and readiness	Spring Boot Actuator, custom `/healthz`
Exception Tracking	Report uncaught exceptions to a centralised service that aggregates, deduplicates, and alerts	Sentry, Bugsnag, Rollbar
Audit Logging	Record user actions (who did what, when) in an append-only audit database for compliance and forensics	Custom audit service, AWS CloudTrail
Log Deployments & Changes	Annotate metrics dashboards with deployment events — correlate performance changes with releases	Grafana annotations, PagerDuty change events

Deep Dive: Full implementations of the three pillars of observability (logs, metrics, traces) with alerting strategies and SLO-based monitoring are covered in Part 10: Monitoring & Observability.

UI Composition (Micro-Frontends)

When the backend is decomposed into microservices, how should the UI be structured? Two approaches for composing a single page from multiple teams' work:

Pattern	How It Works	Pros	Cons
Server-Side Page Fragment Composition	A gateway/server assembles a page from HTML fragments generated by different services (e.g., Nginx SSI, Edge Side Includes, Tailor by Zalando)	Works without JavaScript, SEO-friendly, fast first paint	Limited interactivity between fragments, server-side coupling
Client-Side UI Composition	Each team ships an independent frontend module (React/Angular/Vue micro-app) loaded and orchestrated on the client via Module Federation, Single-SPA, or iframes	Full team autonomy, rich interactivity, independent deployments	Larger bundle size, runtime integration complexity, shared state management

Micro-Frontend Architecture

// Webpack Module Federation — micro-frontend setup
// Each team deploys independently, modules loaded at runtime

// Team A: Product Catalog (host application)
// webpack.config.js
module.exports = {
  plugins: [
    new ModuleFederationPlugin({
      name: "catalog_app",
      remotes: {
        // Other teams' micro-frontends loaded at runtime
        reviews: "reviews_app@https://reviews.cdn.example.com/remoteEntry.js",
        cart: "cart_app@https://cart.cdn.example.com/remoteEntry.js",
        recommendations: "recs_app@https://recs.cdn.example.com/remoteEntry.js"
      },
      shared: ["react", "react-dom"]  // Shared dependencies
    })
  ]
};

// In the host app — compose micro-frontends
// ProductPage.jsx
const ReviewsWidget = React.lazy(
  () => import("reviews/ReviewsWidget")
);
const CartButton = React.lazy(
  () => import("cart/AddToCartButton")
);
const Recommendations = React.lazy(
  () => import("recommendations/RelatedProducts")
);

function ProductPage({ productId }) {
  return (
    <div>
      <ProductDetails id={productId} />
      <Suspense fallback="Loading reviews...">
        <ReviewsWidget productId={productId} />
      </Suspense>
      <Suspense fallback="Loading cart...">
        <CartButton productId={productId} />
      </Suspense>
      <Suspense fallback="Loading recommendations...">
        <Recommendations productId={productId} />
      </Suspense>
    </div>
  );
}

console.log("Each widget deployed independently by different teams");

Module Federation Micro-Frontends Independent Deploy

Serverless Architecture

                        
                        Key Insight: Serverless doesn't mean "no servers"—it means you don't manage servers. The cloud provider handles scaling, patching, and infrastructure automatically.
                    

Function-as-a-Service (FaaS) is a cloud computing model where you deploy individual functions that execute in response to events. AWS Lambda, Azure Functions, and Google Cloud Functions are popular FaaS platforms.

Serverless FaaS architecture showing event sources triggering individual functions that auto-scale, with pay-per-execution billing and managed infrastructure — Serverless Function-as-a-Service architecture — event sources trigger stateless functions that auto-scale from zero with pay-per-execution billing

Serverless Benefits

No Server Management: Focus on code, not infrastructure
Auto-Scaling: Scale to zero or thousands instantly
Pay-per-Use: Only pay for actual execution time
Built-in HA: Multi-AZ by default

AWS Lambda Function Example

# AWS Lambda Function - Process uploaded images
import boto3
import json
from PIL import Image
import io

s3 = boto3.client('s3')

def lambda_handler(event, context):
    """Triggered when image uploaded to S3"""
    # Get uploaded file info
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Download image
    response = s3.get_object(Bucket=bucket, Key=key)
    image_content = response['Body'].read()
    
    # Process image (resize)
    image = Image.open(io.BytesIO(image_content))
    thumbnail = image.resize((200, 200))
    
    # Save thumbnail
    buffer = io.BytesIO()
    thumbnail.save(buffer, format='JPEG')
    buffer.seek(0)
    
    thumbnail_key = f"thumbnails/{key}"
    s3.put_object(
        Bucket=bucket,
        Key=thumbnail_key,
        Body=buffer.getvalue()
    )
    
    return {
        'statusCode': 200,
        'body': json.dumps({'thumbnail': thumbnail_key})
    }

AWS Lambda Event-Driven

Serverless Patterns

Event-Driven Processing

Functions triggered by events from various sources:

# serverless.yml - Event triggers
service: order-processor

functions:
  processOrder:
    handler: handler.process_order
    events:
      - sqs:
          arn: arn:aws:sqs:region:account:order-queue
          batchSize: 10
  
  sendNotification:
    handler: handler.send_notification
    events:
      - sns:
          arn: arn:aws:sns:region:account:order-notifications
  
  apiEndpoint:
    handler: handler.api
    events:
      - http:
          path: /orders
          method: post
  
  scheduledReport:
    handler: handler.generate_report
    events:
      - schedule: rate(1 day)

Function Composition

Chain functions using Step Functions or event choreography:

# AWS Step Functions - Order workflow
{
  "Comment": "Order Processing Workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:validateOrder",
      "Next": "CheckInventory"
    },
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:checkInventory",
      "Next": "ProcessPayment",
      "Catch": [
        {
          "ErrorEquals": ["OutOfStockError"],
          "Next": "NotifyOutOfStock"
        }
      ]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:processPayment",
      "Next": "FulfillOrder"
    },
    "FulfillOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:fulfillOrder",
      "End": true
    },
    "NotifyOutOfStock": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:notifyUser",
      "End": true
    }
  }
}

Serverless Tradeoffs

Benefit	Tradeoff
No server management	Less control over environment
Auto-scaling	Cold start latency (100ms-few seconds)
Pay-per-execution	Expensive for constant high load
Built-in HA	Vendor lock-in
Rapid development	Debugging/testing more complex
Event-driven simplicity	Limited execution time (15 min max)

                        
                        Cost Comparison: At low traffic, serverless is cheaper. But at ~1M requests/day with 200ms execution, EC2 often becomes more economical. Calculate your break-even point!
                    

When to Use Serverless

? Variable/unpredictable traffic
? Event-driven workloads
? Short-running tasks (< 15 minutes)
? Rapid prototyping
? Background jobs (image processing, data ETL)
? Long-running processes
? Consistent high-throughput workloads
? Latency-sensitive applications (cold starts)

System Evolution & Refactoring

Most real-world systems aren't built from scratch — they evolve from monoliths into distributed architectures. This section covers proven strategies for modernising legacy systems incrementally, without big-bang rewrites.

The Strangler Fig Pattern

Named after the tropical strangler fig that grows around a host tree until it replaces it entirely. The pattern routes traffic to a new service while the legacy module still runs, progressively migrating functionality.

Pattern

Strangler Fig Implementation

# Strangler Fig Pattern — incremental monolith decomposition

# Step 1: Intercept requests at the gateway
class StranglerProxy:
    def __init__(self):
        self.migrated_routes = {
            "/api/users": "http://user-service:8080",
            "/api/auth":  "http://auth-service:8080",
        }
        self.legacy_url = "http://monolith:3000"
    
    def route(self, request):
        for prefix, target in self.migrated_routes.items():
            if request.path.startswith(prefix):
                return forward_to(target, request)
        # Not yet migrated — send to monolith
        return forward_to(self.legacy_url, request)

# Step 2: Verify parity with shadow traffic
class ShadowTrafficVerifier:
    def compare(self, request):
        legacy_response = call_legacy(request)
        new_response = call_new_service(request)
        if legacy_response != new_response:
            log_discrepancy(request, legacy_response, new_response)
        return legacy_response  # Serve legacy until verified

# Step 3: Cutover — update routing, decommission legacy module
proxy = StranglerProxy()
print("Migrated routes:", list(proxy.migrated_routes.keys()))
print("Remaining legacy:", proxy.legacy_url)

Strangler Fig Migration

Legacy Modernisation Strategies

Strategy	Approach	Risk	Best For
Strangler Fig	Route-by-route migration behind a proxy	Low — rollback is instant	Monolith with clear API boundaries
Branch by Abstraction	Introduce abstraction layer, swap implementation behind it	Low–Medium	Replacing internal libraries or data layers
Parallel Run	Run old and new systems simultaneously, compare outputs	Low (verification heavy)	Financial or safety-critical systems
Big Bang Rewrite	Replace entire system at once	Very High	Almost never recommended — high failure rate

                        
                        Warning — The Second System Effect: Complete rewrites almost always take longer than estimated and often re-introduce the same problems. Prefer incremental modernisation strategies that deliver value continuously.
                    

Feature Flags for Migration

Feature flags (feature toggles) let you control which code path runs — old or new — without deploying new code:

# Feature flags for safe migration rollout

class FeatureFlags:
    def __init__(self):
        self.flags = {
            "use_new_payment_service": {
                "enabled": True,
                "rollout_percentage": 25,  # 25% of traffic uses new service
                "allowed_users": ["internal-testers"],
            }
        }
    
    def is_enabled(self, flag_name, user_id=None):
        flag = self.flags.get(flag_name, {})
        if not flag.get("enabled"):
            return False
        if user_id in flag.get("allowed_users", []):
            return True  # Always enable for test users
        # Percentage-based rollout using consistent hashing
        return hash(user_id) % 100 < flag.get("rollout_percentage", 0)

# Usage: gradually shift traffic from legacy to new service
flags = FeatureFlags()

def process_payment(order_id, user_id):
    if flags.is_enabled("use_new_payment_service", user_id):
        return new_payment_service.process(order_id)
    return legacy_monolith.process_payment(order_id)

print("25% rollout:", flags.is_enabled("use_new_payment_service", "user-123"))
print("Test user:", flags.is_enabled("use_new_payment_service", "internal-testers"))

Database Evolution — Expand-Contract Pattern

When migrating data ownership from a monolith to a service, the expand-contract pattern prevents downtime:

Pattern

Expand-Contract Migration

Expand: Add the new schema (new column, new table, or new service DB) alongside the old one. Both are written to simultaneously.
Migrate: Backfill historical data from old to new schema. Verify data integrity.
Contract: Remove writes to the old schema. Old consumers now read from the new schema. Drop the old column/table after a cooling-off period.

Zero-Downtime Schema Migration

Blue-Green & Canary Deployments for Migration

Blue-Green: Run two identical environments. Route all traffic to "blue" (legacy). Deploy the new version to "green." Switch the load balancer to "green" once validated. Rollback = switch back to "blue."
Canary: Deploy the new version to a small subset of servers (1–5%). Monitor error rates and latency. Gradually increase traffic (10% → 25% → 50% → 100%). Automated rollback if metrics breach thresholds.

                        
                        Migration Success Metrics: Track error rate delta (new vs. old), p99 latency comparison, data consistency checks, and rollback count per migration phase. A migration is "complete" when the legacy path has had zero traffic for 30+ days and can be safely decommissioned.
                    

Next Steps

Microservices Architecture Plan Generator

Design your microservices decomposition with bounded contexts, communication patterns, and deployment topology. Download as Word, Excel, PDF, or PowerPoint.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

System Name *

Communication Pattern *

Services & Responsibilities *

Bounded Contexts

Service Mesh / API Gateway

Data Strategy

Deployment Strategy

Resilience Patterns

Author Name

System Design Series Part 5: Microservices Architecture

Table of Contents

Microservices Fundamentals

System Design Mastery

Introduction to System Design

Scalability Fundamentals

Load Balancing & Caching

Database Design & Sharding

Microservices Architecture

API Design & REST/GraphQL

Message Queues & Event-Driven

CAP Theorem & Consistency

Rate Limiting & Security

Monitoring & Observability

Real-World Case Studies

Data Modeling & Schema Design

Distributed Systems Deep Dive

Authentication & Security

Questions & Trade-offs

Core Principles

Monolith vs Microservices

When to Choose Monolith

When to Choose Microservices

Service Decomposition

Domain-Driven Design (DDD)

Ubiquitous Language

Bounded Contexts & Context Mapping

E-Commerce Bounded Contexts

Tactical DDD — Building Blocks

Aggregates, Entities & Value Objects

Strategic vs. Tactical DDD

Decomposition Strategies

Service Boundaries

Good Boundaries

Bad Boundaries (Distributed Monolith)

Signs of Bad Boundaries

Architecture Patterns

API Gateway

Backend for Frontend (BFF)

Database per Service

Saga Pattern

Circuit Breaker

Service Mesh

Service Mesh Capabilities

Istio Service Mesh Example

Popular Service Meshes

Service Collaboration Patterns

Data Ownership Strategies

API Composition

API Composition Pattern

CQRS & Event Sourcing

Transactional Messaging

Transactional Outbox Pattern

Outbox Relay Strategies

Communication & Discovery

Communication Styles

Idempotent Consumer

Self-Contained Services

Service Discovery

Service Discovery in Practice

Reliability & Resilience

Cross-Cutting Concerns

Testing Microservices

Consumer-Driven Contract Test (Pact)

Deployment Patterns

Observability Patterns

UI Composition (Micro-Frontends)

Micro-Frontend Architecture

Serverless Architecture

Serverless Benefits

AWS Lambda Function Example

Serverless Patterns

Event-Driven Processing

Function Composition

Serverless Tradeoffs

When to Use Serverless

System Evolution & Refactoring

The Strangler Fig Pattern

Strangler Fig Implementation

Legacy Modernisation Strategies