Back to Technology

System Design Series Part 5: Microservices Architecture

January 25, 2026 Wasil Zafar 85 min read

Master microservices architecture patterns for building scalable, maintainable systems. Learn service decomposition, API gateways, service mesh, containerization, and Kubernetes orchestration.

Table of Contents

  1. Microservices Fundamentals
  2. Service Decomposition
  3. Architecture Patterns
  4. Service Collaboration
  5. Communication & Discovery
  6. Reliability & Resilience
  7. Cross-Cutting Concerns
  8. Testing Microservices
  9. Deployment Patterns
  10. Observability Patterns
  11. UI Composition
  12. Serverless Architecture
  13. System Evolution & Refactoring
  14. Next Steps

Microservices Fundamentals

Series Navigation: This is Part 5 of the 15-part System Design Series. Review Part 4: Database Design & Sharding first.

Microservices architecture structures an application as a collection of loosely coupled, independently deployable services. Each service is owned by a small team and focuses on a specific business capability.

Overview diagram of microservices architecture showing multiple independent services with separate databases communicating through APIs and message queues
Microservices architecture overview — independently deployable services with separate data stores, communicating through well-defined APIs and async messaging
Key Insight: Microservices aren't always the answer. Start with a monolith and extract services as complexity grows and team boundaries become clear.

Core Principles

  • Single Responsibility: Each service does one thing well
  • Loose Coupling: Services interact through well-defined interfaces
  • High Cohesion: Related functionality grouped together
  • Independent Deployment: Deploy services without affecting others
  • Decentralized Data: Each service owns its data store
  • Failure Isolation: One service failure doesn't crash the system

Monolith vs Microservices

Understanding when to use each architecture is crucial:

Side-by-side comparison of monolithic architecture as a single deployable unit versus microservices as independent services with separate deployments and databases
Monolith vs microservices — a single deployable unit with shared database compared to independently deployable services with dedicated data stores
Aspect Monolith Microservices
Deployment Deploy entire application Deploy services independently
Scaling Scale entire application Scale individual services
Technology Single tech stack Polyglot (different stacks per service)
Team Structure Large, cross-functional Small, service-owning teams
Complexity In-process calls Network calls, distributed systems
Data Shared database Database per service
Testing End-to-end easier Integration testing complex

When to Choose Monolith

  • Small team (< 10 developers)
  • Early-stage startup exploring product-market fit
  • Simple domain with clear boundaries
  • Need rapid development without distributed complexity
  • Limited DevOps expertise
# Monolith structure example
my_app/
+-- app/
¦   +-- models/
¦   ¦   +-- user.py
¦   ¦   +-- order.py
¦   ¦   +-- product.py
¦   +-- services/
¦   ¦   +-- user_service.py
¦   ¦   +-- order_service.py
¦   ¦   +-- product_service.py
¦   +-- api/
¦   ¦   +-- routes.py
¦   +-- database.py
+-- tests/
+-- requirements.txt
Simple Fast to Start

When to Choose Microservices

  • Large organization with multiple teams
  • Different parts of system have different scaling needs
  • Teams need to deploy independently
  • Complex domain with clear bounded contexts
  • Strong DevOps/infrastructure capabilities
# Microservices structure example
platform/
+-- services/
¦   +-- user-service/
¦   ¦   +-- src/
¦   ¦   +-- Dockerfile
¦   ¦   +-- k8s/
¦   +-- order-service/
¦   ¦   +-- src/
¦   ¦   +-- Dockerfile
¦   ¦   +-- k8s/
¦   +-- payment-service/
¦   ¦   +-- src/
¦   ¦   +-- Dockerfile
¦   ¦   +-- k8s/
¦   +-- notification-service/
¦       +-- src/
¦       +-- Dockerfile
¦       +-- k8s/
+-- api-gateway/
+-- infrastructure/
+-- docker-compose.yml
Scalable Team Autonomy
The Monolith First Approach: Many successful companies (Shopify, Etsy) started as monoliths and extracted services as they scaled. Don't prematurely optimize for microservices—the complexity isn't free.

Service Decomposition

Breaking a monolith into microservices requires thoughtful decomposition strategies:

Diagram showing DDD bounded contexts mapping to microservices, with an e-commerce domain split into user, catalog, order, inventory, payment, and shipping contexts
Service decomposition using bounded contexts — each DDD context maps to an independent microservice with clear ownership boundaries

Domain-Driven Design (DDD)

DDD provides a toolkit for decomposing complex domains into services. It works at two levels: strategic design (how domains relate) and tactical design (patterns within a domain).

Ubiquitous Language

Each bounded context develops its own ubiquitous language — a shared vocabulary between developers and domain experts. The same real-world concept may mean different things in different contexts:

Term Order Context Shipping Context Payment Context
Customer Buyer placing an order Recipient at a delivery address Billing entity with payment methods
Product Line item with quantity and price Package with weight and dimensions Not relevant
Address Not stored (delegated) Delivery destination with GPS coords Billing address for fraud checks
Key Insight: If the same term means different things in different parts of the codebase, you have found a context boundary. Each context should use its own model — never force a single "Customer" class to serve all contexts.

Bounded Contexts & Context Mapping

Use bounded contexts from DDD to identify service boundaries. Each context encapsulates a coherent domain model:

E-Commerce Bounded Contexts

# Bounded Contexts for E-Commerce Platform

# User Context - User identity and preferences
class UserService:
    """Handles user registration, authentication, profiles"""
    def register_user(self, email, password): pass
    def authenticate(self, email, password): pass
    def get_profile(self, user_id): pass

# Catalog Context - Product information
class CatalogService:
    """Manages product listings, categories, search"""
    def get_product(self, product_id): pass
    def search_products(self, query): pass
    def get_category(self, category_id): pass

# Order Context - Order lifecycle
class OrderService:
    """Handles order creation, status, history"""
    def create_order(self, user_id, items): pass
    def get_order_status(self, order_id): pass
    def cancel_order(self, order_id): pass

# Inventory Context - Stock management
class InventoryService:
    """Manages stock levels, reservations"""
    def check_availability(self, product_id, quantity): pass
    def reserve_stock(self, product_id, quantity): pass
    def release_stock(self, product_id, quantity): pass

# Payment Context - Financial transactions
class PaymentService:
    """Processes payments, refunds"""
    def process_payment(self, order_id, amount): pass
    def refund_payment(self, payment_id): pass

# Shipping Context - Fulfillment
class ShippingService:
    """Handles shipping labels, tracking"""
    def create_shipment(self, order_id): pass
    def get_tracking(self, shipment_id): pass
DDD Bounded Context

Contexts interact through well-defined context mapping patterns:

Mapping Pattern Description When to Use
Shared Kernel Two contexts share a small, jointly owned model Closely collaborating teams with overlapping domain
Customer–Supplier Upstream context provides what downstream needs Clear producer/consumer relationship between teams
Anti-Corruption Layer (ACL) Translation layer that protects your model from external changes Integrating with legacy systems or third-party APIs
Open Host Service Published API with a well-defined protocol Context serves many consumers with a stable contract
Conformist Downstream adopts upstream's model as-is No leverage to influence the upstream team

Tactical DDD — Building Blocks

Within a bounded context, DDD provides tactical patterns for modelling the domain:

Pattern

Aggregates, Entities & Value Objects

# Tactical DDD Building Blocks

# Entity: has identity, mutable state
class Order:
    def __init__(self, order_id, customer_id):
        self.order_id = order_id         # Identity
        self.customer_id = customer_id
        self.items = []
        self.status = "pending"
    
    def add_item(self, product_id, quantity, price):
        self.items.append(OrderItem(product_id, quantity, price))
    
    def confirm(self):
        if not self.items:
            raise ValueError("Cannot confirm empty order")
        self.status = "confirmed"

# Value Object: no identity, immutable, compared by attributes
class Money:
    def __init__(self, amount, currency):
        self.amount = amount
        self.currency = currency
    
    def __eq__(self, other):
        return self.amount == other.amount and self.currency == other.currency
    
    def add(self, other):
        assert self.currency == other.currency, "Currency mismatch"
        return Money(self.amount + other.amount, self.currency)

# Aggregate Root: Order is the root, OrderItem is only accessed through it
class OrderItem:
    """Never accessed directly — always through Order aggregate"""
    def __init__(self, product_id, quantity, price):
        self.product_id = product_id
        self.quantity = quantity
        self.price = price  # Money value object

# Domain Event: something that happened in the domain
class OrderConfirmed:
    def __init__(self, order_id, customer_id, total):
        self.order_id = order_id
        self.customer_id = customer_id
        self.total = total
        self.occurred_at = "2025-06-01T10:30:00Z"

print("Aggregate: Order (root) -> OrderItem (child)")
print("Value Object: Money (immutable, no identity)")
print("Domain Event: OrderConfirmed (published to other contexts)")
Aggregate Entity Value Object Domain Event

Strategic vs. Tactical DDD

Aspect Strategic DDD Tactical DDD
Scope System-wide, cross-team Within a single bounded context
Key Concepts Bounded contexts, context maps, ubiquitous language Aggregates, entities, value objects, domain events
Goal Find the right service boundaries Model rich domain logic within a service
Who Leads Architects + domain experts Developers + domain experts
Impact of Getting It Wrong Distributed monolith, team coupling Anemic domain model, logic leaking into services layer

Decomposition Strategies

Strategy Description When to Use
By Business Capability Align services with business functions Clear business domains exist
By Subdomain Core, supporting, generic subdomains Complex domain with varying importance
Strangler Fig Gradually replace monolith pieces Migrating existing monolith
By Team Conway's Law—match org structure Clear team boundaries

Service Boundaries

Good service boundaries minimize inter-service communication:

Good Boundaries

# Good: Order service owns all order data
class OrderService:
    def create_order(self, user_id, items):
        # All order logic contained within service
        order = Order(user_id=user_id)
        for item in items:
            order.add_item(item)
        order.calculate_total()
        self.db.save(order)
        
        # Emit event for other services (async)
        self.event_bus.publish("order.created", order)
        return order
    
    def get_order_details(self, order_id):
        # No external calls needed
        return self.db.get(order_id)

Bad Boundaries (Distributed Monolith)

# Bad: Order service makes synchronous calls for every operation
class OrderService:
    def create_order(self, user_id, items):
        # Synchronous calls create tight coupling
        user = self.user_service.get_user(user_id)  # Network call
        
        for item in items:
            product = self.catalog_service.get_product(item.id)  # Network call
            price = self.pricing_service.get_price(item.id)  # Network call
            stock = self.inventory_service.check(item.id)  # Network call
        
        # Any service failure = order failure
        # Can't deploy independently
        # Latency compounds with each call

Signs of Bad Boundaries

  • Services need to call each other synchronously for simple operations
  • Circular dependencies between services
  • Shared database between services
  • Must deploy multiple services together
  • Same data modified by multiple services

Architecture Patterns

API Gateway

Single entry point that routes requests to appropriate services.

# API Gateway responsibilities
class APIGateway:
    def __init__(self):
        self.services = {
            "/users": "user-service:8080",
            "/orders": "order-service:8080",
            "/products": "catalog-service:8080"
        }
    
    def route_request(self, request):
        # 1. Authentication
        user = self.authenticate(request.headers.get("Authorization"))
        
        # 2. Rate limiting
        if self.rate_limiter.is_limited(user.id):
            return Response(status=429)
        
        # 3. Route to service
        service = self.get_service(request.path)
        
        # 4. Protocol translation (REST -> gRPC)
        response = service.forward(request)
        
        # 5. Response aggregation (if needed)
        return response
    
    def aggregate_product_page(self, product_id):
        """Combine data from multiple services"""
        product = self.catalog_service.get(product_id)
        reviews = self.review_service.get_for_product(product_id)
        inventory = self.inventory_service.check(product_id)
        
        return {
            **product,
            "reviews": reviews,
            "in_stock": inventory.available > 0
        }
Entry Point Cross-Cutting Concerns

Backend for Frontend (BFF)

Separate backend for each client type (web, mobile, IoT).

# BFF Pattern
# Each client gets optimized API

# Mobile BFF - Minimal data, pagination
class MobileBFF:
    def get_product_list(self, page=1, limit=20):
        products = self.catalog.get_products(page, limit)
        return [{
            "id": p.id,
            "name": p.name,
            "price": p.price,
            "thumbnail": p.images[0].url  # Only first image
        } for p in products]

# Web BFF - Rich data, full details
class WebBFF:
    def get_product_list(self):
        products = self.catalog.get_products()
        for product in products:
            product["reviews_summary"] = self.reviews.get_summary(product.id)
            product["availability"] = self.inventory.check(product.id)
            product["related"] = self.recommendations.get_related(product.id)
        return products
Client-Specific Optimized Responses

Database per Service

Each service owns its data store—no shared databases.

# Each service manages its own database
# Order Service - PostgreSQL for transactions
order_db = PostgreSQL("order-db")

# Catalog Service - MongoDB for flexible product data
catalog_db = MongoDB("catalog-db")

# Search Service - Elasticsearch for full-text search
search_db = Elasticsearch("search-cluster")

# Session Service - Redis for fast access
session_db = Redis("session-cache")

# Analytics Service - ClickHouse for time-series
analytics_db = ClickHouse("analytics-cluster")

Challenge: Cross-service queries require careful design (event sourcing, CQRS, or API composition).

Data Isolation Polyglot Persistence

Saga Pattern

Manage distributed transactions across services using compensating transactions.

# Saga Pattern for Order Creation
class OrderSaga:
    def execute(self, order_data):
        saga_log = []
        
        try:
            # Step 1: Reserve inventory
            reservation = self.inventory.reserve(order_data.items)
            saga_log.append(("inventory", reservation.id))
            
            # Step 2: Process payment
            payment = self.payment.charge(order_data.user_id, order_data.total)
            saga_log.append(("payment", payment.id))
            
            # Step 3: Create order
            order = self.orders.create(order_data)
            saga_log.append(("order", order.id))
            
            # Step 4: Notify user
            self.notifications.send(order_data.user_id, "Order confirmed!")
            
            return order
            
        except Exception as e:
            # Compensating transactions (rollback)
            self.compensate(saga_log)
            raise e
    
    def compensate(self, saga_log):
        """Undo completed steps in reverse order"""
        for service, resource_id in reversed(saga_log):
            if service == "order":
                self.orders.cancel(resource_id)
            elif service == "payment":
                self.payment.refund(resource_id)
            elif service == "inventory":
                self.inventory.release(resource_id)
Saga Pattern — Choreography
sequenceDiagram
    participant OS as Order Service
    participant PS as Payment Service
    participant IS as Inventory Service
    participant SS as Shipping Service
    
    Note over OS,SS: Happy Path
    OS->>PS: OrderCreated Event
    PS->>IS: PaymentCompleted Event
    IS->>SS: InventoryReserved Event
    SS-->>OS: ShipmentScheduled Event
    
    Note over OS,SS: Compensation (Failure at Inventory)
    OS->>PS: OrderCreated Event
    PS->>IS: PaymentCompleted Event
    IS--xPS: InventoryFailed Event
    PS-->>OS: PaymentRefunded Event
    OS-->>OS: OrderCancelled
                        
Distributed Transactions Eventual Consistency

Circuit Breaker

Prevent cascading failures by stopping calls to failing services.

# Circuit Breaker Implementation
from enum import Enum
import time

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject calls
    HALF_OPEN = "half_open"  # Testing recovery

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED
    
    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
            else:
                raise CircuitOpenError("Circuit is open")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e
    
    def _on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED
    
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
    
    def _should_attempt_reset(self):
        return (time.time() - self.last_failure_time) > self.recovery_timeout

# Usage
payment_circuit = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
result = payment_circuit.call(payment_service.charge, user_id, amount)
Fault Tolerance Resilience

Service Mesh

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It offloads common concerns from application code.

Service mesh architecture showing sidecar proxies attached to each microservice, with a control plane managing traffic routing, mTLS encryption, and observability
Service mesh architecture — sidecar proxies handle cross-cutting concerns like traffic management, mTLS encryption, and distributed tracing without modifying application code

Service Mesh Capabilities

  • Traffic Management: Load balancing, routing, retries
  • Security: mTLS encryption, authentication
  • Observability: Metrics, tracing, logging
  • Resilience: Circuit breaking, timeouts, rate limiting

Istio Service Mesh Example

# Istio VirtualService for traffic routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
spec:
  hosts:
    - order-service
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: order-service
            subset: v2
          weight: 100
    - route:
        - destination:
            host: order-service
            subset: v1
          weight: 90
        - destination:
            host: order-service
            subset: v2
          weight: 10  # 10% canary traffic

---
# Istio DestinationRule for circuit breaking
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-service
spec:
  host: order-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
Istio Traffic Management

Popular Service Meshes

Service Mesh Architecture Best For
Istio Envoy sidecar Full-featured, large deployments
Linkerd Lightweight proxy Simplicity, performance
Consul Connect HashiCorp ecosystem Multi-cloud, service discovery
AWS App Mesh Envoy, AWS integration AWS-native workloads

Service Collaboration Patterns

When operations span multiple services, you need patterns for maintaining data consistency and composing data across service boundaries — without falling back to a shared database.

Data Ownership Strategies

Pattern Description Pros Cons
Database per Service Each service owns a private database — no other service accesses it directly Loose coupling, independent scaling, polyglot persistence Cross-service queries harder, eventual consistency
Shared Database (Anti-pattern) Multiple services read/write the same database Simple queries, strong consistency Tight coupling, schema changes break multiple services, scaling bottleneck
Command-side Replica A service maintains a read-only copy of another service's data, updated via events Fast local reads, reduced inter-service calls Data staleness, storage duplication
Anti-pattern — Shared Database: When multiple services share a database, any schema change requires coordinating all consumer services. This creates a distributed monolith — you get the complexity of microservices without the benefits of independent deployment.

API Composition

Implement cross-service queries by invoking the services that own the data and performing an in-memory join:

API Composition Pattern

# API Composition — aggregate data from multiple services
import asyncio

class OrderDetailsComposer:
    """
    Composes a rich order view by querying multiple services
    and joining results in memory.
    """
    def __init__(self, order_svc, user_svc, product_svc):
        self.order_svc = order_svc
        self.user_svc = user_svc
        self.product_svc = product_svc
    
    async def get_order_details(self, order_id):
        # Fetch order first (owns the core data)
        order = await self.order_svc.get(order_id)
        
        # Parallel calls to other services
        user, products = await asyncio.gather(
            self.user_svc.get(order["user_id"]),
            self.product_svc.get_batch(
                [item["product_id"] for item in order["items"]]
            )
        )
        
        # In-memory join
        product_map = {p["id"]: p for p in products}
        return {
            "order_id": order["id"],
            "status": order["status"],
            "customer": {
                "name": user["name"],
                "email": user["email"]
            },
            "items": [
                {
                    **item,
                    "product_name": product_map[item["product_id"]]["name"],
                    "image_url": product_map[item["product_id"]]["image"]
                }
                for item in order["items"]
            ],
            "total": order["total"]
        }

# Typically implemented in the API Gateway or a BFF
print("API Composition: query N services, join in memory")
print("Best for: read-heavy queries spanning 2-4 services")
Read Queries In-Memory Join

CQRS & Event Sourcing

For high-throughput systems with asymmetric read/write loads, CQRS (Command Query Responsibility Segregation) separates the write model from the read model. Combined with Event Sourcing, every state change is stored as an immutable event — enabling full audit trails and temporal queries.

Pattern How It Works When to Use
CQRS Write commands go to a normalised write store; reads served from denormalised, pre-computed views (materialized views) Read/write ratio > 10:1, complex queries across aggregates
Event Sourcing Persist every state change as an immutable event; rebuild current state by replaying the event log Audit-critical domains (finance, healthcare), complex event-driven workflows
Domain Events Publish an event whenever data changes — other services subscribe to react asynchronously Decoupled service communication, event-driven choreography
Deep Dive: Full implementations of Event Sourcing and CQRS with code examples are covered in Part 7: Message Queues & Event-Driven Architecture.

Transactional Messaging

A critical challenge in microservices: how do you atomically update a database and publish an event? If you update the DB but the event publish fails (or vice versa), you get inconsistency.

Transactional Outbox Pattern

Write events to an outbox table in the same database transaction as the business data. A separate process reads the outbox and publishes events to the message broker.

# Transactional Outbox Pattern
import json
import uuid

class OrderService:
    def create_order(self, user_id, items, total):
        # Single database transaction
        with self.db.transaction() as tx:
            # 1. Write business data
            order_id = str(uuid.uuid4())
            tx.execute(
                "INSERT INTO orders (id, user_id, total, status) "
                "VALUES (%s, %s, %s, 'created')",
                (order_id, user_id, total)
            )
            for item in items:
                tx.execute(
                    "INSERT INTO order_items (order_id, product_id, qty, price) "
                    "VALUES (%s, %s, %s, %s)",
                    (order_id, item["product_id"], item["qty"], item["price"])
                )
            
            # 2. Write event to outbox (SAME transaction)
            tx.execute(
                "INSERT INTO outbox (id, aggregate_type, aggregate_id, "
                "event_type, payload, created_at) "
                "VALUES (%s, 'Order', %s, 'OrderCreated', %s, NOW())",
                (str(uuid.uuid4()), order_id, json.dumps({
                    "order_id": order_id,
                    "user_id": user_id,
                    "total": total,
                    "items": items
                }))
            )
            # Both writes commit or both roll back — atomicity guaranteed

# Separate process: OutboxRelay reads outbox, publishes to Kafka/RabbitMQ
class OutboxRelay:
    """Polls the outbox table and publishes events to the broker."""
    def relay(self):
        rows = self.db.query(
            "SELECT * FROM outbox WHERE published = FALSE "
            "ORDER BY created_at LIMIT 100"
        )
        for row in rows:
            self.broker.publish(row["event_type"], row["payload"])
            self.db.execute(
                "UPDATE outbox SET published = TRUE WHERE id = %s",
                (row["id"],)
            )

print("Outbox guarantees: DB write + event publish are atomic")
print("Relay methods: polling, CDC (Change Data Capture), log tailing")
Transactional Outbox Atomicity

Outbox Relay Strategies

Strategy How It Works Latency Complexity
Polling Publisher Periodically query outbox for unpublished events Medium (polling interval) Low — simple SQL queries
Transaction Log Tailing Read the database's transaction log (WAL/binlog) and extract outbox inserts Low (near real-time) High — requires CDC tools (Debezium, Maxwell)

Communication & Discovery

Microservices need to communicate with each other and with external clients. The choice of communication style affects coupling, latency, and resilience.

Communication Styles

Style Mechanism Coupling When to Use
Remote Procedure Invocation (RPI) REST, gRPC, GraphQL — synchronous request/response Higher (temporal coupling) Queries, real-time interactions, simple CRUD
Asynchronous Messaging Kafka, RabbitMQ, SQS — publish/subscribe or point-to-point Lower (no temporal coupling) Event-driven workflows, fire-and-forget, long-running operations
Domain-Specific Protocol SMTP for email, MQTT for IoT, FIX for trading Varies Integrating with systems using specialised protocols

Idempotent Consumer

In distributed systems, messages can be delivered more than once (at-least-once delivery). An idempotent consumer ensures that processing the same message multiple times produces the same result.

# Idempotent Consumer Pattern
class PaymentEventHandler:
    def __init__(self, db, payment_service):
        self.db = db
        self.payment_service = payment_service
    
    def handle_order_created(self, event):
        message_id = event["message_id"]
        
        # Check if we've already processed this message
        if self.db.exists("processed_messages", message_id):
            print(f"Duplicate message {message_id} — skipping")
            return  # Idempotent: no side effects
        
        # Process the message
        self.payment_service.charge(
            user_id=event["user_id"],
            amount=event["total"],
            order_id=event["order_id"]
        )
        
        # Record that we processed this message
        self.db.insert("processed_messages", {
            "message_id": message_id,
            "processed_at": "2026-03-26T10:00:00Z",
            "event_type": "OrderCreated"
        })

print("Idempotency key: message_id (UUID assigned by producer)")
print("Storage: processed_messages table with TTL cleanup")
At-Least-Once Deduplication

Self-Contained Services

A self-contained service can handle synchronous requests without waiting for other services to respond. It achieves this by maintaining local replicas of the data it needs:

  • Subscribe to events from upstream services to keep a local read-only copy of the data
  • No synchronous calls during request handling — all data needed is available locally
  • Trade-off: Data may be slightly stale (eventual consistency) but the service never blocks on another service's availability
Self-Contained vs Distributed Monolith: A self-contained service fails gracefully when dependencies are down. A distributed monolith cascades failures through synchronous call chains. Prefer self-contained services for all read paths.

Service Discovery

In a dynamic environment where service instances scale up/down and IP addresses change, service discovery lets services find each other:

Pattern How It Works Example
Client-Side Discovery Client queries a service registry and uses a load-balancing algorithm to select an instance Netflix Eureka + Ribbon
Server-Side Discovery Client makes a request to a router/load balancer, which queries the registry and forwards the request AWS ALB, Kubernetes Service
Service Registry Central database of service instance locations (host + port), health status, and metadata Consul, Eureka, etcd, ZooKeeper
Self Registration Service instance registers itself with the registry on startup and deregisters on shutdown Spring Cloud Eureka client
3rd Party Registration A separate registrar (e.g., container orchestrator) registers/deregisters instances automatically Kubernetes (kubelet), AWS ECS

Service Discovery in Practice

# Service Discovery — Client-Side (e.g., with Consul)
import random

class ServiceRegistry:
    """In production, this would be Consul, Eureka, or etcd."""
    def __init__(self):
        self.services = {}
    
    def register(self, service_name, instance_id, host, port):
        self.services.setdefault(service_name, []).append({
            "instance_id": instance_id,
            "host": host,
            "port": port,
            "healthy": True
        })
    
    def deregister(self, service_name, instance_id):
        self.services[service_name] = [
            i for i in self.services[service_name]
            if i["instance_id"] != instance_id
        ]
    
    def get_instances(self, service_name):
        instances = self.services.get(service_name, [])
        return [i for i in instances if i["healthy"]]

class ClientSideLoadBalancer:
    """Client picks an instance from the registry."""
    def __init__(self, registry):
        self.registry = registry
    
    def call(self, service_name, request):
        instances = self.registry.get_instances(service_name)
        if not instances:
            raise Exception(f"No healthy instances for {service_name}")
        
        # Round-robin, random, or weighted selection
        instance = random.choice(instances)
        url = f"http://{instance['host']}:{instance['port']}"
        print(f"Routing to {service_name} at {url}")
        return url

# Kubernetes simplifies this with built-in DNS
# kubectl: order-service.default.svc.cluster.local
registry = ServiceRegistry()
registry.register("order-service", "order-1", "10.0.1.5", 8080)
registry.register("order-service", "order-2", "10.0.1.6", 8080)
registry.register("order-service", "order-3", "10.0.1.7", 8080)

lb = ClientSideLoadBalancer(registry)
lb.call("order-service", "/api/orders/123")
Consul Eureka Kubernetes DNS

Reliability & Resilience

In a distributed system, failures aren't exceptional — they're expected. Resilience patterns prevent a single service failure from cascading across the entire system.

Pattern Purpose How It Works
Circuit Breaker Stop calling a failing service After N consecutive failures, "open" the circuit and return a fallback. Periodically retry ("half-open") to check recovery.
Bulkhead Isolate failures to one component Limit the number of concurrent calls to each dependency using thread pools or semaphores — a slow dependency can't exhaust all threads.
Retry with Backoff Handle transient failures Retry failed calls with exponential backoff + jitter (e.g., 100ms → 200ms → 400ms + random). Cap at max retries (typically 3).
Timeout Bound latency Set explicit timeouts on all outbound calls (e.g., 3s for sync, 30s for async). Never use "infinite" timeouts.
Fallback Degrade gracefully When a dependency fails, return cached data, default values, or a reduced-functionality response instead of an error.
Key Insight: Combine patterns in layers: Timeout → Retry → Circuit Breaker → Bulkhead → Fallback. Libraries like Resilience4j (Java) and Polly (.NET) implement all five patterns with declarative configuration. Service meshes like Istio handle these at the infrastructure level.

Cross-Cutting Concerns

Every microservice needs logging, health checks, configuration, security, and metrics. Without patterns for reuse, each team reinvents the wheel — leading to inconsistency and drift.

Pattern Description Example
Microservice Chassis A framework or library that handles cross-cutting concerns (logging, health checks, config, metrics, tracing) so service developers focus on business logic Spring Boot + Spring Cloud, Go Kit, Dapr
Externalized Configuration Store configuration (DB URLs, API keys, feature flags) outside the service binary — injected at runtime via environment variables or config servers Kubernetes ConfigMaps/Secrets, Spring Cloud Config, AWS Parameter Store, HashiCorp Vault
Service Template A project template implementing standard cross-cutting concerns. Developers copy and customise it to quickly create new services with consistent structure Cookiecutter templates, GitHub template repos, Yeoman generators
# Kubernetes ConfigMap — Externalized Configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: order-service-config
data:
  DATABASE_URL: "postgresql://db:5432/orders"
  CACHE_TTL: "300"
  LOG_LEVEL: "info"
  FEATURE_NEW_CHECKOUT: "true"
---
# Inject into the pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    spec:
      containers:
        - name: order-service
          envFrom:
            - configMapRef:
                name: order-service-config
            - secretRef:
                name: order-service-secrets  # Sensitive values

Testing Microservices

Testing distributed systems is fundamentally different from monolith testing. The testing pyramid for microservices adds contract and component tests between unit and end-to-end tests:

Test Type What It Verifies Scope Speed
Unit Tests Individual functions and classes within a service Single class/function Milliseconds
Service Component Test A service in isolation — test its endpoints with test doubles (mocks/stubs) for all external dependencies Entire service (isolated) Seconds
Consumer-Driven Contract Test A test suite written by the consumer team that defines the API contract it expects from a provider service. The provider runs these tests in its CI. Service API boundary Seconds
Consumer-Side Contract Test A test suite for a service client (API wrapper / SDK) verifying it can communicate correctly with the provider Client library Seconds
Integration Tests Interaction between 2-3 real services (no mocks) Service pair/group Minutes
End-to-End Tests Full user journey across all services Entire system Minutes–Hours

Consumer-Driven Contract Test (Pact)

# Consumer-Driven Contract Test with Pact
# Written by the ORDER SERVICE team (consumer)
# Tests what it expects from the USER SERVICE (provider)

# Step 1: Consumer defines expected interactions
def test_get_user_contract():
    """Order service expects user-service to return user by ID."""
    pact = Pact(consumer="order-service", provider="user-service")
    
    # Define expected interaction
    pact.given("user 123 exists").upon_receiving(
        "a request for user 123"
    ).with_request(
        method="GET", path="/api/users/123"
    ).will_respond_with(
        status=200,
        body={
            "id": "123",
            "name": Like("Jane Doe"),      # Any string
            "email": Like("jane@test.com")  # Any string
        }
    )
    
    # Run test against mock provider
    with pact:
        user = order_service_client.get_user("123")
        assert user["id"] == "123"
        assert "name" in user
        assert "email" in user
    
    # Pact file generated → shared with provider team
    # Provider runs this contract in their CI pipeline
    print("Contract published to Pact Broker")
    print("Provider verifies contract on every build")
Pact Contract Testing
Testing Strategy: Aim for many unit tests, several component tests, contract tests for every service boundary, a handful of integration tests, and very few end-to-end tests. Contract tests catch breaking API changes early without the brittleness of full E2E suites.

Deployment Patterns

How you deploy services affects isolation, resource utilisation, and operational complexity:

Pattern Description Isolation Cost Efficiency
Multiple Instances per Host Run several service instances on the same physical/virtual host Low (shared resources, port conflicts) High (dense packing)
Service Instance per VM Each service instance runs in its own virtual machine High (full OS isolation) Low (VM overhead per instance)
Service Instance per Container Each instance runs in a container (Docker) — lightweight isolation with shared kernel Medium–High (namespace + cgroup isolation) High (fast startup, efficient resources)
Serverless Deployment Deploy functions (Lambda, Azure Functions) — no server management, scale-to-zero High (cloud-managed isolation) Very High for spiky/low traffic
Service Deployment Platform Use a platform (Kubernetes, ECS, Nomad) that abstracts infrastructure and provides service-level primitives (scaling, health checks, rolling updates) Configurable High (automated management)
Industry Standard: Most organisations have converged on container-per-service on Kubernetes as the default pattern. Serverless (Lambda/Cloud Functions) is used for event-driven glue logic and low-traffic services. See Cloud Computing: Containers & Kubernetes and Cloud Computing: Serverless Services for deeper coverage.

Observability Patterns

In a distributed system, you can't SSH into a monolith and read log files. Observability requires structured, centralised tooling across all services:

Pattern Purpose Tools
Log Aggregation Collect logs from all services into a central, searchable store ELK Stack, Loki + Grafana, CloudWatch Logs
Application Metrics Instrument code to gather statistics — request rate, error rate, latency percentiles Prometheus + Grafana, Datadog, CloudWatch Metrics
Distributed Tracing Assign a unique trace ID to each request and propagate it across services to track the full request lifecycle Jaeger, Zipkin, AWS X-Ray, OpenTelemetry
Health Check API Expose a /health endpoint for load balancers and orchestrators to probe service liveness and readiness Spring Boot Actuator, custom /healthz
Exception Tracking Report uncaught exceptions to a centralised service that aggregates, deduplicates, and alerts Sentry, Bugsnag, Rollbar
Audit Logging Record user actions (who did what, when) in an append-only audit database for compliance and forensics Custom audit service, AWS CloudTrail
Log Deployments & Changes Annotate metrics dashboards with deployment events — correlate performance changes with releases Grafana annotations, PagerDuty change events
Deep Dive: Full implementations of the three pillars of observability (logs, metrics, traces) with alerting strategies and SLO-based monitoring are covered in Part 10: Monitoring & Observability.

UI Composition (Micro-Frontends)

When the backend is decomposed into microservices, how should the UI be structured? Two approaches for composing a single page from multiple teams' work:

Pattern How It Works Pros Cons
Server-Side Page Fragment Composition A gateway/server assembles a page from HTML fragments generated by different services (e.g., Nginx SSI, Edge Side Includes, Tailor by Zalando) Works without JavaScript, SEO-friendly, fast first paint Limited interactivity between fragments, server-side coupling
Client-Side UI Composition Each team ships an independent frontend module (React/Angular/Vue micro-app) loaded and orchestrated on the client via Module Federation, Single-SPA, or iframes Full team autonomy, rich interactivity, independent deployments Larger bundle size, runtime integration complexity, shared state management

Micro-Frontend Architecture

// Webpack Module Federation — micro-frontend setup
// Each team deploys independently, modules loaded at runtime

// Team A: Product Catalog (host application)
// webpack.config.js
module.exports = {
  plugins: [
    new ModuleFederationPlugin({
      name: "catalog_app",
      remotes: {
        // Other teams' micro-frontends loaded at runtime
        reviews: "reviews_app@https://reviews.cdn.example.com/remoteEntry.js",
        cart: "cart_app@https://cart.cdn.example.com/remoteEntry.js",
        recommendations: "recs_app@https://recs.cdn.example.com/remoteEntry.js"
      },
      shared: ["react", "react-dom"]  // Shared dependencies
    })
  ]
};

// In the host app — compose micro-frontends
// ProductPage.jsx
const ReviewsWidget = React.lazy(
  () => import("reviews/ReviewsWidget")
);
const CartButton = React.lazy(
  () => import("cart/AddToCartButton")
);
const Recommendations = React.lazy(
  () => import("recommendations/RelatedProducts")
);

function ProductPage({ productId }) {
  return (
    <div>
      <ProductDetails id={productId} />
      <Suspense fallback="Loading reviews...">
        <ReviewsWidget productId={productId} />
      </Suspense>
      <Suspense fallback="Loading cart...">
        <CartButton productId={productId} />
      </Suspense>
      <Suspense fallback="Loading recommendations...">
        <Recommendations productId={productId} />
      </Suspense>
    </div>
  );
}

console.log("Each widget deployed independently by different teams");
Module Federation Micro-Frontends Independent Deploy

Serverless Architecture

Key Insight: Serverless doesn't mean "no servers"—it means you don't manage servers. The cloud provider handles scaling, patching, and infrastructure automatically.

Function-as-a-Service (FaaS) is a cloud computing model where you deploy individual functions that execute in response to events. AWS Lambda, Azure Functions, and Google Cloud Functions are popular FaaS platforms.

Serverless FaaS architecture showing event sources triggering individual functions that auto-scale, with pay-per-execution billing and managed infrastructure
Serverless Function-as-a-Service architecture — event sources trigger stateless functions that auto-scale from zero with pay-per-execution billing

Serverless Benefits

  • No Server Management: Focus on code, not infrastructure
  • Auto-Scaling: Scale to zero or thousands instantly
  • Pay-per-Use: Only pay for actual execution time
  • Built-in HA: Multi-AZ by default

AWS Lambda Function Example

# AWS Lambda Function - Process uploaded images
import boto3
import json
from PIL import Image
import io

s3 = boto3.client('s3')

def lambda_handler(event, context):
    """Triggered when image uploaded to S3"""
    # Get uploaded file info
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Download image
    response = s3.get_object(Bucket=bucket, Key=key)
    image_content = response['Body'].read()
    
    # Process image (resize)
    image = Image.open(io.BytesIO(image_content))
    thumbnail = image.resize((200, 200))
    
    # Save thumbnail
    buffer = io.BytesIO()
    thumbnail.save(buffer, format='JPEG')
    buffer.seek(0)
    
    thumbnail_key = f"thumbnails/{key}"
    s3.put_object(
        Bucket=bucket,
        Key=thumbnail_key,
        Body=buffer.getvalue()
    )
    
    return {
        'statusCode': 200,
        'body': json.dumps({'thumbnail': thumbnail_key})
    }
AWS Lambda Event-Driven

Serverless Patterns

Event-Driven Processing

Functions triggered by events from various sources:

# serverless.yml - Event triggers
service: order-processor

functions:
  processOrder:
    handler: handler.process_order
    events:
      - sqs:
          arn: arn:aws:sqs:region:account:order-queue
          batchSize: 10
  
  sendNotification:
    handler: handler.send_notification
    events:
      - sns:
          arn: arn:aws:sns:region:account:order-notifications
  
  apiEndpoint:
    handler: handler.api
    events:
      - http:
          path: /orders
          method: post
  
  scheduledReport:
    handler: handler.generate_report
    events:
      - schedule: rate(1 day)

Function Composition

Chain functions using Step Functions or event choreography:

# AWS Step Functions - Order workflow
{
  "Comment": "Order Processing Workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:validateOrder",
      "Next": "CheckInventory"
    },
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:checkInventory",
      "Next": "ProcessPayment",
      "Catch": [
        {
          "ErrorEquals": ["OutOfStockError"],
          "Next": "NotifyOutOfStock"
        }
      ]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:processPayment",
      "Next": "FulfillOrder"
    },
    "FulfillOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:fulfillOrder",
      "End": true
    },
    "NotifyOutOfStock": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:notifyUser",
      "End": true
    }
  }
}

Serverless Tradeoffs

Benefit Tradeoff
No server management Less control over environment
Auto-scaling Cold start latency (100ms-few seconds)
Pay-per-execution Expensive for constant high load
Built-in HA Vendor lock-in
Rapid development Debugging/testing more complex
Event-driven simplicity Limited execution time (15 min max)
Cost Comparison: At low traffic, serverless is cheaper. But at ~1M requests/day with 200ms execution, EC2 often becomes more economical. Calculate your break-even point!

When to Use Serverless

  • ? Variable/unpredictable traffic
  • ? Event-driven workloads
  • ? Short-running tasks (< 15 minutes)
  • ? Rapid prototyping
  • ? Background jobs (image processing, data ETL)
  • ? Long-running processes
  • ? Consistent high-throughput workloads
  • ? Latency-sensitive applications (cold starts)

System Evolution & Refactoring

Most real-world systems aren't built from scratch — they evolve from monoliths into distributed architectures. This section covers proven strategies for modernising legacy systems incrementally, without big-bang rewrites.

The Strangler Fig Pattern

Named after the tropical strangler fig that grows around a host tree until it replaces it entirely. The pattern routes traffic to a new service while the legacy module still runs, progressively migrating functionality.

Pattern

Strangler Fig Implementation

# Strangler Fig Pattern — incremental monolith decomposition

# Step 1: Intercept requests at the gateway
class StranglerProxy:
    def __init__(self):
        self.migrated_routes = {
            "/api/users": "http://user-service:8080",
            "/api/auth":  "http://auth-service:8080",
        }
        self.legacy_url = "http://monolith:3000"
    
    def route(self, request):
        for prefix, target in self.migrated_routes.items():
            if request.path.startswith(prefix):
                return forward_to(target, request)
        # Not yet migrated — send to monolith
        return forward_to(self.legacy_url, request)

# Step 2: Verify parity with shadow traffic
class ShadowTrafficVerifier:
    def compare(self, request):
        legacy_response = call_legacy(request)
        new_response = call_new_service(request)
        if legacy_response != new_response:
            log_discrepancy(request, legacy_response, new_response)
        return legacy_response  # Serve legacy until verified

# Step 3: Cutover — update routing, decommission legacy module
proxy = StranglerProxy()
print("Migrated routes:", list(proxy.migrated_routes.keys()))
print("Remaining legacy:", proxy.legacy_url)
Strangler Fig Migration

Legacy Modernisation Strategies

Strategy Approach Risk Best For
Strangler Fig Route-by-route migration behind a proxy Low — rollback is instant Monolith with clear API boundaries
Branch by Abstraction Introduce abstraction layer, swap implementation behind it Low–Medium Replacing internal libraries or data layers
Parallel Run Run old and new systems simultaneously, compare outputs Low (verification heavy) Financial or safety-critical systems
Big Bang Rewrite Replace entire system at once Very High Almost never recommended — high failure rate
Warning — The Second System Effect: Complete rewrites almost always take longer than estimated and often re-introduce the same problems. Prefer incremental modernisation strategies that deliver value continuously.

Feature Flags for Migration

Feature flags (feature toggles) let you control which code path runs — old or new — without deploying new code:

# Feature flags for safe migration rollout

class FeatureFlags:
    def __init__(self):
        self.flags = {
            "use_new_payment_service": {
                "enabled": True,
                "rollout_percentage": 25,  # 25% of traffic uses new service
                "allowed_users": ["internal-testers"],
            }
        }
    
    def is_enabled(self, flag_name, user_id=None):
        flag = self.flags.get(flag_name, {})
        if not flag.get("enabled"):
            return False
        if user_id in flag.get("allowed_users", []):
            return True  # Always enable for test users
        # Percentage-based rollout using consistent hashing
        return hash(user_id) % 100 < flag.get("rollout_percentage", 0)

# Usage: gradually shift traffic from legacy to new service
flags = FeatureFlags()

def process_payment(order_id, user_id):
    if flags.is_enabled("use_new_payment_service", user_id):
        return new_payment_service.process(order_id)
    return legacy_monolith.process_payment(order_id)

print("25% rollout:", flags.is_enabled("use_new_payment_service", "user-123"))
print("Test user:", flags.is_enabled("use_new_payment_service", "internal-testers"))

Database Evolution — Expand-Contract Pattern

When migrating data ownership from a monolith to a service, the expand-contract pattern prevents downtime:

Pattern

Expand-Contract Migration

  1. Expand: Add the new schema (new column, new table, or new service DB) alongside the old one. Both are written to simultaneously.
  2. Migrate: Backfill historical data from old to new schema. Verify data integrity.
  3. Contract: Remove writes to the old schema. Old consumers now read from the new schema. Drop the old column/table after a cooling-off period.
Zero-Downtime Schema Migration

Blue-Green & Canary Deployments for Migration

  • Blue-Green: Run two identical environments. Route all traffic to "blue" (legacy). Deploy the new version to "green." Switch the load balancer to "green" once validated. Rollback = switch back to "blue."
  • Canary: Deploy the new version to a small subset of servers (1–5%). Monitor error rates and latency. Gradually increase traffic (10% → 25% → 50% → 100%). Automated rollback if metrics breach thresholds.
Migration Success Metrics: Track error rate delta (new vs. old), p99 latency comparison, data consistency checks, and rollback count per migration phase. A migration is "complete" when the legacy path has had zero traffic for 30+ days and can be safely decommissioned.

Next Steps

Microservices Architecture Plan Generator

Design your microservices decomposition with bounded contexts, communication patterns, and deployment topology. Download as Word, Excel, PDF, or PowerPoint.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Technology