Back to Systems Thinking & Architecture Mastery Series

Part 10: Queueing & Caching Architecture

May 15, 2026 Wasil Zafar 30 min read

Queues decouple producers from consumers. Caches trade freshness for speed. Together, they're the two most powerful performance multipliers in distributed systems — and the source of some of the most subtle bugs. This module covers the theory and practice of both.

Table of Contents

  1. Module 16: Queueing Systems
  2. Module 17: Caching Architecture
  3. Case Studies
  4. Conclusion & Next Steps

Module 16: Queueing Systems

Queues are everywhere — HTTP request buffers, database connection pools, Kafka topics, email delivery pipelines, print spoolers. Any time a producer generates work faster than a consumer can process it, a queue forms. Understanding queueing theory helps you predict when systems will break before they break.

Queueing Theory Basics

Queueing theory gives us mathematical models for systems where requests arrive, wait, get processed, and depart. The fundamental parameters:

  • λ (lambda) — Arrival rate: How many requests arrive per unit time (e.g., 100 requests/second)
  • μ (mu) — Service rate: How many requests one server can process per unit time (e.g., 120 requests/second)
  • ρ (rho) — Utilization: λ/μ — the fraction of time the server is busy. Must be < 1 for the system to be stable.
  • L — Average number in system: Requests being served + requests waiting
  • W — Average time in system: Service time + wait time
  • Lq — Average queue length: Requests waiting (not yet being served)
  • Wq — Average wait time: Time spent waiting before service begins
The Stability Condition: If ρ ≥ 1 (arrival rate ≥ service rate), the queue grows without bound. The system is unstable — latency increases to infinity. This is the mathematical proof that you cannot sustainably run a system at 100% utilization. Every production system needs headroom.

Little's Law: The Universal Queueing Relationship

Little's Law is one of the most powerful and general results in queueing theory. It applies to any stable system regardless of arrival distribution, service distribution, or number of servers:

Little's Law: L = λ × W
The average number of items in a system (L) equals the average arrival rate (λ) times the average time each item spends in the system (W). No assumptions about distributions required — it works for any stable queue.

Practical applications of Little's Law:

  • Capacity planning: If you need to handle 1,000 req/s (λ) and each request takes 200ms (W), you need 200 concurrent connections in your system (L = 1000 × 0.2)
  • Thread pool sizing: L tells you how many threads/goroutines/connections you need
  • Diagnosing bottlenecks: If L is growing but λ is constant, then W is increasing — something is getting slower
  • SLA validation: If your SLA requires W ≤ 500ms and λ = 2,000 req/s, you need infrastructure that can sustain L = 1,000 concurrent requests
"""
Little's Law Calculator — Capacity Planning Tool

L = λ × W (Items in system = Arrival rate × Time in system)

Use this to determine:
- How many concurrent connections your system needs
- Thread pool sizes for web servers
- Connection pool sizes for databases
- Whether your infrastructure can meet SLA requirements
"""

def littles_law_analysis(arrival_rate, service_time_ms):
    """
    Analyze system capacity requirements using Little's Law.
    
    Args:
        arrival_rate: Requests per second (λ)
        service_time_ms: Average time per request in milliseconds (W)
    
    Returns:
        Dictionary with capacity requirements
    """
    W = service_time_ms / 1000  # Convert to seconds
    L = arrival_rate * W         # Average items in system
    
    # Utilization assuming single server
    service_rate = 1000 / service_time_ms  # Requests one server handles/s
    rho_single = arrival_rate / service_rate
    
    # Minimum servers needed for stability (ρ < 1 per server)
    import math
    min_servers = math.ceil(arrival_rate / service_rate)
    
    # Recommended servers (target 70% utilization for latency headroom)
    recommended_servers = math.ceil(arrival_rate / (service_rate * 0.70))
    
    return {
        "arrival_rate": arrival_rate,
        "service_time_ms": service_time_ms,
        "avg_concurrent_requests": round(L, 1),
        "min_servers_for_stability": min_servers,
        "recommended_servers_70pct": recommended_servers,
        "utilization_per_server": round(arrival_rate / (recommended_servers * service_rate) * 100, 1),
    }


# Scenario 1: Web API
print("=" * 60)
print("SCENARIO 1: REST API Service")
print("=" * 60)
api = littles_law_analysis(arrival_rate=2000, service_time_ms=50)
print(f"  Arrival rate (λ):        {api['arrival_rate']} req/s")
print(f"  Service time (W):        {api['service_time_ms']}ms per request")
print(f"  Concurrent requests (L): {api['avg_concurrent_requests']}")
print(f"  Min servers needed:      {api['min_servers_for_stability']}")
print(f"  Recommended (70% util):  {api['recommended_servers_70pct']}")
print(f"  Utilization per server:  {api['utilization_per_server']}%")

# Scenario 2: Database connection pool
print(f"\n{'=' * 60}")
print("SCENARIO 2: Database Connection Pool")
print("=" * 60)
db = littles_law_analysis(arrival_rate=500, service_time_ms=20)
print(f"  Query rate (λ):          {db['arrival_rate']} queries/s")
print(f"  Avg query time (W):      {db['service_time_ms']}ms")
print(f"  Connections needed (L):  {db['avg_concurrent_requests']}")
print(f"  → Set pool size to:      {int(db['avg_concurrent_requests'] * 1.5)} (1.5x headroom)")

# Scenario 3: Message queue consumer
print(f"\n{'=' * 60}")
print("SCENARIO 3: Kafka Consumer Group")
print("=" * 60)
kafka = littles_law_analysis(arrival_rate=10000, service_time_ms=100)
print(f"  Message rate (λ):        {kafka['arrival_rate']} msg/s")
print(f"  Processing time (W):     {kafka['service_time_ms']}ms per message")
print(f"  Messages in-flight (L):  {kafka['avg_concurrent_requests']}")
print(f"  Min consumers needed:    {kafka['min_servers_for_stability']}")
print(f"  Recommended consumers:   {kafka['recommended_servers_70pct']}")

# The key insight
print(f"\n{'=' * 60}")
print("KEY INSIGHT: Why 100% utilization destroys latency")
print("=" * 60)
print("\nAs utilization (ρ) approaches 1.0, wait time → infinity:")
for rho in [0.5, 0.7, 0.8, 0.9, 0.95, 0.99]:
    # M/M/1 queue: Wq = ρ / (μ(1-ρ)) — wait time formula
    mu = 20  # service rate: 20 req/s (50ms each)
    Wq_ms = (rho / (mu * (1 - rho))) * 1000
    print(f"  ρ = {rho:.2f} → avg wait time: {Wq_ms:>8.1f}ms")

The Hockey Stick: Why Systems Break at High Utilization

The relationship between utilization and latency is not linear — it's exponential. This is the single most important graph in system design:

Utilization vs. Latency — The Hockey Stick Curve
flowchart LR
    subgraph CURVE["Latency vs. Utilization (M/M/1 Queue)"]
        direction TB
        LOW["ρ = 50%
Latency: 2× service time
✓ Comfortable headroom"] MED["ρ = 70%
Latency: 3.3× service time
✓ Normal operating range"] HIGH["ρ = 85%
Latency: 6.7× service time
⚠ Warning zone"] CRIT["ρ = 95%
Latency: 20× service time
🔴 Danger zone"] DEAD["ρ = 99%
Latency: 100× service time
💀 System failing"] end LOW --> MED MED --> HIGH HIGH --> CRIT CRIT --> DEAD style LOW fill:#e8f4f4,stroke:#3B9797,color:#132440 style MED fill:#e8f4f4,stroke:#3B9797,color:#132440 style HIGH fill:#f0f4f8,stroke:#16476A,color:#132440 style CRIT fill:#fdf0f0,stroke:#BF092F,color:#132440 style DEAD fill:#fdf0f0,stroke:#BF092F,color:#132440
Kingman's Formula (G/G/1 approximation): For real-world systems with variable arrival and service times, average wait time ≈ (ρ / (1 - ρ)) × (c²_a + c²_s) / 2 × average_service_time, where c²_a and c²_s are the squared coefficients of variation for arrivals and service. The key insight: variability multiplies the latency penalty. Bursty traffic (high c²_a) makes the hockey stick curve even steeper.

Practical implications:

  • Target 70% utilization for latency-sensitive services (APIs, databases)
  • Target 85% utilization for throughput-oriented batch processing
  • Never exceed 90% sustained — any traffic spike will push you into the exponential zone
  • Auto-scale triggers should fire at 70-75%, not 90% — by the time you hit 90%, latency is already degraded and new instances take time to warm up

Backpressure Patterns: When to Push Back

Backpressure is the mechanism by which a system signals upstream that it cannot accept more work. Without backpressure, an overloaded consumer's queue grows until memory is exhausted, then crashes — taking everything with it.

Backpressure strategies (from least to most aggressive):

  • Buffer & batch: Absorb bursts in a bounded buffer, process in batches. Works for short spikes but fails for sustained overload.
  • Signal upstream (flow control): Tell the producer to slow down. TCP does this with window size; Kafka consumers commit offsets to control fetch rate; reactive streams use request(n).
  • Load shed: Drop lowest-priority requests. Return 503 (Service Unavailable) with Retry-After header. Priority queues help — drop analytics events before dropping payment processing.
  • Circuit breaker: Stop accepting requests entirely for a cooldown period. Fast-fail instead of slow-fail — it's better to reject quickly than to accept and time out.
  • Adaptive concurrency: Dynamically adjust the concurrency limit based on measured latency. Netflix's concurrency-limits library implements this — reduce parallelism when latency increases.

Dead-Letter Queues: Handling Poison Messages

A poison message is a message that consistently fails processing — malformed data, schema mismatch, downstream dependency failure. Without a dead-letter queue (DLQ), this message blocks the queue forever (infinite retry loop) or gets silently dropped (data loss).

DLQ pattern:

  • Consumer attempts to process message
  • Processing fails → increment retry counter
  • After N failures (typically 3-5), move message to the DLQ
  • Main queue continues processing — no blocking
  • Operators investigate DLQ messages, fix the issue, and replay them
Dead-Letter Queue Flow — Isolating Poison Messages
flowchart TD
    PROD["Producer"] -->|"publish"| QUEUE["Main Queue"]
    QUEUE -->|"consume"| CONSUMER["Consumer"]
    CONSUMER -->|"success"| ACK["Acknowledge
(remove from queue)"] CONSUMER -->|"failure"| RETRY{"Retry count
< max?"} RETRY -->|"Yes"| QUEUE RETRY -->|"No (poison msg)"| DLQ["Dead-Letter Queue"] DLQ -->|"alert"| OPS["Operations Team"] OPS -->|"fix & replay"| QUEUE style PROD fill:#e8f4f4,stroke:#3B9797,color:#132440 style QUEUE fill:#e8f4f4,stroke:#3B9797,color:#132440 style CONSUMER fill:#f0f4f8,stroke:#16476A,color:#132440 style ACK fill:#e8f4f4,stroke:#3B9797,color:#132440 style DLQ fill:#fdf0f0,stroke:#BF092F,color:#132440 style OPS fill:#f0f4f8,stroke:#16476A,color:#132440

Consumer Lag: The Early Warning System

Consumer lag is the difference between what's been produced and what's been consumed. It's the most important metric for any queue-based system — growing lag means your consumers can't keep up.

# Monitor Kafka consumer lag — the #1 queue health metric
# Consumer lag = latest offset - consumer's committed offset

# Check lag for all consumer groups
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
    --describe --all-groups

# Output columns: GROUP, TOPIC, PARTITION, CURRENT-OFFSET, LOG-END-OFFSET, LAG
# Example:
# order-processors  orders  0  15234  15240  6
# order-processors  orders  1  14891  14950  59  ← GROWING LAG!

# Set up alerting thresholds:
# - LAG > 1000 messages: WARNING (consumer falling behind)
# - LAG > 10000 messages: CRITICAL (consumer significantly behind)
# - LAG growing for > 5 minutes: ALERT (sustained overload)

# Monitor lag continuously with prometheus-style metrics
echo "Checking lag every 10 seconds..."
while true; do
    LAG=$(kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
        --describe --group order-processors 2>/dev/null \
        | awk 'NR>1 {sum += $6} END {print sum}')
    
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    echo "$TIMESTAMP | Total lag: $LAG messages"
    
    if [ "$LAG" -gt 10000 ]; then
        echo "  ⚠️  CRITICAL: Consumer lag exceeds 10,000!"
        echo "  Consider: adding consumers, increasing batch size, or investigating slow processing"
    fi
    
    sleep 10
done

Module 17: Caching Architecture

Caching trades freshness for speed. Every cache is a bet: "this data probably hasn't changed since we last looked." When the bet pays off (cache hit), latency drops from hundreds of milliseconds to single-digit milliseconds. When it fails (stale data served), users see outdated information.

The art of caching is managing this tradeoff — getting the performance benefit while minimizing staleness impact.

Caching Layers: The Speed Pyramid

Modern systems stack multiple cache layers. Each layer trades capacity for speed:

Caching Layers Pyramid — From Fastest to Most Capacious
flowchart TD
    subgraph LAYERS["Cache Layers (top = fastest, bottom = most capacity)"]
        direction TB
        L1["Browser Cache
~0ms | Per-user
Cache-Control, ETag, Service Worker"] L2["CDN / Edge Cache
~5-20ms | Per-region
CloudFront, Fastly, Cloudflare"] L3["Reverse Proxy Cache
~1-5ms | Per-datacenter
Varnish, NGINX cache"] L4["Application Cache
~1-10ms | In-process or distributed
Redis, Memcached, local memory"] L5["Database Cache
~5-50ms | Query result cache
MySQL query cache, materialized views"] L6["Database (Source of Truth)
~10-100ms | Persistent storage
PostgreSQL, DynamoDB, etc."] end L1 -->|"MISS"| L2 L2 -->|"MISS"| L3 L3 -->|"MISS"| L4 L4 -->|"MISS"| L5 L5 -->|"MISS"| L6 style L1 fill:#e8f4f4,stroke:#3B9797,color:#132440 style L2 fill:#e8f4f4,stroke:#3B9797,color:#132440 style L3 fill:#f0f4f8,stroke:#16476A,color:#132440 style L4 fill:#f0f4f8,stroke:#16476A,color:#132440 style L5 fill:#fdf0f0,stroke:#BF092F,color:#132440 style L6 fill:#fdf0f0,stroke:#BF092F,color:#132440

Each layer's characteristics:

Reference
Cache Layer Comparison
Layer Latency Capacity Invalidation Scope
Browser 0ms (local) ~50-300MB TTL, ETag, manual (Cache-Control: no-cache) Per-user, per-device
CDN 5-20ms Terabytes TTL, purge API, versioned URLs Per-region, shared across users
Reverse Proxy 1-5ms RAM/SSD of proxy TTL, BAN/PURGE commands Per-datacenter
Application (Redis) 1-10ms GB to TB cluster TTL, explicit DELETE, pub/sub Application-wide
Database 5-50ms Configured pool Auto (on write), manual FLUSH Per-database
Caching Performance Layers

Caching Patterns: Read & Write Strategies

The choice of caching pattern determines consistency guarantees and failure behavior:

Cache-Aside (Lazy Loading):

  • Application checks cache first. On miss, reads from DB, then writes to cache.
  • Pros: Only caches data that's actually requested. Cache failure doesn't break reads (fallback to DB).
  • Cons: First request always slow (cache miss). Stale data if DB is updated without invalidating cache.
  • Best for: General-purpose read caching, user profiles, product catalogs.
Cache-Aside Pattern — Application Manages Cache
flowchart TD
    APP["Application"] -->|"1. GET key"| CACHE["Cache (Redis)"]
    CACHE -->|"2a. HIT → return data"| APP
    CACHE -->|"2b. MISS"| APP
    APP -->|"3. Query DB"| DB["Database"]
    DB -->|"4. Return data"| APP
    APP -->|"5. SET key (with TTL)"| CACHE

    style APP fill:#e8f4f4,stroke:#3B9797,color:#132440
    style CACHE fill:#f0f4f8,stroke:#16476A,color:#132440
    style DB fill:#fdf0f0,stroke:#BF092F,color:#132440
                            

Read-Through:

  • Cache sits in front of DB. Application only talks to cache. On miss, the cache loads from DB.
  • Pros: Simpler application code (no cache management logic). Consistent interface.
  • Cons: Cache library must support DB integration. Initial latency on miss includes cache-to-DB round trip.

Write-Through:

  • Every write goes to cache AND database synchronously. Cache is always up-to-date.
  • Pros: Cache never stale. Reads always hit cache after first write.
  • Cons: Write latency increases (two writes per operation). Caches data that may never be read.

Write-Behind (Write-Back):

  • Write to cache immediately, asynchronously flush to database in batches.
  • Pros: Ultra-fast writes. Batching reduces DB load. Great for high-write workloads.
  • Cons: Data loss risk — if cache crashes before flushing, writes are lost. Complexity of ensuring eventual consistency.
  • Best for: Analytics counters, view counts, non-critical metrics where occasional loss is acceptable.
# Redis Cache Configuration — Production-Ready Setup
# Suitable for application-layer caching (cache-aside pattern)

# Redis server configuration (redis.conf)
# Memory management
maxmemory 4gb
maxmemory-policy allkeys-lru    # Evict least-recently-used keys when full

# Persistence (RDB snapshots + AOF for durability)
save 900 1          # Snapshot if 1 key changed in 900 seconds
save 300 10         # Snapshot if 10 keys changed in 300 seconds
save 60 10000       # Snapshot if 10000 keys changed in 60 seconds
appendonly yes      # Enable append-only file
appendfsync everysec  # Fsync every second (balance of durability and performance)

# Connection limits
maxclients 10000
timeout 300         # Close idle connections after 5 minutes
tcp-keepalive 60    # TCP keepalive every 60 seconds

# Performance tuning
hz 10               # Server tick rate (10 Hz default)
lazyfree-lazy-eviction yes    # Background eviction (non-blocking)
lazyfree-lazy-expire yes      # Background expiry
lazyfree-lazy-server-del yes  # Background deletion

# Cluster configuration (for horizontal scaling)
# cluster-enabled yes
# cluster-config-file nodes.conf
# cluster-node-timeout 5000

---
# Application-side Redis client config (Python example as YAML for clarity)
# Using redis-py with connection pooling

redis_config:
  host: "redis-primary.internal"
  port: 6379
  db: 0
  password: "${REDIS_AUTH_TOKEN}"  # From environment variable
  
  # Connection pool settings
  max_connections: 50        # Match your expected concurrency
  socket_timeout: 5          # 5s timeout for operations
  socket_connect_timeout: 2  # 2s timeout for initial connection
  retry_on_timeout: true     # Auto-retry on timeout
  
  # Health check
  health_check_interval: 30  # Ping every 30s to keep connections alive

  # Default TTL strategy by data type
  ttl_defaults:
    user_session: 3600       # 1 hour
    user_profile: 300        # 5 minutes
    product_catalog: 600     # 10 minutes
    search_results: 60       # 1 minute
    rate_limit_counter: 60   # 1 minute window

Cache Invalidation: The Hardest Problem

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton. He wasn't joking. Cache invalidation is hard because you're managing two copies of data and ensuring they stay synchronized across distributed systems with network delays.

Invalidation Strategies:

  • TTL (Time-To-Live): Set an expiration time. After TTL, key is automatically deleted. Simple but imprecise — data might change before TTL expires (stale) or TTL might expire unnecessarily (miss penalty). Choose TTL based on acceptable staleness: 5s for stock prices, 5min for user profiles, 1hr for product descriptions.
  • Event-Driven Invalidation: When data changes, publish an invalidation event. Subscribers delete/update their cached copies. Precise but requires reliable event delivery (what if the invalidation message is lost?).
  • Versioned Keys: Include a version in the cache key (user:42:v7). When data changes, increment version. Old keys expire naturally via TTL. No explicit invalidation needed — elegant but wastes cache space with stale versioned keys.
  • Write-Invalidate: On every write, explicitly delete the cache key. Next read triggers a cache miss → fresh load from DB. Simple and consistent but means writes always cause a subsequent read to be slow.
  • Write-Update: On every write, update the cache with the new value. No subsequent miss penalty. But: what if two concurrent writes race? Cache might end up with older value.

Cache Stampede Prevention: The Thundering Herd

A cache stampede (thundering herd) occurs when a popular cache key expires and hundreds/thousands of concurrent requests simultaneously miss the cache and hit the database. The database, designed to handle the load with the cache, buckles under the sudden flood.

Prevention techniques:

  • Mutex/Lock: First request to miss acquires a lock, fetches from DB, and repopulates cache. All other requests wait (or return stale data) until the lock is released. Simple but adds latency for waiting requests.
  • Probabilistic Early Expiration (PER): Before TTL actually expires, each request has a small probability of refreshing the cache proactively. As expiry approaches, the probability increases. Result: one request refreshes early, preventing the stampede.
  • Pre-warming: Background job refreshes popular keys before they expire. Requires knowing which keys are "hot" — typically tracked via access frequency counters.
  • Stale-While-Revalidate: Serve stale cached data immediately, trigger an async background refresh. User gets fast (slightly stale) response; cache gets updated for next request. HTTP stale-while-revalidate directive implements this at the CDN layer.
"""
Probabilistic Early Expiration (PER) — Cache Stampede Prevention

Instead of all requests discovering the cache expired simultaneously,
each request independently decides whether to refresh early.
The probability of refreshing increases as TTL approaches expiration.

Algorithm: refresh if (current_time - (ttl - beta * log(random()))) > expiry_time
Based on: "Optimal Probabilistic Cache Stampede Prevention" (Vattani et al., 2015)
"""

import time
import math
import random

class ProbabilisticCache:
    """
    Cache with probabilistic early expiration to prevent stampedes.
    """
    
    def __init__(self, beta=1.0):
        """
        Args:
            beta: Controls refresh eagerness.
                  Higher beta = earlier refresh (more conservative).
                  Typical values: 0.5 to 2.0. Start with 1.0.
        """
        self.beta = beta
        self.store = {}  # key -> (value, expiry_time, ttl_duration)
    
    def set(self, key, value, ttl_seconds):
        """Store value with TTL."""
        expiry = time.time() + ttl_seconds
        self.store[key] = (value, expiry, ttl_seconds)
    
    def get(self, key, recompute_func):
        """
        Get value with probabilistic early refresh.
        
        If the key is "close to expiring", there's an increasing
        probability that THIS request will trigger a refresh —
        before the actual expiry causes a stampede.
        """
        if key not in self.store:
            # Cold miss — must compute
            value = recompute_func()
            self.set(key, value, ttl_seconds=300)  # Default 5min TTL
            return value, "MISS"
        
        value, expiry_time, ttl_duration = self.store[key]
        current_time = time.time()
        
        # Check if actually expired
        if current_time >= expiry_time:
            value = recompute_func()
            self.set(key, value, ttl_duration)
            return value, "EXPIRED"
        
        # Probabilistic early expiration check
        # Gap = time remaining until expiry
        time_remaining = expiry_time - current_time
        
        # XFetch algorithm: refresh if random delay exceeds remaining time
        # As time_remaining shrinks, probability of refresh increases
        random_delay = self.beta * ttl_duration * math.log(random.random())
        # Note: log(random()) is negative, so we negate it
        effective_remaining = time_remaining + random_delay  # random_delay is negative
        
        if effective_remaining <= 0:
            # This request "wins" the probabilistic race — refresh early
            value = recompute_func()
            self.set(key, value, ttl_duration)
            return value, "EARLY_REFRESH"
        
        # Normal cache hit — return cached value
        return value, "HIT"


# Simulation: 1000 concurrent requests near expiry
cache = ProbabilisticCache(beta=1.0)

def expensive_db_query():
    """Simulates a 200ms database query."""
    time.sleep(0.001)  # Simplified for demo
    return {"user": "Alice", "score": 42}

# Pre-populate cache that's about to expire (2 seconds remaining of 300s TTL)
cache.store["popular_key"] = (
    {"user": "Alice", "score": 42},
    time.time() + 2,   # Expires in 2 seconds
    300                  # Original TTL was 300s
)

# Simulate 100 near-simultaneous requests
results = {"HIT": 0, "EARLY_REFRESH": 0, "EXPIRED": 0, "MISS": 0}
for _ in range(100):
    _, status = cache.get("popular_key", expensive_db_query)
    results[status] += 1

print("=== Probabilistic Early Expiration Results ===")
print(f"  Total requests:    100")
print(f"  Cache hits:        {results['HIT']}")
print(f"  Early refreshes:   {results['EARLY_REFRESH']}")
print(f"  Actual expires:    {results['EXPIRED']}")
print(f"\n  Without PER: ALL 100 requests would hit DB simultaneously")
print(f"  With PER:    Only {results['EARLY_REFRESH']} request(s) refreshed early")
print(f"               Remaining {results['HIT']} got fast cache hits")

Cache Consistency in Distributed Systems

When you have multiple application servers each with local caches, or a distributed cache cluster (Redis Cluster), keeping caches consistent becomes a coordination problem:

  • Write-invalidate across nodes: When one server updates the DB, it must notify all other servers to invalidate their local copies. Use Redis pub/sub or a dedicated invalidation channel.
  • Lease-based consistency: Cache entries include a "lease" (short-lived token). To refresh, you must present the lease. If a stale entry tries to refresh with an expired lease, it must re-read from the source. Facebook's Memcached uses this pattern.
  • Versioned cache reads: Include a version counter in the database row and in the cache entry. On read, compare versions — if cache version is stale, treat as miss.
The Fundamental Cache Consistency Dilemma: You can have strong consistency (cache always matches DB) or high availability (cache never blocks on DB), but maximizing both is impossible in the presence of network partitions. This is CAP theorem applied to caching. Most systems choose eventual consistency with bounded staleness — data is at most N seconds old.
# Cache Hit Rate Monitoring — The #1 Cache Health Metric
# Target: >95% hit rate for application caches, >80% for CDN

# Redis cache statistics
redis-cli INFO stats | grep -E "keyspace_hits|keyspace_misses"
# keyspace_hits:4892731
# keyspace_misses:231456

# Calculate hit rate
redis-cli INFO stats | awk -F: '
    /keyspace_hits/ {hits=$2}
    /keyspace_misses/ {misses=$2}
    END {
        total = hits + misses
        rate = (total > 0) ? hits/total * 100 : 0
        printf "Cache Hit Rate: %.2f%% (%d hits, %d misses)\n", rate, hits, misses
        if (rate < 90) print "⚠️  WARNING: Hit rate below 90% — investigate!"
        if (rate < 80) print "🔴 CRITICAL: Hit rate below 80% — cache may be misconfigured"
    }
'

# Monitor evictions (sign of insufficient memory)
redis-cli INFO stats | grep evicted_keys
# If evicted_keys is growing: increase maxmemory or review TTL strategy

# Memory usage breakdown
redis-cli INFO memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation"
# used_memory_human: 2.45G
# maxmemory_human: 4.00G
# mem_fragmentation_ratio: 1.12  (ideal: 1.0-1.5; >2.0 = fragmented)

# Top keys by memory usage (identify cache hogs)
redis-cli --bigkeys
# Shows largest keys per data type — useful for identifying
# keys that should be split or have shorter TTLs

echo ""
echo "=== CDN Cache Hit Rate (Cloudflare example) ==="
echo "Check via API or dashboard:"
echo "  Bandwidth saved (cached): target > 80%"
echo "  Requests cached: target > 85%"
echo "  If low: check Cache-Control headers, vary headers, query string handling"

Case Studies

Facebook's Memcached at Scale (TAO)

Case Study 2008 – Present
Facebook TAO: Billions of Reads, Millions of Writes

Facebook's social graph (users, friendships, posts, likes) generates billions of read queries per second. Their caching architecture evolved from simple Memcached to a custom distributed cache called TAO (The Associations and Objects cache).

The Problem:

  • MySQL couldn't handle the read volume (billions/sec), but writes are manageable (millions/sec)
  • Simple Memcached caching had thundering herd problems — hot keys (celebrity profiles, viral posts) caused stampedes
  • Cache invalidation across data centers introduced race conditions (write in DC-1, stale read from cache in DC-2)

TAO's Architecture:

  • Read-through cache: Clients never talk to MySQL directly. TAO handles miss → DB read → cache populate transparently.
  • Leader-follower per region: One TAO leader per region handles writes and invalidations. Followers serve reads. Cross-region invalidation flows through leaders.
  • Lease mechanism: On cache miss, client gets a "lease" (token). Only the lease holder can populate the cache. All other concurrent misses wait — eliminates thundering herd completely.
  • Bounded staleness: Cross-region replication has a known delay (typically <1 second). The system accepts this staleness for read availability.
  • Graph-aware cache: TAO understands objects and associations natively. "Get all friends of user:42" is a single cache operation, not N individual key lookups.

Scale: TAO handles >10 billion reads/second with a 99.9%+ cache hit rate. A single cache miss costs ~1ms (Memcached) vs ~10ms (MySQL). Without TAO, Facebook would need 1000× more database capacity.

Memcached Thundering Herd Lease

Cloudflare CDN: Caching at the Edge

Case Study 2010 – Present
Cloudflare — 300+ PoPs, Millisecond Global Response

Cloudflare operates 300+ Points of Presence (PoPs) globally, caching and serving content within milliseconds of any user. Their architecture demonstrates CDN caching at internet scale.

Architecture Highlights:

  • Tiered caching: When a PoP has a cache miss, it doesn't go directly to the origin server. It first checks a regional "upper-tier" cache. Only if the upper tier also misses does the request reach the origin. This reduces origin load by ~90%.
  • Argo Smart Routing: When requests must reach the origin, Cloudflare uses real-time network intelligence to route through the fastest path — not the shortest geographic distance, but the lowest-latency network path.
  • Cache Reserve (R2-backed): Persistent cache tier backed by object storage. Even after TTL expiry, content can be served from Cache Reserve while revalidating with the origin — implementing stale-while-revalidate at CDN scale.
  • Workers at the edge: Custom JavaScript/WASM runs at every PoP. Enables cache key manipulation, conditional caching logic, and A/B testing — all at the edge without origin round trips.
  • Cache Everything with custom TTLs: By default, Cloudflare only caches static assets. With page rules or Cache-Control headers, any response (including API responses) can be cached with fine-grained TTL control.

Stampede Prevention: Cloudflare implements "request coalescing" — when multiple requests for the same uncached resource arrive simultaneously, only ONE request is forwarded to the origin. All other requests wait for that single response and share it. This eliminates the thundering herd at the CDN layer.

Lesson: CDN caching isn't just for static files. With proper Cache-Control headers and edge logic, even dynamic API responses can be cached at the edge — turning a 200ms origin response into a 5ms edge response.

CDN Edge Computing Tiered Caching

Conclusion & Next Steps

Modules 16 and 17 covered the two most powerful performance multipliers in distributed systems — message queues and caches.

The key takeaways:

  • Little's Law (L = λW) is universal. Use it for capacity planning — it tells you exactly how many concurrent connections, threads, or consumers you need. No assumptions about distributions required.
  • The hockey stick curve is your enemy. Latency explodes as utilization approaches 100%. Target 70% for latency-sensitive services. Auto-scale at 75%, not 90%.
  • Backpressure prevents cascading failure. A system that can push back (reject, shed load, signal upstream) survives overload. A system that blindly accepts everything eventually crashes under its own queue.
  • Dead-letter queues isolate poison messages. Without DLQs, one malformed message can block an entire queue indefinitely. Always configure a DLQ with alerting.
  • Cache-aside is the default pattern. Unless you have specific requirements for write-through or write-behind, cache-aside (lazy loading) gives the best balance of simplicity, performance, and fault tolerance.
  • Cache stampedes are real. Probabilistic early expiration, mutex locks, or stale-while-revalidate — pick one. If your cache serves a hot key with 1000+ requests/second, a naive TTL expiration will DDoS your own database.
  • Cache invalidation is a coordination problem. Accept bounded staleness rather than pursuing impossible strong consistency across distributed caches. Design for eventual consistency with clear staleness bounds.

Next in the Series

In Part 11: Distributed Data & Internet-Scale, we'll explore how to design data systems that operate across multiple regions — distributed consensus (Raft, Paxos), replication strategies (leader-follower, multi-leader, leaderless), sharding patterns, and the engineering behind internet-scale systems serving billions of users.