System Design Series Part 3: Load Balancing & Caching

Load Balancing

Series Navigation: This is Part 3 of the 15-part System Design Series. Review Part 2: Scalability first.

System Design Mastery

Your 15-step learning path • Currently on Step 3

Introduction to System Design

Fundamentals, why it matters, key concepts

Scalability Fundamentals

Horizontal vs vertical scaling, stateless design

3

Load Balancing & Caching

Algorithms, Redis, CDN patterns

You Are Here

4

Load balancing is the process of distributing network traffic across multiple servers to ensure no single server bears too much demand. This improves responsiveness and availability of applications.

Architecture diagram showing a load balancer distributing incoming client requests across multiple backend servers — Load balancers distribute traffic across servers for improved performance and availability

                        
                        Key Insight: A good load balancer is invisible to users—it seamlessly routes requests while handling server failures and traffic spikes.
                    

Why Load Balancing Matters

Without load balancing, a single server handles all incoming requests, creating a single point of failure. When traffic exceeds the server's capacity or the server fails, your entire application becomes unavailable.

Load balancers solve this by:

Distributing traffic across multiple servers
Detecting failures and routing around unhealthy servers
Enabling scaling by adding/removing servers dynamically
Improving performance through optimal server selection
Providing SSL termination to offload encryption from backend servers

Types of Load Balancers

Load balancers operate at different layers of the OSI model:

Layer 4 (Transport Layer) Load Balancer

Operates at the TCP/UDP level. Routes traffic based on IP address and port number without inspecting packet contents.

Pros: Very fast, low latency, simple to configure
Cons: No content-based routing, limited visibility
Use cases: High-throughput applications, database connections

TCP/UDP Network High Performance

Layer 7 (Application Layer) Load Balancer

Operates at the HTTP/HTTPS level. Can inspect request content and make intelligent routing decisions.

Pros: Content-based routing, SSL termination, request modification
Cons: Higher latency, more resource-intensive
Use cases: Web applications, API gateways, microservices

# NGINX Layer 7 Load Balancer Configuration
upstream api_servers {
    server api1.example.com:8080;
    server api2.example.com:8080;
    server api3.example.com:8080;
}

upstream static_servers {
    server static1.example.com:80;
    server static2.example.com:80;
}

server {
    listen 443 ssl;
    server_name example.com;
    
    # Route API requests to API servers
    location /api/ {
        proxy_pass http://api_servers;
    }
    
    # Route static content to static servers
    location /static/ {
        proxy_pass http://static_servers;
    }
}

HTTP/HTTPS Content Routing SSL Termination

Global vs. Local Load Balancing

Feature	Local Load Balancing	Global Load Balancing (GSLB)
Scope	Single data center	Multiple data centers/regions
Routing Decision	Server health, capacity	Geographic location, latency, availability
Technology	HAProxy, NGINX, AWS ALB	DNS-based, Cloudflare, AWS Route 53
Failover	Within data center	Across regions/continents

Load Balancing Algorithms

The choice of algorithm significantly impacts how traffic is distributed and overall system performance.

Round Robin

Distributes requests sequentially across servers in a circular manner. Simple and effective when all servers have equal capacity.

# Round Robin Implementation
class RoundRobinBalancer:
    def __init__(self, servers):
        self.servers = servers
        self.current_index = 0
    
    def get_server(self):
        server = self.servers[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.servers)
        return server

# Usage
balancer = RoundRobinBalancer(['server1', 'server2', 'server3'])
for i in range(6):
    print(f"Request {i+1} -> {balancer.get_server()}")
# Output: server1, server2, server3, server1, server2, server3

Best for: Homogeneous server environments with similar capacity.

Simple Equal Distribution

Weighted Round Robin

Similar to round robin but assigns weights based on server capacity. Higher-capacity servers receive more requests.

# Weighted Round Robin Implementation
class WeightedRoundRobinBalancer:
    def __init__(self, servers_with_weights):
        # servers_with_weights: [('server1', 3), ('server2', 2), ('server3', 1)]
        self.servers = []
        for server, weight in servers_with_weights:
            self.servers.extend([server] * weight)
        self.current_index = 0
    
    def get_server(self):
        server = self.servers[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.servers)
        return server

# Usage: server1 gets 3x traffic, server2 gets 2x, server3 gets 1x
balancer = WeightedRoundRobinBalancer([
    ('powerful-server', 5),
    ('medium-server', 3),
    ('small-server', 1)
])

Best for: Heterogeneous environments with varying server capacities.

Capacity-Aware Weighted

Least Connections

Routes new requests to the server with the fewest active connections. Ideal when request processing times vary significantly.

# Least Connections Implementation
import threading

class LeastConnectionsBalancer:
    def __init__(self, servers):
        self.servers = {server: 0 for server in servers}
        self.lock = threading.Lock()
    
    def get_server(self):
        with self.lock:
            # Find server with minimum connections
            server = min(self.servers, key=self.servers.get)
            self.servers[server] += 1
            return server
    
    def release_server(self, server):
        with self.lock:
            self.servers[server] = max(0, self.servers[server] - 1)

# Usage
balancer = LeastConnectionsBalancer(['server1', 'server2', 'server3'])
server = balancer.get_server()
# ... process request ...
balancer.release_server(server)

Best for: Variable request processing times, long-lived connections.

Connection-Aware Dynamic

IP Hash

Uses a hash of the client's IP address to determine which server receives the request. Ensures the same client always reaches the same server (session persistence).

# IP Hash Implementation
import hashlib

class IPHashBalancer:
    def __init__(self, servers):
        self.servers = servers
    
    def get_server(self, client_ip):
        # Create hash of client IP
        hash_value = int(hashlib.md5(client_ip.encode()).hexdigest(), 16)
        # Map to server index
        server_index = hash_value % len(self.servers)
        return self.servers[server_index]

# Usage
balancer = IPHashBalancer(['server1', 'server2', 'server3'])
print(balancer.get_server('192.168.1.100'))  # Always same server
print(balancer.get_server('10.0.0.50'))      # May be different server

Best for: Applications requiring session persistence without sticky sessions.

Session Persistence Deterministic

Consistent Hashing

Advanced hashing that minimizes redistribution when servers are added or removed. Essential for distributed caches.

# Consistent Hashing Implementation
import hashlib
from bisect import bisect_left

class ConsistentHashBalancer:
    def __init__(self, servers, virtual_nodes=100):
        self.ring = []
        self.server_map = {}
        
        for server in servers:
            for i in range(virtual_nodes):
                key = f"{server}:{i}"
                hash_val = self._hash(key)
                self.ring.append(hash_val)
                self.server_map[hash_val] = server
        
        self.ring.sort()
    
    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def get_server(self, key):
        if not self.ring:
            return None
        
        hash_val = self._hash(key)
        idx = bisect_left(self.ring, hash_val) % len(self.ring)
        return self.server_map[self.ring[idx]]

# When a server is added/removed, only ~1/N keys are redistributed

Best for: Distributed caching, sharded databases, dynamic server pools.

Distributed Systems Minimal Redistribution

Algorithm Comparison

Algorithm	Best For	Pros	Cons
Round Robin	Equal servers	Simple, predictable	Ignores server load
Weighted Round Robin	Mixed capacity	Capacity-aware	Static weights
Least Connections	Variable load	Adaptive	Overhead tracking
IP Hash	Session persistence	Consistent routing	Uneven distribution
Consistent Hash	Dynamic pools	Minimal redistribution	Complex implementation

Health Checks

Health checks ensure load balancers only route traffic to healthy servers. There are two main approaches:

Diagram showing active and passive health check mechanisms for monitoring server availability — Active and passive health checks ensure traffic only reaches healthy servers

Passive Health Checks

Monitor ongoing traffic to detect failures. If a server returns errors or times out, it's marked unhealthy.

# NGINX Passive Health Check
upstream backend {
    server backend1.example.com:8080 max_fails=3 fail_timeout=30s;
    server backend2.example.com:8080 max_fails=3 fail_timeout=30s;
}
# After 3 failures, server is marked down for 30 seconds

Active Health Checks

Periodically probe servers with dedicated health endpoints to verify availability.

# Python Health Check Endpoint
from flask import Flask, jsonify
import psutil

app = Flask(__name__)

@app.route('/health')
def health_check():
    """Comprehensive health check endpoint"""
    health = {
        'status': 'healthy',
        'checks': {
            'cpu_percent': psutil.cpu_percent(),
            'memory_percent': psutil.virtual_memory().percent,
            'disk_percent': psutil.disk_usage('/').percent,
            'database': check_database_connection(),
            'cache': check_cache_connection()
        }
    }
    
    # Mark unhealthy if any resource is critical
    if (health['checks']['cpu_percent'] > 90 or
        health['checks']['memory_percent'] > 90):
        health['status'] = 'unhealthy'
        return jsonify(health), 503
    
    return jsonify(health), 200

def check_database_connection():
    try:
        # Attempt database query
        return 'connected'
    except Exception:
        return 'disconnected'

def check_cache_connection():
    try:
        # Attempt cache ping
        return 'connected'
    except Exception:
        return 'disconnected'

                        
                        Best Practice: Use both passive and active health checks. Active checks catch issues before traffic is affected; passive checks provide real-time failure detection.
                    

Caching Strategies

Caching stores frequently accessed data in fast storage (memory) to reduce latency and database load. It's one of the most impactful optimizations in system design.

Multi-layer caching architecture showing browser cache, CDN, application cache, and database cache with latency comparisons — Multi-layer caching architecture from browser to database, with typical latency at each tier

Why Caching Matters

Consider these typical latency numbers:

L1 cache reference: 0.5 ns
L2 cache reference: 7 ns
Main memory reference: 100 ns
SSD read: 150,000 ns (150 µs)
Network round trip: 500,000 ns (500 µs)
Database query: 10,000,000 ns (10 ms)

                        
                        Impact Example: If 90% of requests are cache hits (100 µs) instead of database queries (10 ms), you've reduced average latency from 10 ms to 1.09 ms—a 9x improvement!
                    

Cache Layers

Layer	Location	Latency	Example
Browser Cache	Client	~1 ms	HTTP cache headers, localStorage
CDN Cache	Edge	~10-50 ms	Cloudflare, CloudFront
Application Cache	Server	~1-10 ms	Redis, Memcached
Database Cache	Database	~5-20 ms	MySQL query cache, PostgreSQL buffer

Cache Patterns

Different caching patterns serve different consistency and performance requirements.

Cache-Aside (Lazy Loading)

Application checks cache first; on miss, fetches from database and populates cache.

# Cache-Aside Pattern
import redis
import json

class CacheAsideService:
    def __init__(self):
        self.cache = redis.Redis(host='localhost', port=6379)
        self.cache_ttl = 3600  # 1 hour
    
    def get_user(self, user_id):
        cache_key = f"user:{user_id}"
        
        # 1. Check cache first
        cached = self.cache.get(cache_key)
        if cached:
            return json.loads(cached)
        
        # 2. Cache miss - fetch from database
        user = self.database.query(f"SELECT * FROM users WHERE id = {user_id}")
        
        # 3. Populate cache for next time
        self.cache.setex(cache_key, self.cache_ttl, json.dumps(user))
        
        return user
    
    def update_user(self, user_id, data):
        # Update database first
        self.database.update("users", user_id, data)
        
        # Invalidate cache (will be repopulated on next read)
        self.cache.delete(f"user:{user_id}")

Pros: Simple, only caches what's needed, resilient to cache failures
Cons: Cache miss penalty, potential for stale data

Most Common Read-Heavy

Write-Through

Data is written to cache and database simultaneously. Ensures cache is always consistent.

# Write-Through Pattern
class WriteThroughService:
    def __init__(self):
        self.cache = redis.Redis(host='localhost', port=6379)
    
    def save_user(self, user_id, data):
        cache_key = f"user:{user_id}"
        
        # Write to both cache and database atomically
        try:
            # Write to database
            self.database.insert_or_update("users", user_id, data)
            
            # Write to cache (only if DB write succeeds)
            self.cache.set(cache_key, json.dumps(data))
            
            return True
        except Exception as e:
            # If anything fails, invalidate cache
            self.cache.delete(cache_key)
            raise e
    
    def get_user(self, user_id):
        # Cache is always up-to-date
        cached = self.cache.get(f"user:{user_id}")
        if cached:
            return json.loads(cached)
        
        # Fallback to database (should rarely happen)
        return self.database.query(f"SELECT * FROM users WHERE id = {user_id}")

Pros: Strong consistency, no stale data
Cons: Higher write latency, writes to cache even if data never read

Consistency Write-Heavy

Write-Behind (Write-Back)

Data is written to cache immediately, then asynchronously persisted to database. Improves write performance but risks data loss.

# Write-Behind Pattern with Queue
import threading
import queue
import time

class WriteBehindService:
    def __init__(self):
        self.cache = redis.Redis(host='localhost', port=6379)
        self.write_queue = queue.Queue()
        self._start_background_writer()
    
    def _start_background_writer(self):
        def writer():
            while True:
                try:
                    # Batch writes every 100ms
                    time.sleep(0.1)
                    batch = []
                    while not self.write_queue.empty() and len(batch) < 100:
                        batch.append(self.write_queue.get_nowait())
                    
                    if batch:
                        self.database.batch_insert("users", batch)
                except Exception as e:
                    print(f"Write-behind error: {e}")
        
        thread = threading.Thread(target=writer, daemon=True)
        thread.start()
    
    def save_user(self, user_id, data):
        # Immediately write to cache
        self.cache.set(f"user:{user_id}", json.dumps(data))
        
        # Queue for async database write
        self.write_queue.put((user_id, data))
        
        return True  # Return immediately

Pros: Very fast writes, batching reduces database load
Cons: Risk of data loss, eventual consistency

High Performance Eventual Consistency

Refresh-Ahead

Proactively refresh cache entries before they expire, preventing cache misses.

# Refresh-Ahead Pattern
import threading

class RefreshAheadService:
    def __init__(self):
        self.cache = redis.Redis(host='localhost', port=6379)
        self.cache_ttl = 3600  # 1 hour
        self.refresh_threshold = 0.75  # Refresh when 75% expired
    
    def get_user(self, user_id):
        cache_key = f"user:{user_id}"
        
        # Get cached value with TTL
        cached = self.cache.get(cache_key)
        ttl = self.cache.ttl(cache_key)
        
        if cached:
            # Check if approaching expiration
            if ttl < self.cache_ttl * (1 - self.refresh_threshold):
                # Trigger async refresh
                threading.Thread(
                    target=self._refresh_cache,
                    args=(user_id,)
                ).start()
            
            return json.loads(cached)
        
        # Cache miss - fetch and cache
        return self._fetch_and_cache(user_id)
    
    def _refresh_cache(self, user_id):
        """Background refresh without blocking the request"""
        self._fetch_and_cache(user_id)
    
    def _fetch_and_cache(self, user_id):
        user = self.database.query(f"SELECT * FROM users WHERE id = {user_id}")
        self.cache.setex(f"user:{user_id}", self.cache_ttl, json.dumps(user))
        return user

Pros: Minimizes cache misses, predictable latency
Cons: Complex implementation, may refresh unused data

Low Latency Predictable

Cache Eviction Policies

When cache is full, eviction policies determine which entries to remove:

LRU (Least Recently Used): Evicts the least recently accessed items. Most common choice.
LFU (Least Frequently Used): Evicts items accessed least often. Good for static popularity distributions.
FIFO (First In First Out): Evicts oldest items. Simple but not optimal.
TTL (Time To Live): Items expire after set duration. Good for time-sensitive data.
Random: Randomly evicts items. Surprisingly effective and simple.

Redis & Memcached

The two most popular distributed caching solutions have different strengths:

Feature comparison between Redis and Memcached distributed caching solutions — Redis vs Memcached: choosing the right in-memory caching solution

Feature	Redis	Memcached
Data Types	Strings, Lists, Sets, Hashes, Sorted Sets, Streams	Strings only
Persistence	RDB snapshots, AOF logs	None (pure cache)
Replication	Primary-replica, Redis Cluster	None built-in
Memory Efficiency	Good	Excellent (slab allocation)
Use Cases	Sessions, leaderboards, pub/sub, queues	Simple caching, high-throughput

Redis Examples

# Redis Advanced Use Cases
import redis

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

# 1. Rate Limiting with Sliding Window
def is_rate_limited(user_id, limit=100, window_seconds=60):
    key = f"rate_limit:{user_id}"
    current_time = int(time.time())
    window_start = current_time - window_seconds
    
    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, window_start)  # Remove old entries
    pipe.zadd(key, {str(current_time): current_time})  # Add current
    pipe.zcard(key)  # Count entries
    pipe.expire(key, window_seconds)  # Set expiry
    results = pipe.execute()
    
    return results[2] > limit  # True if rate limited

# 2. Distributed Lock
def acquire_lock(lock_name, timeout=10):
    lock_key = f"lock:{lock_name}"
    identifier = str(uuid.uuid4())
    
    if r.set(lock_key, identifier, nx=True, ex=timeout):
        return identifier  # Lock acquired
    return None  # Lock held by another process

def release_lock(lock_name, identifier):
    lock_key = f"lock:{lock_name}"
    # Lua script for atomic check-and-delete
    script = """
    if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
    else
        return 0
    end
    """
    r.eval(script, 1, lock_key, identifier)

# 3. Leaderboard with Sorted Sets
def update_score(user_id, score):
    r.zadd("leaderboard", {user_id: score})

def get_top_players(count=10):
    return r.zrevrange("leaderboard", 0, count-1, withscores=True)

def get_user_rank(user_id):
    return r.zrevrank("leaderboard", user_id) + 1  # 1-indexed

                        
                        When to Choose: Use Redis when you need rich data structures, persistence, or pub/sub. Use Memcached for simple key-value caching at massive scale with minimal overhead.
                    

Content Delivery Networks

A CDN is a geographically distributed network of proxy servers that cache content close to end users, reducing latency and offloading traffic from origin servers.

Global map showing CDN edge servers distributed across regions routing user requests to the nearest point of presence — CDN edge servers cache content globally, routing users to the nearest point of presence

How CDNs Work

User requests content (e.g., image, video, JavaScript file)
DNS routes to nearest edge server (Points of Presence - PoPs)
Edge server checks cache:
- Cache hit: Returns content immediately (~10-50ms)
- Cache miss: Fetches from origin, caches, then returns
Content served to user from geographically close server

CDN Benefits

Reduced Latency: Content served from edge servers close to users
Origin Offload: CDN handles majority of traffic
DDoS Protection: Distributed infrastructure absorbs attacks
High Availability: Redundant edge servers ensure uptime
Cost Savings: Reduced bandwidth from origin

Push vs. Pull CDNs

Feature	Pull CDN	Push CDN
Content Population	On-demand (first request triggers fetch)	Pre-uploaded to edge servers
Best For	Dynamic content, frequently changing sites	Static content, known assets
Traffic Spikes	May cause origin overload on cache miss	Handles spikes well (content pre-cached)
Storage Cost	Only popular content cached	All content replicated everywhere

CDN Cache Headers

Control CDN caching behavior with HTTP headers:

# Setting Cache Headers in Python/Flask
from flask import Flask, make_response

app = Flask(__name__)

@app.route('/static/image.jpg')
def serve_image():
    response = make_response(get_image())
    
    # Cache for 1 year (immutable assets with versioned URLs)
    response.headers['Cache-Control'] = 'public, max-age=31536000, immutable'
    
    return response

@app.route('/api/user-profile')
def user_profile():
    response = make_response(get_user_data())
    
    # Private data - don't cache on CDN, only browser
    response.headers['Cache-Control'] = 'private, max-age=300'
    
    return response

@app.route('/api/prices')
def prices():
    response = make_response(get_prices())
    
    # Revalidate every request (CDN checks if-modified-since)
    response.headers['Cache-Control'] = 'public, no-cache'
    response.headers['ETag'] = calculate_etag(get_prices())
    
    return response

Common CDN Providers

Cloudflare: Global network, DDoS protection, edge computing (Workers)
AWS CloudFront: Deep AWS integration, Lambda@Edge for edge compute
Akamai: Largest network, enterprise-focused, advanced security
Fastly: Real-time purging, edge computing (Compute@Edge)
Google Cloud CDN: GCP integration, Anycast network

                        
                        Real-World Impact: Netflix uses its own CDN (Open Connect) with servers in ISP networks, achieving 95%+ cache hit rates and reducing origin traffic by orders of magnitude.
                    

Cache Invalidation Strategies

The famous saying "There are only two hard things in Computer Science: cache invalidation and naming things" highlights the challenge of keeping caches current:

TTL-based: Content expires after set duration. Simple but may serve stale content.
Versioned URLs: Include version/hash in URL (e.g., style.v2.css, bundle.a1b2c3.js). Instant invalidation by changing URL.
Purge API: Explicitly invalidate specific URLs or patterns via CDN API.
Tag-based: Tag content (e.g., "product-123") and purge by tag when data changes.

Next Steps

Caching & Load Balancing Strategy Generator

Document your caching layers and load balancer configuration. Download as Word, Excel, or PDF.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

System Name *

Load Balancer Type

LB Algorithm *

Health Check Endpoint

Cache Technology *

Cache Pattern *

Eviction Policy

TTL Settings

Cache Invalidation Strategy

CDN Provider / Regions

Target Cache Hit Ratio

Author Name