Back to Technology

System Design Series Part 3: Load Balancing & Caching

January 25, 2026 Wasil Zafar 35 min read

Master load balancing algorithms and caching strategies for building high-performance distributed systems. Learn round-robin, least connections, Redis, Memcached, and CDN implementation patterns.

Table of Contents

  1. Load Balancing
  2. Caching Strategies
  3. Content Delivery Networks

Load Balancing

Series Navigation: This is Part 3 of the 15-part System Design Series. Review Part 2: Scalability first.

Load balancing is the process of distributing network traffic across multiple servers to ensure no single server bears too much demand. This improves responsiveness and availability of applications.

Key Insight: A good load balancer is invisible to users—it seamlessly routes requests while handling server failures and traffic spikes.

Why Load Balancing Matters

Without load balancing, a single server handles all incoming requests, creating a single point of failure. When traffic exceeds the server's capacity or the server fails, your entire application becomes unavailable.

Load balancers solve this by:

  • Distributing traffic across multiple servers
  • Detecting failures and routing around unhealthy servers
  • Enabling scaling by adding/removing servers dynamically
  • Improving performance through optimal server selection
  • Providing SSL termination to offload encryption from backend servers

Types of Load Balancers

Load balancers operate at different layers of the OSI model:

Layer 4 (Transport Layer) Load Balancer

Operates at the TCP/UDP level. Routes traffic based on IP address and port number without inspecting packet contents.

  • Pros: Very fast, low latency, simple to configure
  • Cons: No content-based routing, limited visibility
  • Use cases: High-throughput applications, database connections
TCP/UDP Network High Performance

Layer 7 (Application Layer) Load Balancer

Operates at the HTTP/HTTPS level. Can inspect request content and make intelligent routing decisions.

  • Pros: Content-based routing, SSL termination, request modification
  • Cons: Higher latency, more resource-intensive
  • Use cases: Web applications, API gateways, microservices
# NGINX Layer 7 Load Balancer Configuration
upstream api_servers {
    server api1.example.com:8080;
    server api2.example.com:8080;
    server api3.example.com:8080;
}

upstream static_servers {
    server static1.example.com:80;
    server static2.example.com:80;
}

server {
    listen 443 ssl;
    server_name example.com;
    
    # Route API requests to API servers
    location /api/ {
        proxy_pass http://api_servers;
    }
    
    # Route static content to static servers
    location /static/ {
        proxy_pass http://static_servers;
    }
}
HTTP/HTTPS Content Routing SSL Termination

Global vs. Local Load Balancing

Feature Local Load Balancing Global Load Balancing (GSLB)
Scope Single data center Multiple data centers/regions
Routing Decision Server health, capacity Geographic location, latency, availability
Technology HAProxy, NGINX, AWS ALB DNS-based, Cloudflare, AWS Route 53
Failover Within data center Across regions/continents

Load Balancing Algorithms

The choice of algorithm significantly impacts how traffic is distributed and overall system performance.

Round Robin

Distributes requests sequentially across servers in a circular manner. Simple and effective when all servers have equal capacity.

# Round Robin Implementation
class RoundRobinBalancer:
    def __init__(self, servers):
        self.servers = servers
        self.current_index = 0
    
    def get_server(self):
        server = self.servers[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.servers)
        return server

# Usage
balancer = RoundRobinBalancer(['server1', 'server2', 'server3'])
for i in range(6):
    print(f"Request {i+1} -> {balancer.get_server()}")
# Output: server1, server2, server3, server1, server2, server3

Best for: Homogeneous server environments with similar capacity.

Simple Equal Distribution

Weighted Round Robin

Similar to round robin but assigns weights based on server capacity. Higher-capacity servers receive more requests.

# Weighted Round Robin Implementation
class WeightedRoundRobinBalancer:
    def __init__(self, servers_with_weights):
        # servers_with_weights: [('server1', 3), ('server2', 2), ('server3', 1)]
        self.servers = []
        for server, weight in servers_with_weights:
            self.servers.extend([server] * weight)
        self.current_index = 0
    
    def get_server(self):
        server = self.servers[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.servers)
        return server

# Usage: server1 gets 3x traffic, server2 gets 2x, server3 gets 1x
balancer = WeightedRoundRobinBalancer([
    ('powerful-server', 5),
    ('medium-server', 3),
    ('small-server', 1)
])

Best for: Heterogeneous environments with varying server capacities.

Capacity-Aware Weighted

Least Connections

Routes new requests to the server with the fewest active connections. Ideal when request processing times vary significantly.

# Least Connections Implementation
import threading

class LeastConnectionsBalancer:
    def __init__(self, servers):
        self.servers = {server: 0 for server in servers}
        self.lock = threading.Lock()
    
    def get_server(self):
        with self.lock:
            # Find server with minimum connections
            server = min(self.servers, key=self.servers.get)
            self.servers[server] += 1
            return server
    
    def release_server(self, server):
        with self.lock:
            self.servers[server] = max(0, self.servers[server] - 1)

# Usage
balancer = LeastConnectionsBalancer(['server1', 'server2', 'server3'])
server = balancer.get_server()
# ... process request ...
balancer.release_server(server)

Best for: Variable request processing times, long-lived connections.

Connection-Aware Dynamic

IP Hash

Uses a hash of the client's IP address to determine which server receives the request. Ensures the same client always reaches the same server (session persistence).

# IP Hash Implementation
import hashlib

class IPHashBalancer:
    def __init__(self, servers):
        self.servers = servers
    
    def get_server(self, client_ip):
        # Create hash of client IP
        hash_value = int(hashlib.md5(client_ip.encode()).hexdigest(), 16)
        # Map to server index
        server_index = hash_value % len(self.servers)
        return self.servers[server_index]

# Usage
balancer = IPHashBalancer(['server1', 'server2', 'server3'])
print(balancer.get_server('192.168.1.100'))  # Always same server
print(balancer.get_server('10.0.0.50'))      # May be different server

Best for: Applications requiring session persistence without sticky sessions.

Session Persistence Deterministic

Consistent Hashing

Advanced hashing that minimizes redistribution when servers are added or removed. Essential for distributed caches.

# Consistent Hashing Implementation
import hashlib
from bisect import bisect_left

class ConsistentHashBalancer:
    def __init__(self, servers, virtual_nodes=100):
        self.ring = []
        self.server_map = {}
        
        for server in servers:
            for i in range(virtual_nodes):
                key = f"{server}:{i}"
                hash_val = self._hash(key)
                self.ring.append(hash_val)
                self.server_map[hash_val] = server
        
        self.ring.sort()
    
    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def get_server(self, key):
        if not self.ring:
            return None
        
        hash_val = self._hash(key)
        idx = bisect_left(self.ring, hash_val) % len(self.ring)
        return self.server_map[self.ring[idx]]

# When a server is added/removed, only ~1/N keys are redistributed

Best for: Distributed caching, sharded databases, dynamic server pools.

Distributed Systems Minimal Redistribution

Algorithm Comparison

Algorithm Best For Pros Cons
Round Robin Equal servers Simple, predictable Ignores server load
Weighted Round Robin Mixed capacity Capacity-aware Static weights
Least Connections Variable load Adaptive Overhead tracking
IP Hash Session persistence Consistent routing Uneven distribution
Consistent Hash Dynamic pools Minimal redistribution Complex implementation

Health Checks

Health checks ensure load balancers only route traffic to healthy servers. There are two main approaches:

Passive Health Checks

Monitor ongoing traffic to detect failures. If a server returns errors or times out, it's marked unhealthy.

# NGINX Passive Health Check
upstream backend {
    server backend1.example.com:8080 max_fails=3 fail_timeout=30s;
    server backend2.example.com:8080 max_fails=3 fail_timeout=30s;
}
# After 3 failures, server is marked down for 30 seconds

Active Health Checks

Periodically probe servers with dedicated health endpoints to verify availability.

# Python Health Check Endpoint
from flask import Flask, jsonify
import psutil

app = Flask(__name__)

@app.route('/health')
def health_check():
    """Comprehensive health check endpoint"""
    health = {
        'status': 'healthy',
        'checks': {
            'cpu_percent': psutil.cpu_percent(),
            'memory_percent': psutil.virtual_memory().percent,
            'disk_percent': psutil.disk_usage('/').percent,
            'database': check_database_connection(),
            'cache': check_cache_connection()
        }
    }
    
    # Mark unhealthy if any resource is critical
    if (health['checks']['cpu_percent'] > 90 or
        health['checks']['memory_percent'] > 90):
        health['status'] = 'unhealthy'
        return jsonify(health), 503
    
    return jsonify(health), 200

def check_database_connection():
    try:
        # Attempt database query
        return 'connected'
    except Exception:
        return 'disconnected'

def check_cache_connection():
    try:
        # Attempt cache ping
        return 'connected'
    except Exception:
        return 'disconnected'
Best Practice: Use both passive and active health checks. Active checks catch issues before traffic is affected; passive checks provide real-time failure detection.

Caching Strategies

Caching stores frequently accessed data in fast storage (memory) to reduce latency and database load. It's one of the most impactful optimizations in system design.

Why Caching Matters

Consider these typical latency numbers:

  • L1 cache reference: 0.5 ns
  • L2 cache reference: 7 ns
  • Main memory reference: 100 ns
  • SSD read: 150,000 ns (150 µs)
  • Network round trip: 500,000 ns (500 µs)
  • Database query: 10,000,000 ns (10 ms)
Impact Example: If 90% of requests are cache hits (100 µs) instead of database queries (10 ms), you've reduced average latency from 10 ms to 1.09 ms—a 9x improvement!

Cache Layers

Layer Location Latency Example
Browser Cache Client ~1 ms HTTP cache headers, localStorage
CDN Cache Edge ~10-50 ms Cloudflare, CloudFront
Application Cache Server ~1-10 ms Redis, Memcached
Database Cache Database ~5-20 ms MySQL query cache, PostgreSQL buffer

Cache Patterns

Different caching patterns serve different consistency and performance requirements.

Cache-Aside (Lazy Loading)

Application checks cache first; on miss, fetches from database and populates cache.

# Cache-Aside Pattern
import redis
import json

class CacheAsideService:
    def __init__(self):
        self.cache = redis.Redis(host='localhost', port=6379)
        self.cache_ttl = 3600  # 1 hour
    
    def get_user(self, user_id):
        cache_key = f"user:{user_id}"
        
        # 1. Check cache first
        cached = self.cache.get(cache_key)
        if cached:
            return json.loads(cached)
        
        # 2. Cache miss - fetch from database
        user = self.database.query(f"SELECT * FROM users WHERE id = {user_id}")
        
        # 3. Populate cache for next time
        self.cache.setex(cache_key, self.cache_ttl, json.dumps(user))
        
        return user
    
    def update_user(self, user_id, data):
        # Update database first
        self.database.update("users", user_id, data)
        
        # Invalidate cache (will be repopulated on next read)
        self.cache.delete(f"user:{user_id}")

Pros: Simple, only caches what's needed, resilient to cache failures
Cons: Cache miss penalty, potential for stale data

Most Common Read-Heavy

Write-Through

Data is written to cache and database simultaneously. Ensures cache is always consistent.

# Write-Through Pattern
class WriteThroughService:
    def __init__(self):
        self.cache = redis.Redis(host='localhost', port=6379)
    
    def save_user(self, user_id, data):
        cache_key = f"user:{user_id}"
        
        # Write to both cache and database atomically
        try:
            # Write to database
            self.database.insert_or_update("users", user_id, data)
            
            # Write to cache (only if DB write succeeds)
            self.cache.set(cache_key, json.dumps(data))
            
            return True
        except Exception as e:
            # If anything fails, invalidate cache
            self.cache.delete(cache_key)
            raise e
    
    def get_user(self, user_id):
        # Cache is always up-to-date
        cached = self.cache.get(f"user:{user_id}")
        if cached:
            return json.loads(cached)
        
        # Fallback to database (should rarely happen)
        return self.database.query(f"SELECT * FROM users WHERE id = {user_id}")

Pros: Strong consistency, no stale data
Cons: Higher write latency, writes to cache even if data never read

Consistency Write-Heavy

Write-Behind (Write-Back)

Data is written to cache immediately, then asynchronously persisted to database. Improves write performance but risks data loss.

# Write-Behind Pattern with Queue
import threading
import queue
import time

class WriteBehindService:
    def __init__(self):
        self.cache = redis.Redis(host='localhost', port=6379)
        self.write_queue = queue.Queue()
        self._start_background_writer()
    
    def _start_background_writer(self):
        def writer():
            while True:
                try:
                    # Batch writes every 100ms
                    time.sleep(0.1)
                    batch = []
                    while not self.write_queue.empty() and len(batch) < 100:
                        batch.append(self.write_queue.get_nowait())
                    
                    if batch:
                        self.database.batch_insert("users", batch)
                except Exception as e:
                    print(f"Write-behind error: {e}")
        
        thread = threading.Thread(target=writer, daemon=True)
        thread.start()
    
    def save_user(self, user_id, data):
        # Immediately write to cache
        self.cache.set(f"user:{user_id}", json.dumps(data))
        
        # Queue for async database write
        self.write_queue.put((user_id, data))
        
        return True  # Return immediately

Pros: Very fast writes, batching reduces database load
Cons: Risk of data loss, eventual consistency

High Performance Eventual Consistency

Refresh-Ahead

Proactively refresh cache entries before they expire, preventing cache misses.

# Refresh-Ahead Pattern
import threading

class RefreshAheadService:
    def __init__(self):
        self.cache = redis.Redis(host='localhost', port=6379)
        self.cache_ttl = 3600  # 1 hour
        self.refresh_threshold = 0.75  # Refresh when 75% expired
    
    def get_user(self, user_id):
        cache_key = f"user:{user_id}"
        
        # Get cached value with TTL
        cached = self.cache.get(cache_key)
        ttl = self.cache.ttl(cache_key)
        
        if cached:
            # Check if approaching expiration
            if ttl < self.cache_ttl * (1 - self.refresh_threshold):
                # Trigger async refresh
                threading.Thread(
                    target=self._refresh_cache,
                    args=(user_id,)
                ).start()
            
            return json.loads(cached)
        
        # Cache miss - fetch and cache
        return self._fetch_and_cache(user_id)
    
    def _refresh_cache(self, user_id):
        """Background refresh without blocking the request"""
        self._fetch_and_cache(user_id)
    
    def _fetch_and_cache(self, user_id):
        user = self.database.query(f"SELECT * FROM users WHERE id = {user_id}")
        self.cache.setex(f"user:{user_id}", self.cache_ttl, json.dumps(user))
        return user

Pros: Minimizes cache misses, predictable latency
Cons: Complex implementation, may refresh unused data

Low Latency Predictable

Cache Eviction Policies

When cache is full, eviction policies determine which entries to remove:

  • LRU (Least Recently Used): Evicts the least recently accessed items. Most common choice.
  • LFU (Least Frequently Used): Evicts items accessed least often. Good for static popularity distributions.
  • FIFO (First In First Out): Evicts oldest items. Simple but not optimal.
  • TTL (Time To Live): Items expire after set duration. Good for time-sensitive data.
  • Random: Randomly evicts items. Surprisingly effective and simple.

Redis & Memcached

The two most popular distributed caching solutions have different strengths:

Feature Redis Memcached
Data Types Strings, Lists, Sets, Hashes, Sorted Sets, Streams Strings only
Persistence RDB snapshots, AOF logs None (pure cache)
Replication Primary-replica, Redis Cluster None built-in
Memory Efficiency Good Excellent (slab allocation)
Use Cases Sessions, leaderboards, pub/sub, queues Simple caching, high-throughput

Redis Examples

# Redis Advanced Use Cases
import redis

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

# 1. Rate Limiting with Sliding Window
def is_rate_limited(user_id, limit=100, window_seconds=60):
    key = f"rate_limit:{user_id}"
    current_time = int(time.time())
    window_start = current_time - window_seconds
    
    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, window_start)  # Remove old entries
    pipe.zadd(key, {str(current_time): current_time})  # Add current
    pipe.zcard(key)  # Count entries
    pipe.expire(key, window_seconds)  # Set expiry
    results = pipe.execute()
    
    return results[2] > limit  # True if rate limited

# 2. Distributed Lock
def acquire_lock(lock_name, timeout=10):
    lock_key = f"lock:{lock_name}"
    identifier = str(uuid.uuid4())
    
    if r.set(lock_key, identifier, nx=True, ex=timeout):
        return identifier  # Lock acquired
    return None  # Lock held by another process

def release_lock(lock_name, identifier):
    lock_key = f"lock:{lock_name}"
    # Lua script for atomic check-and-delete
    script = """
    if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
    else
        return 0
    end
    """
    r.eval(script, 1, lock_key, identifier)

# 3. Leaderboard with Sorted Sets
def update_score(user_id, score):
    r.zadd("leaderboard", {user_id: score})

def get_top_players(count=10):
    return r.zrevrange("leaderboard", 0, count-1, withscores=True)

def get_user_rank(user_id):
    return r.zrevrank("leaderboard", user_id) + 1  # 1-indexed
When to Choose: Use Redis when you need rich data structures, persistence, or pub/sub. Use Memcached for simple key-value caching at massive scale with minimal overhead.

Content Delivery Networks

A CDN is a geographically distributed network of proxy servers that cache content close to end users, reducing latency and offloading traffic from origin servers.

How CDNs Work

  1. User requests content (e.g., image, video, JavaScript file)
  2. DNS routes to nearest edge server (Points of Presence - PoPs)
  3. Edge server checks cache:
    • Cache hit: Returns content immediately (~10-50ms)
    • Cache miss: Fetches from origin, caches, then returns
  4. Content served to user from geographically close server

CDN Benefits

  • Reduced Latency: Content served from edge servers close to users
  • Origin Offload: CDN handles majority of traffic
  • DDoS Protection: Distributed infrastructure absorbs attacks
  • High Availability: Redundant edge servers ensure uptime
  • Cost Savings: Reduced bandwidth from origin

Push vs. Pull CDNs

Feature Pull CDN Push CDN
Content Population On-demand (first request triggers fetch) Pre-uploaded to edge servers
Best For Dynamic content, frequently changing sites Static content, known assets
Traffic Spikes May cause origin overload on cache miss Handles spikes well (content pre-cached)
Storage Cost Only popular content cached All content replicated everywhere

CDN Cache Headers

Control CDN caching behavior with HTTP headers:

# Setting Cache Headers in Python/Flask
from flask import Flask, make_response

app = Flask(__name__)

@app.route('/static/image.jpg')
def serve_image():
    response = make_response(get_image())
    
    # Cache for 1 year (immutable assets with versioned URLs)
    response.headers['Cache-Control'] = 'public, max-age=31536000, immutable'
    
    return response

@app.route('/api/user-profile')
def user_profile():
    response = make_response(get_user_data())
    
    # Private data - don't cache on CDN, only browser
    response.headers['Cache-Control'] = 'private, max-age=300'
    
    return response

@app.route('/api/prices')
def prices():
    response = make_response(get_prices())
    
    # Revalidate every request (CDN checks if-modified-since)
    response.headers['Cache-Control'] = 'public, no-cache'
    response.headers['ETag'] = calculate_etag(get_prices())
    
    return response

Common CDN Providers

  • Cloudflare: Global network, DDoS protection, edge computing (Workers)
  • AWS CloudFront: Deep AWS integration, Lambda@Edge for edge compute
  • Akamai: Largest network, enterprise-focused, advanced security
  • Fastly: Real-time purging, edge computing (Compute@Edge)
  • Google Cloud CDN: GCP integration, Anycast network
Real-World Impact: Netflix uses its own CDN (Open Connect) with servers in ISP networks, achieving 95%+ cache hit rates and reducing origin traffic by orders of magnitude.

Cache Invalidation Strategies

The famous saying "There are only two hard things in Computer Science: cache invalidation and naming things" highlights the challenge of keeping caches current:

  • TTL-based: Content expires after set duration. Simple but may serve stale content.
  • Versioned URLs: Include version/hash in URL (e.g., style.v2.css, bundle.a1b2c3.js). Instant invalidation by changing URL.
  • Purge API: Explicitly invalidate specific URLs or patterns via CDN API.
  • Tag-based: Tag content (e.g., "product-123") and purge by tag when data changes.

Next Steps

Caching & Load Balancing Strategy Generator

Document your caching layers and load balancer configuration. Download as Word, Excel, or PDF.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Technology