Back to Technology

Complete Database Mastery Part 11: Scaling & Distributed Systems

January 31, 2026 Wasil Zafar 45 min read

Master database scaling and distributed systems. Learn replication architectures, sharding strategies, CAP theorem, consistent hashing, distributed consensus algorithms, and horizontal vs vertical scaling techniques.

Table of Contents

  1. Introduction
  2. Scaling Approaches
  3. Replication Architectures
  4. Sharding Strategies
  5. CAP & PACELC
  6. Distributed Consensus
  7. Case Studies
  8. Conclusion & Next Steps

Introduction: Why Scaling Matters

Database scaling enables systems to handle growing workloads. As data volumes and user counts increase, understanding horizontal and vertical scaling, replication, and sharding becomes essential for system architects.

Series Context: This is Part 11 of 15 in the Complete Database Mastery series. We're tackling the architecture challenges of large-scale systems.

Think of scaling like upgrading a restaurant: Vertical scaling = get a bigger kitchen. Horizontal scaling = open more locations. Both have tradeoffs.

Scaling Approaches

Vertical Scaling (Scale Up)

Add more power to a single machine: more CPU, RAM, faster storage.

Vertical Scaling Characteristics

Pros Cons
Simple—no code changes Hardware limits (can't scale forever)
No distributed complexity Single point of failure
ACID transactions work normally Expensive at high end
Lower latency (no network hops) Downtime during upgrades

Horizontal Scaling (Scale Out)

Add more machines to distribute the load. More complex but virtually unlimited.

-- Example: Application routes queries based on user ID
-- Shard 1: Users 1-1,000,000
-- Shard 2: Users 1,000,001-2,000,000
-- etc.

-- Application logic determines routing
user_id = 1,500,000
shard = (user_id / 1000000) + 1  -- Shard 2
When to Scale Horizontally: When you've maximized vertical scaling, need geographic distribution, require fault tolerance, or have predictable sharding keys.

Replication Architectures

Replication copies data across multiple servers for redundancy and read scaling.

Primary-Replica (Master-Slave)

One server handles writes; replicas handle reads. The most common pattern.

# PostgreSQL Streaming Replication Setup

# Primary (postgresql.conf)
wal_level = replica
max_wal_senders = 10
synchronous_standby_names = 'replica1'

# Create replication user
CREATE USER replicator REPLICATION LOGIN PASSWORD 'secret';

# Replica - Create from base backup
pg_basebackup -h primary -D /var/lib/postgresql/data -U replicator -P

# Replica (postgresql.conf)
primary_conninfo = 'host=primary port=5432 user=replicator password=secret'

Read/Write Splitting

# Application-level read/write routing
class DatabaseRouter:
    def get_connection(self, query_type):
        if query_type == 'write':
            return primary_connection
        else:
            # Round-robin across replicas
            return random.choice(replica_connections)

# Or use a proxy like PgBouncer, ProxySQL, HAProxy

Multi-Master

Multiple servers accept writes. Complex but enables geographic distribution.

# MySQL Group Replication (Multi-Primary Mode)
# Each node can accept writes; conflicts resolved automatically

SET GLOBAL group_replication_single_primary_mode = OFF;
SET GLOBAL group_replication_enforce_update_everywhere_checks = ON;
START GROUP_REPLICATION;
Multi-Master Challenges: Conflict resolution (last-write-wins vs merge), increased latency for consistency, complexity in debugging. Use only when truly needed.

Synchronous vs Asynchronous Replication

Replication Modes

Mode Behavior Tradeoff
Synchronous Wait for replica acknowledgment Zero data loss, higher latency
Asynchronous Return immediately after primary write Lower latency, potential data loss
Semi-synchronous Wait for at least one replica Balance of both

Sharding Strategies

Sharding partitions data across multiple databases. Each shard holds a subset of the data.

Range-Based Sharding

Partition by value ranges. Good for sequential access patterns.

-- Range sharding by date
-- Shard 2023: orders from 2023
-- Shard 2024: orders from 2024

-- Routing logic
IF order_date >= '2024-01-01' THEN
    USE shard_2024;
ELSE
    USE shard_2023;
END IF;
Hotspot Risk: If most queries hit recent data, the newest shard becomes overloaded. Consider time-based archiving.

Hash-Based Sharding

Hash the sharding key for even distribution. Best for random access patterns.

# Hash-based shard routing
def get_shard(user_id, num_shards=4):
    return hash(user_id) % num_shards

# user_id=12345 → shard 1
# user_id=67890 → shard 2
# Even distribution regardless of user_id patterns

Consistent Hashing

Minimizes data movement when adding/removing shards. Essential for dynamic scaling.

# Consistent hashing concept
# Nodes placed on a hash ring (0-360 degrees)
# Keys hash to a position, route to nearest clockwise node

# Without consistent hashing:
# Adding a shard moves ~50% of data

# With consistent hashing:
# Adding a shard moves only 1/N of data
# (where N = number of shards)

Sharding Key Selection

Good Keys Bad Keys
user_id (for user-centric apps) Auto-increment ID (sequential hotspots)
tenant_id (for SaaS) Date alone (recent data hotspot)
geographic_region Low-cardinality fields (status, type)

CAP Theorem & PACELC

CAP Theorem

In a distributed system, you can only guarantee two of three properties:

CAP Properties

  • Consistency (C): All nodes see the same data at the same time
  • Availability (A): Every request receives a response (success or failure)
  • Partition Tolerance (P): System continues despite network failures

Since network partitions are inevitable, the real choice is:

During a network partition:

CP (Consistency + Partition Tolerance):
- Refuse requests if consistency can't be guaranteed
- Examples: PostgreSQL, MongoDB (with majority writes)
- Use case: Financial transactions, inventory systems

AP (Availability + Partition Tolerance):
- Serve requests even with potentially stale data
- Examples: Cassandra, DynamoDB (eventual consistency)
- Use case: Social media feeds, analytics

PACELC Extension

CAP only describes partition behavior. PACELC adds: "Else (no partition), what's the Latency vs Consistency tradeoff?"

PACELC: If Partition → choose A or C
        Else → choose Latency or Consistency

Examples:
- DynamoDB: PA/EL (Available during partition, low latency normally)
- PostgreSQL: PC/EC (Consistent always, but slower)
- Cassandra: PA/EL (Available, low latency, eventual consistency)

Distributed Consensus

How do distributed nodes agree on the state of data? Consensus algorithms solve this.

Common Consensus Algorithms

Algorithm Used By Key Idea
Paxos Google Spanner, Chubby Classic, complex, proven
Raft etcd, CockroachDB, Consul Understandable Paxos alternative
Zab ZooKeeper Optimized for primary-backup
Raft Consensus (Simplified):
1. Leader Election: Nodes vote for a leader
2. Log Replication: Leader appends entries, replicates to followers
3. Commit: Entry committed when majority acknowledges
4. Leader fails → New election → New leader

Quorum = (N/2) + 1
- 3 nodes: need 2 to agree
- 5 nodes: need 3 to agree

Case Studies

Database Architecture Examples

System Architecture CAP Choice
Google Spanner Globally distributed, TrueTime CP (strong consistency)
Amazon DynamoDB Multi-region, eventual consistency AP (configurable)
CockroachDB Raft consensus, serializable CP (SQL + distributed)
Cassandra Masterless, tunable consistency AP (default)
Modern Trend: NewSQL databases like CockroachDB, Spanner, and YugabyteDB offer the best of both worlds—SQL semantics with horizontal scaling.

Conclusion & Next Steps

Distributed databases are the backbone of internet-scale applications. Understanding CAP tradeoffs, replication topologies, and sharding strategies is essential for architects building resilient systems.

Next in the Series

In Part 12: Cloud Databases & Managed Services, we'll explore AWS, Azure, and GCP database offerings that implement these distributed concepts.

Technology