The CAP Theorem
In 2000, Eric Brewer conjectured (later proven by Gilbert and Lynch in 2002) that a distributed data store can provide at most two out of three guarantees simultaneously:
Consistency (C)
Every read receives the most recent write or an error. All nodes see the same data at the same time. This is linearizability — the system behaves as if there's only one copy of the data.
Analogy: A bank account balance. If you deposit $100, the next read must show the updated balance — never the old one. Two people checking the balance simultaneously must see the same number.
Availability (A)
Every request receives a non-error response, without guarantee that it contains the most recent write. The system always responds — it never hangs or returns an error due to internal issues.
Analogy: A 24/7 customer service line. You always get an answer when you call, though the information might occasionally be slightly out of date.
Partition Tolerance (P)
The system continues to operate despite network partitions — arbitrary message loss or delay between nodes. Since network partitions are inevitable in any real distributed system (cables get cut, switches fail, data centers lose connectivity), partition tolerance is not optional.
The Impossible Triangle
flowchart TD
CAP[CAP Theorem] --> CP[CP: Consistency + Partition Tolerance]
CAP --> AP[AP: Availability + Partition Tolerance]
CAP --> CA[CA: Consistency + Availability]
CP --> CP_ex["etcd, ZooKeeper, HBase\n❌ Unavailable during partition"]
AP --> AP_ex["Cassandra, DynamoDB, CouchDB\n❌ May return stale data"]
CA --> CA_ex["Single-node RDBMS\n❌ Cannot survive partitions"]
During a network partition:
- CP systems refuse to respond rather than return potentially stale data (they sacrifice availability for correctness)
- AP systems always respond but may return stale data (they sacrifice consistency for uptime)
- CA systems don't exist in distributed environments (they can't handle partitions)
CAP in Practice
CP Systems (Consistency + Partition Tolerance)
CP systems guarantee that every read returns the most recent write, even during partitions — but they may become unavailable (refuse reads/writes) when they can't confirm consistency.
etcd: Choosing Consistency
etcd (Kubernetes' state store) is a CP system. During a network partition, if a Raft quorum cannot be reached, etcd refuses writes entirely. The API Server will return errors rather than accept a write that might conflict with another partition's state. This means your kubectl apply will fail, but you'll never have two conflicting versions of a Deployment spec.
Apache ZooKeeper: Coordination Service
ZooKeeper provides distributed locking, leader election, and configuration management. It uses the ZAB (ZooKeeper Atomic Broadcast) protocol — a Paxos variant. During a partition, the minority side cannot serve reads or writes, ensuring clients never see stale coordination data (which could cause split-brain in applications relying on it).
AP Systems (Availability + Partition Tolerance)
AP systems always respond to requests, even during partitions, accepting that responses may contain stale or conflicting data that must be resolved later.
Apache Cassandra: Always Available
Cassandra has no single leader — any node can accept writes. During a partition, both sides continue accepting writes independently. When the partition heals, Cassandra reconciles conflicts using "last-write-wins" (timestamp-based). This means writes are never rejected, but you might temporarily read stale data or lose a concurrent write.
DynamoDB: Shopping Cart Philosophy
Amazon's original Dynamo paper explicitly chose availability over consistency for the shopping cart. Their reasoning: it's better to show a slightly outdated cart (which might have an old item still in it) than to show an error page. Customers can always remove an item, but a failed add-to-cart is a lost sale. This business trade-off drove the technical architecture.
Beyond CAP: PACELC
CAP only describes behavior during partitions, but most of the time the network is healthy. The PACELC extension (Daniel Abadi, 2012) adds: "Else (when no partition), choose between Latency and Consistency."
| System | During Partition (PAC) | Else (ELC) | Full Classification |
|---|---|---|---|
| etcd | PC (consistent) | EC (consistent) | PC/EC |
| Cassandra | PA (available) | EL (low latency) | PA/EL |
| MongoDB | PC (consistent) | EL (low latency) | PC/EL |
| DynamoDB | PA (available) | EL (low latency) | PA/EL |
Replication Strategies
Synchronous Replication
In synchronous replication, a write is not acknowledged to the client until all (or a quorum of) replicas confirm they've stored it. This guarantees consistency but increases write latency.
# Synchronous replication timeline:
#
# Client → Leader: "Write x=5"
# Leader → Follower A: "Replicate x=5"
# Leader → Follower B: "Replicate x=5"
# Follower A → Leader: "ACK"
# Follower B → Leader: "ACK"
# Leader → Client: "Success" ← Only AFTER all replicas confirm
#
# Total latency = network RTT to slowest follower + disk write
# If Follower B is in another region (50ms away):
# Write latency ≥ 50ms (dominated by network distance)
# PostgreSQL synchronous replication config:
# synchronous_commit = on
# synchronous_standby_names = 'follower_a, follower_b'
Advantages: Strong consistency, no data loss on leader failure
Disadvantages: High write latency (limited by slowest replica), reduced availability (one slow replica blocks all writes)
Asynchronous Replication
In asynchronous replication, the leader acknowledges the write immediately after its own local write, then propagates to followers in the background.
# Asynchronous replication timeline:
#
# Client → Leader: "Write x=5"
# Leader writes locally
# Leader → Client: "Success" ← Immediate (no waiting for followers)
#
# ... later (milliseconds to seconds) ...
# Leader → Follower A: "Replicate x=5"
# Leader → Follower B: "Replicate x=5"
#
# Risk: If leader crashes between local write and replication,
# the write is LOST. Followers never received it.
# MySQL asynchronous replication (default):
# The binlog is shipped to replicas asynchronously
# SHOW SLAVE STATUS\G
# Seconds_Behind_Master: 2 ← replication lag
Advantages: Low write latency, high availability (followers don't block writes)
Disadvantages: Possible data loss on leader failure, stale reads from followers
Semi-Synchronous Replication
A middle ground: acknowledge after at least one follower confirms (not all). This is Raft's approach — write to a majority (quorum), tolerate minority failures.
# Semi-synchronous (quorum) replication:
# Cluster: Leader + Follower A + Follower B (3 nodes, quorum = 2)
#
# Client → Leader: "Write x=5"
# Leader writes locally ← 1 write
# Leader → Follower A: "Replicate x=5"
# Leader → Follower B: "Replicate x=5"
# Follower A → Leader: "ACK" ← 2 writes (quorum reached!)
# Leader → Client: "Success" ← Doesn't wait for Follower B
#
# Follower B might be slow or partitioned — doesn't matter.
# The write is durable with 2/3 copies.
# This is exactly how etcd (Raft) works:
# - 3-node etcd: write committed after 2 nodes confirm
# - 5-node etcd: write committed after 3 nodes confirm
Consistency Models
Strong Consistency (Linearizability)
The strongest guarantee: every operation appears to happen atomically at a single point in time, and all clients see operations in the same order. After a write completes, all subsequent reads (from any client, to any replica) return the new value.
kubectl get deployment from any client will see it. This is critical for correctness — controllers must see the latest state to make correct scheduling decisions.
Eventual Consistency
The weakest useful guarantee: if no new writes are made, all replicas will eventually converge to the same value. The "eventually" might be milliseconds or minutes — no upper bound is guaranteed.
Eventual consistency is acceptable when:
- Stale reads don't cause incorrect behavior (social media feeds, product catalogs)
- High availability is more important than immediate consistency (e-commerce)
- Writes are rare relative to reads (DNS, CDN caches)
Causal Consistency
A middle ground: operations that are causally related are seen in order by all nodes, but concurrent (unrelated) operations may be seen in different orders. This captures the intuition that "if I post a message and then edit it, everyone should see the edit after the original" without requiring global ordering of all operations.
| Model | Guarantee | Performance | Example Use |
|---|---|---|---|
| Strong (Linearizable) | All ops appear atomic and ordered | Slowest (requires coordination) | Bank transfers, K8s state |
| Causal | Related ops are ordered | Medium | Social media comments |
| Eventual | All replicas converge eventually | Fastest (no coordination) | Shopping carts, DNS |
Conflict Resolution
When replicas diverge (in AP systems or during partitions), conflicts must eventually be resolved. Common strategies:
- Last-Write-Wins (LWW): Use timestamps to pick the "latest" write. Simple but can silently drop writes. Used by Cassandra.
- Vector Clocks: Track causal relationships between writes. Detect true conflicts (concurrent writes) vs. sequential updates. Used by original Dynamo.
- CRDTs (Conflict-Free Replicated Data Types): Data structures that are mathematically guaranteed to converge regardless of operation order. No conflicts possible by design. Used by Redis (CRDT counters), Riak.
- Application-level resolution: Present conflicts to the application/user for manual resolution. Used by CouchDB (revision conflicts).
# Example: Last-Write-Wins conflict
#
# Partition occurs at T=0
# Node A (partition 1): "set x=apple" at T=1 (clock: 1001)
# Node B (partition 2): "set x=banana" at T=2 (clock: 1002)
#
# Partition heals at T=3
# LWW resolution: x = "banana" (higher timestamp wins)
# The "apple" write is silently discarded!
#
# Problem: If clocks are skewed, the "wrong" write might win.
# NTP clock drift can be 100ms+ between data centers.
Kubernetes & CAP
Kubernetes is fundamentally a CP system. Here's how CAP manifests across its architecture:
| Component | CAP Choice | Behavior During Partition |
|---|---|---|
| etcd (state store) | CP | Refuses writes without quorum |
| API Server | CP (inherits from etcd) | Returns errors for write operations |
| kubelet (node agent) | AP-ish | Continues running existing pods even if API Server unreachable |
| kube-proxy (networking) | AP-ish | Uses last-known service endpoints |
| DNS (CoreDNS) | AP (cached) | Serves cached DNS records |
Exercises
kubectl apply commands? (b) Can the isolated node serve kubectl get reads? (c) Are existing pods affected on worker nodes connected to the isolated control plane node? (d) What would happen if etcd were AP instead of CP?
Conclusion
The CAP theorem isn't about choosing two properties from a menu — it's about understanding that network partitions force a choice between consistency and availability. Most real systems aren't purely CP or AP; they make nuanced trade-offs at different levels of their architecture.
Kubernetes embraces CP for its control plane (etcd must be consistent for correct orchestration) while designing the data plane to degrade gracefully (running workloads continue even without the control plane). This architectural split — strong consistency for coordination, eventual consistency for execution — is a pattern you'll see throughout distributed systems.
In Part 4, we'll explore service discovery and communication — how ephemeral containers find each other in a world where IP addresses constantly change, and how to build resilient inter-service communication with retries, circuit breakers, and message queues.