Back to Distributed Systems & Kubernetes Series

Part 3: CAP Theorem & Replication

May 14, 2026 Wasil Zafar 35 min read

You cannot have it all. The CAP theorem defines the fundamental trade-off every distributed system must make — and understanding it explains why Kubernetes, databases, and cloud services behave the way they do.

Table of Contents

  1. The CAP Theorem
  2. CAP in Practice
  3. Replication Strategies
  4. Consistency Models
  5. Conflict Resolution
  6. Kubernetes & CAP
  7. Exercises
  8. Conclusion

The CAP Theorem

In 2000, Eric Brewer conjectured (later proven by Gilbert and Lynch in 2002) that a distributed data store can provide at most two out of three guarantees simultaneously:

Consistency (C)

Every read receives the most recent write or an error. All nodes see the same data at the same time. This is linearizability — the system behaves as if there's only one copy of the data.

Analogy: A bank account balance. If you deposit $100, the next read must show the updated balance — never the old one. Two people checking the balance simultaneously must see the same number.

Availability (A)

Every request receives a non-error response, without guarantee that it contains the most recent write. The system always responds — it never hangs or returns an error due to internal issues.

Analogy: A 24/7 customer service line. You always get an answer when you call, though the information might occasionally be slightly out of date.

Partition Tolerance (P)

The system continues to operate despite network partitions — arbitrary message loss or delay between nodes. Since network partitions are inevitable in any real distributed system (cables get cut, switches fail, data centers lose connectivity), partition tolerance is not optional.

Critical Insight: Since network partitions will happen (P is mandatory), the real choice is between Consistency and Availability during a partition. When the network is healthy, you can have all three. The CAP trade-off only matters when things go wrong.

The Impossible Triangle

CAP Theorem Trade-offs
flowchart TD
    CAP[CAP Theorem] --> CP[CP: Consistency + Partition Tolerance]
    CAP --> AP[AP: Availability + Partition Tolerance]
    CAP --> CA[CA: Consistency + Availability]
    CP --> CP_ex["etcd, ZooKeeper, HBase\n❌ Unavailable during partition"]
    AP --> AP_ex["Cassandra, DynamoDB, CouchDB\n❌ May return stale data"]
    CA --> CA_ex["Single-node RDBMS\n❌ Cannot survive partitions"]
                            

During a network partition:

  • CP systems refuse to respond rather than return potentially stale data (they sacrifice availability for correctness)
  • AP systems always respond but may return stale data (they sacrifice consistency for uptime)
  • CA systems don't exist in distributed environments (they can't handle partitions)

CAP in Practice

CP Systems (Consistency + Partition Tolerance)

CP systems guarantee that every read returns the most recent write, even during partitions — but they may become unavailable (refuse reads/writes) when they can't confirm consistency.

CP Example etcd / Kubernetes
etcd: Choosing Consistency

etcd (Kubernetes' state store) is a CP system. During a network partition, if a Raft quorum cannot be reached, etcd refuses writes entirely. The API Server will return errors rather than accept a write that might conflict with another partition's state. This means your kubectl apply will fail, but you'll never have two conflicting versions of a Deployment spec.

Raft Consensus Linearizable Quorum Required
CP Example ZooKeeper
Apache ZooKeeper: Coordination Service

ZooKeeper provides distributed locking, leader election, and configuration management. It uses the ZAB (ZooKeeper Atomic Broadcast) protocol — a Paxos variant. During a partition, the minority side cannot serve reads or writes, ensuring clients never see stale coordination data (which could cause split-brain in applications relying on it).

ZAB Protocol Distributed Locking

AP Systems (Availability + Partition Tolerance)

AP systems always respond to requests, even during partitions, accepting that responses may contain stale or conflicting data that must be resolved later.

AP Example Cassandra
Apache Cassandra: Always Available

Cassandra has no single leader — any node can accept writes. During a partition, both sides continue accepting writes independently. When the partition heals, Cassandra reconciles conflicts using "last-write-wins" (timestamp-based). This means writes are never rejected, but you might temporarily read stale data or lose a concurrent write.

Leaderless Last-Write-Wins Tunable Consistency
AP Example Amazon DynamoDB
DynamoDB: Shopping Cart Philosophy

Amazon's original Dynamo paper explicitly chose availability over consistency for the shopping cart. Their reasoning: it's better to show a slightly outdated cart (which might have an old item still in it) than to show an error page. Customers can always remove an item, but a failed add-to-cart is a lost sale. This business trade-off drove the technical architecture.

Eventually Consistent Business-Driven

Beyond CAP: PACELC

CAP only describes behavior during partitions, but most of the time the network is healthy. The PACELC extension (Daniel Abadi, 2012) adds: "Else (when no partition), choose between Latency and Consistency."

System During Partition (PAC) Else (ELC) Full Classification
etcd PC (consistent) EC (consistent) PC/EC
Cassandra PA (available) EL (low latency) PA/EL
MongoDB PC (consistent) EL (low latency) PC/EL
DynamoDB PA (available) EL (low latency) PA/EL

Replication Strategies

Synchronous Replication

In synchronous replication, a write is not acknowledged to the client until all (or a quorum of) replicas confirm they've stored it. This guarantees consistency but increases write latency.

# Synchronous replication timeline:
#
# Client → Leader: "Write x=5"
# Leader → Follower A: "Replicate x=5"
# Leader → Follower B: "Replicate x=5"
# Follower A → Leader: "ACK"
# Follower B → Leader: "ACK"
# Leader → Client: "Success"    ← Only AFTER all replicas confirm
#
# Total latency = network RTT to slowest follower + disk write
# If Follower B is in another region (50ms away):
#   Write latency ≥ 50ms (dominated by network distance)

# PostgreSQL synchronous replication config:
# synchronous_commit = on
# synchronous_standby_names = 'follower_a, follower_b'

Advantages: Strong consistency, no data loss on leader failure

Disadvantages: High write latency (limited by slowest replica), reduced availability (one slow replica blocks all writes)

Asynchronous Replication

In asynchronous replication, the leader acknowledges the write immediately after its own local write, then propagates to followers in the background.

# Asynchronous replication timeline:
#
# Client → Leader: "Write x=5"
# Leader writes locally
# Leader → Client: "Success"    ← Immediate (no waiting for followers)
#
# ... later (milliseconds to seconds) ...
# Leader → Follower A: "Replicate x=5"
# Leader → Follower B: "Replicate x=5"
#
# Risk: If leader crashes between local write and replication,
# the write is LOST. Followers never received it.

# MySQL asynchronous replication (default):
# The binlog is shipped to replicas asynchronously
# SHOW SLAVE STATUS\G
# Seconds_Behind_Master: 2   ← replication lag

Advantages: Low write latency, high availability (followers don't block writes)

Disadvantages: Possible data loss on leader failure, stale reads from followers

Semi-Synchronous Replication

A middle ground: acknowledge after at least one follower confirms (not all). This is Raft's approach — write to a majority (quorum), tolerate minority failures.

# Semi-synchronous (quorum) replication:
# Cluster: Leader + Follower A + Follower B (3 nodes, quorum = 2)
#
# Client → Leader: "Write x=5"
# Leader writes locally          ← 1 write
# Leader → Follower A: "Replicate x=5"
# Leader → Follower B: "Replicate x=5"
# Follower A → Leader: "ACK"    ← 2 writes (quorum reached!)
# Leader → Client: "Success"    ← Doesn't wait for Follower B
#
# Follower B might be slow or partitioned — doesn't matter.
# The write is durable with 2/3 copies.

# This is exactly how etcd (Raft) works:
# - 3-node etcd: write committed after 2 nodes confirm
# - 5-node etcd: write committed after 3 nodes confirm

Consistency Models

Strong Consistency (Linearizability)

The strongest guarantee: every operation appears to happen atomically at a single point in time, and all clients see operations in the same order. After a write completes, all subsequent reads (from any client, to any replica) return the new value.

Kubernetes uses strong consistency for its state store. When you create a Deployment, any subsequent kubectl get deployment from any client will see it. This is critical for correctness — controllers must see the latest state to make correct scheduling decisions.

Eventual Consistency

The weakest useful guarantee: if no new writes are made, all replicas will eventually converge to the same value. The "eventually" might be milliseconds or minutes — no upper bound is guaranteed.

Eventual consistency is acceptable when:

  • Stale reads don't cause incorrect behavior (social media feeds, product catalogs)
  • High availability is more important than immediate consistency (e-commerce)
  • Writes are rare relative to reads (DNS, CDN caches)

Causal Consistency

A middle ground: operations that are causally related are seen in order by all nodes, but concurrent (unrelated) operations may be seen in different orders. This captures the intuition that "if I post a message and then edit it, everyone should see the edit after the original" without requiring global ordering of all operations.

Model Guarantee Performance Example Use
Strong (Linearizable) All ops appear atomic and ordered Slowest (requires coordination) Bank transfers, K8s state
Causal Related ops are ordered Medium Social media comments
Eventual All replicas converge eventually Fastest (no coordination) Shopping carts, DNS

Conflict Resolution

When replicas diverge (in AP systems or during partitions), conflicts must eventually be resolved. Common strategies:

  • Last-Write-Wins (LWW): Use timestamps to pick the "latest" write. Simple but can silently drop writes. Used by Cassandra.
  • Vector Clocks: Track causal relationships between writes. Detect true conflicts (concurrent writes) vs. sequential updates. Used by original Dynamo.
  • CRDTs (Conflict-Free Replicated Data Types): Data structures that are mathematically guaranteed to converge regardless of operation order. No conflicts possible by design. Used by Redis (CRDT counters), Riak.
  • Application-level resolution: Present conflicts to the application/user for manual resolution. Used by CouchDB (revision conflicts).
# Example: Last-Write-Wins conflict
#
# Partition occurs at T=0
# Node A (partition 1): "set x=apple"  at T=1 (clock: 1001)
# Node B (partition 2): "set x=banana" at T=2 (clock: 1002)
#
# Partition heals at T=3
# LWW resolution: x = "banana" (higher timestamp wins)
# The "apple" write is silently discarded!
#
# Problem: If clocks are skewed, the "wrong" write might win.
# NTP clock drift can be 100ms+ between data centers.

Kubernetes & CAP

Kubernetes is fundamentally a CP system. Here's how CAP manifests across its architecture:

Component CAP Choice Behavior During Partition
etcd (state store) CP Refuses writes without quorum
API Server CP (inherits from etcd) Returns errors for write operations
kubelet (node agent) AP-ish Continues running existing pods even if API Server unreachable
kube-proxy (networking) AP-ish Uses last-known service endpoints
DNS (CoreDNS) AP (cached) Serves cached DNS records
Graceful Degradation: Kubernetes is cleverly designed so that while the control plane is CP (new changes require consensus), the data plane is AP-ish. During a control plane outage, existing pods keep running, services keep routing, and DNS keeps resolving. You just can't make changes. This is the best of both worlds — correctness for mutations, availability for running workloads.

Exercises

Exercise 1 — Classify Systems: For each system below, determine whether it's CP or AP and explain why: (a) A banking ledger system, (b) A social media "likes" counter, (c) A flight booking system, (d) A DNS resolver, (e) A distributed lock service. Consider: what happens if each returns stale data?
Exercise 2 — Kubernetes Scenario: Your 3-node etcd cluster suffers a network partition: 1 node on one side, 2 on the other. (a) Which side can accept kubectl apply commands? (b) Can the isolated node serve kubectl get reads? (c) Are existing pods affected on worker nodes connected to the isolated control plane node? (d) What would happen if etcd were AP instead of CP?
Exercise 3 — Design Decision: You're building a global e-commerce inventory system. You have warehouses in 3 regions. A customer in Asia buys the last unit while a customer in Europe tries to buy it 50ms later. (a) If you choose CP, what's the user experience? (b) If you choose AP, what might happen? (c) What hybrid approach would you use? Consider: overselling 1 item vs. showing "out of stock" incorrectly to the European customer.

Conclusion

The CAP theorem isn't about choosing two properties from a menu — it's about understanding that network partitions force a choice between consistency and availability. Most real systems aren't purely CP or AP; they make nuanced trade-offs at different levels of their architecture.

Kubernetes embraces CP for its control plane (etcd must be consistent for correct orchestration) while designing the data plane to degrade gracefully (running workloads continue even without the control plane). This architectural split — strong consistency for coordination, eventual consistency for execution — is a pattern you'll see throughout distributed systems.

In Part 4, we'll explore service discovery and communication — how ephemeral containers find each other in a world where IP addresses constantly change, and how to build resilient inter-service communication with retries, circuit breakers, and message queues.