CAP Theorem & Replication - Part 3

Part 2 showed how consensus algorithms let nodes agree on a sequence of operations. But consensus has a cost — it requires waiting for a quorum, which means latency. The CAP theorem formalises this trade-off: you can't have perfect consistency, availability, and partition tolerance simultaneously. Understanding CAP explains why systems like etcd, Cassandra, and DynamoDB behave so differently under pressure.

The CAP Theorem

In 2000, Eric Brewer conjectured (later proven by Gilbert and Lynch in 2002) that a distributed data store can provide at most two out of three guarantees simultaneously:

Consistency (C)

Every read receives the most recent write or an error. All nodes see the same data at the same time. This is linearizability — the system behaves as if there's only one copy of the data.

Analogy: A bank account balance. If you deposit $100, the next read must show the updated balance — never the old one. Two people checking the balance simultaneously must see the same number.

Availability (A)

Every request receives a non-error response, without guarantee that it contains the most recent write. The system always responds — it never hangs or returns an error due to internal issues.

Analogy: A 24/7 customer service line. You always get an answer when you call, though the information might occasionally be slightly out of date.

Partition Tolerance (P)

The system continues to operate despite network partitions — arbitrary message loss or delay between nodes. Since network partitions are inevitable in any real distributed system (cables get cut, switches fail, data centers lose connectivity), partition tolerance is not optional.

                            
                            Critical Insight: Since network partitions will happen (P is mandatory), the real choice is between Consistency and Availability during a partition. When the network is healthy, you can have all three. The CAP trade-off only matters when things go wrong.
                        

The Impossible Triangle

CAP Theorem — Pick Two (During a Partition)

flowchart TD
    CAP["CAP Theorem\nIn a distributed system during a network partition,\nyou can guarantee at most TWO of three properties"]

    CAP --> CP["CP\nConsistency + Partition Tolerance"]
    CAP --> AP["AP\nAvailability + Partition Tolerance"]
    CAP --> CA["CA\nConsistency + Availability"]

    CP --> CP_ex["etcd — K8s state store (Raft consensus)\nZooKeeper — Distributed coordination (ZAB protocol)\nHBase — Hadoop database (single-master writes)"]
    CP --> CP_cost["❌ Trade-off: Refuses reads/writes\nwhen quorum is unreachable"]

    AP --> AP_ex["Cassandra — Wide-column DB (leaderless, any node writes)\nDynamoDB — AWS key-value (shopping cart philosophy)\nCouchDB — Document DB (multi-master replication)"]
    AP --> AP_cost["❌ Trade-off: May return stale or\nconflicting data during partition"]

    CA --> CA_ex["PostgreSQL (single-node) — Traditional RDBMS\nMySQL (single-node) — No replication needed\nSQLite — Embedded, single-process only"]
    CA --> CA_cost["❌ Trade-off: Cannot survive network\npartitions (not truly distributed)"]

    style CP fill:#e8f4f4,stroke:#3B9797,color:#132440
    style AP fill:#fff5f5,stroke:#BF092F,color:#132440
    style CA fill:#f0f4f8,stroke:#16476A,color:#132440
    style CP_cost fill:#fff,stroke:#3B9797,color:#64748b
    style AP_cost fill:#fff,stroke:#BF092F,color:#64748b
    style CA_cost fill:#fff,stroke:#16476A,color:#64748b

                            Key Insight: Network partitions will happen in any distributed system. Since you can't avoid P, the real choice is between C (refuse requests to stay correct) and A (always respond, accept stale data). CA systems avoid this choice by not distributing at all — which means they can't scale horizontally or survive node failures.
                        

During a network partition:

CP systems refuse to respond rather than return potentially stale data (they sacrifice availability for correctness)
AP systems always respond but may return stale data (they sacrifice consistency for uptime)
CA systems don't exist in distributed environments (they can't handle partitions)

CAP in Practice

CP Systems (Consistency + Partition Tolerance)

CP systems guarantee that every read returns the most recent write, even during partitions — but they may become unavailable (refuse reads/writes) when they can't confirm consistency.

CP Example etcd / Kubernetes

etcd: Choosing Consistency

etcd (Kubernetes' state store) is a CP system. During a network partition, if a Raft quorum cannot be reached, etcd refuses writes entirely. The API Server will return errors rather than accept a write that might conflict with another partition's state. This means your kubectl apply will fail, but you'll never have two conflicting versions of a Deployment spec.

Raft Consensus Linearizable Quorum Required

CP Example ZooKeeper

Apache ZooKeeper: Coordination Service

ZooKeeper provides distributed locking, leader election, and configuration management. It uses the ZAB (ZooKeeper Atomic Broadcast) protocol — a Paxos variant. During a partition, the minority side cannot serve reads or writes, ensuring clients never see stale coordination data (which could cause split-brain in applications relying on it).

ZAB Protocol Distributed Locking

AP Systems (Availability + Partition Tolerance)

AP systems always respond to requests, even during partitions, accepting that responses may contain stale or conflicting data that must be resolved later.

AP Example Cassandra

Apache Cassandra: Always Available

Cassandra has no single leader — any node can accept writes. During a partition, both sides continue accepting writes independently. When the partition heals, Cassandra reconciles conflicts using "last-write-wins" (timestamp-based). This means writes are never rejected, but you might temporarily read stale data or lose a concurrent write.

Leaderless Last-Write-Wins Tunable Consistency

AP Example Amazon DynamoDB

DynamoDB: Shopping Cart Philosophy

Amazon's original Dynamo paper explicitly chose availability over consistency for the shopping cart. Their reasoning: it's better to show a slightly outdated cart (which might have an old item still in it) than to show an error page. Customers can always remove an item, but a failed add-to-cart is a lost sale. This business trade-off drove the technical architecture.

Eventually Consistent Business-Driven

Beyond CAP: PACELC

CAP only describes behavior during partitions — but networks are healthy 99.9% of the time. If CAP were the whole story, every CP system would feel identical during normal operation. They don't: etcd prioritises correctness even at the cost of latency, while MongoDB offers fast local reads that may be slightly stale. The PACELC extension (Daniel Abadi, 2012) captures this: "Else (when no partition), choose between Latency and Consistency."

System	During Partition (PAC)	Else (ELC)	Full Classification
etcd	PC (consistent)	EC (consistent)	PC/EC
Cassandra	PA (available)	EL (low latency)	PA/EL
MongoDB	PC (consistent)	EL (low latency)	PC/EL
DynamoDB	PA (available)	EL (low latency)	PA/EL

Replication Strategies

Synchronous Replication

In synchronous replication, a write is not acknowledged to the client until all (or a quorum of) replicas confirm they've stored it. This guarantees consistency but increases write latency.

# Synchronous replication timeline:
#
# Client → Leader: "Write x=5"
# Leader → Follower A: "Replicate x=5"
# Leader → Follower B: "Replicate x=5"
# Follower A → Leader: "ACK"
# Follower B → Leader: "ACK"
# Leader → Client: "Success"    ← Only AFTER all replicas confirm
#
# Total latency = network RTT to slowest follower + disk write
# If Follower B is in another region (50ms away):
#   Write latency ≥ 50ms (dominated by network distance)

# PostgreSQL synchronous replication config:
# synchronous_commit = on
# synchronous_standby_names = 'follower_a, follower_b'

Advantages: Strong consistency, no data loss on leader failure

Disadvantages: High write latency (limited by slowest replica), reduced availability (one slow replica blocks all writes)

Asynchronous Replication

In asynchronous replication, the leader acknowledges the write immediately after its own local write, then propagates to followers in the background.

# Asynchronous replication timeline:
#
# Client → Leader: "Write x=5"
# Leader writes locally
# Leader → Client: "Success"    ← Immediate (no waiting for followers)
#
# ... later (milliseconds to seconds) ...
# Leader → Follower A: "Replicate x=5"
# Leader → Follower B: "Replicate x=5"
#
# Risk: If leader crashes between local write and replication,
# the write is LOST. Followers never received it.

# MySQL asynchronous replication (default):
# The binlog is shipped to replicas asynchronously
# SHOW SLAVE STATUS\G
# Seconds_Behind_Master: 2   ← replication lag

Advantages: Low write latency, high availability (followers don't block writes)

Disadvantages: Possible data loss on leader failure, stale reads from followers

Semi-Synchronous Replication

A middle ground: acknowledge after at least one follower confirms (not all). This is Raft's approach — write to a majority (quorum), tolerate minority failures.

# Semi-synchronous (quorum) replication:
# Cluster: Leader + Follower A + Follower B (3 nodes, quorum = 2)
#
# Client → Leader: "Write x=5"
# Leader writes locally          ← 1 write
# Leader → Follower A: "Replicate x=5"
# Leader → Follower B: "Replicate x=5"
# Follower A → Leader: "ACK"    ← 2 writes (quorum reached!)
# Leader → Client: "Success"    ← Doesn't wait for Follower B
#
# Follower B might be slow or partitioned — doesn't matter.
# The write is durable with 2/3 copies.

# This is exactly how etcd (Raft) works:
# - 3-node etcd: write committed after 2 nodes confirm
# - 5-node etcd: write committed after 3 nodes confirm

                            
                            The Key Insight: Your replication strategy determines your consistency model. Synchronous replication (wait for all) enables strong consistency — but at high latency cost. Asynchronous replication (ack immediately) forces eventual consistency — fast but clients may read stale data. Semi-synchronous/quorum (Raft) gives a middle ground: strong consistency within the quorum, without waiting for every replica. This is why Kubernetes chose quorum-based Raft for etcd — it provides strong consistency for control plane state without the latency penalty of fully synchronous replication across all nodes.
                        

Consistency Models

Strong Consistency (Linearizability)

The strongest guarantee: every operation appears to happen atomically at a single point in time, and all clients see operations in the same order. After a write completes, all subsequent reads (from any client, to any replica) return the new value.

                            
                            Kubernetes uses strong consistency for its state store. When you create a Deployment, any subsequent kubectl get deployment from any client will see it. This is critical for correctness — controllers must see the latest state to make correct scheduling decisions.
                        

Eventual Consistency

The weakest useful guarantee: if no new writes are made, all replicas will eventually converge to the same value. The "eventually" might be milliseconds or minutes — no upper bound is guaranteed.

Eventual consistency is acceptable when:

Stale reads don't cause incorrect behavior (social media feeds, product catalogs)
High availability is more important than immediate consistency (e-commerce)
Writes are rare relative to reads (DNS, CDN caches)

Causal Consistency

A middle ground: operations that are causally related are seen in order by all nodes, but concurrent (unrelated) operations may be seen in different orders. This captures the intuition that "if I post a message and then edit it, everyone should see the edit after the original" without requiring global ordering of all operations.

Model	Guarantee	Performance	Example Use
Strong (Linearizable)	All ops appear atomic and ordered	Slowest (requires coordination)	Bank transfers, K8s state
Causal	Related ops are ordered	Medium	Social media comments
Eventual	All replicas converge eventually	Fastest (no coordination)	Shopping carts, DNS

Conflict Resolution

When replicas diverge (in AP systems or during partitions), conflicts must eventually be resolved. Common strategies:

Last-Write-Wins (LWW): Use timestamps to pick the "latest" write. Simple but can silently drop writes. Used by Cassandra.
Vector Clocks: Track causal relationships between writes. Detect true conflicts (concurrent writes) vs. sequential updates. Used by original Dynamo.
CRDTs (Conflict-Free Replicated Data Types): Data structures that are mathematically guaranteed to converge regardless of operation order. No conflicts possible by design. Used by Redis (CRDT counters), Riak.
Application-level resolution: Present conflicts to the application/user for manual resolution. Used by CouchDB (revision conflicts).

# Example: Last-Write-Wins conflict
#
# Partition occurs at T=0
# Node A (partition 1): "set x=apple"  at T=1 (clock: 1001)
# Node B (partition 2): "set x=banana" at T=2 (clock: 1002)
#
# Partition heals at T=3
# LWW resolution: x = "banana" (higher timestamp wins)
# The "apple" write is silently discarded!
#
# Problem: If clocks are skewed, the "wrong" write might win.
# NTP clock drift can be 100ms+ between data centers.

Kubernetes & CAP

Kubernetes is fundamentally a CP system. Here's how CAP manifests across its architecture:

Component	CAP Choice	Behavior During Partition
etcd (state store)	CP	Refuses writes without quorum
API Server	CP (inherits from etcd)	Returns errors for write operations
kubelet (node agent)	AP-ish	Continues running existing pods even if API Server unreachable
kube-proxy (networking)	AP-ish	Uses last-known service endpoints
DNS (CoreDNS)	AP (cached)	Serves cached DNS records

                            
                            Graceful Degradation: Kubernetes is cleverly designed so that while the control plane is CP (new changes require consensus), the data plane is AP-ish. During a control plane outage, existing pods keep running, services keep routing, and DNS keeps resolving. You just can't make changes. This is the best of both worlds — correctness for mutations, availability for running workloads.
                        

Exercises

                            
                            Exercise 1 — Classify Systems: For each system below, determine whether it's CP or AP and explain why: (a) A banking ledger system, (b) A social media "likes" counter, (c) A flight booking system, (d) A DNS resolver, (e) A distributed lock service. Consider: what happens if each returns stale data?
                        

                            
                            Exercise 2 — Kubernetes Scenario: Your 3-node etcd cluster suffers a network partition: 1 node on one side, 2 on the other. (a) Which side can accept kubectl apply commands? (b) Can the isolated node serve kubectl get reads? (c) Are existing pods affected on worker nodes connected to the isolated control plane node? (d) What would happen if etcd were AP instead of CP?
                        

                            
                            Exercise 3 — Design Decision: You're building a global e-commerce inventory system. You have warehouses in 3 regions. A customer in Asia buys the last unit while a customer in Europe tries to buy it 50ms later. (a) If you choose CP, what's the user experience? (b) If you choose AP, what might happen? (c) What hybrid approach would you use? Consider: overselling 1 item vs. showing "out of stock" incorrectly to the European customer.
                        

Conclusion

The CAP theorem isn't about choosing two properties from a menu — it's about understanding that network partitions force a choice between consistency and availability. Most real systems aren't purely CP or AP; they make nuanced trade-offs at different levels of their architecture.

Kubernetes embraces CP for its control plane (etcd must be consistent for correct orchestration) while designing the data plane to degrade gracefully (running workloads continue even without the control plane). This architectural split — strong consistency for coordination, eventual consistency for execution — is a pattern you'll see throughout distributed systems.

In Part 4, we'll explore service discovery and communication — how ephemeral containers find each other in a world where IP addresses constantly change, and how to build resilient inter-service communication with retries, circuit breakers, and message queues.

Previous Part 2: Consensus Algorithms Next Part 4: Service Discovery & Communication

Cookie Consent

Part 3: CAP Theorem & Replication

Table of Contents

The CAP Theorem

Consistency (C)

Availability (A)

Partition Tolerance (P)

The Impossible Triangle

CAP in Practice

CP Systems (Consistency + Partition Tolerance)

etcd: Choosing Consistency

Apache ZooKeeper: Coordination Service

AP Systems (Availability + Partition Tolerance)

Apache Cassandra: Always Available

DynamoDB: Shopping Cart Philosophy

Beyond CAP: PACELC

Replication Strategies

Synchronous Replication

Asynchronous Replication

Semi-Synchronous Replication

Consistency Models

Strong Consistency (Linearizability)

Eventual Consistency

Causal Consistency

Conflict Resolution

Kubernetes & CAP

Exercises

Conclusion

Cookie Consent

Part 3: CAP Theorem & Replication

Table of Contents

The CAP Theorem

Consistency (C)

Availability (A)

Partition Tolerance (P)

The Impossible Triangle

CAP in Practice

CP Systems (Consistency + Partition Tolerance)

etcd: Choosing Consistency

Apache ZooKeeper: Coordination Service

AP Systems (Availability + Partition Tolerance)

Apache Cassandra: Always Available

DynamoDB: Shopping Cart Philosophy

Beyond CAP: PACELC

Replication Strategies

Synchronous Replication

Asynchronous Replication

Semi-Synchronous Replication

Consistency Models

Strong Consistency (Linearizability)

Eventual Consistency

Causal Consistency

Conflict Resolution

Kubernetes & CAP

Exercises

Conclusion

Continue the Series

Part 2: Consensus Algorithms

Part 4: Service Discovery & Communication

Part 10: Kubernetes Storage