The Consensus Problem
Why Consensus Matters
Imagine three database servers that need to agree on the order of transactions. Client A sends "set x=1" to Server 1, while Client B simultaneously sends "set x=2" to Server 2. Without consensus, each server might apply these operations in a different order, leading to permanent disagreement about the value of x.
Consensus algorithms solve this by ensuring all nodes agree on the same sequence of operations, even when some nodes fail or messages are delayed. This is the foundation of:
- Leader election: Who coordinates the cluster?
- Atomic broadcast: Delivering messages to all nodes in the same order
- Distributed locking: Ensuring only one process holds a lock
- State machine replication: Keeping replicas identical
kubectl apply goes through Raft before being considered committed. Understanding Raft means understanding why Kubernetes requires an odd number of etcd nodes and what happens during control plane failures.
Required Properties
A correct consensus algorithm must guarantee three properties:
| Property | Meaning | Why It Matters |
|---|---|---|
| Agreement | All non-faulty nodes decide on the same value | No conflicting state across replicas |
| Validity | The decided value was proposed by some node | Prevents fabricating values from nowhere |
| Termination | All non-faulty nodes eventually reach a decision | System doesn't hang forever |
FLP Impossibility
In 1985, Fischer, Lynch, and Paterson proved that no deterministic consensus algorithm can guarantee all three properties in an asynchronous system where even one node might crash (the FLP impossibility result). This sounds devastating, but practical systems work around it using:
- Randomization: Break symmetry with random timeouts (Raft uses this)
- Partial synchrony: Assume the network eventually delivers messages within some bound
- Failure detectors: Use timeouts as imperfect crash detectors
Raft and Paxos both rely on partial synchrony — they guarantee safety (agreement) always, and guarantee liveness (termination) as long as the network eventually stabilizes.
Raft Consensus Algorithm
Overview & Design Goals
Raft was designed in 2014 by Diego Ongaro and John Ousterhout with one primary goal: understandability. Paxos, the previous standard, was notoriously difficult to understand and implement correctly. Raft achieves equivalent correctness guarantees while being dramatically easier to reason about.
Raft decomposes consensus into three sub-problems:
- Leader Election: Choose one node to coordinate
- Log Replication: The leader replicates operations to followers
- Safety: Ensure replicated logs remain consistent
stateDiagram-v2
[*] --> Follower
Follower --> Candidate : Election timeout (no heartbeat)
Candidate --> Leader : Receives majority votes
Candidate --> Follower : Discovers higher term leader
Candidate --> Candidate : Election timeout (split vote)
Leader --> Follower : Discovers higher term
Every node in a Raft cluster is in one of three states at any time: Follower, Candidate, or Leader. Time is divided into terms — monotonically increasing integers that act as logical clocks.
Leader Election
Raft uses a heartbeat mechanism to trigger elections:
- All nodes start as Followers
- The Leader sends periodic heartbeats (empty AppendEntries RPCs)
- If a Follower doesn't receive a heartbeat within its election timeout (randomized, typically 150–300ms), it becomes a Candidate
- The Candidate increments its term, votes for itself, and requests votes from all other nodes
- If it receives votes from a majority, it becomes Leader
- If another node wins (higher term), it reverts to Follower
- If no one wins (split vote), timeout and try again with a new term
# Raft election example with 5 nodes (etcd cluster):
#
# Term 1: Node A is leader, sends heartbeats every 100ms
#
# Term 2: Node A crashes
# - Node C election timeout fires first (randomized: 180ms)
# - Node C increments term to 2, votes for self
# - Node C sends RequestVote to B, D, E
# - B, D, E grant vote (haven't voted in term 2 yet)
# - Node C has 4 votes (self + B + D + E) out of 5 → MAJORITY
# - Node C becomes Leader for term 2
# - Node C begins sending heartbeats
#
# Total election time: ~200-300ms (one round-trip + timeout)
# In Kubernetes etcd, view the current leader:
etcdctl endpoint status --write-out=table
# +----------------+------------------+---------+---------+
# | ENDPOINT | ID | IS_LEADER | TERM |
# +----------------+------------------+---------+---------+
# | 10.0.0.1:2379 | 8e9e05c52164694d | false | 5 |
# | 10.0.0.2:2379 | 91bc3c398fb3c146 | true | 5 |
# | 10.0.0.3:2379 | fd422379fda50e48 | false | 5 |
# +----------------+------------------+---------+---------+
Log Replication
Once a leader is elected, it handles all client requests. Each request becomes a log entry that must be replicated to a majority before being "committed":
- Client sends a write request to the Leader
- Leader appends the entry to its log
- Leader sends AppendEntries RPC to all Followers
- Followers append to their log and acknowledge
- Once a majority acknowledges, the Leader commits the entry
- Leader notifies Followers of the commit
- Leader responds to the client
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
C->>L: Write "set x=5"
L->>L: Append to log (index 4, term 2)
L->>F1: AppendEntries(index 4, "set x=5")
L->>F2: AppendEntries(index 4, "set x=5")
F1->>L: ACK (log index 4)
Note over L: Majority (self + F1) = 2/3 ✓
L->>L: Commit index 4
L->>C: Success
F2->>L: ACK (late but accepted)
L->>F1: Heartbeat (commitIndex=4)
L->>F2: Heartbeat (commitIndex=4)
Quorum Mechanics
A quorum is the minimum number of nodes that must agree for a decision to be valid. In Raft, the quorum is ⌊N/2⌋ + 1 (a strict majority):
| Cluster Size (N) | Quorum Required | Tolerates Failures | Use Case |
|---|---|---|---|
| 1 | 1 | 0 | Development only |
| 3 | 2 | 1 | Small production clusters |
| 5 | 3 | 2 | Standard production |
| 7 | 4 | 3 | High availability (rarely needed) |
Safety Guarantees
Raft guarantees that once a log entry is committed, it will never be lost or contradicted, even across leader changes. This is enforced through:
- Election restriction: A candidate can only win if its log is at least as up-to-date as a majority of nodes
- Log matching: If two logs contain an entry with the same index and term, all preceding entries are identical
- Leader completeness: A committed entry will be present in every future leader's log
These properties ensure that committed writes survive leader elections, network partitions, and node restarts — exactly what you need for a system like etcd that stores your entire Kubernetes cluster state.
Paxos (Conceptual)
Historical Context
Paxos was described by Leslie Lamport in 1989 (published 1998) as the first practical consensus algorithm. It's provably correct and the theoretical foundation for most consensus systems. However, it's notoriously difficult to understand — Lamport's original paper used a fictional Greek island parliament as an analogy, and the community spent years arguing about what it actually meant.
Two-Phase Protocol
Basic Paxos uses two phases to reach agreement on a single value:
Phase 1 (Prepare):
- A proposer picks a unique proposal number N
- Sends "Prepare(N)" to all acceptors
- Each acceptor promises not to accept proposals with numbers less than N
- If the acceptor already accepted a value, it returns that value
Phase 2 (Accept):
- If the proposer receives promises from a majority, it sends "Accept(N, value)"
- The value is either the highest-numbered previously accepted value, or the proposer's own value if no prior value exists
- Acceptors accept if they haven't promised to a higher number
- Once a majority accepts, the value is chosen
Raft vs Paxos
| Aspect | Raft | Paxos |
|---|---|---|
| Design goal | Understandability | Theoretical correctness |
| Leader | Strong leader (all writes go through leader) | Any node can propose (leaderless variant) |
| Log ordering | Entries committed in order | Gaps allowed (Multi-Paxos fills them) |
| Implementations | etcd, Consul, CockroachDB | Google Chubby, Apache ZooKeeper (ZAB variant) |
| Complexity | ~2,000 lines (reference implementation) | Varies widely; often 5,000+ lines |
Split-Brain Problems
What is Split-Brain?
A split-brain occurs when a network partition divides a cluster into two (or more) groups, each believing the other has failed. Without proper consensus, both sides may elect their own leader and accept writes independently, causing permanent data divergence.
flowchart TD
subgraph Partition A
A1[Node 1 - thinks it is Leader]
A2[Node 2 - Follower]
end
subgraph Partition B
B3[Node 3 - thinks it is Leader]
B4[Node 4 - Follower]
B5[Node 5 - Follower]
end
A1 -.->|Network Partition| B3
A1 -->|Heartbeats| A2
B3 -->|Heartbeats| B4
B3 -->|Heartbeats| B5
In this 5-node example, nodes 1–2 are partitioned from nodes 3–5. Node 1 was the original leader but can't reach a majority (needs 3). Nodes 3–5 elect Node 3 as the new leader (has 3 = majority). The quorum requirement naturally resolves this — only the partition with a majority can make progress.
Prevention Strategies
- Quorum-based decisions: Only the majority partition can commit (Raft's approach)
- Fencing tokens: Monotonically increasing tokens that invalidate stale leaders
- Lease-based leadership: Leaders hold time-bounded leases; must renew to stay leader
- STONITH (Shoot The Other Node In The Head): Hardware-level fencing for the minority partition
Consensus in Kubernetes (etcd)
etcd is Kubernetes' single source of truth — every object (Pods, Services, Deployments) is stored as a key-value pair in etcd. The Raft consensus algorithm ensures this state is consistently replicated across all etcd nodes.
# Check etcd cluster health:
etcdctl endpoint health --cluster
# https://10.0.0.1:2379 is healthy: successfully committed proposal: took = 2.5ms
# https://10.0.0.2:2379 is healthy: successfully committed proposal: took = 2.1ms
# https://10.0.0.3:2379 is healthy: successfully committed proposal: took = 1.9ms
# View etcd member list:
etcdctl member list --write-out=table
# +------------------+---------+---------+------------------------+------------------------+
# | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
# +------------------+---------+---------+------------------------+------------------------+
# | 8e9e05c52164694d | started | etcd-1 | https://10.0.0.1:2380 | https://10.0.0.1:2379 |
# | 91bc3c398fb3c146 | started | etcd-2 | https://10.0.0.2:2380 | https://10.0.0.2:2379 |
# | fd422379fda50e48 | started | etcd-3 | https://10.0.0.3:2380 | https://10.0.0.3:2379 |
# +------------------+---------+---------+------------------------+------------------------+
# Watch Raft term changes (indicates leader elections):
etcdctl endpoint status --write-out=json | python3 -m json.tool
# Look for "raftTerm" field — increments indicate elections occurred
When you run kubectl apply -f deployment.yaml, here's what happens at the consensus level:
- API Server receives the request and validates it
- API Server sends a write proposal to the etcd leader
- etcd leader appends the entry to its Raft log
- Leader replicates the entry to follower etcd nodes
- Once a majority acknowledges, the entry is committed
- etcd responds to the API Server with success
- API Server responds to kubectl with the created resource
etcd_disk_wal_fsync_duration_seconds — if P99 exceeds 10ms, investigate immediately.
Exercises
etcdctl endpoint status --write-out=table and identify: (a) Which node is the leader, (b) The current Raft term, (c) The Raft index (total committed entries). Then restart the leader node and observe: How quickly does a new election happen? What's the new term number?
Conclusion
Consensus algorithms solve the fundamental challenge of distributed agreement — how do unreliable machines, communicating over unreliable networks, agree on anything? Raft provides an elegant solution through strong leadership, sequential log replication, and quorum-based commits.
For Kubernetes practitioners, understanding Raft explains:
- Why etcd needs odd numbers of nodes (quorum math)
- Why losing majority etcd nodes is catastrophic (no quorum = no writes)
- Why etcd performance matters (every write is a Raft round-trip)
- Why leader elections cause brief write pauses (~300ms)
- Why etcd backups are critical (committed state = cluster state)
In Part 3, we'll explore the CAP theorem — the fundamental trade-off between consistency, availability, and partition tolerance that governs all distributed system design decisions, including those made by Kubernetes.