Back to Distributed Systems & Kubernetes Series

Part 2: Consensus Algorithms

May 14, 2026 Wasil Zafar 40 min read

Consensus is the heart of distributed coordination. How do machines agree on shared state when networks are unreliable and nodes can fail?

Table of Contents

  1. The Consensus Problem
  2. Raft Consensus Algorithm
  3. Paxos (Conceptual)
  4. Split-Brain Problems
  5. Consensus in Kubernetes (etcd)
  6. Exercises
  7. Conclusion

The Consensus Problem

Why Consensus Matters

Imagine three database servers that need to agree on the order of transactions. Client A sends "set x=1" to Server 1, while Client B simultaneously sends "set x=2" to Server 2. Without consensus, each server might apply these operations in a different order, leading to permanent disagreement about the value of x.

Consensus algorithms solve this by ensuring all nodes agree on the same sequence of operations, even when some nodes fail or messages are delayed. This is the foundation of:

  • Leader election: Who coordinates the cluster?
  • Atomic broadcast: Delivering messages to all nodes in the same order
  • Distributed locking: Ensuring only one process holds a lock
  • State machine replication: Keeping replicas identical
Kubernetes Connection: etcd — Kubernetes' brain — uses the Raft consensus algorithm to replicate cluster state across control plane nodes. Every kubectl apply goes through Raft before being considered committed. Understanding Raft means understanding why Kubernetes requires an odd number of etcd nodes and what happens during control plane failures.

Required Properties

A correct consensus algorithm must guarantee three properties:

Property Meaning Why It Matters
Agreement All non-faulty nodes decide on the same value No conflicting state across replicas
Validity The decided value was proposed by some node Prevents fabricating values from nowhere
Termination All non-faulty nodes eventually reach a decision System doesn't hang forever

FLP Impossibility

In 1985, Fischer, Lynch, and Paterson proved that no deterministic consensus algorithm can guarantee all three properties in an asynchronous system where even one node might crash (the FLP impossibility result). This sounds devastating, but practical systems work around it using:

  • Randomization: Break symmetry with random timeouts (Raft uses this)
  • Partial synchrony: Assume the network eventually delivers messages within some bound
  • Failure detectors: Use timeouts as imperfect crash detectors

Raft and Paxos both rely on partial synchrony — they guarantee safety (agreement) always, and guarantee liveness (termination) as long as the network eventually stabilizes.

Raft Consensus Algorithm

Overview & Design Goals

Raft was designed in 2014 by Diego Ongaro and John Ousterhout with one primary goal: understandability. Paxos, the previous standard, was notoriously difficult to understand and implement correctly. Raft achieves equivalent correctness guarantees while being dramatically easier to reason about.

Raft decomposes consensus into three sub-problems:

  1. Leader Election: Choose one node to coordinate
  2. Log Replication: The leader replicates operations to followers
  3. Safety: Ensure replicated logs remain consistent
Raft Node States
stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate : Election timeout (no heartbeat)
    Candidate --> Leader : Receives majority votes
    Candidate --> Follower : Discovers higher term leader
    Candidate --> Candidate : Election timeout (split vote)
    Leader --> Follower : Discovers higher term
                            

Every node in a Raft cluster is in one of three states at any time: Follower, Candidate, or Leader. Time is divided into terms — monotonically increasing integers that act as logical clocks.

Leader Election

Raft uses a heartbeat mechanism to trigger elections:

  1. All nodes start as Followers
  2. The Leader sends periodic heartbeats (empty AppendEntries RPCs)
  3. If a Follower doesn't receive a heartbeat within its election timeout (randomized, typically 150–300ms), it becomes a Candidate
  4. The Candidate increments its term, votes for itself, and requests votes from all other nodes
  5. If it receives votes from a majority, it becomes Leader
  6. If another node wins (higher term), it reverts to Follower
  7. If no one wins (split vote), timeout and try again with a new term
Why Random Timeouts? Without randomization, all followers would timeout simultaneously, all become candidates, and split the vote every time — livelock. Random election timeouts (e.g., 150ms–300ms) ensure one node almost always times out first, wins the election quickly, and establishes leadership before others try.
# Raft election example with 5 nodes (etcd cluster):
# 
# Term 1: Node A is leader, sends heartbeats every 100ms
# 
# Term 2: Node A crashes
#   - Node C election timeout fires first (randomized: 180ms)
#   - Node C increments term to 2, votes for self
#   - Node C sends RequestVote to B, D, E
#   - B, D, E grant vote (haven't voted in term 2 yet)
#   - Node C has 4 votes (self + B + D + E) out of 5 → MAJORITY
#   - Node C becomes Leader for term 2
#   - Node C begins sending heartbeats
#
# Total election time: ~200-300ms (one round-trip + timeout)

# In Kubernetes etcd, view the current leader:
etcdctl endpoint status --write-out=table
# +----------------+------------------+---------+---------+
# |    ENDPOINT    |        ID        | IS_LEADER | TERM  |
# +----------------+------------------+---------+---------+
# | 10.0.0.1:2379  | 8e9e05c52164694d |   false   |   5   |
# | 10.0.0.2:2379  | 91bc3c398fb3c146 |    true   |   5   |
# | 10.0.0.3:2379  | fd422379fda50e48 |   false   |   5   |
# +----------------+------------------+---------+---------+

Log Replication

Once a leader is elected, it handles all client requests. Each request becomes a log entry that must be replicated to a majority before being "committed":

  1. Client sends a write request to the Leader
  2. Leader appends the entry to its log
  3. Leader sends AppendEntries RPC to all Followers
  4. Followers append to their log and acknowledge
  5. Once a majority acknowledges, the Leader commits the entry
  6. Leader notifies Followers of the commit
  7. Leader responds to the client
Raft Log Replication Flow
sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2
    
    C->>L: Write "set x=5"
    L->>L: Append to log (index 4, term 2)
    L->>F1: AppendEntries(index 4, "set x=5")
    L->>F2: AppendEntries(index 4, "set x=5")
    F1->>L: ACK (log index 4)
    Note over L: Majority (self + F1) = 2/3 ✓
    L->>L: Commit index 4
    L->>C: Success
    F2->>L: ACK (late but accepted)
    L->>F1: Heartbeat (commitIndex=4)
    L->>F2: Heartbeat (commitIndex=4)
                            

Quorum Mechanics

A quorum is the minimum number of nodes that must agree for a decision to be valid. In Raft, the quorum is ⌊N/2⌋ + 1 (a strict majority):

Cluster Size (N) Quorum Required Tolerates Failures Use Case
1 1 0 Development only
3 2 1 Small production clusters
5 3 2 Standard production
7 4 3 High availability (rarely needed)
Why Odd Numbers? A 4-node cluster requires 3 for quorum and tolerates only 1 failure — same as a 3-node cluster! You added a node for no additional fault tolerance. Even numbers also increase the chance of split votes. This is why Kubernetes etcd recommendations are always 3, 5, or 7 nodes — never 2, 4, or 6.

Safety Guarantees

Raft guarantees that once a log entry is committed, it will never be lost or contradicted, even across leader changes. This is enforced through:

  • Election restriction: A candidate can only win if its log is at least as up-to-date as a majority of nodes
  • Log matching: If two logs contain an entry with the same index and term, all preceding entries are identical
  • Leader completeness: A committed entry will be present in every future leader's log

These properties ensure that committed writes survive leader elections, network partitions, and node restarts — exactly what you need for a system like etcd that stores your entire Kubernetes cluster state.

Paxos (Conceptual)

Historical Context

Paxos was described by Leslie Lamport in 1989 (published 1998) as the first practical consensus algorithm. It's provably correct and the theoretical foundation for most consensus systems. However, it's notoriously difficult to understand — Lamport's original paper used a fictional Greek island parliament as an analogy, and the community spent years arguing about what it actually meant.

Two-Phase Protocol

Basic Paxos uses two phases to reach agreement on a single value:

Phase 1 (Prepare):

  1. A proposer picks a unique proposal number N
  2. Sends "Prepare(N)" to all acceptors
  3. Each acceptor promises not to accept proposals with numbers less than N
  4. If the acceptor already accepted a value, it returns that value

Phase 2 (Accept):

  1. If the proposer receives promises from a majority, it sends "Accept(N, value)"
  2. The value is either the highest-numbered previously accepted value, or the proposer's own value if no prior value exists
  3. Acceptors accept if they haven't promised to a higher number
  4. Once a majority accepts, the value is chosen

Raft vs Paxos

Aspect Raft Paxos
Design goal Understandability Theoretical correctness
Leader Strong leader (all writes go through leader) Any node can propose (leaderless variant)
Log ordering Entries committed in order Gaps allowed (Multi-Paxos fills them)
Implementations etcd, Consul, CockroachDB Google Chubby, Apache ZooKeeper (ZAB variant)
Complexity ~2,000 lines (reference implementation) Varies widely; often 5,000+ lines

Split-Brain Problems

What is Split-Brain?

A split-brain occurs when a network partition divides a cluster into two (or more) groups, each believing the other has failed. Without proper consensus, both sides may elect their own leader and accept writes independently, causing permanent data divergence.

Split-Brain Scenario
flowchart TD
    subgraph Partition A
        A1[Node 1 - thinks it is Leader]
        A2[Node 2 - Follower]
    end
    subgraph Partition B
        B3[Node 3 - thinks it is Leader]
        B4[Node 4 - Follower]
        B5[Node 5 - Follower]
    end
    A1 -.->|Network Partition| B3
    A1 -->|Heartbeats| A2
    B3 -->|Heartbeats| B4
    B3 -->|Heartbeats| B5
                            

In this 5-node example, nodes 1–2 are partitioned from nodes 3–5. Node 1 was the original leader but can't reach a majority (needs 3). Nodes 3–5 elect Node 3 as the new leader (has 3 = majority). The quorum requirement naturally resolves this — only the partition with a majority can make progress.

Prevention Strategies

  • Quorum-based decisions: Only the majority partition can commit (Raft's approach)
  • Fencing tokens: Monotonically increasing tokens that invalidate stale leaders
  • Lease-based leadership: Leaders hold time-bounded leases; must renew to stay leader
  • STONITH (Shoot The Other Node In The Head): Hardware-level fencing for the minority partition
Kubernetes Protection: etcd's Raft implementation prevents split-brain by requiring quorum for all writes. If a control plane loses contact with the majority, its API Server becomes read-only — it cannot accept new deployments or changes. This is why losing 2 of 3 etcd nodes makes the cluster unable to accept writes (no quorum).

Consensus in Kubernetes (etcd)

etcd is Kubernetes' single source of truth — every object (Pods, Services, Deployments) is stored as a key-value pair in etcd. The Raft consensus algorithm ensures this state is consistently replicated across all etcd nodes.

# Check etcd cluster health:
etcdctl endpoint health --cluster
# https://10.0.0.1:2379 is healthy: successfully committed proposal: took = 2.5ms
# https://10.0.0.2:2379 is healthy: successfully committed proposal: took = 2.1ms
# https://10.0.0.3:2379 is healthy: successfully committed proposal: took = 1.9ms

# View etcd member list:
etcdctl member list --write-out=table
# +------------------+---------+---------+------------------------+------------------------+
# |        ID        | STATUS  |  NAME   |       PEER ADDRS       |      CLIENT ADDRS      |
# +------------------+---------+---------+------------------------+------------------------+
# | 8e9e05c52164694d | started | etcd-1  | https://10.0.0.1:2380  | https://10.0.0.1:2379  |
# | 91bc3c398fb3c146 | started | etcd-2  | https://10.0.0.2:2380  | https://10.0.0.2:2379  |
# | fd422379fda50e48 | started | etcd-3  | https://10.0.0.3:2380  | https://10.0.0.3:2379  |
# +------------------+---------+---------+------------------------+------------------------+

# Watch Raft term changes (indicates leader elections):
etcdctl endpoint status --write-out=json | python3 -m json.tool
# Look for "raftTerm" field — increments indicate elections occurred

When you run kubectl apply -f deployment.yaml, here's what happens at the consensus level:

  1. API Server receives the request and validates it
  2. API Server sends a write proposal to the etcd leader
  3. etcd leader appends the entry to its Raft log
  4. Leader replicates the entry to follower etcd nodes
  5. Once a majority acknowledges, the entry is committed
  6. etcd responds to the API Server with success
  7. API Server responds to kubectl with the created resource
Production Warning: etcd performance directly impacts Kubernetes responsiveness. Raft requires disk fsync for every committed entry. Slow disks (especially network-attached storage) cause high commit latency, leader elections, and cluster instability. Always use local SSDs for etcd in production. Monitor etcd_disk_wal_fsync_duration_seconds — if P99 exceeds 10ms, investigate immediately.

Exercises

Exercise 1 — Trace a Leader Election: You have a 5-node Raft cluster in term 3. The leader (Node A) crashes. Nodes B through E have election timeouts of 200ms, 180ms, 250ms, and 190ms respectively. Trace the election step by step: Who becomes the candidate first? What term do they propose? How many votes do they need? What if Node C's vote request arrives at Node E before Node D's request?
Exercise 2 — Quorum Math: Your production Kubernetes cluster has 3 etcd nodes. (a) What's the quorum? (b) How many nodes can fail? (c) If you add a 4th node, does fault tolerance improve? (d) Why do Kubernetes best practices recommend 5 etcd nodes for large clusters but never 4 or 6?
Exercise 3 — Split-Brain Scenario: A 5-node etcd cluster suffers a network partition: nodes 1–2 on one side, nodes 3–5 on the other. Node 1 was the leader. (a) Can nodes 1–2 accept writes? Why or why not? (b) What happens on the 3–5 side? (c) When the partition heals, how is state reconciled? (d) What would happen without quorum requirements?
Exercise 4 — Practical: If you have access to a Kubernetes cluster, run etcdctl endpoint status --write-out=table and identify: (a) Which node is the leader, (b) The current Raft term, (c) The Raft index (total committed entries). Then restart the leader node and observe: How quickly does a new election happen? What's the new term number?

Conclusion

Consensus algorithms solve the fundamental challenge of distributed agreement — how do unreliable machines, communicating over unreliable networks, agree on anything? Raft provides an elegant solution through strong leadership, sequential log replication, and quorum-based commits.

For Kubernetes practitioners, understanding Raft explains:

  • Why etcd needs odd numbers of nodes (quorum math)
  • Why losing majority etcd nodes is catastrophic (no quorum = no writes)
  • Why etcd performance matters (every write is a Raft round-trip)
  • Why leader elections cause brief write pauses (~300ms)
  • Why etcd backups are critical (committed state = cluster state)

In Part 3, we'll explore the CAP theorem — the fundamental trade-off between consistency, availability, and partition tolerance that governs all distributed system design decisions, including those made by Kubernetes.