Consensus Algorithms - Part 2

Part 1 established that distributed systems require coordination — multiple machines must agree on shared facts to act as one system. Consensus is the technical solution to coordination. In this part, we'll see how machines actually reach agreement using the Raft algorithm — the same protocol that powers Kubernetes' state store (etcd).

The Consensus Problem

Why Consensus Matters

Imagine three database servers that need to agree on the order of transactions. Client A sends "set x=1" to Server 1, while Client B simultaneously sends "set x=2" to Server 2. Without consensus, each server might apply these operations in a different order, leading to permanent disagreement about the value of x.

Consensus algorithms solve this by ensuring all nodes agree on the same sequence of operations, even when some nodes fail or messages are delayed. This is the foundation of:

Leader election: Who coordinates the cluster?
Atomic broadcast: Delivering messages to all nodes in the same order
Distributed locking: Ensuring only one process holds a lock
State machine replication: Keeping replicas identical

                            
                            Kubernetes Connection: etcd — Kubernetes' brain — uses the Raft consensus algorithm to replicate cluster state across control plane nodes. Every kubectl apply goes through Raft before being considered committed. Understanding Raft means understanding why Kubernetes requires an odd number of etcd nodes and what happens during control plane failures.
                        

Required Properties

A correct consensus algorithm must guarantee three properties:

Property	Meaning	Why It Matters
Agreement	All non-faulty nodes decide on the same value	No conflicting state across replicas
Validity	The decided value was proposed by some node	Prevents fabricating values from nowhere
Termination	All non-faulty nodes eventually reach a decision	System doesn't hang forever

FLP Impossibility

In 1985, Fischer, Lynch, and Paterson proved that no deterministic consensus algorithm can guarantee all three properties in an asynchronous system where even one node might crash (the FLP impossibility result). This sounds devastating, but practical systems work around it using:

Randomization: Break symmetry with random timeouts (Raft uses this)
Partial synchrony: Assume the network eventually delivers messages within some bound
Failure detectors: Use timeouts as imperfect crash detectors

Raft and Paxos both rely on partial synchrony — they guarantee safety (agreement) always, and guarantee liveness (termination) as long as the network eventually stabilizes.

Consensus Approaches

Before diving into Raft's mechanics, it helps to know the landscape. There are three major families of consensus algorithms, each suited to different trust and failure models:

Approach	How It Works	Tolerates	Examples
Leader-based	One node coordinates; others follow	Crash failures (minority)	Raft, ZAB (ZooKeeper)
Leaderless	Any node can propose; quorum decides	Crash failures (minority)	Paxos, EPaxos
Byzantine Fault Tolerant (BFT)	Handles nodes that lie or act maliciously	Up to ⅓ malicious nodes	PBFT, Tendermint

We focus on leader-based consensus (Raft) because Kubernetes uses it. Leader-based algorithms are simpler to implement and faster in the common case (no conflicts) — the leader makes decisions unilaterally, followers just replicate. The downside: if the leader fails, the cluster pauses briefly during a new election.

                            
                            Key Concept — The Replicated Log: At its core, consensus means agreeing on a sequence of commands. A "log" is that sequence — each entry is {index, term, command}. If all nodes have the same log in the same order, they compute the same state. Raft's job is to ensure every node's log stays identical, even when nodes crash or the network drops messages.
                        

Raft Consensus Algorithm

Overview & Design Goals

Raft was designed in 2014 by Diego Ongaro and John Ousterhout with one primary goal: understandability. Paxos, the previous standard, was notoriously difficult to understand and implement correctly. Raft achieves equivalent correctness guarantees while being dramatically easier to reason about.

Raft decomposes consensus into three sub-problems:

Leader Election: Choose one node to coordinate
Log Replication: The leader replicates operations to followers
Safety: Ensure replicated logs remain consistent

Raft Node States

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate : Election timeout (no heartbeat)
    Candidate --> Leader : Receives majority votes
    Candidate --> Follower : Discovers higher term leader
    Candidate --> Candidate : Election timeout (split vote)
    Leader --> Follower : Discovers higher term

Every node in a Raft cluster is in one of three states at any time: Follower, Candidate, or Leader. Time is divided into terms — monotonically increasing integers that act as logical clocks.

Leader Election

Raft uses a heartbeat mechanism to trigger elections:

All nodes start as Followers
The Leader sends periodic heartbeats (empty AppendEntries RPCs)
If a Follower doesn't receive a heartbeat within its election timeout (randomized, typically 150–300ms), it becomes a Candidate
The Candidate increments its term, votes for itself, and requests votes from all other nodes
If it receives votes from a majority, it becomes Leader
If another node wins (higher term), it reverts to Follower
If no one wins (split vote), timeout and try again with a new term

                            
                            Why Random Timeouts? Without randomization, all followers would timeout simultaneously, all become candidates, and split the vote every time — livelock. Random election timeouts (e.g., 150ms–300ms) ensure one node almost always times out first, wins the election quickly, and establishes leadership before others try.
                        

# Raft election example with 5 nodes (etcd cluster):
# 
# Term 1: Node A is leader, sends heartbeats every 100ms
# 
# Term 2: Node A crashes
#   - Node C election timeout fires first (randomized: 180ms)
#   - Node C increments term to 2, votes for self
#   - Node C sends RequestVote to B, D, E
#   - B, D, E grant vote (haven't voted in term 2 yet)
#   - Node C has 4 votes (self + B + D + E) out of 5 → MAJORITY
#   - Node C becomes Leader for term 2
#   - Node C begins sending heartbeats
#
# Total election time: ~200-300ms (one round-trip + timeout)

# In Kubernetes etcd, view the current leader:
# (Install: apt install etcd-client  OR  snap install etcd)
etcdctl endpoint status --write-out=table
# +----------------+------------------+---------+---------+
# |    ENDPOINT    |        ID        | IS_LEADER | TERM  |
# +----------------+------------------+---------+---------+
# | 10.0.0.1:2379  | 8e9e05c52164694d |   false   |   5   |
# | 10.0.0.2:2379  | 91bc3c398fb3c146 |    true   |   5   |
# | 10.0.0.3:2379  | fd422379fda50e48 |   false   |   5   |
# +----------------+------------------+---------+---------+

Log Replication

Once a leader is elected, it handles all client requests. Each request becomes a log entry that must be replicated to a majority before being "committed":

Client sends a write request to the Leader
Leader appends the entry to its log
Leader sends AppendEntries RPC to all Followers
Followers append to their log and acknowledge
Once a majority acknowledges, the Leader commits the entry
Leader notifies Followers of the commit
Leader responds to the client

Raft Log Replication Flow

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2
    
    C->>L: Write "set x=5"
    L->>L: Append to log (index 4, term 2)
    L->>F1: AppendEntries(index 4, "set x=5")
    L->>F2: AppendEntries(index 4, "set x=5")
    F1->>L: ACK (log index 4)
    Note over L: Majority (self + F1) = 2/3 ✓
    L->>L: Commit index 4
    L->>C: Success
    F2->>L: ACK (late but accepted)
    L->>F1: Heartbeat (commitIndex=4)
    L->>F2: Heartbeat (commitIndex=4)

Quorum Mechanics

A quorum is the minimum number of nodes that must agree for a decision to be valid. In Raft, the quorum is ⌊N/2⌋ + 1 (a strict majority):

Cluster Size (N)	Quorum Required	Tolerates Failures	Use Case
1	1	0	Development only
3	2	1	Small production clusters
5	3	2	Standard production
7	4	3	High availability (rarely needed)

                            
                            Why Odd Numbers? A 4-node cluster requires 3 for quorum and tolerates only 1 failure — same as a 3-node cluster! You added a node for no additional fault tolerance. Even numbers also increase the chance of split votes. This is why Kubernetes etcd recommendations are always 3, 5, or 7 nodes — never 2, 4, or 6.
                        

Safety Guarantees

Raft guarantees that once a log entry is committed, it will never be lost or contradicted, even across leader changes. This is enforced through:

Election restriction: A candidate can only win if its log is at least as up-to-date as a majority of nodes
Log matching: If two logs contain an entry with the same index and term, all preceding entries are identical
Leader completeness: A committed entry will be present in every future leader's log

These properties ensure that committed writes survive leader elections, network partitions, and node restarts — exactly what you need for a system like etcd that stores your entire Kubernetes cluster state.

Interactive: Raft Protocol Simulation

Step through a complete Raft lifecycle — from initial followers to leader election, log replication, network partition, and recovery. Use the Next button (or arrow keys while hovering) to advance each stage.

Raft Consensus — Live Simulation

Follower Candidate Leader Log Entry

All Nodes Start as Followers

A fresh Raft cluster begins with every node in the Follower state. No leader exists yet. Click Next to begin the election process.

t = 0 ms Step 1 / 14

Paxos (Conceptual)

Historical Context

Paxos was described by Leslie Lamport in 1989 (published 1998) as the first practical consensus algorithm. It's provably correct and the theoretical foundation for most consensus systems. However, it's notoriously difficult to understand — Lamport's original paper used a fictional Greek island parliament as an analogy, and the community spent years arguing about what it actually meant.

Two-Phase Protocol

Basic Paxos uses two phases to reach agreement on a single value:

Phase 1 (Prepare):

A proposer picks a unique proposal number N
Sends "Prepare(N)" to all acceptors
Each acceptor promises not to accept proposals with numbers less than N
If the acceptor already accepted a value, it returns that value

Phase 2 (Accept):

If the proposer receives promises from a majority, it sends "Accept(N, value)"
The value is either the highest-numbered previously accepted value, or the proposer's own value if no prior value exists
Acceptors accept if they haven't promised to a higher number
Once a majority accepts, the value is chosen

Raft vs Paxos

Aspect	Raft	Paxos
Design goal	Understandability	Theoretical correctness
Leader	Strong leader (all writes go through leader)	Any node can propose (leaderless variant)
Log ordering	Entries committed in order	Gaps allowed (Multi-Paxos fills them)
Implementations	etcd, Consul, CockroachDB	Google Chubby, Apache ZooKeeper (ZAB variant)
Complexity	~2,000 lines (reference implementation)	Varies widely; often 5,000+ lines

Split-Brain Problems

What is Split-Brain?

A split-brain occurs when a network partition divides a cluster into two (or more) groups, each believing the other has failed. Without proper consensus, both sides may elect their own leader and accept writes independently, causing permanent data divergence.

Split-Brain Scenario

flowchart TD
    subgraph Partition A
        A1[Node 1 - thinks it is Leader]
        A2[Node 2 - Follower]
    end
    subgraph Partition B
        B3[Node 3 - thinks it is Leader]
        B4[Node 4 - Follower]
        B5[Node 5 - Follower]
    end
    A1 -.->|Network Partition| B3
    A1 -->|Heartbeats| A2
    B3 -->|Heartbeats| B4
    B3 -->|Heartbeats| B5

In this 5-node example, nodes 1–2 are partitioned from nodes 3–5. Node 1 was the original leader but can't reach a majority (needs 3). Nodes 3–5 elect Node 3 as the new leader (has 3 = majority). The quorum requirement naturally resolves this — only the partition with a majority can make progress.

Prevention Strategies

Quorum-based decisions: Only the majority partition can commit (Raft's approach)
Fencing tokens: Monotonically increasing tokens that invalidate stale leaders
Lease-based leadership: Leaders hold time-bounded leases; must renew to stay leader
STONITH (Shoot The Other Node In The Head): Hardware-level fencing for the minority partition

                            
                            Kubernetes Protection: etcd's Raft implementation prevents split-brain by requiring quorum for all writes. If a control plane loses contact with the majority, its API Server becomes read-only — it cannot accept new deployments or changes. This is why losing 2 of 3 etcd nodes makes the cluster unable to accept writes (no quorum).
                        

Consensus in Kubernetes (etcd)

etcd is Kubernetes' single source of truth — every object (Pods, Services, Deployments) is stored as a key-value pair in etcd. The Raft consensus algorithm ensures this state is consistently replicated across all etcd nodes.

                            Prerequisites: The commands below require a running etcd cluster. If you have a Kubernetes cluster, etcd is already running on control-plane nodes. To spin up a local test cluster with Docker:
                            docker run -d --name etcd -p 2379:2379 quay.io/coreos/etcd:v3.5.12 /usr/local/bin/etcd --advertise-client-urls http://0.0.0.0:2379 --listen-client-urls http://0.0.0.0:2379
                        

# Install etcdctl (if not available):
# Ubuntu/Debian: sudo apt install etcd-client
# Snap:          sudo snap install etcd
# From binary:   https://github.com/etcd-io/etcd/releases

# Check etcd cluster health:
etcdctl endpoint health --cluster
# https://10.0.0.1:2379 is healthy: successfully committed proposal: took = 2.5ms
# https://10.0.0.2:2379 is healthy: successfully committed proposal: took = 2.1ms
# https://10.0.0.3:2379 is healthy: successfully committed proposal: took = 1.9ms

# View etcd member list:
etcdctl member list --write-out=table
# +------------------+---------+---------+------------------------+------------------------+
# |        ID        | STATUS  |  NAME   |       PEER ADDRS       |      CLIENT ADDRS      |
# +------------------+---------+---------+------------------------+------------------------+
# | 8e9e05c52164694d | started | etcd-1  | https://10.0.0.1:2380  | https://10.0.0.1:2379  |
# | 91bc3c398fb3c146 | started | etcd-2  | https://10.0.0.2:2380  | https://10.0.0.2:2379  |
# | fd422379fda50e48 | started | etcd-3  | https://10.0.0.3:2380  | https://10.0.0.3:2379  |
# +------------------+---------+---------+------------------------+------------------------+

# Watch Raft term changes (indicates leader elections):
etcdctl endpoint status --write-out=json | python3 -m json.tool
# Look for "raftTerm" field — increments indicate elections occurred

When you run kubectl apply -f deployment.yaml, here's what happens at the consensus level:

API Server receives the request and validates it
API Server sends a write proposal to the etcd leader
etcd leader appends the entry to its Raft log
Leader replicates the entry to follower etcd nodes
Once a majority acknowledges, the entry is committed
etcd responds to the API Server with success
API Server responds to kubectl with the created resource

                            
                            Production Warning: etcd performance directly impacts Kubernetes responsiveness. Raft requires disk fsync for every committed entry. Slow disks (especially network-attached storage) cause high commit latency, leader elections, and cluster instability. Always use local SSDs for etcd in production. Monitor etcd_disk_wal_fsync_duration_seconds — if P99 exceeds 10ms, investigate immediately.
                        

Exercises

                            
                            Exercise 1 — Trace a Leader Election: You have a 5-node Raft cluster in term 3. The leader (Node A) crashes. Nodes B through E have election timeouts of 200ms, 180ms, 250ms, and 190ms respectively. Trace the election step by step: Who becomes the candidate first? What term do they propose? How many votes do they need? What if Node C's vote request arrives at Node E before Node D's request?
                        

                            
                            Exercise 2 — Quorum Math: Your production Kubernetes cluster has 3 etcd nodes. (a) What's the quorum? (b) How many nodes can fail? (c) If you add a 4th node, does fault tolerance improve? (d) Why do Kubernetes best practices recommend 5 etcd nodes for large clusters but never 4 or 6?
                        

                            
                            Exercise 3 — Split-Brain Scenario: A 5-node etcd cluster suffers a network partition: nodes 1–2 on one side, nodes 3–5 on the other. Node 1 was the leader. (a) Can nodes 1–2 accept writes? Why or why not? (b) What happens on the 3–5 side? (c) When the partition heals, how is state reconciled? (d) What would happen without quorum requirements?
                        

                            
                            Exercise 4 — Practical: If you have access to a Kubernetes cluster, run etcdctl endpoint status --write-out=table and identify: (a) Which node is the leader, (b) The current Raft term, (c) The Raft index (total committed entries). Then restart the leader node and observe: How quickly does a new election happen? What's the new term number?
                        

Conclusion

Consensus algorithms solve the fundamental challenge of distributed agreement — how do unreliable machines, communicating over unreliable networks, agree on anything? Raft provides an elegant solution through strong leadership, sequential log replication, and quorum-based commits.

For Kubernetes practitioners, understanding Raft explains:

Why etcd needs odd numbers of nodes (quorum math)
Why losing majority etcd nodes is catastrophic (no quorum = no writes)
Why etcd performance matters (every write is a Raft round-trip)
Why leader elections cause brief write pauses (~300ms)
Why etcd backups are critical (committed state = cluster state)

In Part 3, we'll explore the CAP theorem — the fundamental trade-off between consistency, availability, and partition tolerance that governs all distributed system design decisions, including those made by Kubernetes.

Previous Part 1: Distributed Systems Foundations Next Part 3: CAP Theorem & Replication

Cookie Consent

Part 2: Consensus Algorithms

Table of Contents

The Consensus Problem

Why Consensus Matters

Required Properties

FLP Impossibility

Consensus Approaches

Raft Consensus Algorithm

Overview & Design Goals

Leader Election

Log Replication

Quorum Mechanics

Safety Guarantees

Interactive: Raft Protocol Simulation

Raft Consensus — Live Simulation

Paxos (Conceptual)

Historical Context

Two-Phase Protocol

Raft vs Paxos

Split-Brain Problems

What is Split-Brain?

Prevention Strategies

Consensus in Kubernetes (etcd)

Exercises

Conclusion

Cookie Consent

Part 2: Consensus Algorithms

Table of Contents

The Consensus Problem

Why Consensus Matters

Required Properties

FLP Impossibility

Consensus Approaches

Raft Consensus Algorithm

Overview & Design Goals

Leader Election

Log Replication

Quorum Mechanics

Safety Guarantees

Interactive: Raft Protocol Simulation

Raft Consensus — Live Simulation

Paxos (Conceptual)

Historical Context

Two-Phase Protocol

Raft vs Paxos

Split-Brain Problems

What is Split-Brain?

Prevention Strategies

Consensus in Kubernetes (etcd)

Exercises

Conclusion

Continue the Series

Part 1: Distributed Systems Foundations

Part 3: CAP Theorem & Replication

Part 11: Kubernetes Internals