Distributed Coordination & Consistency — Systems Thinking & Architecture Mastery Part 14

Module 28: Distributed Coordination

Why Coordination Is Hard

In a single-machine system, coordination is trivial — use a mutex, a semaphore, or a compare-and-swap instruction. In a distributed system, there is no shared memory, no global clock, and no guarantee that messages arrive in order (or at all). This makes seemingly simple questions enormously difficult:

Who is the leader? (What if two nodes both think they're the leader?)
What is the current value? (What if different nodes have different values?)
Has this operation been committed? (What if the node that acknowledged it crashes?)
What happened first? (With no global clock, "first" has no absolute meaning)

                            
                            The Three Impossibilities: (1) FLP Impossibility: No deterministic consensus algorithm can guarantee progress in an asynchronous system where even one process can crash. (2) CAP Theorem: You cannot simultaneously guarantee Consistency, Availability, and Partition tolerance. (3) Two Generals Problem: No protocol can guarantee agreement over an unreliable channel using a finite number of messages.
                        

These impossibility results don't mean coordination is impossible — they mean you must make tradeoffs. Every distributed system chooses where to compromise: accept that consensus might be slow (sacrifice liveness), accept that reads might be stale (sacrifice consistency), or accept that the system might be unavailable during partitions (sacrifice availability).

The Consensus Problem

The consensus problem is deceptively simple to state: get a group of nodes to agree on a single value, even if some nodes crash. Formally, consensus requires three properties:

Agreement: All non-faulty nodes decide on the same value
Validity: The decided value was proposed by some node (you can't just always return "42")
Termination: All non-faulty nodes eventually decide (the system makes progress)

Consensus is the foundation for everything else — leader election, atomic broadcast, distributed locks, replicated state machines, and configuration management all reduce to consensus.

Raft: Understandable Consensus

Raft was designed explicitly to be understandable (unlike Paxos). It decomposes consensus into three sub-problems: leader election, log replication, and safety.

Raft Leader Election Sequence

sequenceDiagram
    participant N1 as Node 1
(Follower)
    participant N2 as Node 2
(Candidate)
    participant N3 as Node 3
(Follower)

    Note over N1,N3: Term 1: Node X was leader (crashed)

    Note over N2: Election timeout expires
→ Becomes Candidate
→ Increments term to 2

    N2->>N1: RequestVote(term=2, lastLogIndex=5)
    N2->>N3: RequestVote(term=2, lastLogIndex=5)

    N1-->>N2: VoteGranted(term=2) ✓
    N3-->>N2: VoteGranted(term=2) ✓

    Note over N2: Received majority (2/3 votes)
→ Becomes Leader (term 2)

    N2->>N1: AppendEntries(heartbeat, term=2)
    N2->>N3: AppendEntries(heartbeat, term=2)

    Note over N1,N3: Followers reset election timer
Leader maintains authority

Raft's key mechanisms:

Leader election: When a follower's election timer expires (randomized 150-300ms), it becomes a candidate and requests votes. First candidate to receive majority votes wins. Randomized timers prevent split votes.
Log replication: The leader accepts client requests, appends them to its log, and replicates to followers via AppendEntries RPCs. An entry is "committed" when a majority of nodes have it.
Safety: A candidate can only win if its log is at least as up-to-date as the majority. This ensures the elected leader always has all committed entries.

Here's a simplified Raft state machine implementation:

"""
Raft Consensus - Simplified State Machine
Demonstrates the core state transitions and leader election logic.
"""
import random
import time
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional


class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"


@dataclass
class LogEntry:
    term: int
    index: int
    command: str


@dataclass
class RaftNode:
    """Simplified Raft node demonstrating core state machine logic."""
    node_id: str
    cluster_size: int = 5

    # Persistent state
    current_term: int = 0
    voted_for: Optional[str] = None
    log: list = field(default_factory=list)

    # Volatile state
    state: NodeState = NodeState.FOLLOWER
    commit_index: int = 0
    last_applied: int = 0

    # Leader state (reinitialized on election)
    next_index: dict = field(default_factory=dict)
    match_index: dict = field(default_factory=dict)

    # Election state
    votes_received: int = 0
    election_timeout: float = 0.0

    def reset_election_timer(self):
        """Randomized timeout prevents split votes."""
        self.election_timeout = time.time() + random.uniform(0.150, 0.300)

    @property
    def majority(self) -> int:
        """Majority quorum for this cluster size."""
        return (self.cluster_size // 2) + 1

    def start_election(self):
        """Follower timeout expired → become candidate."""
        self.state = NodeState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id
        self.votes_received = 1  # Vote for self
        self.reset_election_timer()
        print(f"[{self.node_id}] Starting election for term {self.current_term}")
        # In real implementation: send RequestVote RPCs to all peers

    def handle_vote_response(self, vote_granted: bool, term: int):
        """Process a vote response from a peer."""
        if term > self.current_term:
            # Discovered higher term → step down
            self.current_term = term
            self.state = NodeState.FOLLOWER
            self.voted_for = None
            return

        if self.state != NodeState.CANDIDATE:
            return  # No longer a candidate

        if vote_granted:
            self.votes_received += 1
            if self.votes_received >= self.majority:
                self.become_leader()

    def become_leader(self):
        """Won election → initialize leader state."""
        self.state = NodeState.LEADER
        print(f"[{self.node_id}] Elected leader for term {self.current_term}")
        # Initialize next_index for each peer to end of log
        last_log_index = len(self.log)
        # In real implementation: send initial empty AppendEntries (heartbeat)

    def handle_request_vote(self, candidate_id: str, term: int,
                            last_log_index: int, last_log_term: int) -> bool:
        """Decide whether to grant vote to a candidate."""
        if term < self.current_term:
            return False  # Reject stale term

        if term > self.current_term:
            self.current_term = term
            self.state = NodeState.FOLLOWER
            self.voted_for = None

        # Grant vote if: haven't voted yet AND candidate's log is up-to-date
        if self.voted_for is None or self.voted_for == candidate_id:
            if self._is_log_up_to_date(last_log_index, last_log_term):
                self.voted_for = candidate_id
                self.reset_election_timer()
                return True

        return False

    def _is_log_up_to_date(self, last_log_index: int, last_log_term: int) -> bool:
        """Check if candidate's log is at least as up-to-date as ours."""
        my_last_term = self.log[-1].term if self.log else 0
        my_last_index = len(self.log)

        if last_log_term != my_last_term:
            return last_log_term > my_last_term
        return last_log_index >= my_last_index

    def append_entry(self, command: str) -> Optional[LogEntry]:
        """Leader accepts a client command."""
        if self.state != NodeState.LEADER:
            return None  # Only leader accepts writes

        entry = LogEntry(
            term=self.current_term,
            index=len(self.log) + 1,
            command=command
        )
        self.log.append(entry)
        print(f"[{self.node_id}] Appended entry: {command} (index={entry.index})")
        # In real implementation: replicate to followers via AppendEntries
        return entry


# Demonstration
if __name__ == "__main__":
    node = RaftNode(node_id="node-1", cluster_size=5)
    print(f"Initial state: {node.state.value}, term={node.current_term}")

    # Simulate election timeout
    node.start_election()
    print(f"After election start: {node.state.value}, term={node.current_term}")

    # Simulate receiving majority votes
    node.handle_vote_response(vote_granted=True, term=1)
    node.handle_vote_response(vote_granted=True, term=1)
    print(f"After majority: {node.state.value}")

    # Leader accepts writes
    entry = node.append_entry("SET x = 42")
    print(f"Log length: {len(node.log)}, commit_index: {node.commit_index}")

Paxos: The Foundational Algorithm

Paxos (by Leslie Lamport, 1989) is the first proven-correct consensus algorithm. It's notoriously difficult to understand, which is why Raft was created as an alternative. However, understanding Paxos's core insight is valuable:

                            
                            Paxos Core Insight: Consensus is achieved through two rounds: (1) Prepare: A proposer asks a majority of acceptors to promise not to accept older proposals. (2) Accept: If a majority promised, the proposer sends the value to accept. A value is chosen when a majority of acceptors accept it. The key insight: if a proposer learns that an acceptor already accepted a value, it must propose THAT value — this prevents conflicts.
                        

Paxos roles:

Proposers: Propose values. Multiple proposers can exist simultaneously (causes dueling proposals that slow but don't break consensus).
Acceptors: Vote on proposals. A majority must agree for consensus. Each acceptor tracks the highest proposal number it has promised and the highest it has accepted.
Learners: Learn the chosen value. In practice, all nodes are learners — they need to know what was decided.

Paxos vs. Raft comparison:

Paxos allows multiple proposers simultaneously (more available, harder to understand)
Raft restricts to a single leader (simpler, slight availability tradeoff during elections)
Both tolerate ⌊(n-1)/2⌋ failures in a cluster of n nodes
Most production systems use Raft (etcd, CockroachDB, TiKV) or Multi-Paxos variants (Chubby, Spanner)

Quorum Systems

A quorum is the minimum number of nodes that must participate in an operation for it to be valid. The fundamental insight of quorum systems is that any two quorums must overlap — this guarantees that a read quorum will always see the most recent write.

Quorum Overlap Guarantees Consistency (W + R > N)

flowchart TD
    subgraph cluster["5-Node Cluster (N=5)"]
        N1["Node 1"]
        N2["Node 2"]
        N3["Node 3"]
        N4["Node 4"]
        N5["Node 5"]
    end

    subgraph write["Write Quorum (W=3)"]
        W1["Node 1 ✓"]
        W2["Node 2 ✓"]
        W3["Node 3 ✓"]
    end

    subgraph read["Read Quorum (R=3)"]
        R1["Node 3 ✓"]
        R2["Node 4 ✓"]
        R3["Node 5 ✓"]
    end

    W3 -.-|"Overlap!
Guarantees read
sees latest write"| R1

    style write fill:#3B9797,color:#fff
    style read fill:#16476A,color:#fff
    style cluster fill:#f8f9fa,color:#132440

The quorum formula: W + R > N (where W = write quorum, R = read quorum, N = total nodes)

"""
Quorum Calculator — Explore different quorum configurations
and their tradeoffs for distributed systems.
"""


def analyze_quorum(n: int, w: int, r: int) -> dict:
    """Analyze a quorum configuration for a cluster of size n."""
    strong_consistency = (w + r) > n
    fault_tolerance_writes = n - w    # Can lose this many nodes and still write
    fault_tolerance_reads = n - r     # Can lose this many nodes and still read
    fault_tolerance_both = min(fault_tolerance_writes, fault_tolerance_reads)

    return {
        "cluster_size": n,
        "write_quorum": w,
        "read_quorum": r,
        "strong_consistency": strong_consistency,
        "write_fault_tolerance": fault_tolerance_writes,
        "read_fault_tolerance": fault_tolerance_reads,
        "overall_fault_tolerance": fault_tolerance_both,
        "write_latency": "Higher (wait for W nodes)",
        "read_latency": "Higher (wait for R nodes)" if r > 1 else "Low (single node)",
    }


# Common configurations
configs = [
    # Strong consistency: W + R > N
    ("Strong Consistency (majority)", 5, 3, 3),
    ("Write-heavy (fast reads)", 5, 4, 2),
    ("Read-heavy (fast writes)", 5, 2, 4),

    # Eventual consistency: W + R <= N
    ("Eventual (fast writes, fast reads)", 5, 1, 1),
    ("Single-leader replication", 3, 1, 3),

    # DynamoDB-style tunable
    ("DynamoDB default", 3, 2, 2),
]

print("=" * 70)
print(f"{'Config':<35} {'N':>2} {'W':>2} {'R':>2} {'Strong?':<8} {'Fault Tol':>9}")
print("=" * 70)

for name, n, w, r in configs:
    result = analyze_quorum(n, w, r)
    strong = "YES" if result["strong_consistency"] else "NO"
    print(f"{name:<35} {n:>2} {w:>2} {r:>2} {strong:<8} {result['overall_fault_tolerance']:>5} nodes")

print("=" * 70)
print("\nKey insight: W + R > N ensures every read sees the latest write.")
print("Tradeoff: Higher W = slower writes but faster reads (and vice versa).")

Leader Election Patterns

Leader election is a specific application of consensus — agreeing on which node is "in charge." Different systems use different approaches:

Raft election: Randomized timeouts + majority vote. Simple, well-understood, used in etcd/Consul.
Bully algorithm: Highest-ID node always wins. Simple but slower (multiple rounds of messages). Used in older systems.
ZooKeeper recipes: Create ephemeral sequential nodes; lowest sequence number is leader. Watches notify on leadership changes. Used in Kafka (pre-KRaft), HBase.
Lease-based: Leader holds a time-limited lease. Must renew before expiry or another node takes over. Used in Google Chubby, DynamoDB.

An etcd cluster configuration demonstrates Raft-based coordination in practice:

# etcd-cluster.yaml — 3-node etcd cluster for Kubernetes
# Uses Raft consensus for leader election and log replication
apiVersion: v1
kind: ConfigMap
metadata:
  name: etcd-config
  namespace: kube-system
data:
  etcd.conf.yaml: |
    name: etcd-node-1
    data-dir: /var/lib/etcd

    # Cluster configuration
    initial-cluster: >-
      etcd-node-1=https://etcd-1.etcd.kube-system.svc:2380,
      etcd-node-2=https://etcd-2.etcd.kube-system.svc:2380,
      etcd-node-3=https://etcd-3.etcd.kube-system.svc:2380
    initial-cluster-state: new
    initial-cluster-token: k8s-etcd-cluster

    # Client communication
    listen-client-urls: https://0.0.0.0:2379
    advertise-client-urls: https://etcd-1.etcd.kube-system.svc:2379

    # Peer communication (Raft replication)
    listen-peer-urls: https://0.0.0.0:2380
    initial-advertise-peer-urls: https://etcd-1.etcd.kube-system.svc:2380

    # Raft tuning
    heartbeat-interval: 100        # ms — leader sends heartbeats
    election-timeout: 1000         # ms — follower starts election if no heartbeat

    # Snapshot and compaction
    snapshot-count: 10000          # Snapshot after 10k entries
    auto-compaction-mode: periodic
    auto-compaction-retention: "1h"

    # Security
    client-transport-security:
      cert-file: /etc/etcd/pki/server.crt
      key-file: /etc/etcd/pki/server.key
      trusted-ca-file: /etc/etcd/pki/ca.crt
      client-cert-auth: true
    peer-transport-security:
      cert-file: /etc/etcd/pki/peer.crt
      key-file: /etc/etcd/pki/peer.key
      trusted-ca-file: /etc/etcd/pki/ca.crt
      client-cert-auth: true

Verify cluster health with a simple script:

#!/bin/bash
# etcd-health-check.sh — Verify etcd cluster consensus health
# Run periodically to detect split-brain or lagging members

ETCD_ENDPOINTS="https://etcd-1:2379,https://etcd-2:2379,https://etcd-3:2379"
ETCD_CACERT="/etc/etcd/pki/ca.crt"
ETCD_CERT="/etc/etcd/pki/client.crt"
ETCD_KEY="/etc/etcd/pki/client.key"

echo "=== etcd Cluster Health Check ==="
echo "Timestamp: $(date -u)"
echo ""

# Check endpoint health
echo "--- Endpoint Health ---"
etcdctl endpoint health \
    --endpoints="$ETCD_ENDPOINTS" \
    --cacert="$ETCD_CACERT" \
    --cert="$ETCD_CERT" \
    --key="$ETCD_KEY" \
    --write-out=table

echo ""

# Check endpoint status (shows leader, raft term, raft index)
echo "--- Endpoint Status (Raft State) ---"
etcdctl endpoint status \
    --endpoints="$ETCD_ENDPOINTS" \
    --cacert="$ETCD_CACERT" \
    --cert="$ETCD_CERT" \
    --key="$ETCD_KEY" \
    --write-out=table

echo ""

# Check member list
echo "--- Cluster Members ---"
etcdctl member list \
    --endpoints="$ETCD_ENDPOINTS" \
    --cacert="$ETCD_CACERT" \
    --cert="$ETCD_CERT" \
    --key="$ETCD_KEY" \
    --write-out=table

echo ""

# Verify consensus — write and read back
echo "--- Consensus Verification ---"
TEST_KEY="/health-check/$(date +%s)"
TEST_VALUE="healthy-$(hostname)"

etcdctl put "$TEST_KEY" "$TEST_VALUE" \
    --endpoints="$ETCD_ENDPOINTS" \
    --cacert="$ETCD_CACERT" \
    --cert="$ETCD_CERT" \
    --key="$ETCD_KEY" 2>/dev/null

READ_VALUE=$(etcdctl get "$TEST_KEY" --print-value-only \
    --endpoints="$ETCD_ENDPOINTS" \
    --cacert="$ETCD_CACERT" \
    --cert="$ETCD_CERT" \
    --key="$ETCD_KEY" 2>/dev/null)

if [ "$READ_VALUE" == "$TEST_VALUE" ]; then
    echo "✓ Consensus OK: Write and read-back successful"
else
    echo "✗ CONSENSUS FAILURE: Written '$TEST_VALUE', read '$READ_VALUE'"
    exit 1
fi

# Cleanup test key
etcdctl del "$TEST_KEY" \
    --endpoints="$ETCD_ENDPOINTS" \
    --cacert="$ETCD_CACERT" \
    --cert="$ETCD_CERT" \
    --key="$ETCD_KEY" > /dev/null 2>&1

echo ""
echo "=== Health Check Complete ==="

Module 29: Consistency Models

The Consistency Spectrum

Consistency models define what guarantees a distributed system provides about the order and visibility of operations. They form a spectrum from strongest (easiest to reason about, hardest to implement) to weakest (hardest to reason about, easiest to implement).

Consistency Model Spectrum (Strongest → Weakest)

flowchart LR
    A["Linearizability
(Strongest)
Single global order
Real-time guarantee"] --> B["Sequential
Consistency
Single global order
No real-time"]
    B --> C["Causal
Consistency
Respects causality
Concurrent ops unordered"]
    C --> D["Eventual
Consistency
(Weakest)
Converges eventually
No ordering"]

    style A fill:#132440,color:#fff
    style B fill:#16476A,color:#fff
    style C fill:#3B9797,color:#fff
    style D fill:#7CCBCB,color:#132440

The tradeoff is always: stronger consistency = higher latency and/or lower availability. Linearizability requires coordination on every operation. Eventual consistency allows any node to respond immediately without coordination.

Linearizability (Strong Consistency)

Linearizability is the strongest single-object consistency model. It provides the illusion that there is only one copy of the data, and all operations are atomic and instantaneous:

Every operation appears to take effect at a single point in time between its invocation and response
All operations are totally ordered, consistent with real-time ordering
If operation A completes before operation B starts, then A appears before B in the total order

                            
                            Linearizability Intuition: Imagine a single-threaded server processing requests one at a time. Linearizability guarantees that a distributed system behaves as if it were that single-threaded server, even though requests are actually handled by multiple nodes. Once a write is acknowledged, ALL subsequent reads (from any client, any node) MUST see that write.
                        

Where linearizability is required:

Leader election (if two nodes both think they're leader, data corruption follows)
Distributed locks (a lock must be held by exactly one holder)
Unique constraints (user signup — must not create two accounts with same email)
Financial transactions (account balance must reflect all debits/credits atomically)

Cost of linearizability: Requires coordination (consensus) on every write and potentially on every read. In a geo-distributed system, this means cross-region network round trips on every operation — adding 50-200ms+ latency.

Sequential Consistency

Sequential consistency is slightly weaker than linearizability — it guarantees a single total order of operations that is consistent with each client's individual order, but NOT necessarily with real-time:

All nodes see the same order of operations (total order exists)
Each client's operations appear in the order they were issued
But: operation A can "appear" after operation B even if A finished before B in real-time

The practical difference: with linearizability, if you write a value and then call your friend who reads it, they're guaranteed to see your write. With sequential consistency, they might not — the write might not yet be "visible" in the total order even though it completed in real-time.

Causal Consistency

Causal consistency preserves the ordering of causally related operations while allowing concurrent (unrelated) operations to be seen in any order. Two operations are causally related if one could have influenced the other:

Happens-Before Causal Ordering

sequenceDiagram
    participant Alice
    participant Bob
    participant Carol

    Note over Alice,Carol: Causal chain (must be ordered)

    Alice->>Alice: Write: "Meeting at 3pm" (op1)
    Alice->>Bob: Read by Bob → sees "Meeting at 3pm" (op2)
    Bob->>Bob: Write: "I'll be there!" (op3)

    Note over Alice,Carol: op1 → op2 → op3 is causal chain
All nodes must see op1 before op3

    Note over Carol: Carol must see "Meeting at 3pm"
BEFORE "I'll be there!"
(preserves causality)

    Note over Alice,Carol: Concurrent operations (no causal order)

    Alice->>Alice: Write: "Nice weather today" (op4)
    Carol->>Carol: Write: "Going to lunch" (op5)

    Note over Alice,Carol: op4 and op5 are concurrent
Nodes can see them in ANY order

The "happens-before" relation (→):

If A and B are in the same process and A comes before B, then A → B
If A is a send and B is the corresponding receive, then A → B
Transitivity: if A → B and B → C, then A → C
If neither A → B nor B → A, then A and B are concurrent (A ‖ B)

Why causal consistency matters: It's the strongest consistency model that can be achieved without sacrificing availability during network partitions (it lives on the "AP" side of CAP but provides meaningful ordering guarantees). Systems like MongoDB (with causal consistency sessions) and COPS use it.

Eventual Consistency

Eventual consistency is the weakest useful guarantee: if no new writes occur, all replicas will eventually converge to the same state. It says nothing about how long "eventually" takes or what clients see during convergence.

Guarantee: Given sufficient time without new writes, all reads will return the last written value
No guarantee about: What you see during convergence, the order of intermediate states, or how long convergence takes

Challenges of eventual consistency:

Write conflicts: Two nodes accept conflicting writes simultaneously. Resolution strategies: last-writer-wins (LWW), vector clocks, CRDTs, application-level merge
Read-your-writes violation: You write a value, immediately read, and get the OLD value (your write hasn't propagated yet)
Monotonic reads violation: You read value V2, then read again and get older value V1 (routed to a lagging replica)

Session Guarantees

Session guarantees add practical usability on top of eventual consistency without requiring full strong consistency:

Read-your-writes: After your write completes, your subsequent reads always see that write (or a later one). Other clients may not see it yet, but YOU always do. Implemented by routing reads to the same replica that accepted your write, or by tracking write timestamps.
Monotonic reads: Once you've read a value, subsequent reads never return an older value. You never "go back in time." Implemented by tracking the latest version you've seen and rejecting staler responses.
Monotonic writes: Your writes are applied in the order you issued them. Write A before write B guarantees A is applied first everywhere.
Writes-follow-reads: If you read a value and then write based on it, your write is ordered after the read value. Prevents "writing into the past."

                            
                            Practical Advice: Most applications need at least read-your-writes and monotonic reads to provide a reasonable user experience. Without these, users see confusing behavior (post a comment → it disappears → it reappears). DynamoDB, Cassandra, and MongoDB all offer session-level consistency options to provide these guarantees without the cost of global strong consistency.
                        

Real-World Systems Mapped to Consistency Models

Understanding where real systems fall on the consistency spectrum helps you make informed architecture decisions:

System	Default Consistency	Strongest Available	Notes
Google Spanner	Linearizable	Linearizable	Uses TrueTime (atomic clocks + GPS) for global strong consistency
CockroachDB	Serializable	Serializable	Raft consensus per range; global strong consistency
etcd	Linearizable	Linearizable	Raft consensus; all reads go through leader by default
DynamoDB	Eventual	Strong (per-table)	Configurable per-request; strong reads cost 2x
Cassandra	Eventual	Tunable (QUORUM)	W + R > N achieves strong; per-query consistency level
MongoDB	Eventual (reads from secondary)	Linearizable	readConcern: "linearizable" + writeConcern: "majority"
Redis (cluster)	Eventual	Eventual	Async replication; WAIT command for synchronous ack
S3	Strong (since 2020)	Strong	Read-after-write consistency for all operations

Case Studies

etcd: The Brain of Kubernetes

Case Study CoreOS/CNCF · 2013—Present

etcd: Raft Consensus in Production

etcd is a distributed key-value store that uses Raft consensus to provide strong consistency guarantees. It's the coordination backbone of Kubernetes — storing all cluster state (pods, services, configs, secrets).

Why Kubernetes chose etcd:

Strong consistency required: If two API servers simultaneously try to schedule a pod, exactly one must win. Eventual consistency would cause duplicate scheduling.
Watch mechanism: Controllers watch etcd for changes and reconcile state. This requires linearizable reads — a controller must never miss an update.
Small dataset, high coordination: Kubernetes metadata is small (GBs not TBs) but requires strong coordination guarantees — perfect fit for consensus.

etcd's architecture decisions:

Typically 3 or 5 nodes (odd number avoids split-brain; more nodes = slower consensus)
All writes go through the Raft leader (serial, strongly consistent)
Reads can be serializable (through leader) or stale (from any node) — Kubernetes uses serializable
Snapshotting every 10,000 entries to bound log size and speed recovery
Lease-based TTL for ephemeral keys (used for leader election, service discovery)

Production lessons:

etcd performance is bounded by disk fsync latency (writes must be persisted before acknowledging)
Network latency between etcd members directly impacts write latency (Raft requires majority ack)
Large values (>1.5MB) cause performance degradation — store references, not blobs
Regular defragmentation required as revisions accumulate

etcd Raft Kubernetes Strong Consistency

Google Spanner: Global Strong Consistency

Case Study Google · 2012—Present

Spanner: TrueTime and External Consistency

Google Spanner achieves the seemingly impossible: globally distributed, strongly consistent, highly available database. It violates the "you must choose two of three" intuition from CAP by using a novel approach to time.

The TrueTime innovation:

Every Google data center has atomic clocks and GPS receivers
TrueTime API returns an interval [earliest, latest] instead of a single timestamp
The interval width (uncertainty) is typically 1-7ms
Spanner waits out the uncertainty before committing: if commit timestamp is T, wait until TrueTime says "definitely past T" — this guarantees that if transaction A commits before B starts, A's timestamp < B's timestamp

How Spanner achieves strong consistency globally:

Data split into "splits" (shards), each managed by a Paxos group
Single-split transactions use Paxos within the group (fast)
Cross-split transactions use 2PC (two-phase commit) coordinated by a transaction manager
TrueTime ensures that committed transactions have globally meaningful timestamps
Read-only transactions at a timestamp T can read from any replica with data >= T (no coordination needed for reads!)

The cost: Writes incur the TrueTime wait (1-7ms) plus Paxos replication latency. For global writes, this means ~10-50ms per transaction. But reads (the majority of operations) are lock-free and can be served from local replicas — making reads fast even across continents.

Google Spanner TrueTime Global Consistency Paxos

Conclusion & Next Steps

The key takeaways from this module:

Coordination is fundamentally hard. FLP impossibility, CAP theorem, and the Two Generals Problem prove that perfect distributed coordination is impossible. Every system makes tradeoffs — understand which tradeoffs YOUR system makes.
Raft is the practical choice for consensus. It's understandable, well-proven (etcd, CockroachDB, TiKV), and handles the common case well. Use Raft (via etcd or embedded libraries) unless you have specific reasons not to.
Quorum math is simple but powerful. W + R > N guarantees you'll always read the latest write. Adjust W and R to optimize for your read/write ratio. Most systems default to majority quorums (W = R = ⌈(N+1)/2⌉).
Choose the weakest consistency that's correct. Linearizability is expensive. If your application only needs causal consistency (most social apps), or eventual consistency with session guarantees (most e-commerce), use that — it's faster, cheaper, and more available.
Session guarantees are the practical minimum. Read-your-writes and monotonic reads are needed for basic usability. Without them, users see confusing behavior. Most databases offer these at low cost.
Real systems are configurable. DynamoDB, Cassandra, and MongoDB let you choose consistency per-operation. Use strong consistency for critical paths (payments, leader election) and eventual for the rest (feeds, analytics, caches).
Time is the enemy of distributed systems. Google solved it with atomic clocks (Spanner). Everyone else uses logical clocks, version vectors, or accepts the limitations. If your system spans continents, you must explicitly design for clock skew.

Next in the Series

In Part 15: Transactions, Time & Messaging, we'll build on coordination fundamentals to explore distributed transactions (2PC, Saga pattern), logical and physical time (vector clocks, hybrid clocks), and messaging patterns for reliable asynchronous communication.

Previous Part 13: Chaos & Disaster Recovery Next Part 15: Transactions, Time & Messaging

Cookie Consent

Part 14: Distributed Coordination & Consistency

Table of Contents

Module 28: Distributed Coordination

Why Coordination Is Hard

The Consensus Problem

Raft: Understandable Consensus

Paxos: The Foundational Algorithm

Quorum Systems

Leader Election Patterns

Module 29: Consistency Models

The Consistency Spectrum

Linearizability (Strong Consistency)

Sequential Consistency

Causal Consistency

Eventual Consistency

Session Guarantees

Real-World Systems Mapped to Consistency Models

Case Studies

etcd: The Brain of Kubernetes

etcd: Raft Consensus in Production

Google Spanner: Global Strong Consistency

Spanner: TrueTime and External Consistency

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 14: Distributed Coordination & Consistency

Table of Contents

Module 28: Distributed Coordination

Why Coordination Is Hard

The Consensus Problem

Raft: Understandable Consensus

Paxos: The Foundational Algorithm

Quorum Systems

Leader Election Patterns

Module 29: Consistency Models

The Consistency Spectrum

Linearizability (Strong Consistency)

Sequential Consistency

Causal Consistency

Eventual Consistency

Session Guarantees

Real-World Systems Mapped to Consistency Models

Case Studies

etcd: The Brain of Kubernetes

etcd: Raft Consensus in Production

Google Spanner: Global Strong Consistency

Spanner: TrueTime and External Consistency

Conclusion & Next Steps

Next in the Series

Related Articles in This Series

Part 13: Chaos Engineering & Disaster Recovery

Part 15: Transactions, Time & Messaging

Part 16: Security Architecture