Module 28: Distributed Coordination
Why Coordination Is Hard
In a single-machine system, coordination is trivial — use a mutex, a semaphore, or a compare-and-swap instruction. In a distributed system, there is no shared memory, no global clock, and no guarantee that messages arrive in order (or at all). This makes seemingly simple questions enormously difficult:
- Who is the leader? (What if two nodes both think they're the leader?)
- What is the current value? (What if different nodes have different values?)
- Has this operation been committed? (What if the node that acknowledged it crashes?)
- What happened first? (With no global clock, "first" has no absolute meaning)
These impossibility results don't mean coordination is impossible — they mean you must make tradeoffs. Every distributed system chooses where to compromise: accept that consensus might be slow (sacrifice liveness), accept that reads might be stale (sacrifice consistency), or accept that the system might be unavailable during partitions (sacrifice availability).
The Consensus Problem
The consensus problem is deceptively simple to state: get a group of nodes to agree on a single value, even if some nodes crash. Formally, consensus requires three properties:
- Agreement: All non-faulty nodes decide on the same value
- Validity: The decided value was proposed by some node (you can't just always return "42")
- Termination: All non-faulty nodes eventually decide (the system makes progress)
Consensus is the foundation for everything else — leader election, atomic broadcast, distributed locks, replicated state machines, and configuration management all reduce to consensus.
Raft: Understandable Consensus
Raft was designed explicitly to be understandable (unlike Paxos). It decomposes consensus into three sub-problems: leader election, log replication, and safety.
sequenceDiagram
participant N1 as Node 1
(Follower)
participant N2 as Node 2
(Candidate)
participant N3 as Node 3
(Follower)
Note over N1,N3: Term 1: Node X was leader (crashed)
Note over N2: Election timeout expires
→ Becomes Candidate
→ Increments term to 2
N2->>N1: RequestVote(term=2, lastLogIndex=5)
N2->>N3: RequestVote(term=2, lastLogIndex=5)
N1-->>N2: VoteGranted(term=2) ✓
N3-->>N2: VoteGranted(term=2) ✓
Note over N2: Received majority (2/3 votes)
→ Becomes Leader (term 2)
N2->>N1: AppendEntries(heartbeat, term=2)
N2->>N3: AppendEntries(heartbeat, term=2)
Note over N1,N3: Followers reset election timer
Leader maintains authority
Raft's key mechanisms:
- Leader election: When a follower's election timer expires (randomized 150-300ms), it becomes a candidate and requests votes. First candidate to receive majority votes wins. Randomized timers prevent split votes.
- Log replication: The leader accepts client requests, appends them to its log, and replicates to followers via AppendEntries RPCs. An entry is "committed" when a majority of nodes have it.
- Safety: A candidate can only win if its log is at least as up-to-date as the majority. This ensures the elected leader always has all committed entries.
Here's a simplified Raft state machine implementation:
"""
Raft Consensus - Simplified State Machine
Demonstrates the core state transitions and leader election logic.
"""
import random
import time
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional
class NodeState(Enum):
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
@dataclass
class LogEntry:
term: int
index: int
command: str
@dataclass
class RaftNode:
"""Simplified Raft node demonstrating core state machine logic."""
node_id: str
cluster_size: int = 5
# Persistent state
current_term: int = 0
voted_for: Optional[str] = None
log: list = field(default_factory=list)
# Volatile state
state: NodeState = NodeState.FOLLOWER
commit_index: int = 0
last_applied: int = 0
# Leader state (reinitialized on election)
next_index: dict = field(default_factory=dict)
match_index: dict = field(default_factory=dict)
# Election state
votes_received: int = 0
election_timeout: float = 0.0
def reset_election_timer(self):
"""Randomized timeout prevents split votes."""
self.election_timeout = time.time() + random.uniform(0.150, 0.300)
@property
def majority(self) -> int:
"""Majority quorum for this cluster size."""
return (self.cluster_size // 2) + 1
def start_election(self):
"""Follower timeout expired → become candidate."""
self.state = NodeState.CANDIDATE
self.current_term += 1
self.voted_for = self.node_id
self.votes_received = 1 # Vote for self
self.reset_election_timer()
print(f"[{self.node_id}] Starting election for term {self.current_term}")
# In real implementation: send RequestVote RPCs to all peers
def handle_vote_response(self, vote_granted: bool, term: int):
"""Process a vote response from a peer."""
if term > self.current_term:
# Discovered higher term → step down
self.current_term = term
self.state = NodeState.FOLLOWER
self.voted_for = None
return
if self.state != NodeState.CANDIDATE:
return # No longer a candidate
if vote_granted:
self.votes_received += 1
if self.votes_received >= self.majority:
self.become_leader()
def become_leader(self):
"""Won election → initialize leader state."""
self.state = NodeState.LEADER
print(f"[{self.node_id}] Elected leader for term {self.current_term}")
# Initialize next_index for each peer to end of log
last_log_index = len(self.log)
# In real implementation: send initial empty AppendEntries (heartbeat)
def handle_request_vote(self, candidate_id: str, term: int,
last_log_index: int, last_log_term: int) -> bool:
"""Decide whether to grant vote to a candidate."""
if term < self.current_term:
return False # Reject stale term
if term > self.current_term:
self.current_term = term
self.state = NodeState.FOLLOWER
self.voted_for = None
# Grant vote if: haven't voted yet AND candidate's log is up-to-date
if self.voted_for is None or self.voted_for == candidate_id:
if self._is_log_up_to_date(last_log_index, last_log_term):
self.voted_for = candidate_id
self.reset_election_timer()
return True
return False
def _is_log_up_to_date(self, last_log_index: int, last_log_term: int) -> bool:
"""Check if candidate's log is at least as up-to-date as ours."""
my_last_term = self.log[-1].term if self.log else 0
my_last_index = len(self.log)
if last_log_term != my_last_term:
return last_log_term > my_last_term
return last_log_index >= my_last_index
def append_entry(self, command: str) -> Optional[LogEntry]:
"""Leader accepts a client command."""
if self.state != NodeState.LEADER:
return None # Only leader accepts writes
entry = LogEntry(
term=self.current_term,
index=len(self.log) + 1,
command=command
)
self.log.append(entry)
print(f"[{self.node_id}] Appended entry: {command} (index={entry.index})")
# In real implementation: replicate to followers via AppendEntries
return entry
# Demonstration
if __name__ == "__main__":
node = RaftNode(node_id="node-1", cluster_size=5)
print(f"Initial state: {node.state.value}, term={node.current_term}")
# Simulate election timeout
node.start_election()
print(f"After election start: {node.state.value}, term={node.current_term}")
# Simulate receiving majority votes
node.handle_vote_response(vote_granted=True, term=1)
node.handle_vote_response(vote_granted=True, term=1)
print(f"After majority: {node.state.value}")
# Leader accepts writes
entry = node.append_entry("SET x = 42")
print(f"Log length: {len(node.log)}, commit_index: {node.commit_index}")
Paxos: The Foundational Algorithm
Paxos (by Leslie Lamport, 1989) is the first proven-correct consensus algorithm. It's notoriously difficult to understand, which is why Raft was created as an alternative. However, understanding Paxos's core insight is valuable:
Paxos roles:
- Proposers: Propose values. Multiple proposers can exist simultaneously (causes dueling proposals that slow but don't break consensus).
- Acceptors: Vote on proposals. A majority must agree for consensus. Each acceptor tracks the highest proposal number it has promised and the highest it has accepted.
- Learners: Learn the chosen value. In practice, all nodes are learners — they need to know what was decided.
Paxos vs. Raft comparison:
- Paxos allows multiple proposers simultaneously (more available, harder to understand)
- Raft restricts to a single leader (simpler, slight availability tradeoff during elections)
- Both tolerate ⌊(n-1)/2⌋ failures in a cluster of n nodes
- Most production systems use Raft (etcd, CockroachDB, TiKV) or Multi-Paxos variants (Chubby, Spanner)
Quorum Systems
A quorum is the minimum number of nodes that must participate in an operation for it to be valid. The fundamental insight of quorum systems is that any two quorums must overlap — this guarantees that a read quorum will always see the most recent write.
flowchart TD
subgraph cluster["5-Node Cluster (N=5)"]
N1["Node 1"]
N2["Node 2"]
N3["Node 3"]
N4["Node 4"]
N5["Node 5"]
end
subgraph write["Write Quorum (W=3)"]
W1["Node 1 ✓"]
W2["Node 2 ✓"]
W3["Node 3 ✓"]
end
subgraph read["Read Quorum (R=3)"]
R1["Node 3 ✓"]
R2["Node 4 ✓"]
R3["Node 5 ✓"]
end
W3 -.-|"Overlap!
Guarantees read
sees latest write"| R1
style write fill:#3B9797,color:#fff
style read fill:#16476A,color:#fff
style cluster fill:#f8f9fa,color:#132440
The quorum formula: W + R > N (where W = write quorum, R = read quorum, N = total nodes)
"""
Quorum Calculator — Explore different quorum configurations
and their tradeoffs for distributed systems.
"""
def analyze_quorum(n: int, w: int, r: int) -> dict:
"""Analyze a quorum configuration for a cluster of size n."""
strong_consistency = (w + r) > n
fault_tolerance_writes = n - w # Can lose this many nodes and still write
fault_tolerance_reads = n - r # Can lose this many nodes and still read
fault_tolerance_both = min(fault_tolerance_writes, fault_tolerance_reads)
return {
"cluster_size": n,
"write_quorum": w,
"read_quorum": r,
"strong_consistency": strong_consistency,
"write_fault_tolerance": fault_tolerance_writes,
"read_fault_tolerance": fault_tolerance_reads,
"overall_fault_tolerance": fault_tolerance_both,
"write_latency": "Higher (wait for W nodes)",
"read_latency": "Higher (wait for R nodes)" if r > 1 else "Low (single node)",
}
# Common configurations
configs = [
# Strong consistency: W + R > N
("Strong Consistency (majority)", 5, 3, 3),
("Write-heavy (fast reads)", 5, 4, 2),
("Read-heavy (fast writes)", 5, 2, 4),
# Eventual consistency: W + R <= N
("Eventual (fast writes, fast reads)", 5, 1, 1),
("Single-leader replication", 3, 1, 3),
# DynamoDB-style tunable
("DynamoDB default", 3, 2, 2),
]
print("=" * 70)
print(f"{'Config':<35} {'N':>2} {'W':>2} {'R':>2} {'Strong?':<8} {'Fault Tol':>9}")
print("=" * 70)
for name, n, w, r in configs:
result = analyze_quorum(n, w, r)
strong = "YES" if result["strong_consistency"] else "NO"
print(f"{name:<35} {n:>2} {w:>2} {r:>2} {strong:<8} {result['overall_fault_tolerance']:>5} nodes")
print("=" * 70)
print("\nKey insight: W + R > N ensures every read sees the latest write.")
print("Tradeoff: Higher W = slower writes but faster reads (and vice versa).")
Leader Election Patterns
Leader election is a specific application of consensus — agreeing on which node is "in charge." Different systems use different approaches:
- Raft election: Randomized timeouts + majority vote. Simple, well-understood, used in etcd/Consul.
- Bully algorithm: Highest-ID node always wins. Simple but slower (multiple rounds of messages). Used in older systems.
- ZooKeeper recipes: Create ephemeral sequential nodes; lowest sequence number is leader. Watches notify on leadership changes. Used in Kafka (pre-KRaft), HBase.
- Lease-based: Leader holds a time-limited lease. Must renew before expiry or another node takes over. Used in Google Chubby, DynamoDB.
An etcd cluster configuration demonstrates Raft-based coordination in practice:
# etcd-cluster.yaml — 3-node etcd cluster for Kubernetes
# Uses Raft consensus for leader election and log replication
apiVersion: v1
kind: ConfigMap
metadata:
name: etcd-config
namespace: kube-system
data:
etcd.conf.yaml: |
name: etcd-node-1
data-dir: /var/lib/etcd
# Cluster configuration
initial-cluster: >-
etcd-node-1=https://etcd-1.etcd.kube-system.svc:2380,
etcd-node-2=https://etcd-2.etcd.kube-system.svc:2380,
etcd-node-3=https://etcd-3.etcd.kube-system.svc:2380
initial-cluster-state: new
initial-cluster-token: k8s-etcd-cluster
# Client communication
listen-client-urls: https://0.0.0.0:2379
advertise-client-urls: https://etcd-1.etcd.kube-system.svc:2379
# Peer communication (Raft replication)
listen-peer-urls: https://0.0.0.0:2380
initial-advertise-peer-urls: https://etcd-1.etcd.kube-system.svc:2380
# Raft tuning
heartbeat-interval: 100 # ms — leader sends heartbeats
election-timeout: 1000 # ms — follower starts election if no heartbeat
# Snapshot and compaction
snapshot-count: 10000 # Snapshot after 10k entries
auto-compaction-mode: periodic
auto-compaction-retention: "1h"
# Security
client-transport-security:
cert-file: /etc/etcd/pki/server.crt
key-file: /etc/etcd/pki/server.key
trusted-ca-file: /etc/etcd/pki/ca.crt
client-cert-auth: true
peer-transport-security:
cert-file: /etc/etcd/pki/peer.crt
key-file: /etc/etcd/pki/peer.key
trusted-ca-file: /etc/etcd/pki/ca.crt
client-cert-auth: true
Verify cluster health with a simple script:
#!/bin/bash
# etcd-health-check.sh — Verify etcd cluster consensus health
# Run periodically to detect split-brain or lagging members
ETCD_ENDPOINTS="https://etcd-1:2379,https://etcd-2:2379,https://etcd-3:2379"
ETCD_CACERT="/etc/etcd/pki/ca.crt"
ETCD_CERT="/etc/etcd/pki/client.crt"
ETCD_KEY="/etc/etcd/pki/client.key"
echo "=== etcd Cluster Health Check ==="
echo "Timestamp: $(date -u)"
echo ""
# Check endpoint health
echo "--- Endpoint Health ---"
etcdctl endpoint health \
--endpoints="$ETCD_ENDPOINTS" \
--cacert="$ETCD_CACERT" \
--cert="$ETCD_CERT" \
--key="$ETCD_KEY" \
--write-out=table
echo ""
# Check endpoint status (shows leader, raft term, raft index)
echo "--- Endpoint Status (Raft State) ---"
etcdctl endpoint status \
--endpoints="$ETCD_ENDPOINTS" \
--cacert="$ETCD_CACERT" \
--cert="$ETCD_CERT" \
--key="$ETCD_KEY" \
--write-out=table
echo ""
# Check member list
echo "--- Cluster Members ---"
etcdctl member list \
--endpoints="$ETCD_ENDPOINTS" \
--cacert="$ETCD_CACERT" \
--cert="$ETCD_CERT" \
--key="$ETCD_KEY" \
--write-out=table
echo ""
# Verify consensus — write and read back
echo "--- Consensus Verification ---"
TEST_KEY="/health-check/$(date +%s)"
TEST_VALUE="healthy-$(hostname)"
etcdctl put "$TEST_KEY" "$TEST_VALUE" \
--endpoints="$ETCD_ENDPOINTS" \
--cacert="$ETCD_CACERT" \
--cert="$ETCD_CERT" \
--key="$ETCD_KEY" 2>/dev/null
READ_VALUE=$(etcdctl get "$TEST_KEY" --print-value-only \
--endpoints="$ETCD_ENDPOINTS" \
--cacert="$ETCD_CACERT" \
--cert="$ETCD_CERT" \
--key="$ETCD_KEY" 2>/dev/null)
if [ "$READ_VALUE" == "$TEST_VALUE" ]; then
echo "✓ Consensus OK: Write and read-back successful"
else
echo "✗ CONSENSUS FAILURE: Written '$TEST_VALUE', read '$READ_VALUE'"
exit 1
fi
# Cleanup test key
etcdctl del "$TEST_KEY" \
--endpoints="$ETCD_ENDPOINTS" \
--cacert="$ETCD_CACERT" \
--cert="$ETCD_CERT" \
--key="$ETCD_KEY" > /dev/null 2>&1
echo ""
echo "=== Health Check Complete ==="
Module 29: Consistency Models
The Consistency Spectrum
Consistency models define what guarantees a distributed system provides about the order and visibility of operations. They form a spectrum from strongest (easiest to reason about, hardest to implement) to weakest (hardest to reason about, easiest to implement).
flowchart LR
A["Linearizability
(Strongest)
Single global order
Real-time guarantee"] --> B["Sequential
Consistency
Single global order
No real-time"]
B --> C["Causal
Consistency
Respects causality
Concurrent ops unordered"]
C --> D["Eventual
Consistency
(Weakest)
Converges eventually
No ordering"]
style A fill:#132440,color:#fff
style B fill:#16476A,color:#fff
style C fill:#3B9797,color:#fff
style D fill:#7CCBCB,color:#132440
The tradeoff is always: stronger consistency = higher latency and/or lower availability. Linearizability requires coordination on every operation. Eventual consistency allows any node to respond immediately without coordination.
Linearizability (Strong Consistency)
Linearizability is the strongest single-object consistency model. It provides the illusion that there is only one copy of the data, and all operations are atomic and instantaneous:
- Every operation appears to take effect at a single point in time between its invocation and response
- All operations are totally ordered, consistent with real-time ordering
- If operation A completes before operation B starts, then A appears before B in the total order
Where linearizability is required:
- Leader election (if two nodes both think they're leader, data corruption follows)
- Distributed locks (a lock must be held by exactly one holder)
- Unique constraints (user signup — must not create two accounts with same email)
- Financial transactions (account balance must reflect all debits/credits atomically)
Cost of linearizability: Requires coordination (consensus) on every write and potentially on every read. In a geo-distributed system, this means cross-region network round trips on every operation — adding 50-200ms+ latency.
Sequential Consistency
Sequential consistency is slightly weaker than linearizability — it guarantees a single total order of operations that is consistent with each client's individual order, but NOT necessarily with real-time:
- All nodes see the same order of operations (total order exists)
- Each client's operations appear in the order they were issued
- But: operation A can "appear" after operation B even if A finished before B in real-time
The practical difference: with linearizability, if you write a value and then call your friend who reads it, they're guaranteed to see your write. With sequential consistency, they might not — the write might not yet be "visible" in the total order even though it completed in real-time.
Causal Consistency
Causal consistency preserves the ordering of causally related operations while allowing concurrent (unrelated) operations to be seen in any order. Two operations are causally related if one could have influenced the other:
sequenceDiagram
participant Alice
participant Bob
participant Carol
Note over Alice,Carol: Causal chain (must be ordered)
Alice->>Alice: Write: "Meeting at 3pm" (op1)
Alice->>Bob: Read by Bob → sees "Meeting at 3pm" (op2)
Bob->>Bob: Write: "I'll be there!" (op3)
Note over Alice,Carol: op1 → op2 → op3 is causal chain
All nodes must see op1 before op3
Note over Carol: Carol must see "Meeting at 3pm"
BEFORE "I'll be there!"
(preserves causality)
Note over Alice,Carol: Concurrent operations (no causal order)
Alice->>Alice: Write: "Nice weather today" (op4)
Carol->>Carol: Write: "Going to lunch" (op5)
Note over Alice,Carol: op4 and op5 are concurrent
Nodes can see them in ANY order
The "happens-before" relation (→):
- If A and B are in the same process and A comes before B, then A → B
- If A is a send and B is the corresponding receive, then A → B
- Transitivity: if A → B and B → C, then A → C
- If neither A → B nor B → A, then A and B are concurrent (A ‖ B)
Why causal consistency matters: It's the strongest consistency model that can be achieved without sacrificing availability during network partitions (it lives on the "AP" side of CAP but provides meaningful ordering guarantees). Systems like MongoDB (with causal consistency sessions) and COPS use it.
Eventual Consistency
Eventual consistency is the weakest useful guarantee: if no new writes occur, all replicas will eventually converge to the same state. It says nothing about how long "eventually" takes or what clients see during convergence.
- Guarantee: Given sufficient time without new writes, all reads will return the last written value
- No guarantee about: What you see during convergence, the order of intermediate states, or how long convergence takes
Challenges of eventual consistency:
- Write conflicts: Two nodes accept conflicting writes simultaneously. Resolution strategies: last-writer-wins (LWW), vector clocks, CRDTs, application-level merge
- Read-your-writes violation: You write a value, immediately read, and get the OLD value (your write hasn't propagated yet)
- Monotonic reads violation: You read value V2, then read again and get older value V1 (routed to a lagging replica)
Session Guarantees
Session guarantees add practical usability on top of eventual consistency without requiring full strong consistency:
- Read-your-writes: After your write completes, your subsequent reads always see that write (or a later one). Other clients may not see it yet, but YOU always do. Implemented by routing reads to the same replica that accepted your write, or by tracking write timestamps.
- Monotonic reads: Once you've read a value, subsequent reads never return an older value. You never "go back in time." Implemented by tracking the latest version you've seen and rejecting staler responses.
- Monotonic writes: Your writes are applied in the order you issued them. Write A before write B guarantees A is applied first everywhere.
- Writes-follow-reads: If you read a value and then write based on it, your write is ordered after the read value. Prevents "writing into the past."
Real-World Systems Mapped to Consistency Models
Understanding where real systems fall on the consistency spectrum helps you make informed architecture decisions:
| System | Default Consistency | Strongest Available | Notes |
|---|---|---|---|
| Google Spanner | Linearizable | Linearizable | Uses TrueTime (atomic clocks + GPS) for global strong consistency |
| CockroachDB | Serializable | Serializable | Raft consensus per range; global strong consistency |
| etcd | Linearizable | Linearizable | Raft consensus; all reads go through leader by default |
| DynamoDB | Eventual | Strong (per-table) | Configurable per-request; strong reads cost 2x |
| Cassandra | Eventual | Tunable (QUORUM) | W + R > N achieves strong; per-query consistency level |
| MongoDB | Eventual (reads from secondary) | Linearizable | readConcern: "linearizable" + writeConcern: "majority" |
| Redis (cluster) | Eventual | Eventual | Async replication; WAIT command for synchronous ack |
| S3 | Strong (since 2020) | Strong | Read-after-write consistency for all operations |
Case Studies
etcd: The Brain of Kubernetes
etcd: Raft Consensus in Production
etcd is a distributed key-value store that uses Raft consensus to provide strong consistency guarantees. It's the coordination backbone of Kubernetes — storing all cluster state (pods, services, configs, secrets).
Why Kubernetes chose etcd:
- Strong consistency required: If two API servers simultaneously try to schedule a pod, exactly one must win. Eventual consistency would cause duplicate scheduling.
- Watch mechanism: Controllers watch etcd for changes and reconcile state. This requires linearizable reads — a controller must never miss an update.
- Small dataset, high coordination: Kubernetes metadata is small (GBs not TBs) but requires strong coordination guarantees — perfect fit for consensus.
etcd's architecture decisions:
- Typically 3 or 5 nodes (odd number avoids split-brain; more nodes = slower consensus)
- All writes go through the Raft leader (serial, strongly consistent)
- Reads can be serializable (through leader) or stale (from any node) — Kubernetes uses serializable
- Snapshotting every 10,000 entries to bound log size and speed recovery
- Lease-based TTL for ephemeral keys (used for leader election, service discovery)
Production lessons:
- etcd performance is bounded by disk fsync latency (writes must be persisted before acknowledging)
- Network latency between etcd members directly impacts write latency (Raft requires majority ack)
- Large values (>1.5MB) cause performance degradation — store references, not blobs
- Regular defragmentation required as revisions accumulate
Google Spanner: Global Strong Consistency
Spanner: TrueTime and External Consistency
Google Spanner achieves the seemingly impossible: globally distributed, strongly consistent, highly available database. It violates the "you must choose two of three" intuition from CAP by using a novel approach to time.
The TrueTime innovation:
- Every Google data center has atomic clocks and GPS receivers
- TrueTime API returns an interval [earliest, latest] instead of a single timestamp
- The interval width (uncertainty) is typically 1-7ms
- Spanner waits out the uncertainty before committing: if commit timestamp is T, wait until TrueTime says "definitely past T" — this guarantees that if transaction A commits before B starts, A's timestamp < B's timestamp
How Spanner achieves strong consistency globally:
- Data split into "splits" (shards), each managed by a Paxos group
- Single-split transactions use Paxos within the group (fast)
- Cross-split transactions use 2PC (two-phase commit) coordinated by a transaction manager
- TrueTime ensures that committed transactions have globally meaningful timestamps
- Read-only transactions at a timestamp T can read from any replica with data >= T (no coordination needed for reads!)
The cost: Writes incur the TrueTime wait (1-7ms) plus Paxos replication latency. For global writes, this means ~10-50ms per transaction. But reads (the majority of operations) are lock-free and can be served from local replicas — making reads fast even across continents.
Conclusion & Next Steps
The key takeaways from this module:
- Coordination is fundamentally hard. FLP impossibility, CAP theorem, and the Two Generals Problem prove that perfect distributed coordination is impossible. Every system makes tradeoffs — understand which tradeoffs YOUR system makes.
- Raft is the practical choice for consensus. It's understandable, well-proven (etcd, CockroachDB, TiKV), and handles the common case well. Use Raft (via etcd or embedded libraries) unless you have specific reasons not to.
- Quorum math is simple but powerful. W + R > N guarantees you'll always read the latest write. Adjust W and R to optimize for your read/write ratio. Most systems default to majority quorums (W = R = ⌈(N+1)/2⌉).
- Choose the weakest consistency that's correct. Linearizability is expensive. If your application only needs causal consistency (most social apps), or eventual consistency with session guarantees (most e-commerce), use that — it's faster, cheaper, and more available.
- Session guarantees are the practical minimum. Read-your-writes and monotonic reads are needed for basic usability. Without them, users see confusing behavior. Most databases offer these at low cost.
- Real systems are configurable. DynamoDB, Cassandra, and MongoDB let you choose consistency per-operation. Use strong consistency for critical paths (payments, leader election) and eventual for the rest (feeds, analytics, caches).
- Time is the enemy of distributed systems. Google solved it with atomic clocks (Spanner). Everyone else uses logical clocks, version vectors, or accepts the limitations. If your system spans continents, you must explicitly design for clock skew.
Next in the Series
In Part 15: Transactions, Time & Messaging, we'll build on coordination fundamentals to explore distributed transactions (2PC, Saga pattern), logical and physical time (vector clocks, hybrid clocks), and messaging patterns for reliable asynchronous communication.