Module 30: Distributed Transactions
Two-Phase Commit (2PC)
In a monolithic system, a single database transaction guarantees ACID properties — atomicity, consistency, isolation, durability. When data spans multiple services or databases, we need a protocol to coordinate atomic operations across participants. Two-Phase Commit (2PC) is the classical solution.
Phase 1 — Prepare (Voting):
- Coordinator sends
PREPAREto all participants - Each participant executes the transaction locally (but does NOT commit)
- Each participant writes to its WAL (Write-Ahead Log) for durability
- Each participant responds with
VOTE_COMMITorVOTE_ABORT
Phase 2 — Commit (Decision):
- If ALL votes are
VOTE_COMMIT→ coordinator sendsGLOBAL_COMMIT - If ANY vote is
VOTE_ABORT→ coordinator sendsGLOBAL_ABORT - Participants execute commit or rollback and acknowledge
- Coordinator marks transaction complete after all acknowledgements
sequenceDiagram
participant C as Coordinator
participant P1 as Participant A
participant P2 as Participant B
participant P3 as Participant C
Note over C,P3: Phase 1: Prepare
C->>P1: PREPARE
C->>P2: PREPARE
C->>P3: PREPARE
P1->>C: VOTE_COMMIT
P2->>C: VOTE_COMMIT
P3->>C: VOTE_COMMIT
Note over C,P3: Phase 2: Commit
C->>P1: GLOBAL_COMMIT
C->>P2: GLOBAL_COMMIT
C->>P3: GLOBAL_COMMIT
P1->>C: ACK
P2->>C: ACK
P3->>C: ACK
Note over C: Transaction Complete
The Blocking Problem
2PC has a fundamental flaw: it is a blocking protocol. If the coordinator crashes after sending PREPARE but before sending the commit/abort decision, all participants are stuck in an uncertain state — they've voted but don't know the outcome. They cannot commit (another participant might have voted abort), and they cannot abort (the coordinator might have decided to commit).
Why 2PC is rarely used in modern microservices:
- Performance: Synchronous protocol — all participants must respond before progress
- Availability: Single coordinator is a SPOF; coordinator failure blocks everyone
- Latency: Two network round-trips minimum; locks held throughout
- Coupling: All participants must be available simultaneously
- Heterogeneity: All participants must support the same 2PC protocol (XA)
Saga Patterns
A saga is a sequence of local transactions where each step has a corresponding compensating transaction that undoes its effect. Instead of one atomic distributed transaction, you execute a chain of local ACID transactions. If any step fails, previously completed steps are undone by their compensations — in reverse order.
There are two orchestration approaches for sagas:
1. Choreography (Event-Driven): Each service listens for events and publishes its own events. No central coordinator. Services react to events independently.
2. Orchestration (Command-Driven): A central orchestrator tells each service what to do and manages the saga state. The orchestrator knows the complete workflow and handles failures.
flowchart LR
subgraph Choreography
direction LR
A1[Order Service] -->|OrderCreated| B1[Payment Service]
B1 -->|PaymentCharged| C1[Inventory Service]
C1 -->|ItemsReserved| D1[Shipping Service]
D1 -->|ShipmentScheduled| A1
end
subgraph Orchestration
direction LR
O[Saga Orchestrator] -->|ChargePayment| B2[Payment Service]
B2 -->|Success| O
O -->|ReserveItems| C2[Inventory Service]
C2 -->|Success| O
O -->|ScheduleShip| D2[Shipping Service]
D2 -->|Success| O
end
Choreography trade-offs:
- ✅ Loose coupling — services don't know about each other
- ✅ Simple for small sagas (3-4 steps)
- ❌ Hard to understand the overall flow (distributed logic)
- ❌ Difficult to add new steps or change ordering
- ❌ Cyclic dependencies can emerge
Orchestration trade-offs:
- ✅ Centralized workflow logic — easy to understand and modify
- ✅ Easier error handling and compensation
- ✅ Better for complex sagas (5+ steps)
- ❌ Orchestrator becomes a single point of coupling
- ❌ Risk of becoming a "god service" with too much logic
Compensating Transactions
The key to saga reliability is well-designed compensating transactions. A compensating transaction semantically undoes the effect of a completed step — it doesn't literally "undo" (you can't un-send an email), but restores the system to an equivalent consistent state.
Rules for compensating transactions:
- Must be idempotent: Compensation might be triggered multiple times (retries)
- Must be commutative: Order shouldn't matter if multiple compensations run
- Cannot fail permanently: If compensation fails, it must be retried until success
- Must be semantically reversible: The domain must support the reversal (refunds, cancellations)
import enum
import asyncio
from dataclasses import dataclass, field
from typing import Callable, Awaitable
class SagaStepStatus(enum.Enum):
PENDING = "pending"
COMPLETED = "completed"
COMPENSATED = "compensated"
FAILED = "failed"
@dataclass
class SagaStep:
name: str
action: Callable[..., Awaitable[None]]
compensation: Callable[..., Awaitable[None]]
status: SagaStepStatus = SagaStepStatus.PENDING
@dataclass
class SagaOrchestrator:
"""Executes saga steps in order; compensates on failure."""
saga_id: str
steps: list[SagaStep] = field(default_factory=list)
completed: list[SagaStep] = field(default_factory=list)
async def execute(self, context: dict) -> bool:
for step in self.steps:
try:
print(f"[{self.saga_id}] Executing: {step.name}")
await step.action(context)
step.status = SagaStepStatus.COMPLETED
self.completed.append(step)
except Exception as e:
print(f"[{self.saga_id}] FAILED at: {step.name} — {e}")
step.status = SagaStepStatus.FAILED
await self._compensate(context)
return False
print(f"[{self.saga_id}] Saga completed successfully")
return True
async def _compensate(self, context: dict) -> None:
"""Compensate in reverse order."""
for step in reversed(self.completed):
try:
print(f"[{self.saga_id}] Compensating: {step.name}")
await step.compensation(context)
step.status = SagaStepStatus.COMPENSATED
except Exception as e:
# Must retry — compensations cannot fail permanently
print(f"[{self.saga_id}] Compensation failed: {step.name} — retrying")
await asyncio.sleep(1)
await step.compensation(context)
step.status = SagaStepStatus.COMPENSATED
# Example usage: Order processing saga
async def charge_payment(ctx):
print(f" Charging ${ctx['amount']} to card {ctx['card'][-4:]}")
async def refund_payment(ctx):
print(f" Refunding ${ctx['amount']} to card {ctx['card'][-4:]}")
async def reserve_inventory(ctx):
print(f" Reserving {ctx['items']} items")
async def release_inventory(ctx):
print(f" Releasing {ctx['items']} items back to stock")
async def schedule_shipping(ctx):
raise RuntimeError("Shipping service unavailable")
async def cancel_shipping(ctx):
print(" Cancelling shipment")
async def main():
saga = SagaOrchestrator(
saga_id="order-12345",
steps=[
SagaStep("charge_payment", charge_payment, refund_payment),
SagaStep("reserve_inventory", reserve_inventory, release_inventory),
SagaStep("schedule_shipping", schedule_shipping, cancel_shipping),
]
)
context = {"amount": 99.99, "card": "4111111111111111", "items": 3}
success = await saga.execute(context)
print(f"Order result: {'Success' if success else 'Rolled back'}")
asyncio.run(main())
Module 31: Time in Distributed Systems
Clock Drift & NTP Limitations
Every computer has a quartz oscillator that ticks at a slightly different rate. Over time, clocks drift — they diverge from true time and from each other. NTP (Network Time Protocol) periodically resynchronizes clocks, but has fundamental accuracy limits.
Clock anomalies that break systems:
- Clock skew: Two machines disagree on the current time
- Clock jump: NTP correction causes time to leap forward (or backward with
slewmode) - Leap seconds: UTC inserts a 61st second — some systems crash, others repeat a second
- Monotonicity violation:
gettimeofday()can return earlier values after NTP adjustments
Lamport Timestamps
Leslie Lamport's 1978 paper established that we don't need physical time to order events — we need logical time. A Lamport timestamp is a simple integer counter that provides a partial ordering of events in a distributed system.
max(local, received) + 1. This guarantees: if event A happened-before event B, then L(A) < L(B). But the converse is NOT true — L(A) < L(B) does NOT mean A happened before B (concurrent events can have any order).
class LamportClock:
"""Lamport logical clock for event ordering."""
def __init__(self, node_id: str):
self.node_id = node_id
self.time = 0
def tick(self) -> int:
"""Local event: increment clock."""
self.time += 1
return self.time
def send(self) -> tuple[int, str]:
"""Send event: increment and return timestamp."""
self.time += 1
return (self.time, self.node_id)
def receive(self, sender_time: int) -> int:
"""Receive event: merge clocks and increment."""
self.time = max(self.time, sender_time) + 1
return self.time
# Demonstrate causal ordering
node_a = LamportClock("A")
node_b = LamportClock("B")
# Node A does local work
node_a.tick() # A.time = 1
node_a.tick() # A.time = 2
# Node A sends message to B
msg_time, sender = node_a.send() # A.time = 3
print(f"A sends at t={msg_time}")
# Node B does local work independently
node_b.tick() # B.time = 1
# Node B receives A's message
node_b.receive(msg_time) # B.time = max(1, 3) + 1 = 4
print(f"B receives, updates to t={node_b.time}")
# Guaranteed: send(3) happened-before receive(4)
# But B's tick(1) and A's tick(2) are CONCURRENT — no ordering possible
print(f"A clock: {node_a.time}, B clock: {node_b.time}")
Vector Clocks
Lamport clocks can't detect concurrency — they give a total order where concurrent events get arbitrary ordering. Vector clocks solve this by maintaining a vector of counters (one per node), enabling detection of causal ordering AND concurrency.
[t₁, t₂, ..., tₙ] of size N (number of nodes). Entry i represents the latest known time of node i. Comparison rules: V(A) < V(B) iff every entry in A ≤ corresponding entry in B, and at least one is strictly less. If neither V(A) < V(B) nor V(B) < V(A), the events are concurrent — a conflict that must be resolved.
flowchart TD
A1["Node A: [1,0,0]"] --> A2["Node A: [2,0,0]"]
A2 -->|"send"| B2["Node B: [2,1,0]"]
B1["Node B: [0,1,0]"] --> B2
B2 --> B3["Node B: [2,2,0]"]
A2 --> A3["Node A: [3,0,0]"]
C1["Node C: [0,0,1]"] --> C2["Node C: [0,0,2]"]
A3 --> CONFLICT{"A3=[3,0,0] vs C2=[0,0,2]
CONCURRENT — conflict!"}
C2 --> CONFLICT
B3 --> CAUSAL{"B3=[2,2,0] > A1=[1,0,0]
B3 causally after A1"}
Hybrid Logical Clocks (HLC)
Hybrid Logical Clocks combine the best of physical and logical time. An HLC timestamp has two parts: a physical component (wall clock, bounded by NTP uncertainty) and a logical component (for ordering events within the same physical time). This gives you causal ordering while keeping timestamps close to real time.
Google TrueTime takes a different approach — it uses atomic clocks and GPS receivers to bound clock uncertainty. TrueTime returns an interval [earliest, latest] instead of a point in time. Spanner waits until the uncertainty window passes before committing, guaranteeing linearizability despite clock drift. This requires specialized hardware that most systems don't have.
CockroachDB's HLC provides a practical alternative:
- Physical component: wall clock, bounded by
max_offset(default 500ms) - Logical component: counter for events at the same physical timestamp
- If physical time moves forward, reset logical counter to 0
- If two events have same physical time, logical counter provides ordering
- Transactions that might violate ordering trigger "uncertainty restarts"
Module 32: Distributed Messaging Systems
Delivery Guarantees
Message delivery in distributed systems has three semantic levels, each with different trade-offs between reliability, performance, and complexity:
At-Most-Once (Fire-and-Forget):
- Message is sent once with no confirmation or retry
- Message may be lost (network failure, consumer crash)
- Zero duplicate messages — simplest semantics
- Use case: metrics, logging, non-critical notifications
- Implementation: UDP, async publish without ACK
At-Least-Once (ACK + Retry):
- Producer retries until it receives acknowledgement
- Guarantees delivery but may produce duplicates
- Consumer must be idempotent — processing same message twice must be safe
- Use case: most business transactions, event sourcing
- Implementation: Kafka with
acks=all, RabbitMQ with publisher confirms
Exactly-Once (Hardest Problem):
- Each message is processed exactly one time — no loss, no duplicates
- Technically impossible across network boundaries (Two Generals Problem)
- Achieved in practice through idempotent processing + deduplication
- Use case: financial transactions, inventory updates, billing
- Implementation: Kafka transactions, transactional outbox + deduplication table
Exactly-Once Semantics in Practice
Kafka achieves exactly-once semantics (EOS) through the combination of an idempotent producer (assigns sequence numbers to detect duplicates at the broker) and transactional consumers (atomic read-process-write within a single Kafka transaction).
# Kafka producer configuration for exactly-once semantics
producer:
bootstrap.servers: "kafka-1:9092,kafka-2:9092,kafka-3:9092"
# Idempotent producer: broker deduplicates by producer ID + sequence number
enable.idempotence: true
acks: all # Wait for ALL in-sync replicas
retries: 2147483647 # Infinite retries (bounded by timeout)
max.in.flight.requests.per.connection: 5 # Safe with idempotence
# Transactional producer: atomic multi-partition writes
transactional.id: "order-processor-1"
# Reliability settings
delivery.timeout.ms: 120000 # 2 minutes total delivery timeout
request.timeout.ms: 30000 # Individual request timeout
consumer:
bootstrap.servers: "kafka-1:9092,kafka-2:9092,kafka-3:9092"
group.id: "order-processors"
# Read-committed: only see messages from committed transactions
isolation.level: read_committed
# Manual offset management (committed atomically with processing)
enable.auto.commit: false
# Start from earliest if no committed offset exists
auto.offset.reset: earliest
Transactional Outbox Pattern
The transactional outbox pattern solves the dual-write problem: how do you atomically update a database AND publish an event? If you update the DB first and the event publish fails, the event is lost. If you publish first and the DB write fails, you sent a lie.
The solution: write the event to an "outbox" table in the same database transaction as the business data. A separate process (relay/poller or CDC connector) reads the outbox and publishes to the message broker. The database transaction guarantees atomicity between the data change and the event record.
flowchart LR
subgraph "Same DB Transaction"
A[Service] -->|"1. Write data"| B[(Orders Table)]
A -->|"2. Write event"| C[(Outbox Table)]
end
C -->|"3. Poll/CDC"| D[Relay Process]
D -->|"4. Publish"| E[Message Broker]
E -->|"5. Deliver"| F[Consumer]
D -->|"6. Mark sent"| C
#!/bin/bash
# Monitor Kafka consumer group lag — detect if consumers are falling behind
BOOTSTRAP="kafka-1:9092"
GROUP="order-processors"
TOPIC="orders"
echo "=== Consumer Group Lag Report ==="
echo "Group: $GROUP | Topic: $TOPIC"
echo "================================="
# Get consumer group lag using kafka-consumer-groups CLI
kafka-consumer-groups.sh \
--bootstrap-server "$BOOTSTRAP" \
--group "$GROUP" \
--describe 2>/dev/null | \
awk 'NR>1 {
partition=$3; current=$4; end=$5; lag=$6;
total_lag += lag;
printf " Partition %s: offset=%s end=%s lag=%s\n", partition, current, end, lag
}
END { printf "\n TOTAL LAG: %d messages\n", total_lag }'
# Alert if lag exceeds threshold
THRESHOLD=10000
TOTAL_LAG=$(kafka-consumer-groups.sh \
--bootstrap-server "$BOOTSTRAP" \
--group "$GROUP" \
--describe 2>/dev/null | \
awk 'NR>1 {sum+=$6} END {print sum}')
if [ "$TOTAL_LAG" -gt "$THRESHOLD" ]; then
echo "⚠️ WARNING: Consumer lag ($TOTAL_LAG) exceeds threshold ($THRESHOLD)"
echo " Possible causes: slow consumer, consumer crash, rebalancing"
fi
Case Studies
Uber: Saga-Based Payment Processing
Uber's trip payment involves multiple services: ride pricing, payment authorization, driver payout, promotions, and receipts. They use an orchestrated saga with their internal "Cadence" workflow engine (now open-sourced as Temporal).
The saga flow:
- Calculate fare — pricing service determines final amount
- Authorize payment — charge rider's payment method (hold, not capture)
- Apply promotions — deduct credits, promo codes
- Capture payment — finalize the charge
- Pay driver — credit driver's account
- Generate receipt — create and send receipt
Compensation chain: If step 4 (capture) fails after step 2 (auth), the compensation releases the authorization hold. If step 5 (driver pay) fails, the compensation reverses the capture and refunds the rider. Each compensation is idempotent and can be retried.
Key design decisions:
- Orchestrator persists saga state to survive crashes (Temporal durability)
- Each step has configurable retry policies and timeouts
- Compensations run in reverse order with at-least-once delivery
- Idempotency keys prevent duplicate charges during retries
CockroachDB: Hybrid Logical Clocks
CockroachDB provides serializable isolation across a distributed SQL database without specialized hardware (unlike Google Spanner's atomic clocks). It uses Hybrid Logical Clocks (HLC) to order transactions.
How HLC works in CockroachDB:
- Each timestamp is
(wall_time, logical_counter) wall_timeis synchronized via NTP (bounded by--max-offset, default 500ms)logical_counterdifferentiates events at the same wall time- Causally-related events always get ordered timestamps (Lamport property)
Uncertainty interval: When a transaction reads a key, it might encounter a value written by a concurrent transaction on a different node. If that value's timestamp falls within the reader's uncertainty window [read_ts, read_ts + max_offset], CockroachDB triggers an "uncertainty restart" — it bumps the transaction's timestamp above the uncertain value and retries.
Trade-off: Tighter max_offset (better NTP) = fewer uncertainty restarts = better performance. CockroachDB recommends nodes within 500ms clock offset, achievable on any modern cloud with NTP.
Conclusion & Next Steps
The key takeaways:
- Avoid distributed transactions when possible. 2PC is blocking and fragile. Prefer sagas with compensating transactions for cross-service consistency — accept eventual consistency in exchange for availability.
- Choreography for simple flows, orchestration for complex. 3-4 step sagas work well with events. 5+ steps need a central orchestrator to manage complexity and make the flow visible.
- Compensating transactions must be idempotent and cannot fail permanently. Design every saga step with its undo operation from day one. Test compensation paths as rigorously as the happy path.
- Never trust wall clocks for ordering. Physical clocks drift, jump, and disagree. Use logical clocks (Lamport for total order, vector clocks for causal + conflict detection) or hybrid clocks (HLC for real-time approximation with causal guarantees).
- Exactly-once is idempotent at-least-once. True exactly-once across network boundaries is impossible. Achieve it in practice with idempotent consumers + deduplication (idempotency keys, dedup tables, Kafka transactions).
- The outbox pattern solves dual-write. Never update a database and publish an event in separate transactions. Write both atomically to the DB, then relay events asynchronously. This is the foundation of reliable event-driven architecture.
- Monitor consumer lag. In at-least-once systems, growing lag indicates consumers can't keep up — leading to stale data, unbounded memory growth, and eventually message loss if retention expires.
Next in the Series
In Part 16: Telemetry & Performance Modeling, we'll master the three pillars of observability — metrics, traces, and logs — plus performance modeling with USE/RED methods, capacity planning, and SLI/SLO/SLA reliability frameworks.