Transactions, Time & Messaging — Systems Thinking & Architecture Mastery Part 15

Module 30: Distributed Transactions

Two-Phase Commit (2PC)

In a monolithic system, a single database transaction guarantees ACID properties — atomicity, consistency, isolation, durability. When data spans multiple services or databases, we need a protocol to coordinate atomic operations across participants. Two-Phase Commit (2PC) is the classical solution.

                            
                            2PC Protocol: A coordinator node orchestrates the transaction across multiple participant nodes in two phases — a prepare phase where participants vote, and a commit phase where the coordinator makes the final decision. All participants must agree, or the entire transaction is aborted.
                        

Phase 1 — Prepare (Voting):

Coordinator sends PREPARE to all participants
Each participant executes the transaction locally (but does NOT commit)
Each participant writes to its WAL (Write-Ahead Log) for durability
Each participant responds with VOTE_COMMIT or VOTE_ABORT

Phase 2 — Commit (Decision):

If ALL votes are VOTE_COMMIT → coordinator sends GLOBAL_COMMIT
If ANY vote is VOTE_ABORT → coordinator sends GLOBAL_ABORT
Participants execute commit or rollback and acknowledge
Coordinator marks transaction complete after all acknowledgements

Two-Phase Commit Protocol

sequenceDiagram
    participant C as Coordinator
    participant P1 as Participant A
    participant P2 as Participant B
    participant P3 as Participant C

    Note over C,P3: Phase 1: Prepare
    C->>P1: PREPARE
    C->>P2: PREPARE
    C->>P3: PREPARE
    P1->>C: VOTE_COMMIT
    P2->>C: VOTE_COMMIT
    P3->>C: VOTE_COMMIT

    Note over C,P3: Phase 2: Commit
    C->>P1: GLOBAL_COMMIT
    C->>P2: GLOBAL_COMMIT
    C->>P3: GLOBAL_COMMIT
    P1->>C: ACK
    P2->>C: ACK
    P3->>C: ACK
    Note over C: Transaction Complete

The Blocking Problem

2PC has a fundamental flaw: it is a blocking protocol. If the coordinator crashes after sending PREPARE but before sending the commit/abort decision, all participants are stuck in an uncertain state — they've voted but don't know the outcome. They cannot commit (another participant might have voted abort), and they cannot abort (the coordinator might have decided to commit).

                            
                            Coordinator Failure Scenarios: If the coordinator fails after Phase 1, participants hold locks indefinitely while waiting for the decision. This creates a blocking state that can cascade into system-wide deadlocks. Three-phase commit (3PC) adds a "pre-commit" phase to reduce blocking, but at the cost of additional round trips and still cannot handle network partitions.
                        

Why 2PC is rarely used in modern microservices:

Performance: Synchronous protocol — all participants must respond before progress
Availability: Single coordinator is a SPOF; coordinator failure blocks everyone
Latency: Two network round-trips minimum; locks held throughout
Coupling: All participants must be available simultaneously
Heterogeneity: All participants must support the same 2PC protocol (XA)

Saga Patterns

A saga is a sequence of local transactions where each step has a corresponding compensating transaction that undoes its effect. Instead of one atomic distributed transaction, you execute a chain of local ACID transactions. If any step fails, previously completed steps are undone by their compensations — in reverse order.

                            
                            Saga vs 2PC: 2PC provides atomicity — all or nothing. Sagas provide eventual consistency — the system will eventually reach a consistent state, either by completing all steps or by compensating all completed steps. The trade-off: sagas sacrifice isolation (intermediate states are visible) for availability and performance.
                        

There are two orchestration approaches for sagas:

1. Choreography (Event-Driven): Each service listens for events and publishes its own events. No central coordinator. Services react to events independently.

2. Orchestration (Command-Driven): A central orchestrator tells each service what to do and manages the saga state. The orchestrator knows the complete workflow and handles failures.

Saga: Choreography vs Orchestration

flowchart LR
    subgraph Choreography
        direction LR
        A1[Order Service] -->|OrderCreated| B1[Payment Service]
        B1 -->|PaymentCharged| C1[Inventory Service]
        C1 -->|ItemsReserved| D1[Shipping Service]
        D1 -->|ShipmentScheduled| A1
    end

    subgraph Orchestration
        direction LR
        O[Saga Orchestrator] -->|ChargePayment| B2[Payment Service]
        B2 -->|Success| O
        O -->|ReserveItems| C2[Inventory Service]
        C2 -->|Success| O
        O -->|ScheduleShip| D2[Shipping Service]
        D2 -->|Success| O
    end

Choreography trade-offs:

✅ Loose coupling — services don't know about each other
✅ Simple for small sagas (3-4 steps)
❌ Hard to understand the overall flow (distributed logic)
❌ Difficult to add new steps or change ordering
❌ Cyclic dependencies can emerge

Orchestration trade-offs:

✅ Centralized workflow logic — easy to understand and modify
✅ Easier error handling and compensation
✅ Better for complex sagas (5+ steps)
❌ Orchestrator becomes a single point of coupling
❌ Risk of becoming a "god service" with too much logic

Compensating Transactions

The key to saga reliability is well-designed compensating transactions. A compensating transaction semantically undoes the effect of a completed step — it doesn't literally "undo" (you can't un-send an email), but restores the system to an equivalent consistent state.

Rules for compensating transactions:

Must be idempotent: Compensation might be triggered multiple times (retries)
Must be commutative: Order shouldn't matter if multiple compensations run
Cannot fail permanently: If compensation fails, it must be retried until success
Must be semantically reversible: The domain must support the reversal (refunds, cancellations)

import enum
import asyncio
from dataclasses import dataclass, field
from typing import Callable, Awaitable


class SagaStepStatus(enum.Enum):
    PENDING = "pending"
    COMPLETED = "completed"
    COMPENSATED = "compensated"
    FAILED = "failed"


@dataclass
class SagaStep:
    name: str
    action: Callable[..., Awaitable[None]]
    compensation: Callable[..., Awaitable[None]]
    status: SagaStepStatus = SagaStepStatus.PENDING


@dataclass
class SagaOrchestrator:
    """Executes saga steps in order; compensates on failure."""
    saga_id: str
    steps: list[SagaStep] = field(default_factory=list)
    completed: list[SagaStep] = field(default_factory=list)

    async def execute(self, context: dict) -> bool:
        for step in self.steps:
            try:
                print(f"[{self.saga_id}] Executing: {step.name}")
                await step.action(context)
                step.status = SagaStepStatus.COMPLETED
                self.completed.append(step)
            except Exception as e:
                print(f"[{self.saga_id}] FAILED at: {step.name} — {e}")
                step.status = SagaStepStatus.FAILED
                await self._compensate(context)
                return False
        print(f"[{self.saga_id}] Saga completed successfully")
        return True

    async def _compensate(self, context: dict) -> None:
        """Compensate in reverse order."""
        for step in reversed(self.completed):
            try:
                print(f"[{self.saga_id}] Compensating: {step.name}")
                await step.compensation(context)
                step.status = SagaStepStatus.COMPENSATED
            except Exception as e:
                # Must retry — compensations cannot fail permanently
                print(f"[{self.saga_id}] Compensation failed: {step.name} — retrying")
                await asyncio.sleep(1)
                await step.compensation(context)
                step.status = SagaStepStatus.COMPENSATED


# Example usage: Order processing saga
async def charge_payment(ctx):
    print(f"  Charging ${ctx['amount']} to card {ctx['card'][-4:]}")

async def refund_payment(ctx):
    print(f"  Refunding ${ctx['amount']} to card {ctx['card'][-4:]}")

async def reserve_inventory(ctx):
    print(f"  Reserving {ctx['items']} items")

async def release_inventory(ctx):
    print(f"  Releasing {ctx['items']} items back to stock")

async def schedule_shipping(ctx):
    raise RuntimeError("Shipping service unavailable")

async def cancel_shipping(ctx):
    print("  Cancelling shipment")


async def main():
    saga = SagaOrchestrator(
        saga_id="order-12345",
        steps=[
            SagaStep("charge_payment", charge_payment, refund_payment),
            SagaStep("reserve_inventory", reserve_inventory, release_inventory),
            SagaStep("schedule_shipping", schedule_shipping, cancel_shipping),
        ]
    )
    context = {"amount": 99.99, "card": "4111111111111111", "items": 3}
    success = await saga.execute(context)
    print(f"Order result: {'Success' if success else 'Rolled back'}")


asyncio.run(main())

Module 31: Time in Distributed Systems

Clock Drift & NTP Limitations

Every computer has a quartz oscillator that ticks at a slightly different rate. Over time, clocks drift — they diverge from true time and from each other. NTP (Network Time Protocol) periodically resynchronizes clocks, but has fundamental accuracy limits.

                            
                            Why You Can't Trust Wall Clocks: NTP accuracy is typically 1-10ms on LAN, 10-100ms across the internet. Quartz crystals drift ~10-20 ppm (parts per million), meaning a clock drifts 0.86-1.73 seconds per day. If two events happen 5ms apart on different machines, you cannot determine their order using wall clocks alone. This breaks any system that uses timestamps for ordering (last-write-wins, conflict resolution, expiration checks).
                        

Clock anomalies that break systems:

Clock skew: Two machines disagree on the current time
Clock jump: NTP correction causes time to leap forward (or backward with slew mode)
Leap seconds: UTC inserts a 61st second — some systems crash, others repeat a second
Monotonicity violation: gettimeofday() can return earlier values after NTP adjustments

Lamport Timestamps

Leslie Lamport's 1978 paper established that we don't need physical time to order events — we need logical time. A Lamport timestamp is a simple integer counter that provides a partial ordering of events in a distributed system.

                            
                            Lamport Clock Rules: (1) Before executing an event, increment the counter. (2) When sending a message, include the counter value. (3) When receiving a message, set the counter to max(local, received) + 1. This guarantees: if event A happened-before event B, then L(A) < L(B). But the converse is NOT true — L(A) < L(B) does NOT mean A happened before B (concurrent events can have any order).
                        

class LamportClock:
    """Lamport logical clock for event ordering."""

    def __init__(self, node_id: str):
        self.node_id = node_id
        self.time = 0

    def tick(self) -> int:
        """Local event: increment clock."""
        self.time += 1
        return self.time

    def send(self) -> tuple[int, str]:
        """Send event: increment and return timestamp."""
        self.time += 1
        return (self.time, self.node_id)

    def receive(self, sender_time: int) -> int:
        """Receive event: merge clocks and increment."""
        self.time = max(self.time, sender_time) + 1
        return self.time


# Demonstrate causal ordering
node_a = LamportClock("A")
node_b = LamportClock("B")

# Node A does local work
node_a.tick()  # A.time = 1
node_a.tick()  # A.time = 2

# Node A sends message to B
msg_time, sender = node_a.send()  # A.time = 3
print(f"A sends at t={msg_time}")

# Node B does local work independently
node_b.tick()  # B.time = 1

# Node B receives A's message
node_b.receive(msg_time)  # B.time = max(1, 3) + 1 = 4
print(f"B receives, updates to t={node_b.time}")

# Guaranteed: send(3) happened-before receive(4)
# But B's tick(1) and A's tick(2) are CONCURRENT — no ordering possible
print(f"A clock: {node_a.time}, B clock: {node_b.time}")

Vector Clocks

Lamport clocks can't detect concurrency — they give a total order where concurrent events get arbitrary ordering. Vector clocks solve this by maintaining a vector of counters (one per node), enabling detection of causal ordering AND concurrency.

                            
                            Vector Clock Semantics: Each node maintains a vector [t₁, t₂, ..., tₙ] of size N (number of nodes). Entry i represents the latest known time of node i. Comparison rules: V(A) < V(B) iff every entry in A ≤ corresponding entry in B, and at least one is strictly less. If neither V(A) < V(B) nor V(B) < V(A), the events are concurrent — a conflict that must be resolved.
                        

Vector Clock Divergence & Conflict Detection

flowchart TD
    A1["Node A: [1,0,0]"] --> A2["Node A: [2,0,0]"]
    A2 -->|"send"| B2["Node B: [2,1,0]"]
    B1["Node B: [0,1,0]"] --> B2
    B2 --> B3["Node B: [2,2,0]"]
    A2 --> A3["Node A: [3,0,0]"]
    C1["Node C: [0,0,1]"] --> C2["Node C: [0,0,2]"]

    A3 --> CONFLICT{"A3=[3,0,0] vs C2=[0,0,2]
CONCURRENT — conflict!"}
    C2 --> CONFLICT
    B3 --> CAUSAL{"B3=[2,2,0] > A1=[1,0,0]
B3 causally after A1"}

Hybrid Logical Clocks (HLC)

Hybrid Logical Clocks combine the best of physical and logical time. An HLC timestamp has two parts: a physical component (wall clock, bounded by NTP uncertainty) and a logical component (for ordering events within the same physical time). This gives you causal ordering while keeping timestamps close to real time.

Google TrueTime takes a different approach — it uses atomic clocks and GPS receivers to bound clock uncertainty. TrueTime returns an interval [earliest, latest] instead of a point in time. Spanner waits until the uncertainty window passes before committing, guaranteeing linearizability despite clock drift. This requires specialized hardware that most systems don't have.

CockroachDB's HLC provides a practical alternative:

Physical component: wall clock, bounded by max_offset (default 500ms)
Logical component: counter for events at the same physical timestamp
If physical time moves forward, reset logical counter to 0
If two events have same physical time, logical counter provides ordering
Transactions that might violate ordering trigger "uncertainty restarts"

Module 32: Distributed Messaging Systems

Delivery Guarantees

Message delivery in distributed systems has three semantic levels, each with different trade-offs between reliability, performance, and complexity:

At-Most-Once (Fire-and-Forget):

Message is sent once with no confirmation or retry
Message may be lost (network failure, consumer crash)
Zero duplicate messages — simplest semantics
Use case: metrics, logging, non-critical notifications
Implementation: UDP, async publish without ACK

At-Least-Once (ACK + Retry):

Producer retries until it receives acknowledgement
Guarantees delivery but may produce duplicates
Consumer must be idempotent — processing same message twice must be safe
Use case: most business transactions, event sourcing
Implementation: Kafka with acks=all, RabbitMQ with publisher confirms

Exactly-Once (Hardest Problem):

Each message is processed exactly one time — no loss, no duplicates
Technically impossible across network boundaries (Two Generals Problem)
Achieved in practice through idempotent processing + deduplication
Use case: financial transactions, inventory updates, billing
Implementation: Kafka transactions, transactional outbox + deduplication table

Exactly-Once Semantics in Practice

Kafka achieves exactly-once semantics (EOS) through the combination of an idempotent producer (assigns sequence numbers to detect duplicates at the broker) and transactional consumers (atomic read-process-write within a single Kafka transaction).

# Kafka producer configuration for exactly-once semantics
producer:
  bootstrap.servers: "kafka-1:9092,kafka-2:9092,kafka-3:9092"

  # Idempotent producer: broker deduplicates by producer ID + sequence number
  enable.idempotence: true
  acks: all                    # Wait for ALL in-sync replicas
  retries: 2147483647          # Infinite retries (bounded by timeout)
  max.in.flight.requests.per.connection: 5  # Safe with idempotence

  # Transactional producer: atomic multi-partition writes
  transactional.id: "order-processor-1"

  # Reliability settings
  delivery.timeout.ms: 120000  # 2 minutes total delivery timeout
  request.timeout.ms: 30000    # Individual request timeout

consumer:
  bootstrap.servers: "kafka-1:9092,kafka-2:9092,kafka-3:9092"
  group.id: "order-processors"

  # Read-committed: only see messages from committed transactions
  isolation.level: read_committed

  # Manual offset management (committed atomically with processing)
  enable.auto.commit: false

  # Start from earliest if no committed offset exists
  auto.offset.reset: earliest

Transactional Outbox Pattern

The transactional outbox pattern solves the dual-write problem: how do you atomically update a database AND publish an event? If you update the DB first and the event publish fails, the event is lost. If you publish first and the DB write fails, you sent a lie.

The solution: write the event to an "outbox" table in the same database transaction as the business data. A separate process (relay/poller or CDC connector) reads the outbox and publishes to the message broker. The database transaction guarantees atomicity between the data change and the event record.

Transactional Outbox Pattern Flow

flowchart LR
    subgraph "Same DB Transaction"
        A[Service] -->|"1. Write data"| B[(Orders Table)]
        A -->|"2. Write event"| C[(Outbox Table)]
    end
    C -->|"3. Poll/CDC"| D[Relay Process]
    D -->|"4. Publish"| E[Message Broker]
    E -->|"5. Deliver"| F[Consumer]
    D -->|"6. Mark sent"| C

#!/bin/bash
# Monitor Kafka consumer group lag — detect if consumers are falling behind

BOOTSTRAP="kafka-1:9092"
GROUP="order-processors"
TOPIC="orders"

echo "=== Consumer Group Lag Report ==="
echo "Group: $GROUP | Topic: $TOPIC"
echo "================================="

# Get consumer group lag using kafka-consumer-groups CLI
kafka-consumer-groups.sh \
    --bootstrap-server "$BOOTSTRAP" \
    --group "$GROUP" \
    --describe 2>/dev/null | \
    awk 'NR>1 {
        partition=$3; current=$4; end=$5; lag=$6;
        total_lag += lag;
        printf "  Partition %s: offset=%s end=%s lag=%s\n", partition, current, end, lag
    }
    END { printf "\n  TOTAL LAG: %d messages\n", total_lag }'

# Alert if lag exceeds threshold
THRESHOLD=10000
TOTAL_LAG=$(kafka-consumer-groups.sh \
    --bootstrap-server "$BOOTSTRAP" \
    --group "$GROUP" \
    --describe 2>/dev/null | \
    awk 'NR>1 {sum+=$6} END {print sum}')

if [ "$TOTAL_LAG" -gt "$THRESHOLD" ]; then
    echo "⚠️  WARNING: Consumer lag ($TOTAL_LAG) exceeds threshold ($THRESHOLD)"
    echo "   Possible causes: slow consumer, consumer crash, rebalancing"
fi

Case Studies

Uber: Saga-Based Payment Processing

Case Study Uber — Distributed Payment Saga

Uber's trip payment involves multiple services: ride pricing, payment authorization, driver payout, promotions, and receipts. They use an orchestrated saga with their internal "Cadence" workflow engine (now open-sourced as Temporal).

The saga flow:

Calculate fare — pricing service determines final amount
Authorize payment — charge rider's payment method (hold, not capture)
Apply promotions — deduct credits, promo codes
Capture payment — finalize the charge
Pay driver — credit driver's account
Generate receipt — create and send receipt

Compensation chain: If step 4 (capture) fails after step 2 (auth), the compensation releases the authorization hold. If step 5 (driver pay) fails, the compensation reverses the capture and refunds the rider. Each compensation is idempotent and can be retried.

Key design decisions:

Orchestrator persists saga state to survive crashes (Temporal durability)
Each step has configurable retry policies and timeouts
Compensations run in reverse order with at-least-once delivery
Idempotency keys prevent duplicate charges during retries

Saga Orchestration Temporal Idempotency

CockroachDB: Hybrid Logical Clocks

Case Study CockroachDB — HLC for Serializable Transactions

CockroachDB provides serializable isolation across a distributed SQL database without specialized hardware (unlike Google Spanner's atomic clocks). It uses Hybrid Logical Clocks (HLC) to order transactions.

How HLC works in CockroachDB:

Each timestamp is (wall_time, logical_counter)
wall_time is synchronized via NTP (bounded by --max-offset, default 500ms)
logical_counter differentiates events at the same wall time
Causally-related events always get ordered timestamps (Lamport property)

Uncertainty interval: When a transaction reads a key, it might encounter a value written by a concurrent transaction on a different node. If that value's timestamp falls within the reader's uncertainty window [read_ts, read_ts + max_offset], CockroachDB triggers an "uncertainty restart" — it bumps the transaction's timestamp above the uncertain value and retries.

Trade-off: Tighter max_offset (better NTP) = fewer uncertainty restarts = better performance. CockroachDB recommends nodes within 500ms clock offset, achievable on any modern cloud with NTP.

HLC Serializable Clock Uncertainty Distributed SQL

Conclusion & Next Steps

The key takeaways:

Avoid distributed transactions when possible. 2PC is blocking and fragile. Prefer sagas with compensating transactions for cross-service consistency — accept eventual consistency in exchange for availability.
Choreography for simple flows, orchestration for complex. 3-4 step sagas work well with events. 5+ steps need a central orchestrator to manage complexity and make the flow visible.
Compensating transactions must be idempotent and cannot fail permanently. Design every saga step with its undo operation from day one. Test compensation paths as rigorously as the happy path.
Never trust wall clocks for ordering. Physical clocks drift, jump, and disagree. Use logical clocks (Lamport for total order, vector clocks for causal + conflict detection) or hybrid clocks (HLC for real-time approximation with causal guarantees).
Exactly-once is idempotent at-least-once. True exactly-once across network boundaries is impossible. Achieve it in practice with idempotent consumers + deduplication (idempotency keys, dedup tables, Kafka transactions).
The outbox pattern solves dual-write. Never update a database and publish an event in separate transactions. Write both atomically to the DB, then relay events asynchronously. This is the foundation of reliable event-driven architecture.
Monitor consumer lag. In at-least-once systems, growing lag indicates consumers can't keep up — leading to stale data, unbounded memory growth, and eventually message loss if retention expires.

Next in the Series

In Part 16: Telemetry & Performance Modeling, we'll master the three pillars of observability — metrics, traces, and logs — plus performance modeling with USE/RED methods, capacity planning, and SLI/SLO/SLA reliability frameworks.

Previous Part 14: Distributed Coordination Next Part 16: Telemetry & Performance

Cookie Consent

Part 15: Transactions, Time & Messaging

Table of Contents

Module 30: Distributed Transactions

Two-Phase Commit (2PC)

The Blocking Problem

Saga Patterns

Compensating Transactions

Module 31: Time in Distributed Systems

Clock Drift & NTP Limitations

Lamport Timestamps

Vector Clocks

Hybrid Logical Clocks (HLC)

Module 32: Distributed Messaging Systems

Delivery Guarantees

Exactly-Once Semantics in Practice

Transactional Outbox Pattern

Case Studies

Uber: Saga-Based Payment Processing

CockroachDB: Hybrid Logical Clocks

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 15: Transactions, Time & Messaging

Table of Contents

Module 30: Distributed Transactions

Two-Phase Commit (2PC)

The Blocking Problem

Saga Patterns

Compensating Transactions

Module 31: Time in Distributed Systems

Clock Drift & NTP Limitations

Lamport Timestamps

Vector Clocks

Hybrid Logical Clocks (HLC)

Module 32: Distributed Messaging Systems

Delivery Guarantees

Exactly-Once Semantics in Practice

Transactional Outbox Pattern

Case Studies

Uber: Saga-Based Payment Processing

CockroachDB: Hybrid Logical Clocks

Conclusion & Next Steps

Next in the Series

Related Articles in This Series

Part 14: Distributed Coordination & Consistency

Part 16: Telemetry & Performance Modeling

Part 17: Evolutionary Architecture & Conway's Law