System Design Series Part 13: Distributed Systems Deep Dive

Distributed Systems Fundamentals

Series Navigation: This is Part 13 of the 15-part System Design Series. Review Part 12: Low-Level Design first.

System Design Mastery

Your 15-step learning path • Currently on Step 13

13

Distributed Systems Deep Dive

Consensus, Paxos, Raft, coordination

You Are Here

14

Authentication & Security

OAuth, JWT, zero trust, compliance

15

Questions & Trade-offs

Common questions, SQL vs NoSQL, push vs pull

Distributed Systems are systems where components are located on different networked computers and communicate by passing messages. They present unique challenges around consistency, availability, and partition tolerance that don't exist in single-machine systems.

Diagram showing distributed system components communicating over a network with message passing, highlighting challenges of consistency, availability, and partition tolerance — Overview of distributed system architecture with networked components exchanging messages

                        
                        Key Insight: Distributed systems fail in ways that single-machine systems never do. Understanding consensus algorithms and coordination patterns is essential for building reliable distributed architectures.
                    

Key Challenges in Distributed Systems

Challenge	Description	Solutions
Network Failures	Messages can be lost, delayed, duplicated, or reordered	Retries, timeouts, idempotency, sequence numbers
Partial Failures	Some nodes fail while others continue	Heartbeats, failure detectors, consensus protocols
Clock Skew	No global clock; different machines have different times	Logical clocks, vector clocks, TrueTime
Consistency	Keeping data in sync across replicas	Consensus algorithms, quorums, eventual consistency
Split-Brain	Network partition causes multiple leaders	Fencing tokens, leader leases, consensus

The Eight Fallacies of Distributed Computing

# Assumptions that engineers wrongly make about distributed systems:

"""
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn't change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous

Reality: All of these assumptions are FALSE in production systems.
"""

# Example: Handling unreliable networks
class ReliableClient:
    def __init__(self, max_retries=3, timeout=5):
        self.max_retries = max_retries
        self.timeout = timeout
    
    def send_with_retry(self, request):
        for attempt in range(self.max_retries):
            try:
                response = self._send(request, timeout=self.timeout)
                return response
            except TimeoutError:
                if attempt == self.max_retries - 1:
                    raise
                # Exponential backoff
                time.sleep(2 ** attempt)
            except ConnectionError:
                if attempt == self.max_retries - 1:
                    raise
                time.sleep(1)

Consensus Algorithms

Consensus algorithms allow distributed nodes to agree on a single value, even when some nodes fail. They are the foundation of replicated state machines and distributed databases.

Flowchart of consensus algorithm phases showing proposer, acceptor, and learner roles reaching agreement across distributed nodes — Consensus algorithm flow: nodes propose, vote, and agree on a single value despite partial failures

Agreement: All non-faulty nodes decide on the same value
Validity: The decided value was proposed by some node
Termination: All non-faulty nodes eventually decide

Paxos Algorithm

                        
                        Key Insight: Paxos is notoriously difficult to understand and implement. It's often referred to as the algorithm that "makes your head hurt." Raft was designed as a more understandable alternative.
                    

Paxos Roles

Proposers: Propose values to be agreed upon
Acceptors: Vote on proposals and remember accepted values
Learners: Learn the decided value

Basic Paxos Flow

# Basic Paxos (Single-Decree Paxos)
"""
Phase 1: Prepare
1. Proposer selects proposal number n
2. Proposer sends Prepare(n) to majority of acceptors
3. Acceptors respond with Promise(n, v) if n > highest seen
   - Include previously accepted value v (if any)

Phase 2: Accept
1. If proposer receives majority promises:
   - If any promise included accepted value, use highest-numbered one
   - Otherwise, use proposer's own value
2. Proposer sends Accept(n, v) to acceptors
3. Acceptors accept if n >= highest promised
4. If majority accept, value is chosen
"""

class PaxosAcceptor:
    def __init__(self, node_id):
        self.node_id = node_id
        self.promised_id = None
        self.accepted_id = None
        self.accepted_value = None
    
    def receive_prepare(self, proposal_id):
        """Phase 1b: Respond to Prepare"""
        if self.promised_id is None or proposal_id > self.promised_id:
            self.promised_id = proposal_id
            return {
                'promised': True,
                'accepted_id': self.accepted_id,
                'accepted_value': self.accepted_value
            }
        return {'promised': False}
    
    def receive_accept(self, proposal_id, value):
        """Phase 2b: Respond to Accept"""
        if self.promised_id is None or proposal_id >= self.promised_id:
            self.promised_id = proposal_id
            self.accepted_id = proposal_id
            self.accepted_value = value
            return {'accepted': True}
        return {'accepted': False}

Raft Algorithm

Raft was designed for understandability. It separates consensus into three sub-problems:

Raft Consensus State Machine

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate: Election Timeout
    Candidate --> Leader: Receives Majority Votes
    Candidate --> Follower: Discovers Higher Term
    Candidate --> Candidate: Split Vote / New Election
    Leader --> Follower: Discovers Higher Term
    Follower --> Follower: Receives Heartbeat

Leader Election: Choose one leader at a time
Log Replication: Leader replicates its log to followers
Safety: Never return stale or incorrect data

Raft Implementation

from enum import Enum
import random
import time

class State(Enum):
    FOLLOWER = 1
    CANDIDATE = 2
    LEADER = 3

class RaftNode:
    def __init__(self, node_id, peers):
        self.node_id = node_id
        self.peers = peers
        
        # Persistent state
        self.current_term = 0
        self.voted_for = None
        self.log = []  # List of (term, command)
        
        # Volatile state
        self.commit_index = 0
        self.last_applied = 0
        self.state = State.FOLLOWER
        
        # Leader state (reinitialized after election)
        self.next_index = {}
        self.match_index = {}
        
        # Timing
        self.election_timeout = self._random_timeout()
        self.last_heartbeat = time.time()
    
    def _random_timeout(self):
        return random.uniform(150, 300) / 1000  # 150-300ms
    
    def on_election_timeout(self):
        """Start election if no heartbeat received"""
        self.state = State.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id
        votes_received = 1  # Vote for self
        
        # Request votes from all peers
        for peer in self.peers:
            response = self.send_request_vote(peer)
            if response and response['vote_granted']:
                votes_received += 1
        
        # Check if won election
        if votes_received > len(self.peers) // 2:
            self._become_leader()
        else:
            self.state = State.FOLLOWER
    
    def _become_leader(self):
        """Transition to leader state"""
        self.state = State.LEADER
        # Initialize leader state
        for peer in self.peers:
            self.next_index[peer] = len(self.log) + 1
            self.match_index[peer] = 0
        # Send initial heartbeats
        self.send_heartbeats()
    
    def handle_request_vote(self, candidate_term, candidate_id, last_log_idx, last_log_term):
        """Handle RequestVote RPC"""
        if candidate_term < self.current_term:
            return {'term': self.current_term, 'vote_granted': False}
        
        if candidate_term > self.current_term:
            self.current_term = candidate_term
            self.state = State.FOLLOWER
            self.voted_for = None
        
        # Check if candidate's log is at least as up-to-date
        log_ok = (last_log_term > self._last_log_term() or 
                  (last_log_term == self._last_log_term() and 
                   last_log_idx >= len(self.log)))
        
        if (self.voted_for is None or self.voted_for == candidate_id) and log_ok:
            self.voted_for = candidate_id
            return {'term': self.current_term, 'vote_granted': True}
        
        return {'term': self.current_term, 'vote_granted': False}
    
    def handle_append_entries(self, leader_term, leader_id, prev_log_idx, 
                               prev_log_term, entries, leader_commit):
        """Handle AppendEntries RPC (heartbeat + log replication)"""
        if leader_term < self.current_term:
            return {'term': self.current_term, 'success': False}
        
        self.last_heartbeat = time.time()
        self.state = State.FOLLOWER
        
        if leader_term > self.current_term:
            self.current_term = leader_term
            self.voted_for = None
        
        # Check log consistency
        if prev_log_idx > 0:
            if len(self.log) < prev_log_idx:
                return {'term': self.current_term, 'success': False}
            if self.log[prev_log_idx - 1][0] != prev_log_term:
                return {'term': self.current_term, 'success': False}
        
        # Append new entries
        for i, entry in enumerate(entries):
            idx = prev_log_idx + i
            if idx < len(self.log):
                if self.log[idx][0] != entry[0]:
                    self.log = self.log[:idx]
                    self.log.append(entry)
            else:
                self.log.append(entry)
        
        # Update commit index
        if leader_commit > self.commit_index:
            self.commit_index = min(leader_commit, len(self.log))
        
        return {'term': self.current_term, 'success': True}

Byzantine Fault Tolerance (BFT)

BFT algorithms handle malicious nodes (Byzantine faults), not just crash faults. They're used in blockchain and high-security systems.

Algorithm	Fault Tolerance	Message Complexity	Use Case
Paxos/Raft	f < n/2 (crash faults)	O(n)	Databases, coordination
PBFT	f < n/3 (Byzantine)	O(n²)	Permissioned blockchains
Proof of Work	51% honest compute	O(n)	Bitcoin, public chains

PBFT Flow

# PBFT (Practical Byzantine Fault Tolerance)
"""
Requires 3f + 1 nodes to tolerate f Byzantine faults

Phases:
1. Pre-Prepare: Primary broadcasts client request
2. Prepare: Each replica broadcasts PREPARE message
3. Commit: After 2f PREPARE messages, broadcast COMMIT
4. Reply: After 2f+1 COMMIT messages, execute and reply

View Change: If primary is faulty, replicas elect new primary
"""

class PBFTNode:
    def __init__(self, node_id, total_nodes):
        self.node_id = node_id
        self.n = total_nodes
        self.f = (total_nodes - 1) // 3  # Max faulty nodes
        self.view = 0
        self.sequence_number = 0
        self.prepared = {}   # (v, n) -> set of nodes
        self.committed = {}  # (v, n) -> set of nodes
        self.log = []
    
    def is_primary(self):
        return self.node_id == self.view % self.n
    
    def handle_request(self, request):
        if not self.is_primary():
            return self.forward_to_primary(request)
        
        self.sequence_number += 1
        # Broadcast PRE-PREPARE
        msg = {
            'type': 'PRE-PREPARE',
            'view': self.view,
            'seq': self.sequence_number,
            'digest': hash(request),
            'request': request
        }
        self.broadcast(msg)
    
    def handle_pre_prepare(self, msg):
        # Verify and broadcast PREPARE
        if self.verify_pre_prepare(msg):
            prepare_msg = {
                'type': 'PREPARE',
                'view': msg['view'],
                'seq': msg['seq'],
                'digest': msg['digest'],
                'node_id': self.node_id
            }
            self.broadcast(prepare_msg)
    
    def handle_prepare(self, msg):
        key = (msg['view'], msg['seq'])
        if key not in self.prepared:
            self.prepared[key] = set()
        self.prepared[key].add(msg['node_id'])
        
        # If received 2f prepares (including self), enter COMMIT phase
        if len(self.prepared[key]) >= 2 * self.f:
            commit_msg = {
                'type': 'COMMIT',
                'view': msg['view'],
                'seq': msg['seq'],
                'digest': msg['digest'],
                'node_id': self.node_id
            }
            self.broadcast(commit_msg)

Distributed Coordination

Coordination services provide primitives for building distributed systems: configuration management, leader election, distributed locks, and group membership.

Apache ZooKeeper

ZooKeeper provides a hierarchical namespace (like a filesystem) with strong consistency guarantees. It's used by Kafka, HBase, and many other distributed systems.

ZooKeeper Patterns

from kazoo.client import KazooClient

# Connect to ZooKeeper ensemble
zk = KazooClient(hosts='zk1:2181,zk2:2181,zk3:2181')
zk.start()

# Configuration Management
def store_config(path, config):
    """Store configuration in ZooKeeper"""
    zk.ensure_path(path)
    zk.set(path, json.dumps(config).encode())

def watch_config(path, callback):
    """Watch for configuration changes"""
    @zk.DataWatch(path)
    def watch_config_node(data, stat):
        if data:
            config = json.loads(data.decode())
            callback(config)

# Distributed Lock
lock = zk.Lock("/locks/my-resource", "client-1")
with lock:
    # Critical section - only one client at a time
    perform_exclusive_operation()

# Leader Election
election = zk.Election("/election/my-service", "server-1")

def leader_function():
    """Called when this node becomes leader"""
    print("I am the leader!")
    while True:
        # Do leader work
        time.sleep(1)

# Run for election (blocks until elected or cancelled)
election.run(leader_function)

# Service Discovery
def register_service(service_name, host, port):
    """Register a service instance"""
    path = f"/services/{service_name}"
    zk.ensure_path(path)
    # Ephemeral node: auto-deleted if client disconnects
    zk.create(
        f"{path}/instance-",
        f"{host}:{port}".encode(),
        ephemeral=True,
        sequence=True
    )

def discover_services(service_name):
    """Get all instances of a service"""
    path = f"/services/{service_name}"
    children = zk.get_children(path)
    instances = []
    for child in children:
        data, _ = zk.get(f"{path}/{child}")
        instances.append(data.decode())
    return instances

etcd

etcd is a distributed key-value store using Raft consensus. It's the backbone of Kubernetes for storing cluster state.

etcd Operations

import etcd3

# Connect to etcd cluster
etcd = etcd3.client(host='etcd1', port=2379)

# Key-Value Operations
etcd.put('/config/database/host', 'db.example.com')
value, metadata = etcd.get('/config/database/host')

# Transactions (atomic operations)
etcd.transaction(
    compare=[
        etcd.transactions.value('/config/version') == '1'
    ],
    success=[
        etcd.transactions.put('/config/database/host', 'new-db.example.com'),
        etcd.transactions.put('/config/version', '2')
    ],
    failure=[
        etcd.transactions.get('/config/version')
    ]
)

# Watch for changes
events_iterator, cancel = etcd.watch('/config/', prefix=True)
for event in events_iterator:
    print(f"Key: {event.key}, Value: {event.value}")

# Lease-based ephemeral keys (auto-expire)
lease = etcd.lease(ttl=30)  # 30 second lease
etcd.put('/services/api/instance-1', '192.168.1.10:8080', lease=lease)

# Keep lease alive (heartbeat)
while True:
    lease.refresh()
    time.sleep(10)

# Distributed Lock
lock = etcd.lock('/locks/my-resource', ttl=30)
with lock:
    # Critical section
    pass

Leader Election Patterns

Leader Election with Fencing

# Leader Election with Fencing Tokens
"""
Problem: Network partitions can cause "split-brain" (multiple leaders)
Solution: Fencing tokens ensure only the true leader can make changes

Fencing token: Monotonically increasing number given to each new leader
Storage systems reject requests with old fencing tokens
"""

class LeaderElection:
    def __init__(self, zk_client, election_path, node_id):
        self.zk = zk_client
        self.path = election_path
        self.node_id = node_id
        self.is_leader = False
        self.fencing_token = 0
    
    def run_election(self):
        """Participate in leader election"""
        # Create sequential ephemeral node
        self.zk.ensure_path(self.path)
        self.my_node = self.zk.create(
            f"{self.path}/candidate-",
            self.node_id.encode(),
            ephemeral=True,
            sequence=True
        )
        
        self._check_leadership()
    
    def _check_leadership(self):
        """Check if this node is the leader"""
        children = sorted(self.zk.get_children(self.path))
        my_seq = self.my_node.split('-')[-1]
        
        if children[0].endswith(my_seq):
            # I am the leader!
            self._become_leader()
        else:
            # Watch the node before me
            my_index = next(i for i, c in enumerate(children) if c.endswith(my_seq))
            predecessor = children[my_index - 1]
            
            @self.zk.DataWatch(f"{self.path}/{predecessor}")
            def watch_predecessor(data, stat):
                if stat is None:  # Predecessor deleted
                    self._check_leadership()
    
    def _become_leader(self):
        """Transition to leader with new fencing token"""
        # Increment fencing token (stored in ZK for persistence)
        token_data, stat = self.zk.get(f"{self.path}/fencing_token")
        self.fencing_token = int(token_data.decode()) + 1
        self.zk.set(f"{self.path}/fencing_token", str(self.fencing_token).encode())
        
        self.is_leader = True
        print(f"Became leader with fencing token {self.fencing_token}")
    
    def perform_leader_action(self, storage, operation):
        """Perform action only if still valid leader"""
        if not self.is_leader:
            raise NotLeaderError()
        
        # Include fencing token in all storage operations
        storage.execute(operation, fencing_token=self.fencing_token)

Distributed File Systems

Distributed file systems store files across multiple machines, providing scalability, fault tolerance, and high throughput for large datasets.

HDFS (Hadoop Distributed File System)

HDFS is designed for large files (GB to TB) with write-once, read-many access patterns. It's optimized for batch processing workloads.

HDFS Architecture

# HDFS Architecture
"""
Components:
- NameNode: Metadata server (single master)
  - File namespace (directory tree)
  - Block-to-DataNode mapping
  - Persists to fsimage + edit log
  
- DataNode: Block storage (many workers)
  - Stores actual data blocks (default 128MB)
  - Sends heartbeats to NameNode
  - Replication factor (default 3)
  
- Secondary NameNode: Checkpoint helper
  - Merges fsimage + edit log periodically
  - NOT a failover node!
"""

# Write Path
"""
1. Client contacts NameNode, requests to create file
2. NameNode creates entry in namespace, returns block ID and DataNode list
3. Client writes to first DataNode (DN1)
4. DN1 forwards to DN2, DN2 forwards to DN3 (pipeline)
5. Acks flow back: DN3 ? DN2 ? DN1 ? Client
6. Client requests next block, repeat
7. Client closes file, NameNode persists metadata
"""

# Read Path
"""
1. Client contacts NameNode, requests file location
2. NameNode returns list of blocks and their DataNode locations
3. Client reads blocks from nearest DataNode (rack-aware)
4. If DataNode fails, client retries with next replica
"""

# Rack Awareness
"""
Default replication placement policy:
- 1st replica: Local rack, same node as client (or random)
- 2nd replica: Different rack
- 3rd replica: Same rack as 2nd, different node

Benefits:
- Survives rack failure (power, switch)
- Good write performance (local + one remote)
- Good read performance (can read from local rack)
"""

# Python HDFS Client
from hdfs import InsecureClient

client = InsecureClient('http://namenode:50070', user='hadoop')

# Write file
with client.write('/user/data/myfile.txt', overwrite=True) as writer:
    writer.write(b'Hello HDFS!')

# Read file
with client.read('/user/data/myfile.txt') as reader:
    content = reader.read()

# List directory
files = client.list('/user/data/')

# Get file status
status = client.status('/user/data/myfile.txt')
print(f"Size: {status['length']}, Replication: {status['replication']}")

GFS (Google File System)

GFS (2003) inspired HDFS. Key design choices:

Large blocks (64MB): Reduces metadata, amortizes seek time
Append-only: Optimized for append, not random writes
Relaxed consistency: At-least-once semantics for appends
Co-located compute: Move computation to data

Object Storage (S3 Architecture)

Object storage (like S3) provides a flat namespace of buckets and objects, optimized for web-scale workloads with HTTP access.

S3-Style Architecture

# Object Storage Architecture
"""
Key Differences from File Systems:
- Flat namespace (bucket + key), not hierarchical
- Objects are immutable (replace, not update)
- Eventually consistent (reads may return stale data)
- HTTP API (GET, PUT, DELETE, LIST)
- Designed for massive scale (exabytes)

Components:
- Load Balancer: Route requests
- API Gateway: Authentication, rate limiting
- Metadata Service: Object metadata, bucket policies
- Storage Nodes: Erasure-coded data chunks
"""

class ObjectStorageService:
    def __init__(self, metadata_db, storage_cluster):
        self.metadata = metadata_db
        self.storage = storage_cluster
    
    def put_object(self, bucket, key, data, metadata=None):
        """Store object with erasure coding"""
        # Validate bucket exists and permissions
        if not self.metadata.bucket_exists(bucket):
            raise BucketNotFoundError()
        
        # Calculate content hash for integrity
        content_hash = hashlib.sha256(data).hexdigest()
        
        # Erasure code the data (e.g., Reed-Solomon)
        # Split into k data chunks, compute m parity chunks
        # Can recover from any m chunk failures
        chunks = self._erasure_encode(data, k=6, m=3)
        
        # Distribute chunks across storage nodes
        chunk_locations = []
        for i, chunk in enumerate(chunks):
            nodes = self.storage.select_nodes(num=3)  # Store each chunk on 3 nodes
            for node in nodes:
                node.store_chunk(bucket, key, i, chunk)
            chunk_locations.append(nodes)
        
        # Store metadata
        self.metadata.put_object_metadata(
            bucket, key,
            size=len(data),
            content_hash=content_hash,
            chunk_locations=chunk_locations,
            user_metadata=metadata,
            created_at=datetime.now()
        )
        
        return content_hash
    
    def get_object(self, bucket, key):
        """Retrieve object, reconstructing from erasure coding if needed"""
        meta = self.metadata.get_object_metadata(bucket, key)
        if not meta:
            raise ObjectNotFoundError()
        
        # Fetch chunks (only need k of k+m)
        chunks = []
        for i, locations in enumerate(meta['chunk_locations']):
            for node in locations:
                try:
                    chunk = node.get_chunk(bucket, key, i)
                    chunks.append((i, chunk))
                    break
                except ChunkNotFoundError:
                    continue  # Try next replica
        
        # Decode data
        data = self._erasure_decode(chunks, k=6, m=3)
        
        # Verify integrity
        if hashlib.sha256(data).hexdigest() != meta['content_hash']:
            raise DataCorruptionError()
        
        return data, meta

Distributed Clocks & Ordering

In distributed systems, there's no global clock. Physical clocks drift, and network delays are unpredictable. We need logical clocks to establish event ordering.

Lamport Clocks

# Lamport Logical Clocks
"""
Rules:
1. Each process has a counter, starts at 0
2. Before each event, increment counter
3. When sending message, include counter value
4. When receiving, set counter = max(local, received) + 1

Property: If a ? b (a happened-before b), then L(a) < L(b)
Limitation: L(a) < L(b) does NOT imply a ? b
"""

class LamportClock:
    def __init__(self):
        self.time = 0
    
    def tick(self):
        """Local event occurred"""
        self.time += 1
        return self.time
    
    def send(self):
        """Prepare timestamp for outgoing message"""
        self.tick()
        return self.time
    
    def receive(self, msg_timestamp):
        """Update clock on message receipt"""
        self.time = max(self.time, msg_timestamp) + 1
        return self.time

# Example: Three processes
p1_clock = LamportClock()
p2_clock = LamportClock()
p3_clock = LamportClock()

# P1 sends to P2
t1 = p1_clock.send()  # t1 = 1
print(f"P1 sends at time {t1}")

# P2 receives and sends to P3
t2 = p2_clock.receive(t1)  # t2 = 2
print(f"P2 receives at time {t2}")
t3 = p2_clock.send()  # t3 = 3
print(f"P2 sends at time {t3}")

# P3 receives
t4 = p3_clock.receive(t3)  # t4 = 4
print(f"P3 receives at time {t4}")

Vector Clocks

Vector clocks extend Lamport clocks to detect concurrent events (events where neither happened-before the other).

Vector Clock Implementation

# Vector Clocks
"""
Each process maintains a vector of counters, one per process.
Rules:
1. On local event: increment own counter
2. On send: include entire vector
3. On receive: take element-wise max, then increment own counter

Comparison:
- V1 < V2 (happened-before): All V1[i] <= V2[i] and at least one V1[i] < V2[i]
- V1 || V2 (concurrent): Neither V1 < V2 nor V2 < V1
"""

class VectorClock:
    def __init__(self, node_id, num_nodes):
        self.node_id = node_id
        self.vector = [0] * num_nodes
    
    def tick(self):
        """Local event"""
        self.vector[self.node_id] += 1
        return self.copy()
    
    def send(self):
        """Prepare timestamp for message"""
        return self.tick()
    
    def receive(self, other_vector):
        """Merge on message receipt"""
        for i in range(len(self.vector)):
            self.vector[i] = max(self.vector[i], other_vector[i])
        self.vector[self.node_id] += 1
        return self.copy()
    
    def copy(self):
        return list(self.vector)
    
    @staticmethod
    def compare(v1, v2):
        """
        Returns:
        -1 if v1 < v2 (v1 happened before v2)
         1 if v1 > v2 (v2 happened before v1)
         0 if v1 || v2 (concurrent)
        """
        less = any(v1[i] < v2[i] for i in range(len(v1)))
        greater = any(v1[i] > v2[i] for i in range(len(v1)))
        
        if less and not greater:
            return -1
        elif greater and not less:
            return 1
        else:
            return 0  # Concurrent or equal

# Example: Detecting conflicts in distributed database
def resolve_conflict(value1, vc1, value2, vc2):
    """Resolve write conflict using vector clocks"""
    comparison = VectorClock.compare(vc1, vc2)
    
    if comparison == -1:
        return value2  # v2 is newer
    elif comparison == 1:
        return value1  # v1 is newer
    else:
        # Concurrent writes - need application-level resolution
        return merge_values(value1, value2)  # e.g., CRDTs

                        
                        Google's TrueTime: Google Spanner uses TrueTime, which provides globally synchronized time with bounded uncertainty. This enables external consistency (real-time ordering) without vector clocks, but requires atomic clocks and GPS receivers in every datacenter.
                    

Next Steps

You now have a deep understanding of distributed systems internals! Continue to Part 14 to learn about Authentication & Security patterns including OAuth, JWT, and advanced security practices.

Distributed Systems Design Document Generator

Document your distributed system's consensus protocol, replication strategy, and fault tolerance design. Download as Word, Excel, or PDF.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

System Name *

Consensus Protocol *

Consistency Model *

Replication Strategy

Partitioning Strategy

Fault Tolerance Mechanisms

Clock / Ordering Mechanism

Conflict Resolution

Author Name

System Design Series Part 13: Distributed Systems Deep Dive

Table of Contents

Distributed Systems Fundamentals

System Design Mastery

Introduction to System Design

Scalability Fundamentals

Load Balancing & Caching

Database Design & Sharding

Microservices Architecture

API Design & REST/GraphQL

Message Queues & Event-Driven

CAP Theorem & Consistency

Rate Limiting & Security

Monitoring & Observability

Real-World Case Studies

Data Modeling & Schema Design