What is a Distributed System?
Definition
A distributed system is multiple computers coordinating to appear as one system. That single sentence contains the entire discipline. Every challenge, every algorithm, every tool in this series flows from this deceptively simple idea.
Think of it this way: your laptop is a single system. It has one CPU, one memory space, and one disk. When you save a file, it's either saved or it isn't. There's no ambiguity. But what happens when you need more power than one machine can provide? What happens when you need your system to survive hardware failure? You distribute.
Single Machine vs Distributed
On a single machine, communication between components happens through shared memory or local function calls — essentially instantaneous and perfectly reliable. In a distributed system, components communicate over the network, introducing three fundamental uncertainties:
| Aspect | Single Machine | Distributed System |
|---|---|---|
| Communication | Shared memory (nanoseconds) | Network messages (milliseconds+) |
| Failure mode | Total (crashes entirely) | Partial (some nodes fail) |
| Time | Single clock source | Multiple unsynchronized clocks |
| State | Consistent (one copy) | Potentially inconsistent (replicas) |
| Ordering | Deterministic | Non-deterministic |
Why Distribute?
Given these challenges, why build distributed systems at all? Three compelling reasons:
Horizontal Scaling
Single machines hit physical limits. You cannot buy a server with 10,000 CPU cores. But you can network 1,000 ten-core machines. Google processes over 8.5 billion searches per day — no single machine could handle that. Instead, thousands of machines work together, each handling a fraction of the load.
Fault Tolerance Through Redundancy
Hardware fails. Hard drives have a 2–5% annual failure rate. If your entire system runs on one machine and that machine dies, you're offline. Distributing across multiple machines means the system survives individual failures. Netflix's systems are designed so that even losing an entire AWS availability zone doesn't affect users.
Latency Reduction
The speed of light imposes a 67ms minimum round-trip between New York and London. By placing servers close to users worldwide, you reduce latency. A CDN serves content from the nearest edge node rather than a single origin server thousands of miles away.
Core Challenges
Every distributed system must confront five fundamental challenges. Understanding these deeply is the key to understanding why Kubernetes works the way it does.
Network Failures
Networks are unreliable. Packets get lost, delayed, duplicated, or delivered out of order. A switch fails, a cable gets cut, a router's buffer overflows. Unlike a function call that either returns or throws an exception, a network request has three possible outcomes:
- Success — the request arrived and the response came back
- Failure — something went wrong (but which side?)
- Unknown — the request may or may not have arrived; the response may be in transit
That third state — unknown — is what makes distributed systems fundamentally different from local programming. When you call a function locally, you always know whether it executed. Over a network, you sometimes cannot tell.
Partial Failures
In a single machine, failure is total — the machine either works or it doesn't. In a distributed system, some components fail while others continue operating. This creates scenarios far more complex than total failure:
- Node A thinks Node B is dead, but Node B is actually fine — just slow to respond
- A database write succeeds on 2 of 3 replicas — is that a success or failure?
- A service processes a request but crashes before sending the response — did it happen or not?
flowchart TD
A[Client sends request] --> B{Network delivers?}
B -->|Yes| C{Server processes?}
B -->|No - Lost| D[Client times out]
B -->|Delayed| E[Client times out, then arrives late]
C -->|Yes| F{Response delivered?}
C -->|No - Crash| G[Server fails mid-process]
F -->|Yes| H[Success ✓]
F -->|No - Lost| I[Client times out despite success]
D --> J[Client retries — but was it processed?]
G --> J
I --> J
Latency
Network latency is not just slow — it's variable. A request that takes 1ms usually might take 500ms during a garbage collection pause or network congestion. This variability is far more dangerous than consistently slow communication because it makes timeouts unreliable as failure detectors.
Consider: if you set a timeout of 100ms, requests taking 150ms look identical to failed requests. You might kill a healthy node that's just temporarily slow, causing more load on remaining nodes, triggering more timeouts — a cascading failure born from a conservative timeout.
Consistency
When data is replicated across multiple nodes, writes to different replicas may arrive in different orders. What does "the current value" mean when three replicas hold three different values?
# Timeline showing inconsistency:
# Time 0: All replicas have value = "A"
# Time 1: Client writes value = "B" to Replica 1
# Time 2: Client reads from Replica 2 → gets "A" (stale!)
# Time 3: Replica 1 propagates "B" to Replica 2
# Time 4: Client reads from Replica 2 → gets "B" (consistent)
# The window between Time 1 and Time 3 is the
# "inconsistency window" — reads may return stale data.
Different systems choose different consistency guarantees depending on their needs — a theme we'll explore deeply in Part 3 when we cover the CAP theorem.
Coordination
How do multiple machines agree on who's the leader? How do they decide the order of operations? How do they detect failures? Coordination is expensive — it requires communication, and communication is slow and unreliable. The art of distributed systems design is minimizing coordination while maintaining correctness.
Fundamental Principles
Nodes & Clusters
A node is any individual machine (physical or virtual) participating in a distributed system. A cluster is a group of nodes working together toward a common purpose. Clusters can be homogeneous (all nodes do the same work) or heterogeneous (different roles for different nodes).
In Kubernetes, this maps directly: worker nodes run workloads, control plane nodes manage the cluster. The cluster is the unit of administration — you deploy applications to a cluster, not to individual machines.
# A simple cluster topology:
#
# Control Plane (3 nodes for HA):
# - master-1: API Server, Scheduler, Controller Manager, etcd
# - master-2: API Server, Scheduler, Controller Manager, etcd
# - master-3: API Server, Scheduler, Controller Manager, etcd
#
# Worker Nodes (N nodes for workloads):
# - worker-1: kubelet, kube-proxy, container runtime
# - worker-2: kubelet, kube-proxy, container runtime
# - worker-N: kubelet, kube-proxy, container runtime
# View nodes in a Kubernetes cluster:
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# master-1 Ready control-plane 10d v1.30.0
# master-2 Ready control-plane 10d v1.30.0
# master-3 Ready control-plane 10d v1.30.0
# worker-1 Ready 10d v1.30.0
# worker-2 Ready 10d v1.30.0
Consensus
Consensus is how distributed nodes agree on a single value or decision. It's the hardest problem in distributed computing and the foundation of reliable systems. Without consensus, nodes can disagree on who's the leader, what data is authoritative, or whether a transaction committed.
We'll dedicate all of Part 2 to consensus algorithms (Raft, Paxos), but here's the key intuition: consensus requires a quorum — a majority of nodes must agree before a decision is final. With 3 nodes, you need 2 to agree. With 5 nodes, you need 3. This is why production Kubernetes runs odd numbers of etcd nodes.
Replication
Replication means maintaining copies of data or services across multiple nodes. It serves two purposes:
- Availability: If one copy is lost, others remain
- Performance: Reads can be served from any copy, distributing load
The challenge is keeping replicas consistent. There are two fundamental approaches:
| Approach | How It Works | Trade-off |
|---|---|---|
| Synchronous | Write to all replicas before acknowledging | Strong consistency, higher latency |
| Asynchronous | Acknowledge after writing to one, propagate later | Lower latency, eventual consistency |
In Kubernetes, replication appears everywhere: ReplicaSets maintain multiple pod copies, etcd replicates state across control plane nodes, and ConfigMaps are distributed to all nodes that need them.
Fault Tolerance
Fault tolerance is a system's ability to continue operating correctly despite component failures. It's not about preventing failures (impossible) — it's about designing systems that tolerate them gracefully.
Key patterns for fault tolerance:
- Redundancy: Run multiple instances so failures don't cause outages
- Detection: Quickly identify failed components (heartbeats, health checks)
- Recovery: Automatically replace failed components (self-healing)
- Isolation: Contain failures so they don't cascade (bulkheads)
The Evolution to Containers
Bare Metal to Cloud
Understanding the evolution helps explain why Kubernetes exists and why it's designed the way it is:
flowchart LR
A[Bare Metal] --> B[Virtual Machines]
B --> C[Containers]
C --> D[Container Orchestration]
D --> E[Platform Engineering]
Bare Metal Era (1990s–2000s): One application per physical server. Massive waste — average utilization was 10–15%. Provisioning took weeks. Failures meant hardware replacement.
Virtualization Era (2000s–2010s): Multiple VMs per server improved utilization to 40–60%. VMware, then AWS, made provisioning faster (minutes instead of weeks). But VMs are heavy — each runs a full OS kernel, consuming significant CPU and memory overhead.
Container Era (2013–present): Containers share the host kernel, starting in milliseconds with minimal overhead. Docker (2013) made containers accessible. Utilization jumped to 60–80%. But managing thousands of containers across hundreds of machines created new problems.
The Container Revolution
Containers solved the "works on my machine" problem through isolation without the overhead of full virtualization:
# A container is just a process with isolation:
# - Namespace isolation (PID, network, filesystem, users)
# - cgroup resource limits (CPU, memory, I/O)
# - Layered filesystem (overlay2)
# Compare startup times:
# VM: 30-60 seconds (boot full OS)
# Container: 50-500 milliseconds (start process)
# Compare overhead:
# VM: 1-2 GB RAM for OS alone
# Container: ~10 MB overhead
# This is why containers enable microservices:
# You can run 100 containers where you'd run 5 VMs
The Need for Orchestration
Containers made deploying individual services easy. But running a distributed system of hundreds of containers introduced new questions:
- Which machine should this container run on? (Scheduling)
- What if the container crashes? (Self-healing)
- How do containers find each other? (Service discovery)
- How do I update without downtime? (Rolling updates)
- How do I scale based on load? (Autoscaling)
- How do I manage secrets and configuration? (Config management)
These are all distributed systems problems repackaged for the container era. Kubernetes was Google's answer, open-sourced in 2014, based on 15 years of internal experience running containers at scale with Borg and Omega.
Real-World Examples
Google Search: Distributed Systems at Scale
A single Google search query touches over 1,000 machines in under 0.5 seconds. The web index is sharded across thousands of servers. Multiple replicas serve each shard for availability. A failed server is instantly replaced by its replica. This is distributed systems in action: sharding for scale, replication for availability, and consensus for coordination.
Netflix: Resilience Through Chaos
Netflix runs thousands of microservices across multiple AWS regions. They pioneered "Chaos Engineering" with Chaos Monkey — randomly killing production instances to ensure the system survives failures. Their architecture embodies every principle we've discussed: redundancy, isolation, graceful degradation, and self-healing. If one microservice fails, fallbacks ensure users still see content (even if recommendations are slightly stale).
Amazon DynamoDB: Choosing Availability
Amazon's shopping cart was built on Dynamo (precursor to DynamoDB), which chose availability over consistency. During the 2004 holiday season, a brief outage cost millions. Amazon decided that showing a slightly outdated cart was better than showing no cart at all. This trade-off — eventual consistency for high availability — is a core distributed systems design decision we'll explore in the CAP theorem (Part 3).
Exercises
tc or sleep in the server). Observe how timeout-based failure detection produces false positives. What timeout value balances fast detection against false alarms?
Conclusion
Distributed systems are not optional complexity — they're the only way to build systems that scale beyond single machines and survive hardware failures. Every challenge we've discussed — network unreliability, partial failures, consistency, coordination — will reappear throughout this series as we see how Kubernetes addresses each one.
The mental model to carry forward: