Why This Matters
You write Python. You deploy Docker containers. You query databases and call APIs. But here's a question worth sitting with: when your Python script calls open("data.csv") and reads a line — what actually happens? How does text stored magnetically on a spinning disk (or in floating-gate transistors on an SSD) become a Python string object in your variable? How many layers of software and hardware are involved? How does the CPU even know when the disk is done?
Most working engineers can't answer this — not because they're not talented, but because modern tooling is designed to hide it. You don't need to know to build most things. Until you do. Until a production system starts behaving strangely, until you need to debug a latency spike or a mysterious OOM kill, until you're designing something at a scale where these layers start to matter.
Why Foundations Beat Frameworks
Frameworks change. The JVM, Node.js, Python's CPython, Go's runtime — these are all high-level environments that abstract over the same underlying machine. A developer who deeply understands that underlying machine can work productively in any environment. A developer who only knows the framework is limited by it and is lost when it misbehaves.
The goal of this series is not to make you a kernel developer (though Part 21 will take you deep into container internals). The goal is to give you a mental model accurate enough that you can reason about system behaviour from first principles. When you understand what a system call actually is, you read strace output differently. When you understand how the CPU cache works, you write hot loops differently. When you understand how TCP actually establishes a connection, you debug network problems differently.
The Layers of a Computer System
The most powerful idea in computer science is abstraction through layering. Each layer hides the complexity of the layer below it and provides a clean interface to the layer above. This is how a billion transistors become a Python dictionary.
flowchart TD
A["User / Applications
Python, browsers, CLI tools"]
B["Runtime / Standard Library
CPython, glibc, JVM, .NET CLR"]
C["System Calls (syscall interface)
open(), read(), write(), fork(), mmap()"]
D["Operating System Kernel
Process scheduler, memory manager, VFS, networking"]
E["Device Drivers
Disk drivers, NIC drivers, GPU drivers"]
F["Hardware Abstraction Layer
BIOS/UEFI, firmware, microcode"]
G["Physical Hardware
CPU, RAM, NVMe, NIC, GPU"]
A --> B --> C --> D --> E --> F --> G
G -.->|Interrupts| D
D -.->|Results| C
C -.->|Return values| B
B -.->|Objects/data| A
style A fill:#3B9797,color:#fff,stroke:#3B9797
style B fill:#16476A,color:#fff,stroke:#16476A
style C fill:#132440,color:#fff,stroke:#132440
style D fill:#BF092F,color:#fff,stroke:#BF092F
style E fill:#16476A,color:#fff,stroke:#16476A
style F fill:#132440,color:#fff,stroke:#132440
style G fill:#3B9797,color:#fff,stroke:#3B9797
Each arrow going down represents a call or request. Each dotted arrow going up represents a return or event. The entire lifetime of a program is a dance between these layers — your code asking the layer below to do things, and results flowing back up.
Abstraction as Engineering
Consider writing a file. From a Python developer's perspective, it looks like this:
with open("/tmp/output.txt", "w") as f:
f.write("hello world\n")
But the actual sequence of events is far richer:
- Python's
open()calls the C standard library'sfopen() fopen()eventually calls the kernel'sopenat()system call- The kernel's Virtual File System (VFS) resolves the path
/tmp/output.txtthrough the directory tree - The VFS dispatches to the ext4 file system driver, which manages inodes and blocks
- The block layer sends a write request to the NVMe or SATA driver
- The driver programs the disk controller via memory-mapped I/O registers
- The disk (or SSD) performs the write and signals completion via an interrupt
- The interrupt handler in the kernel marks the operation complete, and the process's syscall returns
- The file descriptor is returned to Python and stored as a file object
The CPU — Brain of the Machine
The Central Processing Unit is the component that actually executes instructions. Everything else in the system exists to either give the CPU data to work on or to act on the CPU's outputs.
A modern CPU core contains:
- Registers — tiny, ultra-fast storage locations directly inside the CPU die. A 64-bit CPU has general-purpose registers (rax, rbx, rcx... on x86-64), special-purpose registers (rip for the instruction pointer, rsp for the stack pointer, rflags for condition codes), and floating-point/SIMD registers. Accessing a register takes under 1 nanosecond.
- ALU (Arithmetic Logic Unit) — performs integer arithmetic, bitwise operations, and comparisons.
- FPU (Floating-Point Unit) — handles IEEE 754 floating-point arithmetic.
- Control Unit — decodes instructions and orchestrates the execution units.
- Cache hierarchy — L1 (per-core, ~32 KB, ~1 ns), L2 (per-core, ~256 KB, ~4 ns), L3 (shared across cores, ~8-64 MB, ~20-40 ns).
The fundamental operation of a CPU is the Fetch-Decode-Execute cycle:
flowchart LR
F["FETCH
Load instruction
at address in RIP
from memory/cache"]
D["DECODE
Identify opcode
and operands
(which registers, addressing mode)"]
E["EXECUTE
ALU/FPU performs
the operation
Result written to register/memory"]
U["UPDATE RIP
RIP += instruction length
(or = jump target
for branches)"]
F --> D --> E --> U --> F
style F fill:#3B9797,color:#fff,stroke:#3B9797
style D fill:#16476A,color:#fff,stroke:#16476A
style E fill:#BF092F,color:#fff,stroke:#BF092F
style U fill:#132440,color:#fff,stroke:#132440
Pipelining and Modern CPU Tricks
A naive CPU would complete one instruction before starting the next. Modern CPUs use instruction pipelining — while one instruction is in the Execute stage, the next one is already being decoded, and the one after that is already being fetched. This allows throughput of multiple instructions per clock cycle.
Beyond pipelining, modern CPUs employ several performance optimisations that are critical to understand when reasoning about program performance:
Key CPU Optimisations and Their Implications
Out-of-order execution: The CPU reorders instructions dynamically to avoid stalls — for example, if instruction 3 depends on the result of instruction 2 (which is waiting on a cache miss), the CPU might execute instructions 4 and 5 first. This means the programmer's view of instruction order and the actual hardware execution order can differ.
Branch prediction: When the CPU encounters an if statement (a conditional branch), it guesses which way the branch will go and speculatively executes that path before the condition is resolved. This is why Spectre-class vulnerabilities exist — the CPU executes code speculatively using data it shouldn't have access to.
Superscalar execution: A single core can execute multiple independent instructions simultaneously using multiple execution units. A modern Intel/AMD core can retire 3-6 instructions per clock cycle in ideal conditions.
Memory — The Speed Hierarchy
Memory in a computer is not a single uniform thing — it's a hierarchy of storage technologies, each offering a trade-off between speed, capacity, and cost. This hierarchy is one of the most important concepts for performance engineering.
| Level | Size (typical) | Latency | Bandwidth | Volatile? |
|---|---|---|---|---|
| CPU Registers | ~1 KB (16-32 registers) | <1 ns | Unlimited (direct) | Yes |
| L1 Cache | 32–64 KB per core | ~1–3 ns (3-5 cycles) | ~1 TB/s | Yes |
| L2 Cache | 256 KB – 1 MB per core | ~5–10 ns (10-20 cycles) | ~400 GB/s | Yes |
| L3 Cache | 8–64 MB shared | ~20–40 ns (40-80 cycles) | ~200 GB/s | Yes |
| RAM (DRAM) | 8–512 GB | ~60–100 ns | ~50 GB/s | Yes |
| NVMe SSD | 256 GB – 8 TB | ~100–200 µs | ~7 GB/s | No |
| HDD | 1 TB – 20 TB | ~5–10 ms | ~200 MB/s | No |
Cache and Locality of Reference
The CPU cache works by exploiting two principles:
- Temporal locality: If you accessed a memory location recently, you will likely access it again soon. The cache keeps recently used data.
- Spatial locality: If you accessed a memory location, you will likely access nearby locations soon. The cache loads entire cache lines (typically 64 bytes) at once — so accessing element 0 of an array also pre-fetches elements 1 through 7 (at 8 bytes each) into the cache.
This is why iterating through an array row-by-row is dramatically faster than column-by-column in a 2D array stored in row-major order. The former exhibits spatial locality; the latter does not.
# Demonstrate cache effects — time a sequential vs random access pattern
# Sequential: cache-friendly (100-1000x faster for large arrays)
python3 -c "
import time
n = 10_000_000
a = list(range(n))
# Sequential access — spatial locality
t0 = time.perf_counter()
s = sum(a[i] for i in range(n))
t1 = time.perf_counter()
print(f'Sequential: {t1-t0:.3f}s, sum={s}')
# Random access — poor locality
import random
idx = list(range(n))
random.shuffle(idx)
t0 = time.perf_counter()
s = sum(a[idx[i]] for i in range(n))
t1 = time.perf_counter()
print(f'Random: {t1-t0:.3f}s, sum={s}')
"
Storage — Persistence Beyond Power
RAM is fast but volatile — every byte disappears when power is cut. Storage (NVMe SSDs, SATA SSDs, HDDs) is persistent — data survives power cycles. This is the fundamental distinction between working memory and persistent memory.
From the OS's perspective, storage devices are block devices — they accept read/write requests in fixed-size blocks (typically 512 bytes or 4 KB). The file system layer (ext4, NTFS, APFS) builds the familiar file and directory abstraction on top of this raw block interface. We'll explore file systems deeply in Part 7.
I/O — The World Outside the CPU
A computer system is not just a CPU and memory in isolation. It interacts with the external world through I/O — keyboards, displays, network cards, USB devices, and more. These interactions are handled through a combination of buses, device controllers, and device drivers.
The key I/O mechanisms are:
- Memory-Mapped I/O (MMIO): Device registers are mapped into the CPU's address space. The CPU writes to specific memory addresses that are actually device registers — this is how the kernel talks to hardware controllers.
- DMA (Direct Memory Access): For bulk data transfers (e.g., reading a disk block, receiving a network packet), the device controller writes data directly into RAM without involving the CPU for each byte. The CPU is only involved to initiate the transfer and to process the completion interrupt.
- Port I/O: On x86 systems, special
IN/OUTCPU instructions directly address device ports. Less common in modern systems but still used for legacy hardware.
Interrupts — Hardware Talks to Software
How does the CPU know when a key is pressed? Or when a network packet has arrived? Or when a disk read has completed? It doesn't continuously check — that would waste enormous CPU time. Instead, hardware uses interrupts.
An interrupt is an asynchronous signal from a hardware device to the CPU saying "something needs your attention." When an interrupt fires:
sequenceDiagram
participant HW as Hardware Device
participant CPU as CPU
participant IDT as Interrupt Descriptor Table
participant ISR as Interrupt Service Routine (Kernel)
participant Proc as Interrupted Process
Proc->>CPU: Executing user code (e.g., calculating something)
HW->>CPU: Assert interrupt line (IRQ N)
CPU->>CPU: Finish current instruction
CPU->>CPU: Save processor state (registers, RIP) to kernel stack
CPU->>IDT: Look up handler for IRQ N
IDT->>ISR: Dispatch to Interrupt Service Routine
ISR->>ISR: Handle event (read NIC data, mark I/O complete)
ISR->>CPU: IRET instruction — restore saved state
CPU->>Proc: Resume executing user code
There are two types of interrupts:
- Hardware interrupts (IRQs): Fired by hardware devices — keyboard press, NIC packet received, timer tick, disk I/O complete. The Linux kernel's interrupt handler is registered in the IDT (Interrupt Descriptor Table).
- Software interrupts / exceptions: Fired by the CPU itself in response to special conditions — divide-by-zero, page fault (accessing memory not yet mapped), illegal instruction. These trigger the kernel to take action.
CONFIG_HZ). Each timer interrupt is an opportunity for the kernel's scheduler to decide whether to continue running the current process or switch to a different one. Without interrupts, one process could monopolise the CPU forever.
System Calls — Software Talks to the Kernel
We've seen how hardware talks to the kernel via interrupts. Now let's look at how user-space programs talk to the kernel: through system calls.
The CPU operates in different privilege levels (called "rings" in x86 architecture). Ring 0 is kernel mode — code running here has unrestricted access to hardware, all memory, and all instructions. Ring 3 is user mode — code running here can only access its own memory and cannot execute privileged instructions. Your application runs in Ring 3. The kernel runs in Ring 0.
flowchart LR
subgraph US["User Space (Ring 3)"]
A["Application Code
(Python, Go, Java, C)"]
B["C Standard Library
(glibc / musl)"]
end
subgraph KS["Kernel Space (Ring 0)"]
C["Syscall Handler"]
D["VFS / Networking / Memory Manager"]
E["Device Drivers"]
end
subgraph HW["Hardware"]
F["CPU / RAM / NIC / Disk"]
end
A -->|"open(), read(), write()
(via wrapper in glibc)"| B
B -->|"syscall instruction
+ syscall number"| C
C --> D --> E --> F
style US fill:#f8f9fa,stroke:#3B9797,color:#132440
style KS fill:#132440,stroke:#3B9797,color:#fff
style HW fill:#16476A,stroke:#3B9797,color:#fff
When your code calls open(), what actually happens is:
- The glibc wrapper puts the syscall number for
openat(257 on x86-64 Linux) into registerrax - Arguments go into
rdi,rsi,rdx, etc. - The
syscallinstruction is executed — this triggers a transition from Ring 3 to Ring 0 - The kernel's syscall handler dispatches to the correct kernel function
- The kernel performs the requested operation, puts the return value in
rax - The
sysretinstruction switches back to Ring 3 and the user-space code continues
Tracing System Calls with strace
One of the most powerful debugging tools available on Linux is strace, which intercepts and logs every system call a process makes. This makes the hidden layer 3-to-kernel communication visible.
# Trace all system calls made by the 'ls' command
strace ls /tmp
# Show only file-related system calls (filter by category)
strace -e trace=file ls /tmp
# Show only network-related system calls
strace -e trace=network curl -s https://example.com -o /dev/null
# Trace a running process by PID (attach to existing process)
strace -p $(pgrep python3)
# Show timing for each syscall (-T) and wall-clock timestamps (-t)
strace -T -t ls /tmp
# Summarise syscall counts and time (useful for profiling)
strace -c ls /tmp
What strace Reveals
When you run strace python3 -c "open('/tmp/x.txt', 'w')", you'll see something like this in the output (among many other calls):
openat(AT_FDCWD, "/tmp/x.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ioctl(3, TCGETS, 0x7ffd...) = -1 ENOTTY
fstat(3, {...}) = 0
write(3, "", 0) = 0
close(3) = 0
Notice: Python's open() uses the kernel's openat() syscall (not plain open), sets O_CLOEXEC (closes the FD in child processes), and the return value = 3 is the file descriptor number. The kernel probe of ioctl(TCGETS) is Python checking if the FD is a terminal (it's not, so it returns ENOTTY — "not a typewriter").
Putting It All Together
Let's trace exactly what happens when a Python web server handles a single HTTP request — a journey that touches every layer of the system:
flowchart TD
A["NIC receives Ethernet frame
→ DMA writes packet to RAM
→ NIC fires interrupt"]
B["Kernel interrupt handler runs
→ TCP/IP stack processes segment
→ Data placed in socket receive buffer"]
C["Application's accept() / recv() syscall
→ Blocked process is woken up
→ Data copied from kernel buffer to user buffer"]
D["Python HTTP server parses request
→ Calls route handler
→ Handler calls open() to read template file"]
E["open() → openat() syscall
→ VFS resolves path → ext4 driver
→ Block layer → NVMe driver → SSD read"]
F["SSD fires interrupt on completion
→ DMA copies data to page cache
→ read() syscall returns data to Python"]
G["Python renders response
→ send() syscall writes response
→ TCP/IP stack queues segment
→ NIC sends Ethernet frame"]
A --> B --> C --> D --> E --> F --> G
style A fill:#3B9797,color:#fff
style B fill:#16476A,color:#fff
style C fill:#BF092F,color:#fff
style D fill:#3B9797,color:#fff
style E fill:#16476A,color:#fff
style F fill:#BF092F,color:#fff
style G fill:#3B9797,color:#fff
This entire sequence — from NIC interrupt to response sent — typically completes in under 1 millisecond if all data is in RAM. Each step involves a layer transition: hardware interrupts triggering kernel code, kernel code transitioning to user space, user space making system calls back into the kernel. The CPU switches between user mode and kernel mode dozens of times per request.
Exercises
These exercises build the intuition that makes the theory stick. Each one is runnable on any Linux or macOS system.
Exercise 1 — See Your System Calls
# On Linux: install strace if needed
# sudo apt install strace (Ubuntu/Debian)
# sudo dnf install strace (Fedora/RHEL)
# 1. Trace a simple Python program and count syscalls
strace -c python3 -c "print('hello')" 2>&1 | tail -20
# 2. How many openat() calls does Python make just to print hello?
# (Hint: it loads many .pyc files and shared libraries)
strace -e trace=openat python3 -c "print('hello')" 2>&1 | grep -c openat
Exercise 2 — Inspect the Memory Hierarchy
# View your CPU's cache sizes on Linux
getconf LEVEL1_DCACHE_SIZE # L1 data cache in bytes
getconf LEVEL2_CACHE_SIZE # L2 cache
getconf LEVEL3_CACHE_SIZE # L3 cache
# Alternative: detailed CPU info from the kernel
cat /sys/devices/system/cpu/cpu0/cache/index0/size # L1
cat /sys/devices/system/cpu/cpu0/cache/index1/size # L1i or L2
cat /sys/devices/system/cpu/cpu0/cache/index2/size # L2 or L3
cat /sys/devices/system/cpu/cpu0/cache/index3/size # L3
# On macOS
sysctl hw.l1icachesize hw.l1dcachesize hw.l2cachesize hw.l3cachesize
Exercise 3 — See Interrupts in Action
# Watch interrupt counts change in real-time (Linux)
# Column 1 is the IRQ number, columns 2+ are per-CPU counts
watch -n 0.5 cat /proc/interrupts
# While watching, move your mouse or type in another terminal
# Notice which IRQ line increments with keyboard/mouse input
# See the timer interrupt specifically (usually IRQ 0 or LOC)
cat /proc/interrupts | grep -E "^(0|LOC|TIM)"
Exercise 4 — Observe Context Switches
# See how often the OS switches between processes
# cs = context switches per second, in = interrupts per second
vmstat 1 5
# More detail: voluntary vs involuntary context switches for a process
# Run a process and inspect its context switches
sleep 10 &
PID=$!
cat /proc/$PID/status | grep ctxt
wait
Conclusion & Next Steps
You've just built a foundational mental model of a computer system — one that will serve you across every specialisation in this series and beyond. The key ideas to carry forward:
- Layered abstraction is the core engineering principle — each layer hides complexity and provides a clean interface. Violations of layer boundaries are the source of most interesting bugs.
- The CPU is fast; memory is slow; disk is geological. Every performance-sensitive design decision flows from the memory hierarchy.
- Interrupts are asynchronous — hardware tells software "I'm done" via interrupts rather than the CPU polling. This is what makes multitasking and efficient I/O possible.
- System calls are the boundary between your code and the OS. They involve a privilege-level transition (Ring 3 to Ring 0) and are the mechanism through which all I/O, networking, memory allocation, and process management happens.