Part 1: What Is a Computer System?

Why This Matters

You write Python. You deploy Docker containers. You query databases and call APIs. But here's a question worth sitting with: when your Python script calls open("data.csv") and reads a line — what actually happens? How does text stored magnetically on a spinning disk (or in floating-gate transistors on an SSD) become a Python string object in your variable? How many layers of software and hardware are involved? How does the CPU even know when the disk is done?

Most working engineers can't answer this — not because they're not talented, but because modern tooling is designed to hide it. You don't need to know to build most things. Until you do. Until a production system starts behaving strangely, until you need to debug a latency spike or a mysterious OOM kill, until you're designing something at a scale where these layers start to matter.

                            
                            Core Insight: Every performance problem, every security vulnerability, every mysterious bug that "shouldn't be possible" — almost all of them have their root cause in the gap between what a developer assumed and what the system actually does. This series closes that gap.
                        

Why Foundations Beat Frameworks

Frameworks change. The JVM, Node.js, Python's CPython, Go's runtime — these are all high-level environments that abstract over the same underlying machine. A developer who deeply understands that underlying machine can work productively in any environment. A developer who only knows the framework is limited by it and is lost when it misbehaves.

The goal of this series is not to make you a kernel developer (though Part 21 will take you deep into container internals). The goal is to give you a mental model accurate enough that you can reason about system behaviour from first principles. When you understand what a system call actually is, you read strace output differently. When you understand how the CPU cache works, you write hot loops differently. When you understand how TCP actually establishes a connection, you debug network problems differently.

The Layers of a Computer System

The most powerful idea in computer science is abstraction through layering. Each layer hides the complexity of the layer below it and provides a clean interface to the layer above. This is how a billion transistors become a Python dictionary.

The Seven Layers of a Computer System

flowchart TD
    A["User / Applications
Python, browsers, CLI tools"]
    B["Runtime / Standard Library
CPython, glibc, JVM, .NET CLR"]
    C["System Calls (syscall interface)
open(), read(), write(), fork(), mmap()"]
    D["Operating System Kernel
Process scheduler, memory manager, VFS, networking"]
    E["Device Drivers
Disk drivers, NIC drivers, GPU drivers"]
    F["Hardware Abstraction Layer
BIOS/UEFI, firmware, microcode"]
    G["Physical Hardware
CPU, RAM, NVMe, NIC, GPU"]

    A --> B --> C --> D --> E --> F --> G
    G -.->|Interrupts| D
    D -.->|Results| C
    C -.->|Return values| B
    B -.->|Objects/data| A

    style A fill:#3B9797,color:#fff,stroke:#3B9797
    style B fill:#16476A,color:#fff,stroke:#16476A
    style C fill:#132440,color:#fff,stroke:#132440
    style D fill:#BF092F,color:#fff,stroke:#BF092F
    style E fill:#16476A,color:#fff,stroke:#16476A
    style F fill:#132440,color:#fff,stroke:#132440
    style G fill:#3B9797,color:#fff,stroke:#3B9797

Each arrow going down represents a call or request. Each dotted arrow going up represents a return or event. The entire lifetime of a program is a dance between these layers — your code asking the layer below to do things, and results flowing back up.

Abstraction as Engineering

Consider writing a file. From a Python developer's perspective, it looks like this:

with open("/tmp/output.txt", "w") as f:
    f.write("hello world\n")

But the actual sequence of events is far richer:

Python's open() calls the C standard library's fopen()
fopen() eventually calls the kernel's openat() system call
The kernel's Virtual File System (VFS) resolves the path /tmp/output.txt through the directory tree
The VFS dispatches to the ext4 file system driver, which manages inodes and blocks
The block layer sends a write request to the NVMe or SATA driver
The driver programs the disk controller via memory-mapped I/O registers
The disk (or SSD) performs the write and signals completion via an interrupt
The interrupt handler in the kernel marks the operation complete, and the process's syscall returns
The file descriptor is returned to Python and stored as a file object

                            
                            Analogy: Each layer is like a department in a large company. The CEO (application) says "publish the report." The VP (standard library) translates that into a work order. The manager (kernel) figures out which team (driver) owns the task. The team (driver) talks to the physical equipment (hardware). Completion signals flow back up the chain. The CEO never sees the physical equipment — and doesn't need to.
                        

The CPU — Brain of the Machine

The Central Processing Unit is the component that actually executes instructions. Everything else in the system exists to either give the CPU data to work on or to act on the CPU's outputs.

A modern CPU core contains:

Registers — tiny, ultra-fast storage locations directly inside the CPU die. A 64-bit CPU has general-purpose registers (rax, rbx, rcx... on x86-64), special-purpose registers (rip for the instruction pointer, rsp for the stack pointer, rflags for condition codes), and floating-point/SIMD registers. Accessing a register takes under 1 nanosecond.
ALU (Arithmetic Logic Unit) — performs integer arithmetic, bitwise operations, and comparisons.
FPU (Floating-Point Unit) — handles IEEE 754 floating-point arithmetic.
Control Unit — decodes instructions and orchestrates the execution units.
Cache hierarchy — L1 (per-core, ~32 KB, ~1 ns), L2 (per-core, ~256 KB, ~4 ns), L3 (shared across cores, ~8-64 MB, ~20-40 ns).

The fundamental operation of a CPU is the Fetch-Decode-Execute cycle:

Fetch-Decode-Execute Cycle

flowchart LR
    F["FETCH
Load instruction
at address in RIP
from memory/cache"]
    D["DECODE
Identify opcode
and operands
(which registers, addressing mode)"]
    E["EXECUTE
ALU/FPU performs
the operation
Result written to register/memory"]
    U["UPDATE RIP
RIP += instruction length
(or = jump target
for branches)"]

    F --> D --> E --> U --> F

    style F fill:#3B9797,color:#fff,stroke:#3B9797
    style D fill:#16476A,color:#fff,stroke:#16476A
    style E fill:#BF092F,color:#fff,stroke:#BF092F
    style U fill:#132440,color:#fff,stroke:#132440

Pipelining and Modern CPU Tricks

A naive CPU would complete one instruction before starting the next. Modern CPUs use instruction pipelining — while one instruction is in the Execute stage, the next one is already being decoded, and the one after that is already being fetched. This allows throughput of multiple instructions per clock cycle.

Beyond pipelining, modern CPUs employ several performance optimisations that are critical to understand when reasoning about program performance:

CPU Architecture

Key CPU Optimisations and Their Implications

Out-of-order execution: The CPU reorders instructions dynamically to avoid stalls — for example, if instruction 3 depends on the result of instruction 2 (which is waiting on a cache miss), the CPU might execute instructions 4 and 5 first. This means the programmer's view of instruction order and the actual hardware execution order can differ.

Branch prediction: When the CPU encounters an if statement (a conditional branch), it guesses which way the branch will go and speculatively executes that path before the condition is resolved. This is why Spectre-class vulnerabilities exist — the CPU executes code speculatively using data it shouldn't have access to.

Superscalar execution: A single core can execute multiple independent instructions simultaneously using multiple execution units. A modern Intel/AMD core can retire 3-6 instructions per clock cycle in ideal conditions.

Spectre/Meltdown Performance Engineering Memory Ordering

Memory — The Speed Hierarchy

Memory in a computer is not a single uniform thing — it's a hierarchy of storage technologies, each offering a trade-off between speed, capacity, and cost. This hierarchy is one of the most important concepts for performance engineering.

Level	Size (typical)	Latency	Bandwidth	Volatile?
CPU Registers	~1 KB (16-32 registers)	<1 ns	Unlimited (direct)	Yes
L1 Cache	32–64 KB per core	~1–3 ns (3-5 cycles)	~1 TB/s	Yes
L2 Cache	256 KB – 1 MB per core	~5–10 ns (10-20 cycles)	~400 GB/s	Yes
L3 Cache	8–64 MB shared	~20–40 ns (40-80 cycles)	~200 GB/s	Yes
RAM (DRAM)	8–512 GB	~60–100 ns	~50 GB/s	Yes
NVMe SSD	256 GB – 8 TB	~100–200 µs	~7 GB/s	No
HDD	1 TB – 20 TB	~5–10 ms	~200 MB/s	No

                            
                            The Million-Cycle Problem: A cache miss to RAM takes ~100 ns. At 3 GHz, the CPU can execute ~300 instructions in that time but is instead waiting. An HDD seek takes ~10 ms — that's 30 million wasted cycles. Writing cache-friendly code is one of the highest-leverage performance optimisations available to application developers.
                        

Cache and Locality of Reference

The CPU cache works by exploiting two principles:

Temporal locality: If you accessed a memory location recently, you will likely access it again soon. The cache keeps recently used data.
Spatial locality: If you accessed a memory location, you will likely access nearby locations soon. The cache loads entire cache lines (typically 64 bytes) at once — so accessing element 0 of an array also pre-fetches elements 1 through 7 (at 8 bytes each) into the cache.

This is why iterating through an array row-by-row is dramatically faster than column-by-column in a 2D array stored in row-major order. The former exhibits spatial locality; the latter does not.

# Demonstrate cache effects — time a sequential vs random access pattern
# Sequential: cache-friendly (100-1000x faster for large arrays)
python3 -c "
import time
n = 10_000_000
a = list(range(n))

# Sequential access — spatial locality
t0 = time.perf_counter()
s = sum(a[i] for i in range(n))
t1 = time.perf_counter()
print(f'Sequential: {t1-t0:.3f}s, sum={s}')

# Random access — poor locality
import random
idx = list(range(n))
random.shuffle(idx)
t0 = time.perf_counter()
s = sum(a[idx[i]] for i in range(n))
t1 = time.perf_counter()
print(f'Random: {t1-t0:.3f}s, sum={s}')
"

Storage — Persistence Beyond Power

RAM is fast but volatile — every byte disappears when power is cut. Storage (NVMe SSDs, SATA SSDs, HDDs) is persistent — data survives power cycles. This is the fundamental distinction between working memory and persistent memory.

From the OS's perspective, storage devices are block devices — they accept read/write requests in fixed-size blocks (typically 512 bytes or 4 KB). The file system layer (ext4, NTFS, APFS) builds the familiar file and directory abstraction on top of this raw block interface. We'll explore file systems deeply in Part 7.

                            
                            Why SSDs Changed Everything: Traditional HDDs have a mechanical read/write head that must physically move to the correct track — this is the 5-10ms latency. NVMe SSDs are purely electronic — they use floating-gate NAND flash cells and communicate over the PCIe bus, achieving latencies 50-100x lower. This is why workloads that were once I/O-bound (databases, containers starting up) became CPU-bound after SSD adoption.
                        

I/O — The World Outside the CPU

A computer system is not just a CPU and memory in isolation. It interacts with the external world through I/O — keyboards, displays, network cards, USB devices, and more. These interactions are handled through a combination of buses, device controllers, and device drivers.

The key I/O mechanisms are:

Memory-Mapped I/O (MMIO): Device registers are mapped into the CPU's address space. The CPU writes to specific memory addresses that are actually device registers — this is how the kernel talks to hardware controllers.
DMA (Direct Memory Access): For bulk data transfers (e.g., reading a disk block, receiving a network packet), the device controller writes data directly into RAM without involving the CPU for each byte. The CPU is only involved to initiate the transfer and to process the completion interrupt.
Port I/O: On x86 systems, special IN/OUT CPU instructions directly address device ports. Less common in modern systems but still used for legacy hardware.

Interrupts — Hardware Talks to Software

How does the CPU know when a key is pressed? Or when a network packet has arrived? Or when a disk read has completed? It doesn't continuously check — that would waste enormous CPU time. Instead, hardware uses interrupts.

An interrupt is an asynchronous signal from a hardware device to the CPU saying "something needs your attention." When an interrupt fires:

Interrupt Handling Flow

sequenceDiagram
    participant HW as Hardware Device
    participant CPU as CPU
    participant IDT as Interrupt Descriptor Table
    participant ISR as Interrupt Service Routine (Kernel)
    participant Proc as Interrupted Process

    Proc->>CPU: Executing user code (e.g., calculating something)
    HW->>CPU: Assert interrupt line (IRQ N)
    CPU->>CPU: Finish current instruction
    CPU->>CPU: Save processor state (registers, RIP) to kernel stack
    CPU->>IDT: Look up handler for IRQ N
    IDT->>ISR: Dispatch to Interrupt Service Routine
    ISR->>ISR: Handle event (read NIC data, mark I/O complete)
    ISR->>CPU: IRET instruction — restore saved state
    CPU->>Proc: Resume executing user code

There are two types of interrupts:

Hardware interrupts (IRQs): Fired by hardware devices — keyboard press, NIC packet received, timer tick, disk I/O complete. The Linux kernel's interrupt handler is registered in the IDT (Interrupt Descriptor Table).
Software interrupts / exceptions: Fired by the CPU itself in response to special conditions — divide-by-zero, page fault (accessing memory not yet mapped), illegal instruction. These trigger the kernel to take action.

                            
                            Key Insight: Interrupts are what make multi-tasking possible. The system timer fires an interrupt ~250 times per second (on Linux, configurable via CONFIG_HZ). Each timer interrupt is an opportunity for the kernel's scheduler to decide whether to continue running the current process or switch to a different one. Without interrupts, one process could monopolise the CPU forever.
                        

System Calls — Software Talks to the Kernel

We've seen how hardware talks to the kernel via interrupts. Now let's look at how user-space programs talk to the kernel: through system calls.

The CPU operates in different privilege levels (called "rings" in x86 architecture). Ring 0 is kernel mode — code running here has unrestricted access to hardware, all memory, and all instructions. Ring 3 is user mode — code running here can only access its own memory and cannot execute privileged instructions. Your application runs in Ring 3. The kernel runs in Ring 0.

User Space vs Kernel Space

flowchart LR
    subgraph US["User Space (Ring 3)"]
        A["Application Code
(Python, Go, Java, C)"]
        B["C Standard Library
(glibc / musl)"]
    end
    subgraph KS["Kernel Space (Ring 0)"]
        C["Syscall Handler"]
        D["VFS / Networking / Memory Manager"]
        E["Device Drivers"]
    end
    subgraph HW["Hardware"]
        F["CPU / RAM / NIC / Disk"]
    end

    A -->|"open(), read(), write()
(via wrapper in glibc)"| B
    B -->|"syscall instruction
+ syscall number"| C
    C --> D --> E --> F

    style US fill:#f8f9fa,stroke:#3B9797,color:#132440
    style KS fill:#132440,stroke:#3B9797,color:#fff
    style HW fill:#16476A,stroke:#3B9797,color:#fff

When your code calls open(), what actually happens is:

The glibc wrapper puts the syscall number for openat (257 on x86-64 Linux) into register rax
Arguments go into rdi, rsi, rdx, etc.
The syscall instruction is executed — this triggers a transition from Ring 3 to Ring 0
The kernel's syscall handler dispatches to the correct kernel function
The kernel performs the requested operation, puts the return value in rax
The sysret instruction switches back to Ring 3 and the user-space code continues

Tracing System Calls with strace

One of the most powerful debugging tools available on Linux is strace, which intercepts and logs every system call a process makes. This makes the hidden layer 3-to-kernel communication visible.

# Trace all system calls made by the 'ls' command
strace ls /tmp

# Show only file-related system calls (filter by category)
strace -e trace=file ls /tmp

# Show only network-related system calls
strace -e trace=network curl -s https://example.com -o /dev/null

# Trace a running process by PID (attach to existing process)
strace -p $(pgrep python3)

# Show timing for each syscall (-T) and wall-clock timestamps (-t)
strace -T -t ls /tmp

# Summarise syscall counts and time (useful for profiling)
strace -c ls /tmp

Hands-On

What strace Reveals

When you run strace python3 -c "open('/tmp/x.txt', 'w')", you'll see something like this in the output (among many other calls):

openat(AT_FDCWD, "/tmp/x.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
ioctl(3, TCGETS, 0x7ffd...) = -1 ENOTTY
fstat(3, {...}) = 0
write(3, "", 0) = 0
close(3) = 0

Notice: Python's open() uses the kernel's openat() syscall (not plain open), sets O_CLOEXEC (closes the FD in child processes), and the return value = 3 is the file descriptor number. The kernel probe of ioctl(TCGETS) is Python checking if the FD is a terminal (it's not, so it returns ENOTTY — "not a typewriter").

strace File Descriptors Debugging

Putting It All Together

Let's trace exactly what happens when a Python web server handles a single HTTP request — a journey that touches every layer of the system:

An HTTP Request — All Layers Involved

flowchart TD
    A["NIC receives Ethernet frame
→ DMA writes packet to RAM
→ NIC fires interrupt"]
    B["Kernel interrupt handler runs
→ TCP/IP stack processes segment
→ Data placed in socket receive buffer"]
    C["Application's accept() / recv() syscall
→ Blocked process is woken up
→ Data copied from kernel buffer to user buffer"]
    D["Python HTTP server parses request
→ Calls route handler
→ Handler calls open() to read template file"]
    E["open() → openat() syscall
→ VFS resolves path → ext4 driver
→ Block layer → NVMe driver → SSD read"]
    F["SSD fires interrupt on completion
→ DMA copies data to page cache
→ read() syscall returns data to Python"]
    G["Python renders response
→ send() syscall writes response
→ TCP/IP stack queues segment
→ NIC sends Ethernet frame"]

    A --> B --> C --> D --> E --> F --> G

    style A fill:#3B9797,color:#fff
    style B fill:#16476A,color:#fff
    style C fill:#BF092F,color:#fff
    style D fill:#3B9797,color:#fff
    style E fill:#16476A,color:#fff
    style F fill:#BF092F,color:#fff
    style G fill:#3B9797,color:#fff

This entire sequence — from NIC interrupt to response sent — typically completes in under 1 millisecond if all data is in RAM. Each step involves a layer transition: hardware interrupts triggering kernel code, kernel code transitioning to user space, user space making system calls back into the kernel. The CPU switches between user mode and kernel mode dozens of times per request.

                            
                            Why This Mental Model Pays Off: Now when a performance engineer says "this endpoint is I/O bound," you understand exactly what that means — the CPU is mostly idle, waiting for disk or network I/O to complete, while processes are blocked in system calls waiting for kernel buffers to fill. The fix isn't to use a faster language or more RAM — it's to restructure I/O patterns or add caching.
                        

Exercises

These exercises build the intuition that makes the theory stick. Each one is runnable on any Linux or macOS system.

Exercise 1 — See Your System Calls

# On Linux: install strace if needed
# sudo apt install strace  (Ubuntu/Debian)
# sudo dnf install strace  (Fedora/RHEL)

# 1. Trace a simple Python program and count syscalls
strace -c python3 -c "print('hello')" 2>&1 | tail -20

# 2. How many openat() calls does Python make just to print hello?
# (Hint: it loads many .pyc files and shared libraries)
strace -e trace=openat python3 -c "print('hello')" 2>&1 | grep -c openat

Exercise 2 — Inspect the Memory Hierarchy

# View your CPU's cache sizes on Linux
getconf LEVEL1_DCACHE_SIZE  # L1 data cache in bytes
getconf LEVEL2_CACHE_SIZE   # L2 cache
getconf LEVEL3_CACHE_SIZE   # L3 cache

# Alternative: detailed CPU info from the kernel
cat /sys/devices/system/cpu/cpu0/cache/index0/size  # L1
cat /sys/devices/system/cpu/cpu0/cache/index1/size  # L1i or L2
cat /sys/devices/system/cpu/cpu0/cache/index2/size  # L2 or L3
cat /sys/devices/system/cpu/cpu0/cache/index3/size  # L3

# On macOS
sysctl hw.l1icachesize hw.l1dcachesize hw.l2cachesize hw.l3cachesize

Exercise 3 — See Interrupts in Action

# Watch interrupt counts change in real-time (Linux)
# Column 1 is the IRQ number, columns 2+ are per-CPU counts
watch -n 0.5 cat /proc/interrupts

# While watching, move your mouse or type in another terminal
# Notice which IRQ line increments with keyboard/mouse input

# See the timer interrupt specifically (usually IRQ 0 or LOC)
cat /proc/interrupts | grep -E "^(0|LOC|TIM)"

Exercise 4 — Observe Context Switches

# See how often the OS switches between processes
# cs = context switches per second, in = interrupts per second
vmstat 1 5

# More detail: voluntary vs involuntary context switches for a process
# Run a process and inspect its context switches
sleep 10 &
PID=$!
cat /proc/$PID/status | grep ctxt
wait

Conclusion & Next Steps

You've just built a foundational mental model of a computer system — one that will serve you across every specialisation in this series and beyond. The key ideas to carry forward:

Layered abstraction is the core engineering principle — each layer hides complexity and provides a clean interface. Violations of layer boundaries are the source of most interesting bugs.
The CPU is fast; memory is slow; disk is geological. Every performance-sensitive design decision flows from the memory hierarchy.
Interrupts are asynchronous — hardware tells software "I'm done" via interrupts rather than the CPU polling. This is what makes multitasking and efficient I/O possible.
System calls are the boundary between your code and the OS. They involve a privilege-level transition (Ring 3 to Ring 0) and are the mechanism through which all I/O, networking, memory allocation, and process management happens.

Next Part 2: How Programs Actually Run

Cookie Consent

Table of Contents

Why This Matters

Why Foundations Beat Frameworks

The Layers of a Computer System

Abstraction as Engineering

The CPU — Brain of the Machine

Pipelining and Modern CPU Tricks

Key CPU Optimisations and Their Implications

Memory — The Speed Hierarchy

Cache and Locality of Reference

Storage — Persistence Beyond Power

I/O — The World Outside the CPU

Interrupts — Hardware Talks to Software

System Calls — Software Talks to the Kernel

Tracing System Calls with strace

What strace Reveals

Putting It All Together

Exercises

Conclusion & Next Steps

Cookie Consent

Part 1: What Is a Computer System?

Table of Contents

Why This Matters

Why Foundations Beat Frameworks

The Layers of a Computer System

Abstraction as Engineering

The CPU — Brain of the Machine

Pipelining and Modern CPU Tricks

Key CPU Optimisations and Their Implications

Memory — The Speed Hierarchy

Cache and Locality of Reference

Storage — Persistence Beyond Power

I/O — The World Outside the CPU

Interrupts — Hardware Talks to Software

System Calls — Software Talks to the Kernel

Tracing System Calls with strace

What strace Reveals

Putting It All Together

Exercises

Conclusion & Next Steps

Continue the Series

Part 2: How Programs Actually Run

Computing & Systems Foundations — Full Series

Part 3: Linux Fundamentals — Architecture & Philosophy