Back to Computing & Systems Foundations Series

Part 2: How Programs Actually Run

May 13, 2026 Wasil Zafar 18 min read

From source code to CPU — the complete journey through compilation, linking, loading, and the memory layout that every running process inhabits.

Table of Contents

  1. The Program Lifecycle
  2. Compilation — Four Phases
  3. The ELF Binary Format
  4. Static vs Dynamic Linking
  5. The Loader — From File to Process
  6. Process Memory Layout
  7. Runtime Environments
  8. Exercises
  9. Conclusion & Next Steps

The Program Lifecycle

You type python3 app.py and press Enter. Within milliseconds your web server is listening on port 8080. But what happened in between? How does a text file of Python instructions become a process consuming CPU cycles and RAM? The answer involves more machinery than most developers ever see.

Why This Matters: Understanding how programs load explains container startup times, why microservices have cold-start latency, why some binaries are 5 MB and others 50 MB, what "shared libraries" and "dependency hell" actually mean at the binary level, and why Docker images benefit from minimal base images. It also illuminates a whole class of security vulnerabilities in how dynamic linking works.

Compiled vs Interpreted Languages

Programs are text written by humans — source code. But CPUs execute machine code: binary instructions like 0x48 0x89 0xE5 (which is mov rbp, rsp in x86-64 assembly). There are two primary strategies for bridging this gap:

Ahead-of-Time (AOT) Compilation: The source code is translated to machine code before the program runs. Languages like C, C++, Rust, and Go use this approach. The compiler runs once, producing a native binary. Execution is fast because the CPU runs the binary directly.

Interpretation / Just-in-Time (JIT) Compilation: The source code (or an intermediate bytecode) is translated to machine code while the program runs. Python, JavaScript, and Java use variations of this approach. The initial start-up may be slower, but modern runtimes use JIT compilation to approach native speeds for hot code paths.

Compiled vs Interpreted Program Lifecycle
flowchart LR
    subgraph Compiled["AOT Compiled (C, Go, Rust)"]
        direction TB
        SC1["Source Code (.c, .go, .rs)"]
        BC1["Compiler (gcc, go build, rustc)"]
        OBJ["Object Files (.o)"]
        LNK["Linker (ld)"]
        BIN["Native Binary (ELF)"]
        EXE1["OS loads + executes directly"]
        SC1 --> BC1 --> OBJ --> LNK --> BIN --> EXE1
    end
    subgraph Interpreted["Interpreted / JIT (Python, JVM, JS)"]
        direction TB
        SC2["Source Code (.py, .java, .js)"]
        BC2["Bytecode Compiler"]
        BYTE["Bytecode (.pyc, .class)"]
        RT["Runtime (CPython, JVM, V8)"]
        JIT2["JIT Compiler (optional)"]
        EXE2["Machine Code Execution"]
        SC2 --> BC2 --> BYTE --> RT --> JIT2 --> EXE2
    end

    style Compiled fill:#f8f9fa,stroke:#3B9797
    style Interpreted fill:#f8f9fa,stroke:#16476A
                            

Compilation — Four Phases

When you compile a C program with gcc -o hello hello.c, four distinct phases happen in sequence:

# Let's trace the compilation of a trivial C program
# Create a simple program
cat > /tmp/hello.c << 'EOF'
#include 

#define GREETING "Hello"

int main() {
    printf("%s, World!\n", GREETING);
    return 0;
}
EOF

# Phase 1: Preprocessing — expands macros, includes, #ifdef
# Output: hello.i (preprocessed C, no macros)
gcc -E /tmp/hello.c -o /tmp/hello.i
wc -l /tmp/hello.i   # ~800 lines — stdio.h expanded!

# Phase 2: Compilation — C source to assembly (.s)
gcc -S /tmp/hello.i -o /tmp/hello.s
cat /tmp/hello.s      # Human-readable assembly instructions

# Phase 3: Assembly — assembly (.s) to machine code object file (.o)
gcc -c /tmp/hello.s -o /tmp/hello.o
file /tmp/hello.o     # "ELF 64-bit LSB relocatable object"
nm /tmp/hello.o       # List symbols: U printf (undefined), T main (text)

# Phase 4: Linking — combine .o files + libraries into executable
gcc /tmp/hello.o -o /tmp/hello
file /tmp/hello       # "ELF 64-bit LSB pie executable"

Object Files and Symbols

Object files (.o) are the output of the compilation phase. Each source file compiles to one object file. An object file contains:

  • Machine code for the functions defined in that source file
  • Data for global and static variables
  • A symbol table listing all symbols the file defines and all symbols it references but doesn't define
  • Relocation entries — placeholders for addresses that can't be resolved until linking

The linker's job is to take multiple object files and libraries, resolve all undefined symbol references, assign final memory addresses, and produce a single executable (or shared library).

The ELF Binary Format

On Linux (and most Unix-like systems), executables, object files, and shared libraries all use the ELF format (Executable and Linkable Format). Understanding ELF is understanding the "native language" that the OS uses to load and run programs.

An ELF file has three main structural elements:

  1. ELF Header — at offset 0, always starts with magic bytes 0x7f E L F. Describes the file type (executable, shared library, object file), architecture (x86-64, ARM64), entry point address, and locations of segment/section tables.
  2. Program Header Table (Segments) — used by the OS loader. Describes which parts of the file should be mapped into memory, at what addresses, with what permissions (read/write/execute).
  3. Section Header Table (Sections) — used by the linker and debugger. Describes fine-grained divisions of the file content.
# Inspect an ELF binary with readelf and objdump
# First, create a simple binary to examine
cat > /tmp/demo.c << 'EOF'
#include 
int global_var = 42;
const char* message = "hello";
int main() { printf("%s %d\n", message, global_var); return 0; }
EOF
gcc -o /tmp/demo /tmp/demo.c

# View the ELF header — magic bytes, type, architecture, entry point
readelf -h /tmp/demo | head -20

# View program headers (segments) — how the OS maps the binary into memory
readelf -l /tmp/demo

# View section headers — .text, .data, .bss, .rodata etc.
readelf -S /tmp/demo

# View the symbol table — all named symbols with their addresses
readelf -s /tmp/demo | grep -E "FUNC|OBJECT"

# Disassemble the .text section (machine code -> assembly)
objdump -d /tmp/demo | head -40

Key ELF Sections

Section Contents Permissions Example
.text Compiled machine code Read + Execute Your function bodies
.rodata Read-only data Read only String literals ("Hello"), const arrays
.data Initialised global/static vars Read + Write int global = 42;
.bss Uninitialised global/static vars Read + Write (zeroed) int counter; (0 at startup)
.plt Procedure Linkage Table Read + Execute Stubs for dynamic library calls
.got Global Offset Table Read + Write (then Read) Resolved addresses for shared lib functions
.dynamic Dynamic linking info Read Required shared libs (NEEDED entries)
.symtab Symbol table Read Function/variable names and addresses
.debug_* Debug information (DWARF) Read Source-line mappings (stripped in release builds)
Why .bss Exists: Uninitialised global variables (static int counters[1000000]) would waste 4 MB in the binary file if stored literally. Instead, .bss stores only the size of the region — the kernel zeroes the actual memory pages when loading the process. A 4 MB zeroed array adds only a few bytes to the binary on disk.

Static vs Dynamic Linking

When your program calls printf(), where does that code come from? The answer depends on how the binary was linked.

Static Linking — Self-Contained Binaries

In static linking, all library code your program uses is copied directly into the final binary by the linker. The result is a self-contained executable that doesn't depend on any external library at runtime.

# Static vs dynamic binary comparison
cat > /tmp/static_test.c << 'EOF'
#include 
int main() { printf("hello\n"); return 0; }
EOF

# Dynamic binary (default — links against shared libc)
gcc -o /tmp/hello_dynamic /tmp/static_test.c
ls -lh /tmp/hello_dynamic    # ~16 KB

# Static binary (includes all library code)
gcc -static -o /tmp/hello_static /tmp/static_test.c
ls -lh /tmp/hello_static     # ~800 KB — all of libc included!

# See what shared libraries a dynamic binary requires
ldd /tmp/hello_dynamic
# linux-vdso.so.1, libc.so.6, /lib64/ld-linux-x86-64.so.2

ldd /tmp/hello_static
# statically linked (no shared lib dependencies)
Container Engineering

Static Linking and Scratch Containers

When you see a Docker image using FROM scratch (the empty base image), the application binary inside must be statically linked — because there's no C library or any other shared library in the image to link against at runtime. Go programs compile to statically linked binaries by default (on Linux), which is why Go is so popular for containerised microservices. A Go web server can run in a Docker image that's literally just the binary — 5-15 MB total image size.

Languages like Python or Java can't use FROM scratch directly because they depend on the interpreter/JVM and its dependencies at runtime. This is why Python container images are typically 200 MB+ (python:slim) vs Go images that can be under 10 MB.

Docker Go Container Size

Dynamic Linking and Shared Libraries

In dynamic linking, the library code is not copied into the binary. Instead, the binary stores a reference to the library (e.g., libc.so.6), and the library is loaded by the OS at runtime when the program starts.

Advantages of dynamic linking:

  • Smaller binaries (the library exists once on disk, not in every binary)
  • Shared memory: if 50 processes use libc, the OS maps the same physical memory pages into all 50 address spaces — only one copy in RAM
  • Library updates without recompiling applications (e.g., a security patch to libc benefits all programs immediately)

Disadvantages:

  • "Dependency hell": if the required library version isn't installed, the program fails to start
  • Slightly slower first call to each library function (resolved lazily at runtime)
  • Security attack surface (LD_PRELOAD injection, rpath manipulation)
# Explore shared library dependencies
# List all shared libraries a binary needs
ldd /usr/bin/python3
# Shows: libpython3.x.so, libm.so, libz.so, libc.so, ld-linux...

# See which library provides a specific symbol
# (useful when debugging "symbol not found" errors)
nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep " printf"

# Check if a library is already loaded in memory (shared between processes)
# Look at /proc/PID/maps for a running process
cat /proc/self/maps | grep "\.so"

# Inspect NEEDED entries (required libs) in an ELF binary
readelf -d /usr/bin/python3 | grep NEEDED

PLT, GOT, and Lazy Binding

When dynamically linked code calls an external function (like printf), the address of that function isn't known at compile time — it depends on where libc.so is loaded in memory. The dynamic linker solves this via two data structures in every dynamically linked binary:

  • PLT (Procedure Linkage Table): Stubs in the .plt section. When your code calls printf, it actually calls a PLT stub.
  • GOT (Global Offset Table): A table of addresses in the .got.plt section. Initially, each GOT entry points back into the PLT (the resolver). After the first call, the entry is overwritten with the actual address of printf in libc.

This is lazy binding — function addresses are resolved on first call, not all upfront at program start. The tradeoff: program startup is faster (don't resolve all symbols immediately), but the first call to each library function pays a small overhead for resolution.

Security Note — GOT Overwrite Attacks: Because the GOT is a writable table of function pointers, it has historically been a target for memory corruption exploits. An attacker who can write to the GOT can redirect a call to printf() to any function they choose, including a shell. Modern mitigations include RELRO (RELocation Read-Only) — the GOT is made read-only after all symbols are resolved — and Full RELRO, which resolves all symbols eagerly at startup (eliminating lazy binding) and makes the entire GOT read-only.

The Loader — From File to Process

When you run a program (e.g., type ./hello in a shell), the shell calls the execve() system call. This is the moment a file on disk becomes a running process. Here's what happens:

The Loader: From execve() to main()
sequenceDiagram
    participant SH as Shell
    participant KN as Kernel
    participant LD as ld-linux.so (Dynamic Linker)
    participant LB as Shared Libraries (libc, etc.)
    participant PR as Program's main()

    SH->>KN: execve("./hello", argv, envp)
    KN->>KN: Read ELF header, check magic bytes
    KN->>KN: Map ELF segments into new address space
(LOAD segments → mmap with correct permissions) KN->>LD: Map ld-linux.so (the dynamic linker)
and jump to its entry point LD->>LD: Parse .dynamic section
Find all NEEDED shared libraries LD->>LB: Load each shared library (mmap the .so files) LD->>LD: Resolve all symbol references
Update GOT entries with real addresses LD->>LD: Run library initialisation code
(.init_array, __libc_csu_init) LD->>PR: Jump to program entry point
(which calls main()) PR->>PR: Your code runs!
# See the dynamic linker in action
# The LD_DEBUG environment variable enables verbose dynamic linker output

# Show all libraries being loaded
LD_DEBUG=libs /tmp/hello_dynamic 2>&1 | head -20

# Show all symbol bindings being resolved
LD_DEBUG=bindings /tmp/hello_dynamic 2>&1 | head -20

# Show the entire loading process
LD_DEBUG=all /tmp/hello_dynamic 2>&1 | head -50

# Measure program startup overhead — how long before main() is called?
# Use LD_DEBUG=statistics to see linker timing
LD_DEBUG=statistics /tmp/hello_dynamic 2>&1
The vDSO: You may see linux-vdso.so.1 in ldd output — it doesn't exist as a file on disk. The vDSO (virtual Dynamic Shared Object) is a small shared library that the kernel maps directly into every process's address space. It provides fast implementations of certain frequently-called syscalls (like clock_gettime()) that can be executed without actually crossing the user/kernel boundary — reducing the cost from ~100 ns to ~10 ns.

Process Memory Layout

Once the loader has done its work, the process's virtual address space has a well-defined layout. On a 64-bit Linux system, the address space is 128 TB, but only small portions are actually mapped:

Process Virtual Address Space Layout (x86-64 Linux)
flowchart TD
    A["Kernel Space
Top of address space
Not accessible from user space
(Page fault if accessed)"] B["Stack
Grows downward
Local variables, return addresses, saved registers
Default 8 MB limit (ulimit -s)"] C["Memory-Mapped Region
Shared libraries mapped here
mmap() allocations
(libc.so, libpthread.so, etc.)"] D["Heap
Grows upward
Dynamic allocations: malloc(), new
Managed by the allocator (glibc malloc, jemalloc)"] E[".bss
Uninitialised global/static vars
(zero-filled by OS)"] F[".data
Initialised global/static variables
int x = 42;"] G[".rodata
Read-only data: string literals, const arrays
Attempting write → SIGSEGV"] H[".text
Program machine code
Read + Execute only
Attempting write → SIGSEGV (W^X protection)"] A --- B --- C --- D --- E --- F --- G --- H style A fill:#132440,color:#fff style B fill:#BF092F,color:#fff style C fill:#16476A,color:#fff style D fill:#3B9797,color:#fff style E fill:#16476A,color:#fff style F fill:#3B9797,color:#fff style G fill:#16476A,color:#fff style H fill:#132440,color:#fff
# Inspect the memory map of a running process
# Start a long-running process
python3 -c "import time; time.sleep(60)" &
PID=$!

# View its memory map: addr range, permissions, offset, device, inode, name
cat /proc/$PID/maps

# Summarise memory usage by category
cat /proc/$PID/smaps_rollup

# See the same information with human-readable sizes
pmap -x $PID | tail -20

kill $PID

ASLR and Memory Randomisation

ASLR (Address Space Layout Randomisation) is a security feature that randomises the base addresses of the stack, heap, and shared library mappings every time a program is launched. This makes it much harder for an attacker who knows they can cause a buffer overflow to predict what address their shellcode will end up at.

# Check if ASLR is enabled on your Linux system
# 0 = disabled, 1 = partial, 2 = full ASLR
cat /proc/sys/kernel/randomize_va_space   # Should be 2

# Observe ASLR in action: the stack address changes each run
for i in $(seq 1 5); do
    python3 -c "import ctypes; print(hex(id(ctypes.c_int(1))))"
done
# You'll see different addresses each time

# For testing, temporarily disable ASLR for one process
setarch -R python3 -c "import ctypes; print(hex(id(ctypes.c_int(1))))"
setarch -R python3 -c "import ctypes; print(hex(id(ctypes.c_int(1))))"
# Same address both times

Runtime Environments

Not all programs are compiled AOT to native machine code. Many popular languages use a runtime environment — a managed execution engine that interprets or JIT-compiles the program while handling memory management, type checking, and other services.

JIT Compilation — The Best of Both Worlds

Modern runtimes like V8 (JavaScript), the JVM (Java/Kotlin), and PyPy (Python) use Just-in-Time compilation to close the performance gap with native code:

JIT Compilation: V8 JavaScript Engine
flowchart LR
    JS["JavaScript Source"] --> Parse["Parser
(AST generation)"] Parse --> Ignition["Ignition Interpreter
(fast startup, generates bytecode)"] Ignition -->|"Hot function detected
(called many times)"| TurboFan TurboFan["TurboFan JIT Compiler
(optimised machine code)"] -->|"Type assumption violated
(deoptimisation)"| Ignition TurboFan --> Native["Native Machine Code
(near C-speed performance)"] style JS fill:#f8f9fa,stroke:#3B9797 style Ignition fill:#3B9797,color:#fff style TurboFan fill:#BF092F,color:#fff style Native fill:#132440,color:#fff
Case Study

CPython Is Not "Interpreted Python" — It's a Bytecode VM

Many developers think Python is "interpreted" in the sense that the source code is read and executed line-by-line like a script. This is not accurate. CPython:

  1. Compiles your .py file to CPython bytecode (.pyc files in __pycache__/)
  2. The CPython interpreter — itself a C program compiled to native binary — executes this bytecode in a loop
  3. Each bytecode instruction (LOAD_FAST, CALL_FUNCTION, etc.) dispatches to a C function

This is why NumPy operations are "fast Python" — NumPy functions are implemented in C and called via the bytecode VM with minimal overhead. The slow part of Python is not the bytecode loop itself but the overhead per Python object operation (type checking, reference counting, dict lookups for attributes).

CPython Bytecode NumPy Performance
# Inspect CPython bytecode for a simple function
python3 -c "
import dis

def factorial(n):
    if n <= 1:
        return 1
    return n * factorial(n - 1)

# Disassemble the function to CPython bytecode
dis.dis(factorial)
"
# Output shows: LOAD_FAST (load local var), COMPARE_OP, JUMP_IF, BINARY_OP, etc.
# View the .pyc bytecode cache files CPython generates
python3 -c "import factorial_module" 2>/dev/null || true

# Find a .pyc file and inspect its header
find /usr/lib/python3* -name "*.pyc" -size +10k | head -1 | xargs python3 -c "
import sys, marshal, dis, struct

path = sys.argv[1]
with open(path, 'rb') as f:
    magic = f.read(16)  # magic + flags + timestamp + size
    code = marshal.loads(f.read())
print('Bytecode for:', path)
dis.dis(code)
" 2>/dev/null | head -30

Exercises

Exercise 1 — Explore Your Own Binary

# Compile a C program and explore its ELF structure
cat > /tmp/explore.c << 'EOF'
#include 
#include 

char global_init[] = "I am in .data";
char global_uninit[100];  // will be in .bss
const char* literal = "I am in .rodata";

int main(int argc, char* argv[]) {
    char* heap = malloc(100);  // heap allocation
    printf("Stack var address: %p\n", &argc);
    printf("Heap alloc address: %p\n", heap);
    printf("Text (main) address: %p\n", (void*)main);
    printf(".data address: %p\n", global_init);
    printf(".rodata address: %p\n", literal);
    printf(".bss address: %p\n", global_uninit);
    free(heap);
    return 0;
}
EOF
gcc -o /tmp/explore /tmp/explore.c
/tmp/explore
# Observe how addresses match the virtual memory layout diagram

Exercise 2 — Static vs Dynamic Size

# Compare sizes and startup time of static vs dynamic binary
cat > /tmp/bench.c << 'EOF'
#include 
#include 
int main() {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    printf("%ld.%09ld\n", ts.tv_sec, ts.tv_nsec);
    return 0;
}
EOF

gcc -O2 -o /tmp/bench_dynamic /tmp/bench.c
gcc -O2 -static -o /tmp/bench_static /tmp/bench.c

echo "=== File sizes ==="
ls -lh /tmp/bench_dynamic /tmp/bench_static

echo "=== Startup times (run 10x each) ==="
for i in $(seq 1 10); do /tmp/bench_dynamic; done | awk '{sum+=$1} END{print "Dynamic avg:", sum/NR}'
for i in $(seq 1 10); do /tmp/bench_static; done | awk '{sum+=$1} END{print "Static avg:", sum/NR}'

Exercise 3 — Library Interception with LD_PRELOAD

# LD_PRELOAD lets you inject a shared library before all others
# This is used legitimately (e.g., tcmalloc, mimalloc for better malloc)
# and maliciously (rootkit-style hooking)
# Here we safely observe how it works

# Create a shared library that wraps malloc
cat > /tmp/track_malloc.c << 'EOF'
#define _GNU_SOURCE
#include 
#include 
#include 

void* malloc(size_t size) {
    static void* (*real_malloc)(size_t) = NULL;
    if (!real_malloc) real_malloc = dlsym(RTLD_NEXT, "malloc");
    void* ptr = real_malloc(size);
    fprintf(stderr, "malloc(%zu) = %p\n", size, ptr);
    return ptr;
}
EOF
gcc -shared -fPIC -o /tmp/track_malloc.so /tmp/track_malloc.c -ldl

# Use LD_PRELOAD to inject our tracking malloc into any program
LD_PRELOAD=/tmp/track_malloc.so python3 -c "x = [1,2,3]" 2>&1 | head -10

Conclusion & Next Steps

You now understand the complete journey from source code to a running process. The key concepts to carry forward:

  • Compilation is four phases: preprocessing, compilation, assembly, linking — each producing an intermediate artefact. -E, -S, -c flags stop GCC at each phase.
  • ELF is the universal binary format on Linux — readelf and objdump let you inspect any binary. Every Docker image is full of ELF files.
  • Dynamic linking saves memory but adds complexity — shared libraries, the dynamic linker (ld-linux.so), PLT/GOT, and the RELRO security hardening all exist to manage this.
  • The process memory layout is deterministic (modulo ASLR): text at the bottom, data/bss above, heap grows up, stack grows down, shared libraries somewhere in the middle.
  • Runtimes like CPython and the JVM are themselves native ELF binaries that implement a virtual machine — Python bytecode runs inside a C program.