The Program Lifecycle
You type python3 app.py and press Enter. Within milliseconds your web server is listening on port 8080. But what happened in between? How does a text file of Python instructions become a process consuming CPU cycles and RAM? The answer involves more machinery than most developers ever see.
Compiled vs Interpreted Languages
Programs are text written by humans — source code. But CPUs execute machine code: binary instructions like 0x48 0x89 0xE5 (which is mov rbp, rsp in x86-64 assembly). There are two primary strategies for bridging this gap:
Ahead-of-Time (AOT) Compilation: The source code is translated to machine code before the program runs. Languages like C, C++, Rust, and Go use this approach. The compiler runs once, producing a native binary. Execution is fast because the CPU runs the binary directly.
Interpretation / Just-in-Time (JIT) Compilation: The source code (or an intermediate bytecode) is translated to machine code while the program runs. Python, JavaScript, and Java use variations of this approach. The initial start-up may be slower, but modern runtimes use JIT compilation to approach native speeds for hot code paths.
flowchart LR
subgraph Compiled["AOT Compiled (C, Go, Rust)"]
direction TB
SC1["Source Code (.c, .go, .rs)"]
BC1["Compiler (gcc, go build, rustc)"]
OBJ["Object Files (.o)"]
LNK["Linker (ld)"]
BIN["Native Binary (ELF)"]
EXE1["OS loads + executes directly"]
SC1 --> BC1 --> OBJ --> LNK --> BIN --> EXE1
end
subgraph Interpreted["Interpreted / JIT (Python, JVM, JS)"]
direction TB
SC2["Source Code (.py, .java, .js)"]
BC2["Bytecode Compiler"]
BYTE["Bytecode (.pyc, .class)"]
RT["Runtime (CPython, JVM, V8)"]
JIT2["JIT Compiler (optional)"]
EXE2["Machine Code Execution"]
SC2 --> BC2 --> BYTE --> RT --> JIT2 --> EXE2
end
style Compiled fill:#f8f9fa,stroke:#3B9797
style Interpreted fill:#f8f9fa,stroke:#16476A
Compilation — Four Phases
When you compile a C program with gcc -o hello hello.c, four distinct phases happen in sequence:
# Let's trace the compilation of a trivial C program
# Create a simple program
cat > /tmp/hello.c << 'EOF'
#include
#define GREETING "Hello"
int main() {
printf("%s, World!\n", GREETING);
return 0;
}
EOF
# Phase 1: Preprocessing — expands macros, includes, #ifdef
# Output: hello.i (preprocessed C, no macros)
gcc -E /tmp/hello.c -o /tmp/hello.i
wc -l /tmp/hello.i # ~800 lines — stdio.h expanded!
# Phase 2: Compilation — C source to assembly (.s)
gcc -S /tmp/hello.i -o /tmp/hello.s
cat /tmp/hello.s # Human-readable assembly instructions
# Phase 3: Assembly — assembly (.s) to machine code object file (.o)
gcc -c /tmp/hello.s -o /tmp/hello.o
file /tmp/hello.o # "ELF 64-bit LSB relocatable object"
nm /tmp/hello.o # List symbols: U printf (undefined), T main (text)
# Phase 4: Linking — combine .o files + libraries into executable
gcc /tmp/hello.o -o /tmp/hello
file /tmp/hello # "ELF 64-bit LSB pie executable"
Object Files and Symbols
Object files (.o) are the output of the compilation phase. Each source file compiles to one object file. An object file contains:
- Machine code for the functions defined in that source file
- Data for global and static variables
- A symbol table listing all symbols the file defines and all symbols it references but doesn't define
- Relocation entries — placeholders for addresses that can't be resolved until linking
The linker's job is to take multiple object files and libraries, resolve all undefined symbol references, assign final memory addresses, and produce a single executable (or shared library).
The ELF Binary Format
On Linux (and most Unix-like systems), executables, object files, and shared libraries all use the ELF format (Executable and Linkable Format). Understanding ELF is understanding the "native language" that the OS uses to load and run programs.
An ELF file has three main structural elements:
- ELF Header — at offset 0, always starts with magic bytes
0x7f E L F. Describes the file type (executable, shared library, object file), architecture (x86-64, ARM64), entry point address, and locations of segment/section tables. - Program Header Table (Segments) — used by the OS loader. Describes which parts of the file should be mapped into memory, at what addresses, with what permissions (read/write/execute).
- Section Header Table (Sections) — used by the linker and debugger. Describes fine-grained divisions of the file content.
# Inspect an ELF binary with readelf and objdump
# First, create a simple binary to examine
cat > /tmp/demo.c << 'EOF'
#include
int global_var = 42;
const char* message = "hello";
int main() { printf("%s %d\n", message, global_var); return 0; }
EOF
gcc -o /tmp/demo /tmp/demo.c
# View the ELF header — magic bytes, type, architecture, entry point
readelf -h /tmp/demo | head -20
# View program headers (segments) — how the OS maps the binary into memory
readelf -l /tmp/demo
# View section headers — .text, .data, .bss, .rodata etc.
readelf -S /tmp/demo
# View the symbol table — all named symbols with their addresses
readelf -s /tmp/demo | grep -E "FUNC|OBJECT"
# Disassemble the .text section (machine code -> assembly)
objdump -d /tmp/demo | head -40
Key ELF Sections
| Section | Contents | Permissions | Example |
|---|---|---|---|
.text |
Compiled machine code | Read + Execute | Your function bodies |
.rodata |
Read-only data | Read only | String literals ("Hello"), const arrays |
.data |
Initialised global/static vars | Read + Write | int global = 42; |
.bss |
Uninitialised global/static vars | Read + Write (zeroed) | int counter; (0 at startup) |
.plt |
Procedure Linkage Table | Read + Execute | Stubs for dynamic library calls |
.got |
Global Offset Table | Read + Write (then Read) | Resolved addresses for shared lib functions |
.dynamic |
Dynamic linking info | Read | Required shared libs (NEEDED entries) |
.symtab |
Symbol table | Read | Function/variable names and addresses |
.debug_* |
Debug information (DWARF) | Read | Source-line mappings (stripped in release builds) |
static int counters[1000000]) would waste 4 MB in the binary file if stored literally. Instead, .bss stores only the size of the region — the kernel zeroes the actual memory pages when loading the process. A 4 MB zeroed array adds only a few bytes to the binary on disk.
Static vs Dynamic Linking
When your program calls printf(), where does that code come from? The answer depends on how the binary was linked.
Static Linking — Self-Contained Binaries
In static linking, all library code your program uses is copied directly into the final binary by the linker. The result is a self-contained executable that doesn't depend on any external library at runtime.
# Static vs dynamic binary comparison
cat > /tmp/static_test.c << 'EOF'
#include
int main() { printf("hello\n"); return 0; }
EOF
# Dynamic binary (default — links against shared libc)
gcc -o /tmp/hello_dynamic /tmp/static_test.c
ls -lh /tmp/hello_dynamic # ~16 KB
# Static binary (includes all library code)
gcc -static -o /tmp/hello_static /tmp/static_test.c
ls -lh /tmp/hello_static # ~800 KB — all of libc included!
# See what shared libraries a dynamic binary requires
ldd /tmp/hello_dynamic
# linux-vdso.so.1, libc.so.6, /lib64/ld-linux-x86-64.so.2
ldd /tmp/hello_static
# statically linked (no shared lib dependencies)
Static Linking and Scratch Containers
When you see a Docker image using FROM scratch (the empty base image), the application binary inside must be statically linked — because there's no C library or any other shared library in the image to link against at runtime. Go programs compile to statically linked binaries by default (on Linux), which is why Go is so popular for containerised microservices. A Go web server can run in a Docker image that's literally just the binary — 5-15 MB total image size.
Languages like Python or Java can't use FROM scratch directly because they depend on the interpreter/JVM and its dependencies at runtime. This is why Python container images are typically 200 MB+ (python:slim) vs Go images that can be under 10 MB.
Dynamic Linking and Shared Libraries
In dynamic linking, the library code is not copied into the binary. Instead, the binary stores a reference to the library (e.g., libc.so.6), and the library is loaded by the OS at runtime when the program starts.
Advantages of dynamic linking:
- Smaller binaries (the library exists once on disk, not in every binary)
- Shared memory: if 50 processes use
libc, the OS maps the same physical memory pages into all 50 address spaces — only one copy in RAM - Library updates without recompiling applications (e.g., a security patch to
libcbenefits all programs immediately)
Disadvantages:
- "Dependency hell": if the required library version isn't installed, the program fails to start
- Slightly slower first call to each library function (resolved lazily at runtime)
- Security attack surface (LD_PRELOAD injection, rpath manipulation)
# Explore shared library dependencies
# List all shared libraries a binary needs
ldd /usr/bin/python3
# Shows: libpython3.x.so, libm.so, libz.so, libc.so, ld-linux...
# See which library provides a specific symbol
# (useful when debugging "symbol not found" errors)
nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep " printf"
# Check if a library is already loaded in memory (shared between processes)
# Look at /proc/PID/maps for a running process
cat /proc/self/maps | grep "\.so"
# Inspect NEEDED entries (required libs) in an ELF binary
readelf -d /usr/bin/python3 | grep NEEDED
PLT, GOT, and Lazy Binding
When dynamically linked code calls an external function (like printf), the address of that function isn't known at compile time — it depends on where libc.so is loaded in memory. The dynamic linker solves this via two data structures in every dynamically linked binary:
- PLT (Procedure Linkage Table): Stubs in the
.pltsection. When your code callsprintf, it actually calls a PLT stub. - GOT (Global Offset Table): A table of addresses in the
.got.pltsection. Initially, each GOT entry points back into the PLT (the resolver). After the first call, the entry is overwritten with the actual address ofprintfinlibc.
This is lazy binding — function addresses are resolved on first call, not all upfront at program start. The tradeoff: program startup is faster (don't resolve all symbols immediately), but the first call to each library function pays a small overhead for resolution.
printf() to any function they choose, including a shell. Modern mitigations include RELRO (RELocation Read-Only) — the GOT is made read-only after all symbols are resolved — and Full RELRO, which resolves all symbols eagerly at startup (eliminating lazy binding) and makes the entire GOT read-only.
The Loader — From File to Process
When you run a program (e.g., type ./hello in a shell), the shell calls the execve() system call. This is the moment a file on disk becomes a running process. Here's what happens:
sequenceDiagram
participant SH as Shell
participant KN as Kernel
participant LD as ld-linux.so (Dynamic Linker)
participant LB as Shared Libraries (libc, etc.)
participant PR as Program's main()
SH->>KN: execve("./hello", argv, envp)
KN->>KN: Read ELF header, check magic bytes
KN->>KN: Map ELF segments into new address space
(LOAD segments → mmap with correct permissions)
KN->>LD: Map ld-linux.so (the dynamic linker)
and jump to its entry point
LD->>LD: Parse .dynamic section
Find all NEEDED shared libraries
LD->>LB: Load each shared library (mmap the .so files)
LD->>LD: Resolve all symbol references
Update GOT entries with real addresses
LD->>LD: Run library initialisation code
(.init_array, __libc_csu_init)
LD->>PR: Jump to program entry point
(which calls main())
PR->>PR: Your code runs!
# See the dynamic linker in action
# The LD_DEBUG environment variable enables verbose dynamic linker output
# Show all libraries being loaded
LD_DEBUG=libs /tmp/hello_dynamic 2>&1 | head -20
# Show all symbol bindings being resolved
LD_DEBUG=bindings /tmp/hello_dynamic 2>&1 | head -20
# Show the entire loading process
LD_DEBUG=all /tmp/hello_dynamic 2>&1 | head -50
# Measure program startup overhead — how long before main() is called?
# Use LD_DEBUG=statistics to see linker timing
LD_DEBUG=statistics /tmp/hello_dynamic 2>&1
linux-vdso.so.1 in ldd output — it doesn't exist as a file on disk. The vDSO (virtual Dynamic Shared Object) is a small shared library that the kernel maps directly into every process's address space. It provides fast implementations of certain frequently-called syscalls (like clock_gettime()) that can be executed without actually crossing the user/kernel boundary — reducing the cost from ~100 ns to ~10 ns.
Process Memory Layout
Once the loader has done its work, the process's virtual address space has a well-defined layout. On a 64-bit Linux system, the address space is 128 TB, but only small portions are actually mapped:
flowchart TD
A["Kernel Space
Top of address space
Not accessible from user space
(Page fault if accessed)"]
B["Stack
Grows downward
Local variables, return addresses, saved registers
Default 8 MB limit (ulimit -s)"]
C["Memory-Mapped Region
Shared libraries mapped here
mmap() allocations
(libc.so, libpthread.so, etc.)"]
D["Heap
Grows upward
Dynamic allocations: malloc(), new
Managed by the allocator (glibc malloc, jemalloc)"]
E[".bss
Uninitialised global/static vars
(zero-filled by OS)"]
F[".data
Initialised global/static variables
int x = 42;"]
G[".rodata
Read-only data: string literals, const arrays
Attempting write → SIGSEGV"]
H[".text
Program machine code
Read + Execute only
Attempting write → SIGSEGV (W^X protection)"]
A --- B --- C --- D --- E --- F --- G --- H
style A fill:#132440,color:#fff
style B fill:#BF092F,color:#fff
style C fill:#16476A,color:#fff
style D fill:#3B9797,color:#fff
style E fill:#16476A,color:#fff
style F fill:#3B9797,color:#fff
style G fill:#16476A,color:#fff
style H fill:#132440,color:#fff
# Inspect the memory map of a running process
# Start a long-running process
python3 -c "import time; time.sleep(60)" &
PID=$!
# View its memory map: addr range, permissions, offset, device, inode, name
cat /proc/$PID/maps
# Summarise memory usage by category
cat /proc/$PID/smaps_rollup
# See the same information with human-readable sizes
pmap -x $PID | tail -20
kill $PID
ASLR and Memory Randomisation
ASLR (Address Space Layout Randomisation) is a security feature that randomises the base addresses of the stack, heap, and shared library mappings every time a program is launched. This makes it much harder for an attacker who knows they can cause a buffer overflow to predict what address their shellcode will end up at.
# Check if ASLR is enabled on your Linux system
# 0 = disabled, 1 = partial, 2 = full ASLR
cat /proc/sys/kernel/randomize_va_space # Should be 2
# Observe ASLR in action: the stack address changes each run
for i in $(seq 1 5); do
python3 -c "import ctypes; print(hex(id(ctypes.c_int(1))))"
done
# You'll see different addresses each time
# For testing, temporarily disable ASLR for one process
setarch -R python3 -c "import ctypes; print(hex(id(ctypes.c_int(1))))"
setarch -R python3 -c "import ctypes; print(hex(id(ctypes.c_int(1))))"
# Same address both times
Runtime Environments
Not all programs are compiled AOT to native machine code. Many popular languages use a runtime environment — a managed execution engine that interprets or JIT-compiles the program while handling memory management, type checking, and other services.
JIT Compilation — The Best of Both Worlds
Modern runtimes like V8 (JavaScript), the JVM (Java/Kotlin), and PyPy (Python) use Just-in-Time compilation to close the performance gap with native code:
flowchart LR
JS["JavaScript Source"] --> Parse["Parser
(AST generation)"]
Parse --> Ignition["Ignition Interpreter
(fast startup, generates bytecode)"]
Ignition -->|"Hot function detected
(called many times)"| TurboFan
TurboFan["TurboFan JIT Compiler
(optimised machine code)"] -->|"Type assumption violated
(deoptimisation)"| Ignition
TurboFan --> Native["Native Machine Code
(near C-speed performance)"]
style JS fill:#f8f9fa,stroke:#3B9797
style Ignition fill:#3B9797,color:#fff
style TurboFan fill:#BF092F,color:#fff
style Native fill:#132440,color:#fff
CPython Is Not "Interpreted Python" — It's a Bytecode VM
Many developers think Python is "interpreted" in the sense that the source code is read and executed line-by-line like a script. This is not accurate. CPython:
- Compiles your
.pyfile to CPython bytecode (.pycfiles in__pycache__/) - The CPython interpreter — itself a C program compiled to native binary — executes this bytecode in a loop
- Each bytecode instruction (LOAD_FAST, CALL_FUNCTION, etc.) dispatches to a C function
This is why NumPy operations are "fast Python" — NumPy functions are implemented in C and called via the bytecode VM with minimal overhead. The slow part of Python is not the bytecode loop itself but the overhead per Python object operation (type checking, reference counting, dict lookups for attributes).
# Inspect CPython bytecode for a simple function
python3 -c "
import dis
def factorial(n):
if n <= 1:
return 1
return n * factorial(n - 1)
# Disassemble the function to CPython bytecode
dis.dis(factorial)
"
# Output shows: LOAD_FAST (load local var), COMPARE_OP, JUMP_IF, BINARY_OP, etc.
# View the .pyc bytecode cache files CPython generates
python3 -c "import factorial_module" 2>/dev/null || true
# Find a .pyc file and inspect its header
find /usr/lib/python3* -name "*.pyc" -size +10k | head -1 | xargs python3 -c "
import sys, marshal, dis, struct
path = sys.argv[1]
with open(path, 'rb') as f:
magic = f.read(16) # magic + flags + timestamp + size
code = marshal.loads(f.read())
print('Bytecode for:', path)
dis.dis(code)
" 2>/dev/null | head -30
Exercises
Exercise 1 — Explore Your Own Binary
# Compile a C program and explore its ELF structure
cat > /tmp/explore.c << 'EOF'
#include
#include
char global_init[] = "I am in .data";
char global_uninit[100]; // will be in .bss
const char* literal = "I am in .rodata";
int main(int argc, char* argv[]) {
char* heap = malloc(100); // heap allocation
printf("Stack var address: %p\n", &argc);
printf("Heap alloc address: %p\n", heap);
printf("Text (main) address: %p\n", (void*)main);
printf(".data address: %p\n", global_init);
printf(".rodata address: %p\n", literal);
printf(".bss address: %p\n", global_uninit);
free(heap);
return 0;
}
EOF
gcc -o /tmp/explore /tmp/explore.c
/tmp/explore
# Observe how addresses match the virtual memory layout diagram
Exercise 2 — Static vs Dynamic Size
# Compare sizes and startup time of static vs dynamic binary
cat > /tmp/bench.c << 'EOF'
#include
#include
int main() {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
printf("%ld.%09ld\n", ts.tv_sec, ts.tv_nsec);
return 0;
}
EOF
gcc -O2 -o /tmp/bench_dynamic /tmp/bench.c
gcc -O2 -static -o /tmp/bench_static /tmp/bench.c
echo "=== File sizes ==="
ls -lh /tmp/bench_dynamic /tmp/bench_static
echo "=== Startup times (run 10x each) ==="
for i in $(seq 1 10); do /tmp/bench_dynamic; done | awk '{sum+=$1} END{print "Dynamic avg:", sum/NR}'
for i in $(seq 1 10); do /tmp/bench_static; done | awk '{sum+=$1} END{print "Static avg:", sum/NR}'
Exercise 3 — Library Interception with LD_PRELOAD
# LD_PRELOAD lets you inject a shared library before all others
# This is used legitimately (e.g., tcmalloc, mimalloc for better malloc)
# and maliciously (rootkit-style hooking)
# Here we safely observe how it works
# Create a shared library that wraps malloc
cat > /tmp/track_malloc.c << 'EOF'
#define _GNU_SOURCE
#include
#include
#include
void* malloc(size_t size) {
static void* (*real_malloc)(size_t) = NULL;
if (!real_malloc) real_malloc = dlsym(RTLD_NEXT, "malloc");
void* ptr = real_malloc(size);
fprintf(stderr, "malloc(%zu) = %p\n", size, ptr);
return ptr;
}
EOF
gcc -shared -fPIC -o /tmp/track_malloc.so /tmp/track_malloc.c -ldl
# Use LD_PRELOAD to inject our tracking malloc into any program
LD_PRELOAD=/tmp/track_malloc.so python3 -c "x = [1,2,3]" 2>&1 | head -10
Conclusion & Next Steps
You now understand the complete journey from source code to a running process. The key concepts to carry forward:
- Compilation is four phases: preprocessing, compilation, assembly, linking — each producing an intermediate artefact.
-E,-S,-cflags stop GCC at each phase. - ELF is the universal binary format on Linux —
readelfandobjdumplet you inspect any binary. Every Docker image is full of ELF files. - Dynamic linking saves memory but adds complexity — shared libraries, the dynamic linker (
ld-linux.so), PLT/GOT, and the RELRO security hardening all exist to manage this. - The process memory layout is deterministic (modulo ASLR): text at the bottom, data/bss above, heap grows up, stack grows down, shared libraries somewhere in the middle.
- Runtimes like CPython and the JVM are themselves native ELF binaries that implement a virtual machine — Python bytecode runs inside a C program.