Virtual Memory
Every process believes it has exclusive access to a large, contiguous address space — up to 128 TB on 64-bit Linux. This is an illusion created by virtual memory. The CPU works with virtual addresses; the memory management unit (MMU) hardware translates them to physical addresses (actual RAM locations) using page tables maintained by the kernel.
Page Tables & the TLB
Memory is divided into fixed-size units called pages (typically 4 KB). Each process has a page table — a multi-level data structure maintained by the kernel that maps virtual page numbers to physical frame numbers.
flowchart LR
VA["Virtual Address\n0x7fff0001a4c0"]
TLB["TLB — Translation Lookaside Buffer\n(CPU-internal cache of recent translations)"]
PT["Page Table Walk\n(4-level on x86-64: PGD→PUD→PMD→PTE)"]
PA["Physical Address\n0x00000004a2c0"]
RAM["Physical RAM"]
VA -->|"CPU asks: what's the physical address?"| TLB
TLB -->|"Cache hit (~1 ns)"| PA
TLB -->|"Cache miss — walk page tables (~20-100 ns)"| PT
PT --> PA
PA --> RAM
style TLB fill:#3B9797,color:#fff
style PT fill:#BF092F,color:#fff
style PA fill:#132440,color:#fff
The TLB (Translation Lookaside Buffer) is a small hardware cache inside the CPU that stores recent virtual-to-physical translations. A TLB hit takes ~1 ns; a TLB miss requires walking the page table (~20-100 ns). TLB efficiency depends on working set size — if a process accesses widely scattered memory, TLB thrashing kills performance.
# View the memory map of a process (virtual address layout)
# (Use your shell's PID $$)
cat /proc/$$/maps | head -20
# Format: start-end perms offset dev inode path
# r=read, w=write, x=execute, p=private, s=shared
# Count total virtual address space regions
wc -l /proc/$$/maps
# See where libraries are mapped
cat /proc/$$/maps | grep "\.so\." | awk '{print $6}' | sort -u
# View memory statistics for a process
cat /proc/$$/status | grep -E "VmPeak|VmRSS|VmSize|VmSwap"
# VmSize = total virtual address space
# VmRSS = Resident Set Size (pages actually in RAM)
# VmPeak = peak virtual memory usage
Page Faults
A page fault occurs when a process accesses a virtual address that has no current mapping to a physical page. The CPU triggers a fault, the kernel's fault handler runs, and it resolves the situation by:
- Minor page fault: The page is in RAM but not mapped (e.g., first access to a newly allocated page, or a copy-on-write trigger). Fast to handle — just map it.
- Major page fault: The page is on disk (swapped out or from a file). Requires disk I/O — very slow (milliseconds). This is why "paging" is bad for performance.
- Invalid access: The address is genuinely invalid (null pointer dereference, buffer overflow). The kernel sends
SIGSEGVto the process.
# Count page faults for a process
# minflt = minor faults, majflt = major faults
cat /proc/$$/status | grep -E "^(Min|Maj)Flt"
# See page fault counts while running a program
/usr/bin/time -v ls /tmp 2>&1 | grep -E "Major|Minor|page"
# For a running process
cat /proc/$(pgrep python3 | head -1)/status 2>/dev/null | grep Flt
Memory Allocation
Heap: brk() and mmap()
Dynamic memory allocation (malloc/new) comes from the heap. The kernel provides two system calls for expanding the heap:
brk()/sbrk(): Moves the "program break" — the end of the heap segment — upward, extending the heap. Simple and fast but only grows contiguously.mmap(MAP_ANONYMOUS): Maps new anonymous memory pages anywhere in the address space. Used for large allocations (typically >128 KB by glibc malloc). More flexible — can be unmapped individually.
# See which syscalls a Python memory allocation makes
strace -e trace=brk,mmap python3 -c "x = [0] * 1000000" 2>&1 | grep -E "brk|mmap" | head -10
# Demonstrate the heap vs mmap threshold in glibc
python3 -c "
import ctypes, mmap
# Small allocation: goes through brk (heap)
# Large allocation: goes through mmap
# Check /proc/self/maps after large malloc
import subprocess
subprocess.run(['cat', '/proc/self/maps'], capture_output=False)
" 2>/dev/null | grep anon | head -5
glibc malloc Internals
When you call malloc(), you're not calling the kernel — you're calling glibc's allocator, which manages a pool of memory obtained from the kernel via brk()/mmap(). The allocator maintains free lists, coalesces adjacent free blocks, and only goes to the kernel when its current pool is exhausted.
Memory Fragmentation and Allocator Choice
glibc's default malloc can suffer from memory fragmentation in long-running services with many small, varying-size allocations and frees — the allocator ends up with many small holes that can't be used for new allocations, so RSS grows even though the application "freed" memory. This is why high-performance services often swap to alternative allocators: jemalloc (used by Firefox, Redis, Facebook) or tcmalloc (used by Google Chrome, Chromium). They use different fragmentation-avoidance strategies and typically achieve 10-30% better memory efficiency in real workloads.
Stack Memory
The stack holds local variables, function arguments, return addresses, and saved register values. It grows downward in virtual address space. Each function call pushes a stack frame onto the stack; each return pops it off. This is automatic and O(1) — unlike heap allocation which may require searching free lists.
The default stack size limit on Linux is typically 8 MB (ulimit -s). Exceeding it causes a stack overflow — the program receives SIGSEGV. This is why deeply recursive programs (e.g., parsing deeply nested JSON with a recursive parser) can crash with a segfault.
# View stack size limit
ulimit -s # Default: 8192 (KB = 8 MB)
# Increase stack size for a session
ulimit -s unlimited # Or a specific size
# See stack region in process maps
cat /proc/$$/maps | grep "\[stack\]"
# Python has a default recursion limit (1000) to avoid stack overflow
python3 -c "
import sys
print('Recursion limit:', sys.getrecursionlimit()) # 1000
# sys.setrecursionlimit(10000) # Can increase, but risk stack overflow
"
# Observe stack growth with a simple recursive function
python3 -c "
import sys
def depth(n):
if n == 0: return 0
return 1 + depth(n - 1)
print('Max safe depth:', 980) # ~20 frames of overhead
print('At depth 980:', depth(980))
"
Swap Space
When physical RAM is exhausted, the kernel can move cold (infrequently accessed) memory pages to swap space on disk, freeing RAM for active use. The pages are transparently swapped back when accessed again (a major page fault).
# View swap usage
free -h # Shows: total, used, free for RAM and swap
swapon --show # List swap devices and their usage
cat /proc/swaps # Same info from kernel
# How aggressively the kernel swaps
cat /proc/sys/vm/swappiness # 0=avoid swap, 100=swap aggressively
# Default: 60 for desktop, 1 or 10 for servers
# Set to 1 for latency-sensitive servers (databases)
# sudo sysctl vm.swappiness=1
The OOM Killer
When the system runs out of both RAM and swap, the kernel's OOM (Out Of Memory) killer activates. It selects a process to kill based on an oom_score — a heuristic combining process size, runtime, and user-space hints. The process with the highest score is killed to free memory.
resources.limits.memory in a Pod spec), the kernel's cgroup memory controller sends an OOM kill to the container's PID 1. This appears as a OOMKilled status in kubectl describe pod. The fix is not to increase limits blindly — first diagnose why the process is using so much memory (leak? legitimate growth? underestimated limit?).
# Check OOM killer history in kernel logs
dmesg | grep -i "oom\|kill\|out of memory" | tail -20
journalctl -k | grep -i "oom\|killed process" | tail -10
# View OOM score for current processes (higher = more likely to be killed)
for pid in $(ls /proc | grep '^[0-9]' | head -20); do
score=$(cat /proc/$pid/oom_score 2>/dev/null)
cmd=$(cat /proc/$pid/comm 2>/dev/null)
[ -n "$score" ] && echo "$score $cmd"
done | sort -rn | head -10
# Protect a critical process from OOM kill (score adjustment)
# echo -1000 > /proc/$(pgrep sshd)/oom_score_adj # Requires root
# -1000 = never kill, +1000 = kill first
# View current oom_score_adj
cat /proc/$$/oom_score_adj # 0 = default
Huge Pages & THP
The default 4 KB page size means a process using 1 GB of memory has ~262,000 pages, requiring 262,000 TLB entries (which don't exist — the TLB has ~1,000-4,000 entries). Huge pages (2 MB or 1 GB on x86-64) reduce TLB pressure dramatically for memory-intensive workloads.
# View huge page configuration
cat /proc/meminfo | grep -i huge
# Transparent Huge Pages (THP) — kernel automatically promotes eligible regions
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
# For databases (Postgres, Oracle), disable THP — it causes latency spikes
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Add to /etc/rc.local for persistence
# View THP usage
cat /proc/meminfo | grep AnonHugePages # Memory in 2MB huge pages
cgroups Memory Limits
cgroups v2 memory limits are how Docker, Kubernetes, and systemd enforce memory caps on processes. When a process in a cgroup tries to allocate memory beyond the limit, the kernel either invokes the OOM killer within that cgroup (killing the process) or throttles allocation.
# View cgroups memory hierarchy (cgroups v2)
ls /sys/fs/cgroup/
cat /sys/fs/cgroup/memory.max # System-wide max
cat /sys/fs/cgroup/user.slice/memory.current # User slice usage
# Docker container memory limits (run docker to test)
docker run --memory=128m --memory-swap=128m \
--name mem-test alpine \
sh -c "cat /sys/fs/cgroup/memory.max" 2>/dev/null
# Should output: 134217728 (= 128 MB in bytes)
# See memory accounting for all cgroups
find /sys/fs/cgroup -name "memory.current" 2>/dev/null | \
while read f; do
val=$(cat "$f" 2>/dev/null)
[ "$val" -gt 0 ] 2>/dev/null && echo "$(( val / 1024 / 1024 ))MB $f"
done | sort -rn | head -10
Exercises
# Exercise 1: Read your process's memory stats
cat /proc/$$/status | grep -E "Vm|FDSize|Threads"
# VmRSS = how much RAM is actually in use (Resident Set Size)
# Exercise 2: Observe page faults
/usr/bin/time -v cat /dev/null 2>&1 | grep -E "page fault|Maximum"
# Exercise 3: See memory breakdown of a running Python process
python3 &
PY_PID=$!
sleep 1
cat /proc/$PY_PID/smaps | grep -E "^(Private|Shared|Rss)" | awk '{sum[$1]+=$2} END{for(k in sum) print k, sum[k]/1024, "MB"}'
kill $PY_PID
# Exercise 4: Find processes consuming the most RAM
ps aux --sort=-%mem | head -10
# Or: ps -eo pid,rss,comm --sort=-rss | head -10
# Exercise 5: Check swap usage
free -h
swapon --show
cat /proc/sys/vm/swappiness
Conclusion & Next Steps
Virtual memory isolates processes and enables memory over-commitment. Page tables + the TLB translate virtual to physical addresses — TLB efficiency is the key to memory performance. The heap grows via brk/mmap; the allocator manages the pool. The OOM killer is the kernel's last resort when memory runs out. cgroups enforce hard limits — and understanding them makes "OOMKilled" containers debuggable rather than mysterious.