Back to Computing & Systems Foundations Series

Part 6: Memory Management — Virtual Memory, Paging & cgroups

May 13, 2026Wasil Zafar18 min read

Virtual memory, page tables, the OOM killer, huge pages, and how cgroups enforce memory limits — why your container gets killed and what to do about it.

Table of Contents

  1. Virtual Memory
  2. Memory Allocation
  3. Stack Memory
  4. Swap Space
  5. The OOM Killer
  6. Huge Pages & THP
  7. cgroups Memory Limits
  8. Exercises
  9. Conclusion

Virtual Memory

Every process believes it has exclusive access to a large, contiguous address space — up to 128 TB on 64-bit Linux. This is an illusion created by virtual memory. The CPU works with virtual addresses; the memory management unit (MMU) hardware translates them to physical addresses (actual RAM locations) using page tables maintained by the kernel.

Why Virtual Memory Exists: Without it, every process would need to know the physical addresses of every other process to avoid collisions, processes could corrupt each other's memory, and the total memory of all running processes couldn't exceed physical RAM. Virtual memory solves all three: isolation, protection, and the ability to over-commit (run processes whose combined virtual footprint exceeds physical RAM).

Page Tables & the TLB

Memory is divided into fixed-size units called pages (typically 4 KB). Each process has a page table — a multi-level data structure maintained by the kernel that maps virtual page numbers to physical frame numbers.

Virtual-to-Physical Address Translation
flowchart LR
    VA["Virtual Address\n0x7fff0001a4c0"]
    TLB["TLB — Translation Lookaside Buffer\n(CPU-internal cache of recent translations)"]
    PT["Page Table Walk\n(4-level on x86-64: PGD→PUD→PMD→PTE)"]
    PA["Physical Address\n0x00000004a2c0"]
    RAM["Physical RAM"]

    VA -->|"CPU asks: what's the physical address?"| TLB
    TLB -->|"Cache hit (~1 ns)"| PA
    TLB -->|"Cache miss — walk page tables (~20-100 ns)"| PT
    PT --> PA
    PA --> RAM

    style TLB fill:#3B9797,color:#fff
    style PT fill:#BF092F,color:#fff
    style PA fill:#132440,color:#fff
            

The TLB (Translation Lookaside Buffer) is a small hardware cache inside the CPU that stores recent virtual-to-physical translations. A TLB hit takes ~1 ns; a TLB miss requires walking the page table (~20-100 ns). TLB efficiency depends on working set size — if a process accesses widely scattered memory, TLB thrashing kills performance.

# View the memory map of a process (virtual address layout)
# (Use your shell's PID $$)
cat /proc/$$/maps | head -20
# Format: start-end perms offset dev inode path
# r=read, w=write, x=execute, p=private, s=shared

# Count total virtual address space regions
wc -l /proc/$$/maps

# See where libraries are mapped
cat /proc/$$/maps | grep "\.so\." | awk '{print $6}' | sort -u

# View memory statistics for a process
cat /proc/$$/status | grep -E "VmPeak|VmRSS|VmSize|VmSwap"
# VmSize = total virtual address space
# VmRSS  = Resident Set Size (pages actually in RAM)
# VmPeak = peak virtual memory usage

Page Faults

A page fault occurs when a process accesses a virtual address that has no current mapping to a physical page. The CPU triggers a fault, the kernel's fault handler runs, and it resolves the situation by:

  • Minor page fault: The page is in RAM but not mapped (e.g., first access to a newly allocated page, or a copy-on-write trigger). Fast to handle — just map it.
  • Major page fault: The page is on disk (swapped out or from a file). Requires disk I/O — very slow (milliseconds). This is why "paging" is bad for performance.
  • Invalid access: The address is genuinely invalid (null pointer dereference, buffer overflow). The kernel sends SIGSEGV to the process.
# Count page faults for a process
# minflt = minor faults, majflt = major faults
cat /proc/$$/status | grep -E "^(Min|Maj)Flt"

# See page fault counts while running a program
/usr/bin/time -v ls /tmp 2>&1 | grep -E "Major|Minor|page"

# For a running process
cat /proc/$(pgrep python3 | head -1)/status 2>/dev/null | grep Flt

Memory Allocation

Heap: brk() and mmap()

Dynamic memory allocation (malloc/new) comes from the heap. The kernel provides two system calls for expanding the heap:

  • brk()/sbrk(): Moves the "program break" — the end of the heap segment — upward, extending the heap. Simple and fast but only grows contiguously.
  • mmap(MAP_ANONYMOUS): Maps new anonymous memory pages anywhere in the address space. Used for large allocations (typically >128 KB by glibc malloc). More flexible — can be unmapped individually.
# See which syscalls a Python memory allocation makes
strace -e trace=brk,mmap python3 -c "x = [0] * 1000000" 2>&1 | grep -E "brk|mmap" | head -10

# Demonstrate the heap vs mmap threshold in glibc
python3 -c "
import ctypes, mmap
# Small allocation: goes through brk (heap)
# Large allocation: goes through mmap
# Check /proc/self/maps after large malloc
import subprocess
subprocess.run(['cat', '/proc/self/maps'], capture_output=False)
" 2>/dev/null | grep anon | head -5

glibc malloc Internals

When you call malloc(), you're not calling the kernel — you're calling glibc's allocator, which manages a pool of memory obtained from the kernel via brk()/mmap(). The allocator maintains free lists, coalesces adjacent free blocks, and only goes to the kernel when its current pool is exhausted.

Performance

Memory Fragmentation and Allocator Choice

glibc's default malloc can suffer from memory fragmentation in long-running services with many small, varying-size allocations and frees — the allocator ends up with many small holes that can't be used for new allocations, so RSS grows even though the application "freed" memory. This is why high-performance services often swap to alternative allocators: jemalloc (used by Firefox, Redis, Facebook) or tcmalloc (used by Google Chrome, Chromium). They use different fragmentation-avoidance strategies and typically achieve 10-30% better memory efficiency in real workloads.

jemallocMemory FragmentationRedis

Stack Memory

The stack holds local variables, function arguments, return addresses, and saved register values. It grows downward in virtual address space. Each function call pushes a stack frame onto the stack; each return pops it off. This is automatic and O(1) — unlike heap allocation which may require searching free lists.

The default stack size limit on Linux is typically 8 MB (ulimit -s). Exceeding it causes a stack overflow — the program receives SIGSEGV. This is why deeply recursive programs (e.g., parsing deeply nested JSON with a recursive parser) can crash with a segfault.

# View stack size limit
ulimit -s           # Default: 8192 (KB = 8 MB)

# Increase stack size for a session
ulimit -s unlimited  # Or a specific size

# See stack region in process maps
cat /proc/$$/maps | grep "\[stack\]"

# Python has a default recursion limit (1000) to avoid stack overflow
python3 -c "
import sys
print('Recursion limit:', sys.getrecursionlimit())   # 1000
# sys.setrecursionlimit(10000)  # Can increase, but risk stack overflow
"

# Observe stack growth with a simple recursive function
python3 -c "
import sys
def depth(n):
    if n == 0: return 0
    return 1 + depth(n - 1)
print('Max safe depth:', 980)  # ~20 frames of overhead
print('At depth 980:', depth(980))
"

Swap Space

When physical RAM is exhausted, the kernel can move cold (infrequently accessed) memory pages to swap space on disk, freeing RAM for active use. The pages are transparently swapped back when accessed again (a major page fault).

# View swap usage
free -h          # Shows: total, used, free for RAM and swap
swapon --show    # List swap devices and their usage

cat /proc/swaps  # Same info from kernel

# How aggressively the kernel swaps
cat /proc/sys/vm/swappiness    # 0=avoid swap, 100=swap aggressively
# Default: 60 for desktop, 1 or 10 for servers
# Set to 1 for latency-sensitive servers (databases)
# sudo sysctl vm.swappiness=1

The OOM Killer

When the system runs out of both RAM and swap, the kernel's OOM (Out Of Memory) killer activates. It selects a process to kill based on an oom_score — a heuristic combining process size, runtime, and user-space hints. The process with the highest score is killed to free memory.

OOM Kill in Kubernetes: When a container exceeds its memory limit (set via resources.limits.memory in a Pod spec), the kernel's cgroup memory controller sends an OOM kill to the container's PID 1. This appears as a OOMKilled status in kubectl describe pod. The fix is not to increase limits blindly — first diagnose why the process is using so much memory (leak? legitimate growth? underestimated limit?).
# Check OOM killer history in kernel logs
dmesg | grep -i "oom\|kill\|out of memory" | tail -20
journalctl -k | grep -i "oom\|killed process" | tail -10

# View OOM score for current processes (higher = more likely to be killed)
for pid in $(ls /proc | grep '^[0-9]' | head -20); do
    score=$(cat /proc/$pid/oom_score 2>/dev/null)
    cmd=$(cat /proc/$pid/comm 2>/dev/null)
    [ -n "$score" ] && echo "$score $cmd"
done | sort -rn | head -10

# Protect a critical process from OOM kill (score adjustment)
# echo -1000 > /proc/$(pgrep sshd)/oom_score_adj  # Requires root
# -1000 = never kill, +1000 = kill first

# View current oom_score_adj
cat /proc/$$/oom_score_adj   # 0 = default

Huge Pages & THP

The default 4 KB page size means a process using 1 GB of memory has ~262,000 pages, requiring 262,000 TLB entries (which don't exist — the TLB has ~1,000-4,000 entries). Huge pages (2 MB or 1 GB on x86-64) reduce TLB pressure dramatically for memory-intensive workloads.

# View huge page configuration
cat /proc/meminfo | grep -i huge

# Transparent Huge Pages (THP) — kernel automatically promotes eligible regions
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

# For databases (Postgres, Oracle), disable THP — it causes latency spikes
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Add to /etc/rc.local for persistence

# View THP usage
cat /proc/meminfo | grep AnonHugePages   # Memory in 2MB huge pages

cgroups Memory Limits

cgroups v2 memory limits are how Docker, Kubernetes, and systemd enforce memory caps on processes. When a process in a cgroup tries to allocate memory beyond the limit, the kernel either invokes the OOM killer within that cgroup (killing the process) or throttles allocation.

# View cgroups memory hierarchy (cgroups v2)
ls /sys/fs/cgroup/
cat /sys/fs/cgroup/memory.max           # System-wide max
cat /sys/fs/cgroup/user.slice/memory.current  # User slice usage

# Docker container memory limits (run docker to test)
docker run --memory=128m --memory-swap=128m \
    --name mem-test alpine \
    sh -c "cat /sys/fs/cgroup/memory.max" 2>/dev/null
# Should output: 134217728 (= 128 MB in bytes)

# See memory accounting for all cgroups
find /sys/fs/cgroup -name "memory.current" 2>/dev/null | \
    while read f; do
        val=$(cat "$f" 2>/dev/null)
        [ "$val" -gt 0 ] 2>/dev/null && echo "$(( val / 1024 / 1024 ))MB  $f"
    done | sort -rn | head -10

Exercises

# Exercise 1: Read your process's memory stats
cat /proc/$$/status | grep -E "Vm|FDSize|Threads"
# VmRSS = how much RAM is actually in use (Resident Set Size)

# Exercise 2: Observe page faults
/usr/bin/time -v cat /dev/null 2>&1 | grep -E "page fault|Maximum"

# Exercise 3: See memory breakdown of a running Python process
python3 &
PY_PID=$!
sleep 1
cat /proc/$PY_PID/smaps | grep -E "^(Private|Shared|Rss)" | awk '{sum[$1]+=$2} END{for(k in sum) print k, sum[k]/1024, "MB"}'
kill $PY_PID

# Exercise 4: Find processes consuming the most RAM
ps aux --sort=-%mem | head -10
# Or: ps -eo pid,rss,comm --sort=-rss | head -10

# Exercise 5: Check swap usage
free -h
swapon --show
cat /proc/sys/vm/swappiness

Conclusion & Next Steps

Virtual memory isolates processes and enables memory over-commitment. Page tables + the TLB translate virtual to physical addresses — TLB efficiency is the key to memory performance. The heap grows via brk/mmap; the allocator manages the pool. The OOM killer is the kernel's last resort when memory runs out. cgroups enforce hard limits — and understanding them makes "OOMKilled" containers debuggable rather than mysterious.