The Runtime Stack
Container runtimes exist at two distinct levels, each with a clear responsibility boundary:
- High-level runtimes (containerd, CRI-O) — Manage the complete container lifecycle: pulling images, managing storage, creating network namespaces, supervising container processes, and exposing APIs
- Low-level runtimes (runc, crun, youki) — Execute a single operation: given an OCI bundle (rootfs + config.json), create a container process with the specified isolation. No image pulling, no networking, no supervision
The Complete Call Chain
When you run docker run nginx, here's the actual chain of process invocations:
flowchart TD
A["docker CLI
(client)"] -->|"REST API"| B["dockerd
(Docker daemon)"]
B -->|"gRPC"| C["containerd
(high-level runtime)"]
C -->|"exec"| D["containerd-shim
(per-container process)"]
D -->|"exec"| E["runc
(low-level runtime)"]
E -->|"clone() + exec()"| F["container process
(nginx)"]
style A fill:#f8f9fa,stroke:#132440
style B fill:#f8f9fa,stroke:#132440
style C fill:#f0f9f9,stroke:#3B9797
style D fill:#f0f9f9,stroke:#3B9797
style E fill:#fff5f5,stroke:#BF092F
style F fill:#f8f9fa,stroke:#132440
Each component has a distinct lifecycle and can be restarted independently:
| Component | Responsibility | Can Restart Without Killing Containers? |
|---|---|---|
| docker CLI | User interface, command parsing | N/A (stateless client) |
| dockerd | API server, image builds, compose, swarm | Yes (delegates to containerd) |
| containerd | Image management, container metadata, task supervision | Yes (shims maintain containers) |
| containerd-shim | Holds stdio, reaps exit codes, reports status | No (one per container) |
| runc | Creates the container, then exits | N/A (short-lived process) |
containerd Architecture
containerd is a CNCF graduated project designed as an industry-standard container runtime. Unlike Docker's monolithic daemon, containerd is built around a plugin architecture where every major function is a plugin that can be swapped or extended.
flowchart TB
subgraph API["gRPC API Layer"]
A1[Images Service]
A2[Containers Service]
A3[Tasks Service]
A4[Content Service]
A5[Snapshots Service]
A6[Namespaces Service]
end
subgraph Plugins["Plugin Layer"]
P1[Runtime Plugin
runc / kata / gvisor]
P2[Snapshotter Plugin
overlayfs / native / btrfs]
P3[Content Store
blob storage]
P4[Differ Plugin
layer diffing]
P5[GC Plugin
garbage collection]
P6[CRI Plugin
Kubernetes interface]
end
subgraph Storage["Storage Layer"]
S1[(Content Store
blobs by digest)]
S2[(Metadata Store
BoltDB)]
S3[(Snapshots
filesystem layers)]
end
API --> Plugins
Plugins --> Storage
style API fill:#f0f9f9,stroke:#3B9797
style Plugins fill:#f8f9fa,stroke:#132440
style Storage fill:#fff5f5,stroke:#BF092F
The gRPC API
containerd exposes all functionality through a gRPC API over a Unix socket (default: /run/containerd/containerd.sock). This API is the interface used by Docker, Kubernetes (via CRI plugin), and the ctr CLI.
# Check containerd is running
sudo systemctl status containerd
# containerd configuration file
cat /etc/containerd/config.toml
# List containerd plugins and their status
sudo ctr plugins ls
# Example output showing plugin types:
# TYPE ID PLATFORMS STATUS
# io.containerd.content.v1 content - ok
# io.containerd.snapshotter.v1 overlayfs linux/amd64 ok
# io.containerd.runtime.v2 task linux/amd64 ok
# io.containerd.grpc.v1 cri linux/amd64 ok
# io.containerd.service.v1 containers-service - ok
# io.containerd.service.v1 tasks-service - ok
containerd uses namespaces to isolate different clients. Docker's containers live in the moby namespace, Kubernetes uses the k8s.io namespace, and ctr uses default by default. This prevents Docker and Kubernetes from interfering with each other on the same node.
# List all containerd namespaces
sudo ctr namespaces ls
# NAME LABELS
# default
# moby (Docker containers live here)
# k8s.io (Kubernetes containers live here)
# Work in a specific namespace
sudo ctr -n moby containers ls
sudo ctr -n k8s.io containers ls
containerd: Image Management
containerd's image management is built on top of its content store — a content-addressable blob storage that holds all image layers, manifests, and configurations referenced by their SHA256 digest.
# Pull an image with ctr (containerd's CLI)
sudo ctr images pull docker.io/library/nginx:alpine
# docker.io/library/nginx:alpine: resolved
# index-sha256:... done
# manifest-sha256:... done
# layer-sha256:... done
# config-sha256:... done
# elapsed: 4.2 s
# List pulled images
sudo ctr images ls
# REF TYPE DIGEST SIZE
# docker.io/library/nginx:alpine application/vnd.oci.image.index.v1 sha256:a1b2c3.. 44.2 MiB
# Inspect image details
sudo ctr images info docker.io/library/nginx:alpine
# Check image content (layers, config)
sudo ctr content ls | head -10
# DIGEST SIZE AGE
# sha256:a1b2c3d4e5f6... 7.6 kB 2 minutes
# sha256:b2c3d4e5f6a1... 3.4 MB 2 minutes
# sha256:c3d4e5f6a1b2... 28.1 MB 2 minutes
# Read a specific blob from the content store
sudo ctr content get sha256:a1b2c3d4e5f6... | jq .
/var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/. Each file is named by its SHA256 digest. This is pure content-addressable storage — identical blobs are stored exactly once regardless of how many images reference them.
containerd: Snapshotters
Snapshotters are containerd's abstraction for preparing filesystem layers. They take the compressed tar layers from the content store and prepare them as mountable filesystem views. The snapshotter interface supports different backend technologies:
| Snapshotter | Backend | Copy-on-Write | Best For |
|---|---|---|---|
| overlayfs | OverlayFS kernel module | Yes | Default, general purpose (Linux 4.0+) |
| native | Plain directory copies | No | Filesystems without overlay support |
| btrfs | Btrfs subvolumes | Yes (filesystem-level) | Btrfs-based systems |
| zfs | ZFS clones | Yes (filesystem-level) | ZFS-based systems |
| devmapper | Device mapper thin-provisioning | Yes (block-level) | AWS Firecracker, block storage |
| stargz | Lazy-pulling eStargz images | Yes | Large images, fast startup |
| nydus | Nydus image format | Yes | Container image acceleration |
# List all snapshots
sudo ctr snapshots ls
# KEY PARENT KIND
# sha256:a1b2c3... Committed
# sha256:b2c3d4... sha256:a1b2c3... Committed
# sha256:c3d4e5... sha256:b2c3d4... Committed
# nginx-container sha256:c3d4e5... Active
# Inspect a snapshot's details (mounts, parent chain)
sudo ctr snapshots info sha256:a1b2c3...
# {
# "Kind": "Committed",
# "Name": "sha256:a1b2c3...",
# "Created": "2026-05-14T10:00:00Z",
# "Updated": "2026-05-14T10:00:00Z"
# }
# View mount instructions for a snapshot
sudo ctr snapshots mounts /tmp/mnt nginx-container
# mount -t overlay overlay -o
# lowerdir=/var/lib/containerd/.../sha256:c3d4e5...,
# upperdir=/var/lib/containerd/.../nginx-container/fs,
# workdir=/var/lib/containerd/.../nginx-container/work
# /tmp/mnt
# Prepare a new active snapshot (writable layer)
sudo ctr snapshots prepare my-new-layer sha256:c3d4e5...
The snapshot hierarchy mirrors the image layer stack. Each committed snapshot represents a read-only layer. An active snapshot is a writable layer created on top of committed parents — this is the container's writable layer where runtime changes are stored.
containerd: Container Lifecycle
containerd distinguishes between a container (metadata object) and a task (running process). A container is a static record of what to run. A task is the running instance of that container. This separation allows you to create containers without starting them, and to re-create tasks from the same container definition.
# Pull an image first
sudo ctr images pull docker.io/library/alpine:3.19
# Create a container (metadata only — nothing running yet)
sudo ctr containers create docker.io/library/alpine:3.19 my-alpine
# List containers (note: no running process yet)
sudo ctr containers ls
# CONTAINER IMAGE RUNTIME
# my-alpine docker.io/library/alpine:3.19 io.containerd.runc.v2
# Inspect container metadata
sudo ctr containers info my-alpine
# Start a task (this creates the actual running process)
sudo ctr tasks start -d my-alpine
# List running tasks
sudo ctr tasks ls
# TASK PID STATUS
# my-alpine 12345 RUNNING
# Execute a command in the running task
sudo ctr tasks exec --exec-id shell1 -t my-alpine /bin/sh
# View task process metrics
sudo ctr tasks metrics my-alpine
# Kill the task (container metadata remains)
sudo ctr tasks kill my-alpine
# Delete the task
sudo ctr tasks delete my-alpine
# Delete the container (metadata)
sudo ctr containers delete my-alpine
containerd-shim
The containerd-shim is a critical but often overlooked component. One shim process exists per container, serving as the direct parent of the container process. The shim exists for three reasons:
- Daemon-less containers — The shim allows containerd to restart without killing running containers. The shim keeps the container alive independently.
- Exit code reaping — In Linux, when a process exits, its parent must call
wait()to collect the exit status. The shim is the parent that reaps the container process. - stdio management — The shim holds the container's stdin/stdout/stderr file descriptors and forwards them to logging systems (FIFO pipes or files).
flowchart TB
subgraph Daemon["containerd daemon"]
CD[containerd process
PID 1000]
end
subgraph Shims["Per-Container Shims"]
S1["shim (nginx)
PID 2001"]
S2["shim (redis)
PID 2002"]
S3["shim (postgres)
PID 2003"]
end
subgraph Containers["Container Processes"]
C1["nginx
PID 3001"]
C2["redis-server
PID 3002"]
C3["postgres
PID 3003"]
end
CD -.->|"gRPC over ttrpc"| S1
CD -.->|"gRPC over ttrpc"| S2
CD -.->|"gRPC over ttrpc"| S3
S1 -->|"parent of"| C1
S2 -->|"parent of"| C2
S3 -->|"parent of"| C3
style Daemon fill:#f0f9f9,stroke:#3B9797
style Shims fill:#f8f9fa,stroke:#132440
style Containers fill:#fff5f5,stroke:#BF092F
# View shim processes on a running system
ps aux | grep containerd-shim
# root 2001 containerd-shim-runc-v2 -namespace moby -id abc123...
# root 2002 containerd-shim-runc-v2 -namespace moby -id def456...
# root 2003 containerd-shim-runc-v2 -namespace moby -id ghi789...
# Each shim's parent is PID 1 (init), NOT containerd
# This is by design — allows containerd restart
ps -o pid,ppid,comm -p 2001
# PID PPID COMMAND
# 2001 1 containerd-shim-runc-v2
# Shim communicates with containerd via ttrpc (lightweight gRPC)
# Socket location per container:
ls /run/containerd/io.containerd.runtime.v2.task/moby/abc123.../
# address config.json init.pid log log.json shim.pid
containerd-shim (v1, legacy) and containerd-shim-runc-v2 (v2, current). The v2 shim uses ttrpc protocol, supports one shim per pod (not per container) in Kubernetes, and has better resource efficiency. Always use v2 — v1 is deprecated.
runc Deep Dive
runc is the OCI reference implementation — the canonical low-level container runtime originally extracted from Docker's codebase. Written in Go, it directly manipulates Linux kernel features (namespaces, cgroups, capabilities, seccomp) to create containers. It takes an OCI bundle as input and produces a running, isolated process.
# Check runc version and supported features
runc --version
# runc version 1.1.12
# commit: v1.1.12-0-g51d5e946
# spec: 1.1.0
# go: go1.21.6
# libseccomp: 2.5.4
# Generate a default OCI config.json
mkdir -p mycontainer/rootfs
cd mycontainer
runc spec
# Creates config.json with sensible defaults
# Create a rootfs from an Alpine image
docker export $(docker create alpine:3.19) | tar -xC rootfs/
# Create a container (does NOT start it)
sudo runc create my-container
# Container is now in "created" state
# Check container state
sudo runc state my-container
# {
# "ociVersion": "1.1.0",
# "id": "my-container",
# "pid": 45678,
# "status": "created",
# "bundle": "/home/user/mycontainer",
# "rootfs": "/home/user/mycontainer/rootfs",
# "created": "2026-05-14T10:00:00.123456789Z"
# }
# Start the container (transitions to "running")
sudo runc start my-container
# List running containers managed by runc
sudo runc list
# ID PID STATUS BUNDLE CREATED
# my-container 45678 running /home/user/mycontainer 2026-05-14T10:00:00Z
# Execute a command in the running container
sudo runc exec my-container ls /
# Send a signal to the container
sudo runc kill my-container SIGTERM
# Delete the container
sudo runc delete my-container
runc Internals
When runc creates a container, it performs a carefully orchestrated sequence of kernel operations. Understanding this sequence reveals exactly what "creating a container" means at the system level:
sequenceDiagram
participant P as Parent (runc)
participant I as Init Process (runc init)
participant K as Kernel
P->>K: clone(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWIPC)
K-->>I: New process in new namespaces
I->>K: mount("none", "/", MS_REC|MS_PRIVATE)
I->>K: Setup mounts (/proc, /dev, /sys, tmpfs)
I->>K: pivot_root(rootfs, old_root)
I->>K: umount(old_root, MNT_DETACH)
I->>K: Set hostname (UTS namespace)
I->>K: Configure cgroups (memory, cpu, pids)
I->>K: Apply seccomp filter (BPF program)
I->>K: Drop capabilities (keep only allowed set)
I->>K: setuid/setgid (drop root if configured)
I->>P: Signal ready (via pipe)
P-->>I: Signal start (via pipe)
I->>K: exec(container_entrypoint)
Note over I: Init process replaced by
application process
Step-by-Step Breakdown
- clone() with namespace flags — Creates a new process in fresh namespaces.
CLONE_NEWPIDgives it PID 1 inside the container.CLONE_NEWNETgives it an isolated network stack.CLONE_NEWNSgives it private mount points. - Mount propagation — Makes the mount namespace fully private so that mounts inside the container don't leak to the host and vice versa.
- Setup special filesystems — Mounts
/proc(process information scoped to PID namespace),/dev(minimal device nodes),/sys(read-only sysfs). - pivot_root() — Changes the root filesystem to the container's rootfs. Unlike
chroot, pivot_root actually moves the old root out of scope, making it impossible to "escape" back to the host filesystem. - Cgroup configuration — Places the container process into appropriate cgroups to enforce resource limits (memory, CPU, PIDs, I/O bandwidth).
- seccomp filter — Installs a BPF (Berkeley Packet Filter) program that intercepts system calls and blocks dangerous ones before they reach the kernel.
- Capability drop — Linux capabilities split root's powers into ~40 discrete privileges. The container keeps only what it needs (typically: KILL, NET_BIND_SERVICE, AUDIT_WRITE).
- exec() — The init process replaces itself with the actual container application. From this point, runc's code is gone — only the application code runs.
/proc for the new PID namespace from within it).
The OCI Bundle
An OCI bundle is the complete input to a low-level runtime. You can create one manually to understand exactly what runc receives from containerd:
# Create directory structure
mkdir -p my-oci-bundle/rootfs
# Option 1: Export a Docker image's filesystem
docker export $(docker create busybox:latest) | tar -xC my-oci-bundle/rootfs/
# Option 2: Build a minimal rootfs from scratch
mkdir -p my-oci-bundle/rootfs/{bin,dev,proc,sys,tmp}
cp /bin/busybox my-oci-bundle/rootfs/bin/
cd my-oci-bundle/rootfs/bin && for cmd in sh ls cat echo ps mount; do ln -s busybox $cmd; done && cd ../../..
# Generate the default config.json
cd my-oci-bundle
runc spec
# Customize config.json: change the process
# Edit config.json to set: "args": ["/bin/sh", "-c", "echo Hello from OCI bundle && ps aux"]
# Run the container directly with runc
sudo runc run my-first-oci-container
# Verify the container ran in isolation
sudo runc list
The minimal viable OCI bundle needs only:
- A directory (
rootfs/) with at least one executable file - A
config.jsonthat references that executable inprocess.args
Everything else in config.json — namespaces, mounts, cgroups, capabilities — provides isolation and resource control. Without them, you'd have a chrooted process with no real security boundary.
Container Runtime Interface (CRI)
The Container Runtime Interface (CRI) is Kubernetes' abstraction layer for container runtimes. It's a gRPC protocol that defines how the kubelet (Kubernetes node agent) communicates with any container runtime, without knowing the implementation details.
flowchart LR
subgraph K8s["Kubernetes Node"]
KL[kubelet]
end
subgraph CRI["CRI Protocol (gRPC)"]
RS[RuntimeService]
IS[ImageService]
end
subgraph Runtimes["CRI Implementations"]
CD[containerd
+ CRI plugin]
CO[CRI-O]
end
subgraph Low["Low-Level"]
R1[runc]
R2[kata]
end
KL --> RS
KL --> IS
RS --> CD
RS --> CO
IS --> CD
IS --> CO
CD --> R1
CD --> R2
CO --> R1
style K8s fill:#f8f9fa,stroke:#132440
style CRI fill:#f0f9f9,stroke:#3B9797
style Runtimes fill:#f8f9fa,stroke:#132440
style Low fill:#fff5f5,stroke:#BF092F
CRI Services
| Service | Key Operations | Purpose |
|---|---|---|
| RuntimeService | RunPodSandbox, CreateContainer, StartContainer, StopContainer, RemoveContainer, ListContainers, ExecSync | Container lifecycle management |
| ImageService | PullImage, ListImages, RemoveImage, ImageStatus, ImageFsInfo | Image operations |
# Use crictl to interact with the CRI directly
# (crictl is the CRI equivalent of docker CLI)
# Configure crictl to use containerd's CRI socket
cat /etc/crictl.yaml
# runtime-endpoint: unix:///run/containerd/containerd.sock
# image-endpoint: unix:///run/containerd/containerd.sock
# List pods (sandboxes)
sudo crictl pods
# POD ID CREATED STATE NAME NAMESPACE ATTEMPT
# abc123 2 hours ago Ready nginx-pod default 0
# List containers
sudo crictl ps
# CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
# def456 nginx 2 hours ago Running nginx 0 abc123
# Pull an image via CRI
sudo crictl pull docker.io/library/nginx:alpine
# Inspect a container
sudo crictl inspect def456
# View container logs
sudo crictl logs def456
# Execute a command in a container
sudo crictl exec -it def456 /bin/sh
CRI-O
CRI-O is an alternative high-level runtime purpose-built for Kubernetes. Unlike containerd (which serves Docker, Kubernetes, and standalone use), CRI-O is only designed to implement the CRI protocol. This focused scope makes it lighter and reduces attack surface.
| Feature | containerd | CRI-O |
|---|---|---|
| Primary Purpose | General-purpose container runtime | Kubernetes-only runtime |
| CRI Support | Via built-in CRI plugin | Native (it is a CRI implementation) |
| Docker Compatibility | Yes (Docker uses containerd) | No (Kubernetes only) |
| CLI Tool | ctr, nerdctl | crictl only |
| Image Building | Supports BuildKit | No (use Buildah externally) |
| Default Snapshotter | overlayfs | overlayfs |
| Version Sync | Independent releases | Tracks Kubernetes versions (1.28, 1.29...) |
| Maintained By | CNCF (graduated) | CNCF (incubating) |
| Used By | Docker, AWS EKS, GKE, AKS | Red Hat OpenShift, Fedora CoreOS |
# CRI-O configuration
cat /etc/crio/crio.conf
# Key settings:
# [crio.runtime]
# default_runtime = "runc"
# conmon_cgroup = "pod"
#
# [crio.runtime.runtimes.runc]
# runtime_path = "/usr/bin/runc"
# runtime_type = "oci"
#
# [crio.runtime.runtimes.kata]
# runtime_path = "/usr/bin/kata-runtime"
# runtime_type = "oci"
# privileged_without_host_devices = true
# Check CRI-O status
sudo systemctl status crio
# CRI-O also uses crictl for interaction
sudo crictl --runtime-endpoint unix:///var/run/crio/crio.sock pods
Alternative Runtimes
OCI compliance enables a rich ecosystem of alternative low-level runtimes, each optimised for different use cases:
| Runtime | Language | Key Differentiator | Startup Overhead | Use Case |
|---|---|---|---|---|
| runc | Go | Reference implementation, battle-tested | ~100ms | General purpose, default everywhere |
| crun | C | 2× faster startup, 50% less memory | ~50ms | Performance-critical, Podman default |
| youki | Rust | Memory safety, no GC pauses | ~60ms | Security-focused, emerging alternative |
| kata-containers | Go/Rust | Runs each container in a lightweight VM | ~500ms | Multi-tenant clouds, untrusted workloads |
| gVisor (runsc) | Go | User-space kernel intercepts all syscalls | ~150ms | Defence in depth, Google Cloud Run |
| Firecracker | Rust | MicroVM with minimal device model | ~125ms | AWS Lambda, AWS Fargate |
| WasmEdge | C++/Rust | WebAssembly runtime as OCI runtime | ~1ms | Lightweight serverless, edge computing |
Choosing a Runtime: Decision Matrix
The choice of runtime depends on your threat model and performance requirements:
- Single-tenant, trusted workloads → runc or crun (maximum performance, namespace isolation is sufficient)
- Multi-tenant, untrusted workloads → kata-containers or gVisor (additional VM/kernel boundary between tenants)
- Serverless/FaaS platforms → Firecracker (fast VM boot, minimal overhead, strong isolation for function execution)
- Edge computing, plugins → WasmEdge (sub-millisecond startup, sandboxed execution, language-agnostic)
- Maximum startup performance → crun (C implementation, no garbage collector, minimal overhead)
Key principle: All OCI-compliant runtimes run the same images. Your choice of runtime is independent of your image format. You can switch runtimes per workload in Kubernetes using RuntimeClass resources.
# Kubernetes RuntimeClass: run specific pods with kata-containers
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata
handler: kata
overhead:
podFixed:
memory: "160Mi"
cpu: "250m"
scheduling:
nodeSelector:
kubernetes.io/runtime: kata
---
# Use the RuntimeClass in a Pod spec
apiVersion: v1
kind: Pod
metadata:
name: untrusted-workload
spec:
runtimeClassName: kata
containers:
- name: app
image: untrusted-image:latest
resources:
limits:
memory: "256Mi"
cpu: "500m"
Exercises
Exercise 1: Run a Container with runc Directly
Bypass Docker entirely and create a container using only runc:
# Create the OCI bundle structure
mkdir -p runc-exercise/rootfs
cd runc-exercise
# Extract Alpine's filesystem
docker export $(docker create alpine:3.19) | tar -xC rootfs/
# Generate config.json
runc spec
# Edit config.json to run a custom command
# Change "args": ["sh"] to "args": ["/bin/sh", "-c", "echo PID: $$ && hostname && cat /proc/self/cgroup"]
# Run the container
sudo runc run exercise-container
# In another terminal, check the container state
sudo runc state exercise-container
sudo runc list
Exercise 2: Explore containerd with ctr
Use containerd's native CLI to pull images, create containers, and manage tasks:
# Pull an image directly through containerd
sudo ctr images pull docker.io/library/busybox:latest
# Create a container (metadata only)
sudo ctr containers create docker.io/library/busybox:latest my-busybox
# Start a task with a specific command
sudo ctr tasks start -d my-busybox /bin/sh -c "while true; do date; sleep 5; done"
# List tasks and check PID
sudo ctr tasks ls
# Exec into the running task
sudo ctr tasks exec --exec-id debug -t my-busybox /bin/sh
# View the container's snapshot
sudo ctr snapshots ls | grep busybox
# Cleanup
sudo ctr tasks kill my-busybox
sudo ctr tasks delete my-busybox
sudo ctr containers delete my-busybox
Exercise 3: Trace the Runtime Stack
Start a Docker container and observe the entire process hierarchy:
# Start a long-running container
docker run -d --name trace-test nginx:alpine
# Find all related processes
CONTAINER_PID=$(docker inspect trace-test --format '{{.State.Pid}}')
echo "Container PID: $CONTAINER_PID"
# Show the full process tree
pstree -p $CONTAINER_PID -s
# systemd(1)───containerd(1234)───containerd-shim(5678)───nginx(9012)───nginx(9013)
# Verify the shim is the parent
ps -o pid,ppid,comm -p $CONTAINER_PID
# PID PPID COMMAND
# 9012 5678 nginx
# Check the shim's parent (should be PID 1, not containerd)
ps -o pid,ppid,comm -p 5678
# PID PPID COMMAND
# 5678 1 containerd-shim
# Cleanup
docker rm -f trace-test
Conclusion & Next Steps
The container runtime stack is a masterpiece of separation of concerns. containerd handles the "what" — managing images, snapshots, and container metadata. runc handles the "how" — creating isolated processes using Linux kernel primitives. The containerd-shim bridges them, providing daemon-less container survival and exit code reaping.
Key takeaways:
- containerd is a high-level runtime managing the full container lifecycle: images (content store), filesystem preparation (snapshotters), and process supervision (tasks)
- runc is a short-lived process that creates containers from OCI bundles using clone(), pivot_root(), cgroups, seccomp, and capabilities — then exits
- containerd-shim is the per-container process that parents the container, enabling daemon restarts without killing workloads
- CRI abstracts runtime differences away from Kubernetes, enabling containerd and CRI-O to be interchangeable
- Alternative runtimes (kata, gVisor, crun) provide different trade-offs between isolation strength and performance, all using the same OCI image format
Next in the Series
In Part 15: Security Fundamentals, we'll explore the security model that makes containers safe — from Linux capabilities and seccomp profiles to AppArmor/SELinux policies, rootless containers, and image signing. You'll learn to harden containers for production and understand the threat models that inform runtime design decisions.