Part 14: containerd & runc Deep Dive

The Runtime Stack

Container runtimes exist at two distinct levels, each with a clear responsibility boundary:

High-level runtimes (containerd, CRI-O) — Manage the complete container lifecycle: pulling images, managing storage, creating network namespaces, supervising container processes, and exposing APIs
Low-level runtimes (runc, crun, youki) — Execute a single operation: given an OCI bundle (rootfs + config.json), create a container process with the specified isolation. No image pulling, no networking, no supervision

The Complete Call Chain

When you run docker run nginx, here's the actual chain of process invocations:

Docker Run Call Chain

flowchart TD
    A["docker CLI
(client)"] -->|"REST API"| B["dockerd
(Docker daemon)"]
    B -->|"gRPC"| C["containerd
(high-level runtime)"]
    C -->|"exec"| D["containerd-shim
(per-container process)"]
    D -->|"exec"| E["runc
(low-level runtime)"]
    E -->|"clone() + exec()"| F["container process
(nginx)"]

    style A fill:#f8f9fa,stroke:#132440
    style B fill:#f8f9fa,stroke:#132440
    style C fill:#f0f9f9,stroke:#3B9797
    style D fill:#f0f9f9,stroke:#3B9797
    style E fill:#fff5f5,stroke:#BF092F
    style F fill:#f8f9fa,stroke:#132440

Each component has a distinct lifecycle and can be restarted independently:

Component	Responsibility	Can Restart Without Killing Containers?
docker CLI	User interface, command parsing	N/A (stateless client)
dockerd	API server, image builds, compose, swarm	Yes (delegates to containerd)
containerd	Image management, container metadata, task supervision	Yes (shims maintain containers)
containerd-shim	Holds stdio, reaps exit codes, reports status	No (one per container)
runc	Creates the container, then exits	N/A (short-lived process)

                            
                            Key Insight: runc is not a daemon. It creates the container and immediately exits. The containerd-shim takes over as the parent process of the container. This is why you can upgrade runc without affecting running containers — it's only invoked during container creation.
                        

containerd Architecture

containerd is a CNCF graduated project designed as an industry-standard container runtime. Unlike Docker's monolithic daemon, containerd is built around a plugin architecture where every major function is a plugin that can be swapped or extended.

containerd Component Architecture

flowchart TB
    subgraph API["gRPC API Layer"]
        A1[Images Service]
        A2[Containers Service]
        A3[Tasks Service]
        A4[Content Service]
        A5[Snapshots Service]
        A6[Namespaces Service]
    end
    subgraph Plugins["Plugin Layer"]
        P1[Runtime Plugin
runc / kata / gvisor]
        P2[Snapshotter Plugin
overlayfs / native / btrfs]
        P3[Content Store
blob storage]
        P4[Differ Plugin
layer diffing]
        P5[GC Plugin
garbage collection]
        P6[CRI Plugin
Kubernetes interface]
    end
    subgraph Storage["Storage Layer"]
        S1[(Content Store
blobs by digest)]
        S2[(Metadata Store
BoltDB)]
        S3[(Snapshots
filesystem layers)]
    end
    API --> Plugins
    Plugins --> Storage
    style API fill:#f0f9f9,stroke:#3B9797
    style Plugins fill:#f8f9fa,stroke:#132440
    style Storage fill:#fff5f5,stroke:#BF092F

The gRPC API

containerd exposes all functionality through a gRPC API over a Unix socket (default: /run/containerd/containerd.sock). This API is the interface used by Docker, Kubernetes (via CRI plugin), and the ctr CLI.

# Check containerd is running
sudo systemctl status containerd

# containerd configuration file
cat /etc/containerd/config.toml

# List containerd plugins and their status
sudo ctr plugins ls

# Example output showing plugin types:
# TYPE                                   ID                       PLATFORMS   STATUS
# io.containerd.content.v1               content                  -           ok
# io.containerd.snapshotter.v1           overlayfs                linux/amd64 ok
# io.containerd.runtime.v2               task                     linux/amd64 ok
# io.containerd.grpc.v1                  cri                      linux/amd64 ok
# io.containerd.service.v1               containers-service       -           ok
# io.containerd.service.v1               tasks-service            -           ok

containerd uses namespaces to isolate different clients. Docker's containers live in the moby namespace, Kubernetes uses the k8s.io namespace, and ctr uses default by default. This prevents Docker and Kubernetes from interfering with each other on the same node.

# List all containerd namespaces
sudo ctr namespaces ls
# NAME    LABELS
# default
# moby    (Docker containers live here)
# k8s.io  (Kubernetes containers live here)

# Work in a specific namespace
sudo ctr -n moby containers ls
sudo ctr -n k8s.io containers ls

containerd: Image Management

containerd's image management is built on top of its content store — a content-addressable blob storage that holds all image layers, manifests, and configurations referenced by their SHA256 digest.

# Pull an image with ctr (containerd's CLI)
sudo ctr images pull docker.io/library/nginx:alpine
# docker.io/library/nginx:alpine: resolved
# index-sha256:... done
# manifest-sha256:... done
# layer-sha256:... done
# config-sha256:... done
# elapsed: 4.2 s

# List pulled images
sudo ctr images ls
# REF                            TYPE                                DIGEST          SIZE
# docker.io/library/nginx:alpine application/vnd.oci.image.index.v1 sha256:a1b2c3.. 44.2 MiB

# Inspect image details
sudo ctr images info docker.io/library/nginx:alpine

# Check image content (layers, config)
sudo ctr content ls | head -10
# DIGEST                                                                  SIZE    AGE
# sha256:a1b2c3d4e5f6...  7.6 kB  2 minutes
# sha256:b2c3d4e5f6a1...  3.4 MB  2 minutes
# sha256:c3d4e5f6a1b2...  28.1 MB 2 minutes

# Read a specific blob from the content store
sudo ctr content get sha256:a1b2c3d4e5f6... | jq .

                            
                            Content Store Location: By default, containerd stores all content at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/. Each file is named by its SHA256 digest. This is pure content-addressable storage — identical blobs are stored exactly once regardless of how many images reference them.
                        

containerd: Snapshotters

Snapshotters are containerd's abstraction for preparing filesystem layers. They take the compressed tar layers from the content store and prepare them as mountable filesystem views. The snapshotter interface supports different backend technologies:

Snapshotter	Backend	Copy-on-Write	Best For
overlayfs	OverlayFS kernel module	Yes	Default, general purpose (Linux 4.0+)
native	Plain directory copies	No	Filesystems without overlay support
btrfs	Btrfs subvolumes	Yes (filesystem-level)	Btrfs-based systems
zfs	ZFS clones	Yes (filesystem-level)	ZFS-based systems
devmapper	Device mapper thin-provisioning	Yes (block-level)	AWS Firecracker, block storage
stargz	Lazy-pulling eStargz images	Yes	Large images, fast startup
nydus	Nydus image format	Yes	Container image acceleration

# List all snapshots
sudo ctr snapshots ls
# KEY                                    PARENT                                 KIND
# sha256:a1b2c3...                                                             Committed
# sha256:b2c3d4...                       sha256:a1b2c3...                       Committed
# sha256:c3d4e5...                       sha256:b2c3d4...                       Committed
# nginx-container                        sha256:c3d4e5...                       Active

# Inspect a snapshot's details (mounts, parent chain)
sudo ctr snapshots info sha256:a1b2c3...
# {
#   "Kind": "Committed",
#   "Name": "sha256:a1b2c3...",
#   "Created": "2026-05-14T10:00:00Z",
#   "Updated": "2026-05-14T10:00:00Z"
# }

# View mount instructions for a snapshot
sudo ctr snapshots mounts /tmp/mnt nginx-container
# mount -t overlay overlay -o
#   lowerdir=/var/lib/containerd/.../sha256:c3d4e5...,
#   upperdir=/var/lib/containerd/.../nginx-container/fs,
#   workdir=/var/lib/containerd/.../nginx-container/work
#   /tmp/mnt

# Prepare a new active snapshot (writable layer)
sudo ctr snapshots prepare my-new-layer sha256:c3d4e5...

The snapshot hierarchy mirrors the image layer stack. Each committed snapshot represents a read-only layer. An active snapshot is a writable layer created on top of committed parents — this is the container's writable layer where runtime changes are stored.

containerd: Container Lifecycle

containerd distinguishes between a container (metadata object) and a task (running process). A container is a static record of what to run. A task is the running instance of that container. This separation allows you to create containers without starting them, and to re-create tasks from the same container definition.

# Pull an image first
sudo ctr images pull docker.io/library/alpine:3.19

# Create a container (metadata only — nothing running yet)
sudo ctr containers create docker.io/library/alpine:3.19 my-alpine

# List containers (note: no running process yet)
sudo ctr containers ls
# CONTAINER    IMAGE                             RUNTIME
# my-alpine    docker.io/library/alpine:3.19     io.containerd.runc.v2

# Inspect container metadata
sudo ctr containers info my-alpine

# Start a task (this creates the actual running process)
sudo ctr tasks start -d my-alpine

# List running tasks
sudo ctr tasks ls
# TASK        PID      STATUS
# my-alpine   12345    RUNNING

# Execute a command in the running task
sudo ctr tasks exec --exec-id shell1 -t my-alpine /bin/sh

# View task process metrics
sudo ctr tasks metrics my-alpine

# Kill the task (container metadata remains)
sudo ctr tasks kill my-alpine

# Delete the task
sudo ctr tasks delete my-alpine

# Delete the container (metadata)
sudo ctr containers delete my-alpine

                            
                            Container vs Task: Think of it like a program (file on disk) vs a process (running instance). A container is the "recipe" — image reference, runtime config, snapshot. A task is the "execution" — PID, status, I/O streams. You can delete and recreate tasks without losing the container's configuration.
                        

containerd-shim

The containerd-shim is a critical but often overlooked component. One shim process exists per container, serving as the direct parent of the container process. The shim exists for three reasons:

Daemon-less containers — The shim allows containerd to restart without killing running containers. The shim keeps the container alive independently.
Exit code reaping — In Linux, when a process exits, its parent must call wait() to collect the exit status. The shim is the parent that reaps the container process.
stdio management — The shim holds the container's stdin/stdout/stderr file descriptors and forwards them to logging systems (FIFO pipes or files).

containerd-shim Isolation Model

flowchart TB
    subgraph Daemon["containerd daemon"]
        CD[containerd process
PID 1000]
    end
    subgraph Shims["Per-Container Shims"]
        S1["shim (nginx)
PID 2001"]
        S2["shim (redis)
PID 2002"]
        S3["shim (postgres)
PID 2003"]
    end
    subgraph Containers["Container Processes"]
        C1["nginx
PID 3001"]
        C2["redis-server
PID 3002"]
        C3["postgres
PID 3003"]
    end
    CD -.->|"gRPC over ttrpc"| S1
    CD -.->|"gRPC over ttrpc"| S2
    CD -.->|"gRPC over ttrpc"| S3
    S1 -->|"parent of"| C1
    S2 -->|"parent of"| C2
    S3 -->|"parent of"| C3

    style Daemon fill:#f0f9f9,stroke:#3B9797
    style Shims fill:#f8f9fa,stroke:#132440
    style Containers fill:#fff5f5,stroke:#BF092F

# View shim processes on a running system
ps aux | grep containerd-shim
# root  2001  containerd-shim-runc-v2 -namespace moby -id abc123...
# root  2002  containerd-shim-runc-v2 -namespace moby -id def456...
# root  2003  containerd-shim-runc-v2 -namespace moby -id ghi789...

# Each shim's parent is PID 1 (init), NOT containerd
# This is by design — allows containerd restart
ps -o pid,ppid,comm -p 2001
# PID   PPID  COMMAND
# 2001  1     containerd-shim-runc-v2

# Shim communicates with containerd via ttrpc (lightweight gRPC)
# Socket location per container:
ls /run/containerd/io.containerd.runtime.v2.task/moby/abc123.../
# address  config.json  init.pid  log  log.json  shim.pid

                            
                            Shim Version: containerd ships two shim versions: containerd-shim (v1, legacy) and containerd-shim-runc-v2 (v2, current). The v2 shim uses ttrpc protocol, supports one shim per pod (not per container) in Kubernetes, and has better resource efficiency. Always use v2 — v1 is deprecated.
                        

runc Deep Dive

runc is the OCI reference implementation — the canonical low-level container runtime originally extracted from Docker's codebase. Written in Go, it directly manipulates Linux kernel features (namespaces, cgroups, capabilities, seccomp) to create containers. It takes an OCI bundle as input and produces a running, isolated process.

# Check runc version and supported features
runc --version
# runc version 1.1.12
# commit: v1.1.12-0-g51d5e946
# spec: 1.1.0
# go: go1.21.6
# libseccomp: 2.5.4

# Generate a default OCI config.json
mkdir -p mycontainer/rootfs
cd mycontainer
runc spec
# Creates config.json with sensible defaults

# Create a rootfs from an Alpine image
docker export $(docker create alpine:3.19) | tar -xC rootfs/

# Create a container (does NOT start it)
sudo runc create my-container
# Container is now in "created" state

# Check container state
sudo runc state my-container
# {
#   "ociVersion": "1.1.0",
#   "id": "my-container",
#   "pid": 45678,
#   "status": "created",
#   "bundle": "/home/user/mycontainer",
#   "rootfs": "/home/user/mycontainer/rootfs",
#   "created": "2026-05-14T10:00:00.123456789Z"
# }

# Start the container (transitions to "running")
sudo runc start my-container

# List running containers managed by runc
sudo runc list
# ID              PID     STATUS   BUNDLE                          CREATED
# my-container    45678   running  /home/user/mycontainer          2026-05-14T10:00:00Z

# Execute a command in the running container
sudo runc exec my-container ls /

# Send a signal to the container
sudo runc kill my-container SIGTERM

# Delete the container
sudo runc delete my-container

runc Internals

When runc creates a container, it performs a carefully orchestrated sequence of kernel operations. Understanding this sequence reveals exactly what "creating a container" means at the system level:

runc Container Creation Sequence

sequenceDiagram
    participant P as Parent (runc)
    participant I as Init Process (runc init)
    participant K as Kernel

    P->>K: clone(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWIPC)
    K-->>I: New process in new namespaces
    I->>K: mount("none", "/", MS_REC|MS_PRIVATE)
    I->>K: Setup mounts (/proc, /dev, /sys, tmpfs)
    I->>K: pivot_root(rootfs, old_root)
    I->>K: umount(old_root, MNT_DETACH)
    I->>K: Set hostname (UTS namespace)
    I->>K: Configure cgroups (memory, cpu, pids)
    I->>K: Apply seccomp filter (BPF program)
    I->>K: Drop capabilities (keep only allowed set)
    I->>K: setuid/setgid (drop root if configured)
    I->>P: Signal ready (via pipe)
    P-->>I: Signal start (via pipe)
    I->>K: exec(container_entrypoint)
    Note over I: Init process replaced by
application process

Step-by-Step Breakdown

clone() with namespace flags — Creates a new process in fresh namespaces. CLONE_NEWPID gives it PID 1 inside the container. CLONE_NEWNET gives it an isolated network stack. CLONE_NEWNS gives it private mount points.
Mount propagation — Makes the mount namespace fully private so that mounts inside the container don't leak to the host and vice versa.
Setup special filesystems — Mounts /proc (process information scoped to PID namespace), /dev (minimal device nodes), /sys (read-only sysfs).
pivot_root() — Changes the root filesystem to the container's rootfs. Unlike chroot, pivot_root actually moves the old root out of scope, making it impossible to "escape" back to the host filesystem.
Cgroup configuration — Places the container process into appropriate cgroups to enforce resource limits (memory, CPU, PIDs, I/O bandwidth).
seccomp filter — Installs a BPF (Berkeley Packet Filter) program that intercepts system calls and blocks dangerous ones before they reach the kernel.
Capability drop — Linux capabilities split root's powers into ~40 discrete privileges. The container keeps only what it needs (typically: KILL, NET_BIND_SERVICE, AUDIT_WRITE).
exec() — The init process replaces itself with the actual container application. From this point, runc's code is gone — only the application code runs.

                            
                            Why Two Processes? runc uses a "parent-child" or "double-fork" pattern. The parent process (runc) stays in the host namespaces and orchestrates. The child process (runc init) enters the new namespaces and sets up the container from the inside. They communicate via a pipe. This is necessary because some namespace operations can only be performed from inside the new namespace (e.g., you can only mount /proc for the new PID namespace from within it).
                        

The OCI Bundle

An OCI bundle is the complete input to a low-level runtime. You can create one manually to understand exactly what runc receives from containerd:

# Create directory structure
mkdir -p my-oci-bundle/rootfs

# Option 1: Export a Docker image's filesystem
docker export $(docker create busybox:latest) | tar -xC my-oci-bundle/rootfs/

# Option 2: Build a minimal rootfs from scratch
mkdir -p my-oci-bundle/rootfs/{bin,dev,proc,sys,tmp}
cp /bin/busybox my-oci-bundle/rootfs/bin/
cd my-oci-bundle/rootfs/bin && for cmd in sh ls cat echo ps mount; do ln -s busybox $cmd; done && cd ../../..

# Generate the default config.json
cd my-oci-bundle
runc spec

# Customize config.json: change the process
# Edit config.json to set: "args": ["/bin/sh", "-c", "echo Hello from OCI bundle && ps aux"]

# Run the container directly with runc
sudo runc run my-first-oci-container

# Verify the container ran in isolation
sudo runc list

The minimal viable OCI bundle needs only:

A directory (rootfs/) with at least one executable file
A config.json that references that executable in process.args

Everything else in config.json — namespaces, mounts, cgroups, capabilities — provides isolation and resource control. Without them, you'd have a chrooted process with no real security boundary.

Container Runtime Interface (CRI)

The Container Runtime Interface (CRI) is Kubernetes' abstraction layer for container runtimes. It's a gRPC protocol that defines how the kubelet (Kubernetes node agent) communicates with any container runtime, without knowing the implementation details.

CRI Architecture in Kubernetes

flowchart LR
    subgraph K8s["Kubernetes Node"]
        KL[kubelet]
    end
    subgraph CRI["CRI Protocol (gRPC)"]
        RS[RuntimeService]
        IS[ImageService]
    end
    subgraph Runtimes["CRI Implementations"]
        CD[containerd
+ CRI plugin]
        CO[CRI-O]
    end
    subgraph Low["Low-Level"]
        R1[runc]
        R2[kata]
    end
    KL --> RS
    KL --> IS
    RS --> CD
    RS --> CO
    IS --> CD
    IS --> CO
    CD --> R1
    CD --> R2
    CO --> R1
    style K8s fill:#f8f9fa,stroke:#132440
    style CRI fill:#f0f9f9,stroke:#3B9797
    style Runtimes fill:#f8f9fa,stroke:#132440
    style Low fill:#fff5f5,stroke:#BF092F

CRI Services

Service	Key Operations	Purpose
RuntimeService	RunPodSandbox, CreateContainer, StartContainer, StopContainer, RemoveContainer, ListContainers, ExecSync	Container lifecycle management
ImageService	PullImage, ListImages, RemoveImage, ImageStatus, ImageFsInfo	Image operations

# Use crictl to interact with the CRI directly
# (crictl is the CRI equivalent of docker CLI)

# Configure crictl to use containerd's CRI socket
cat /etc/crictl.yaml
# runtime-endpoint: unix:///run/containerd/containerd.sock
# image-endpoint: unix:///run/containerd/containerd.sock

# List pods (sandboxes)
sudo crictl pods
# POD ID    CREATED     STATE   NAME              NAMESPACE   ATTEMPT
# abc123    2 hours ago Ready   nginx-pod         default     0

# List containers
sudo crictl ps
# CONTAINER  IMAGE     CREATED     STATE     NAME    ATTEMPT  POD ID
# def456     nginx     2 hours ago Running   nginx   0        abc123

# Pull an image via CRI
sudo crictl pull docker.io/library/nginx:alpine

# Inspect a container
sudo crictl inspect def456

# View container logs
sudo crictl logs def456

# Execute a command in a container
sudo crictl exec -it def456 /bin/sh

                            
                            Why Kubernetes Dropped Docker: In Kubernetes 1.24, the "dockershim" was removed. Kubernetes didn't stop supporting Docker images — it stopped using Docker as a runtime. The problem: Docker doesn't implement CRI natively. The dockershim was an adapter that translated CRI calls to Docker API calls. This extra layer added complexity with no benefit — containerd (which Docker uses internally anyway) implements CRI directly. Removing the shim simplified the stack: kubelet → containerd → runc, instead of kubelet → dockershim → dockerd → containerd → runc.
                        

CRI-O

CRI-O is an alternative high-level runtime purpose-built for Kubernetes. Unlike containerd (which serves Docker, Kubernetes, and standalone use), CRI-O is only designed to implement the CRI protocol. This focused scope makes it lighter and reduces attack surface.

Feature	containerd	CRI-O
Primary Purpose	General-purpose container runtime	Kubernetes-only runtime
CRI Support	Via built-in CRI plugin	Native (it is a CRI implementation)
Docker Compatibility	Yes (Docker uses containerd)	No (Kubernetes only)
CLI Tool	`ctr`, `nerdctl`	`crictl` only
Image Building	Supports BuildKit	No (use Buildah externally)
Default Snapshotter	overlayfs	overlayfs
Version Sync	Independent releases	Tracks Kubernetes versions (1.28, 1.29...)
Maintained By	CNCF (graduated)	CNCF (incubating)
Used By	Docker, AWS EKS, GKE, AKS	Red Hat OpenShift, Fedora CoreOS

# CRI-O configuration
cat /etc/crio/crio.conf

# Key settings:
# [crio.runtime]
# default_runtime = "runc"
# conmon_cgroup = "pod"
#
# [crio.runtime.runtimes.runc]
# runtime_path = "/usr/bin/runc"
# runtime_type = "oci"
#
# [crio.runtime.runtimes.kata]
# runtime_path = "/usr/bin/kata-runtime"
# runtime_type = "oci"
# privileged_without_host_devices = true

# Check CRI-O status
sudo systemctl status crio

# CRI-O also uses crictl for interaction
sudo crictl --runtime-endpoint unix:///var/run/crio/crio.sock pods

Alternative Runtimes

OCI compliance enables a rich ecosystem of alternative low-level runtimes, each optimised for different use cases:

Runtime	Language	Key Differentiator	Startup Overhead	Use Case
runc	Go	Reference implementation, battle-tested	~100ms	General purpose, default everywhere
crun	C	2× faster startup, 50% less memory	~50ms	Performance-critical, Podman default
youki	Rust	Memory safety, no GC pauses	~60ms	Security-focused, emerging alternative
kata-containers	Go/Rust	Runs each container in a lightweight VM	~500ms	Multi-tenant clouds, untrusted workloads
gVisor (runsc)	Go	User-space kernel intercepts all syscalls	~150ms	Defence in depth, Google Cloud Run
Firecracker	Rust	MicroVM with minimal device model	~125ms	AWS Lambda, AWS Fargate
WasmEdge	C++/Rust	WebAssembly runtime as OCI runtime	~1ms	Lightweight serverless, edge computing

Architecture Decision Runtime Selection

Choosing a Runtime: Decision Matrix

The choice of runtime depends on your threat model and performance requirements:

Single-tenant, trusted workloads → runc or crun (maximum performance, namespace isolation is sufficient)
Multi-tenant, untrusted workloads → kata-containers or gVisor (additional VM/kernel boundary between tenants)
Serverless/FaaS platforms → Firecracker (fast VM boot, minimal overhead, strong isolation for function execution)
Edge computing, plugins → WasmEdge (sub-millisecond startup, sandboxed execution, language-agnostic)
Maximum startup performance → crun (C implementation, no garbage collector, minimal overhead)

Key principle: All OCI-compliant runtimes run the same images. Your choice of runtime is independent of your image format. You can switch runtimes per workload in Kubernetes using RuntimeClass resources.

Architecture Security Performance

# Kubernetes RuntimeClass: run specific pods with kata-containers
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata
overhead:
  podFixed:
    memory: "160Mi"
    cpu: "250m"
scheduling:
  nodeSelector:
    kubernetes.io/runtime: kata
---
# Use the RuntimeClass in a Pod spec
apiVersion: v1
kind: Pod
metadata:
  name: untrusted-workload
spec:
  runtimeClassName: kata
  containers:
  - name: app
    image: untrusted-image:latest
    resources:
      limits:
        memory: "256Mi"
        cpu: "500m"

Exercises

Hands-On 30 minutes

Exercise 1: Run a Container with runc Directly

Bypass Docker entirely and create a container using only runc:

# Create the OCI bundle structure
mkdir -p runc-exercise/rootfs
cd runc-exercise

# Extract Alpine's filesystem
docker export $(docker create alpine:3.19) | tar -xC rootfs/

# Generate config.json
runc spec

# Edit config.json to run a custom command
# Change "args": ["sh"] to "args": ["/bin/sh", "-c", "echo PID: $$ && hostname && cat /proc/self/cgroup"]

# Run the container
sudo runc run exercise-container

# In another terminal, check the container state
sudo runc state exercise-container
sudo runc list

runc OCI Bundle

Hands-On 25 minutes

Exercise 2: Explore containerd with ctr

Use containerd's native CLI to pull images, create containers, and manage tasks:

# Pull an image directly through containerd
sudo ctr images pull docker.io/library/busybox:latest

# Create a container (metadata only)
sudo ctr containers create docker.io/library/busybox:latest my-busybox

# Start a task with a specific command
sudo ctr tasks start -d my-busybox /bin/sh -c "while true; do date; sleep 5; done"

# List tasks and check PID
sudo ctr tasks ls

# Exec into the running task
sudo ctr tasks exec --exec-id debug -t my-busybox /bin/sh

# View the container's snapshot
sudo ctr snapshots ls | grep busybox

# Cleanup
sudo ctr tasks kill my-busybox
sudo ctr tasks delete my-busybox
sudo ctr containers delete my-busybox

containerd ctr Tasks

Hands-On 20 minutes

Exercise 3: Trace the Runtime Stack

Start a Docker container and observe the entire process hierarchy:

# Start a long-running container
docker run -d --name trace-test nginx:alpine

# Find all related processes
CONTAINER_PID=$(docker inspect trace-test --format '{{.State.Pid}}')
echo "Container PID: $CONTAINER_PID"

# Show the full process tree
pstree -p $CONTAINER_PID -s
# systemd(1)───containerd(1234)───containerd-shim(5678)───nginx(9012)───nginx(9013)

# Verify the shim is the parent
ps -o pid,ppid,comm -p $CONTAINER_PID
# PID    PPID  COMMAND
# 9012   5678  nginx

# Check the shim's parent (should be PID 1, not containerd)
ps -o pid,ppid,comm -p 5678
# PID    PPID  COMMAND
# 5678   1     containerd-shim

# Cleanup
docker rm -f trace-test

Process Tree Shim Debugging

Conclusion & Next Steps

The container runtime stack is a masterpiece of separation of concerns. containerd handles the "what" — managing images, snapshots, and container metadata. runc handles the "how" — creating isolated processes using Linux kernel primitives. The containerd-shim bridges them, providing daemon-less container survival and exit code reaping.

Key takeaways:

containerd is a high-level runtime managing the full container lifecycle: images (content store), filesystem preparation (snapshotters), and process supervision (tasks)
runc is a short-lived process that creates containers from OCI bundles using clone(), pivot_root(), cgroups, seccomp, and capabilities — then exits
containerd-shim is the per-container process that parents the container, enabling daemon restarts without killing workloads
CRI abstracts runtime differences away from Kubernetes, enabling containerd and CRI-O to be interchangeable
Alternative runtimes (kata, gVisor, crun) provide different trade-offs between isolation strength and performance, all using the same OCI image format

Next in the Series

In Part 15: Security Fundamentals, we'll explore the security model that makes containers safe — from Linux capabilities and seccomp profiles to AppArmor/SELinux policies, rootless containers, and image signing. You'll learn to harden containers for production and understand the threat models that inform runtime design decisions.

Previous Part 13: OCI Standards & Specifications Next Part 15: Security Fundamentals

Cookie Consent

Part 14: containerd & runc Deep Dive

Table of Contents

The Runtime Stack

The Complete Call Chain

containerd Architecture

The gRPC API

containerd: Image Management

containerd: Snapshotters

containerd: Container Lifecycle

containerd-shim

runc Deep Dive

runc Internals

Step-by-Step Breakdown

The OCI Bundle

Container Runtime Interface (CRI)

CRI Services

CRI-O

Alternative Runtimes

Choosing a Runtime: Decision Matrix

Exercises

Exercise 1: Run a Container with runc Directly

Exercise 2: Explore containerd with ctr

Exercise 3: Trace the Runtime Stack

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 14: containerd & runc Deep Dive

Table of Contents

The Runtime Stack

The Complete Call Chain

containerd Architecture

The gRPC API

containerd: Image Management

containerd: Snapshotters

containerd: Container Lifecycle

containerd-shim

runc Deep Dive

runc Internals

Step-by-Step Breakdown

The OCI Bundle

Container Runtime Interface (CRI)

CRI Services

CRI-O

Alternative Runtimes

Choosing a Runtime: Decision Matrix

Exercises

Exercise 1: Run a Container with runc Directly

Exercise 2: Explore containerd with ctr

Exercise 3: Trace the Runtime Stack

Conclusion & Next Steps

Next in the Series

Continue the Series

Part 13: OCI Standards & Specifications

Part 5: Docker Architecture & Components

Part 19: Container Orchestration