Back to Containers & Runtime Environments Mastery Series

Part 16: Runtime Security & Hardening

May 14, 2026 Wasil Zafar 26 min read

Image scanning catches known vulnerabilities before deployment, but what protects containers at runtime? This article builds defence-in-depth using Linux security primitives — dropping capabilities, filtering system calls with seccomp, enforcing mandatory access control with AppArmor and SELinux, locking filesystems to read-only, and running the entire Docker daemon without root privileges.

Table of Contents

  1. Defence in Depth
  2. Linux Capabilities
  3. Seccomp Profiles
  4. AppArmor
  5. SELinux
  6. Read-Only Filesystems
  7. Rootless Containers
  8. No-New-Privileges
  9. Resource Limits as Security
  10. Docker Bench for Security
  11. Runtime Security Tools
  12. Exercises
  13. Conclusion & Next Steps

Defence in Depth

No single security control is sufficient. Runtime security follows the defence-in-depth principle: multiple independent layers so that if one fails, others still protect the system. An attacker who bypasses seccomp still faces AppArmor; one who escapes AppArmor still hits a read-only filesystem with no capabilities.

Security Layer Stack

Container Runtime Security Layers
flowchart TB
    A["Application Code"] --> B["User Namespace (non-root)"]
    B --> C["Capability Restrictions"]
    C --> D["Seccomp (syscall filtering)"]
    D --> E["AppArmor / SELinux (MAC)"]
    E --> F["Read-Only Filesystem"]
    F --> G["Resource Limits (cgroups)"]
    G --> H["Host Kernel"]

    style A fill:#f8f9fa,stroke:#132440
    style B fill:#f0f9f9,stroke:#3B9797
    style C fill:#f0f9f9,stroke:#3B9797
    style D fill:#f0f9f9,stroke:#3B9797
    style E fill:#f0f9f9,stroke:#3B9797
    style F fill:#f0f9f9,stroke:#3B9797
    style G fill:#f0f9f9,stroke:#3B9797
    style H fill:#fff5f5,stroke:#BF092F
                            

Each layer addresses a different class of attack:

Layer Defends Against Mechanism
Non-root userPrivilege escalation starting pointUSER directive, user namespaces
CapabilitiesExcessive root-like powers--cap-drop ALL, selective --cap-add
SeccompKernel exploitation via syscallsBPF filter blocking dangerous syscalls
AppArmor/SELinuxFile access, network, mount abuseMandatory access control policies
Read-only FSPersistence, malware installation--read-only flag, tmpfs for writes
Resource limitsDoS, cryptomining, fork bombscgroup memory, CPU, PID limits

Linux Capabilities

Traditionally, Linux has two privilege levels: root (UID 0, can do everything) and non-root (restricted). Capabilities split root's powers into ~40 distinct privileges that can be independently granted or revoked.

Docker grants containers a subset of capabilities by default — more than necessary for most applications, but less than full root:

Default Docker Capabilities

Capability What It Allows Usually Needed?
CAP_CHOWNChange file ownershipRarely
CAP_DAC_OVERRIDEBypass file read/write/execute permission checksRarely
CAP_FSETIDDon't clear set-user-ID/set-group-ID bits on file modifyRarely
CAP_KILLSend signals to any processSometimes
CAP_SETGIDManipulate process GIDsSometimes
CAP_SETUIDManipulate process UIDsSometimes
CAP_NET_BIND_SERVICEBind to ports below 1024Often (web servers)
CAP_NET_RAWUse RAW/PACKET sockets (ping, tcpdump)Rarely
CAP_SYS_CHROOTUse chroot()Rarely
CAP_MKNODCreate special files using mknod()Rarely
CAP_AUDIT_WRITEWrite to kernel audit logRarely
CAP_SETFCAPSet file capabilitiesRarely
# Run with ALL capabilities dropped, add back only what's needed
docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx:alpine

# A web server typically needs only NET_BIND_SERVICE (for port 80/443)
# Everything else is unnecessary attack surface

# View capabilities of a running container
docker exec mycontainer cat /proc/1/status | grep Cap

# Decode capability hex values
capsh --decode=00000000a80425fb

# Run with completely unprivileged container (zero capabilities)
docker run --cap-drop ALL --user 1000:1000 myapp:latest

# DANGEROUS: Never do this in production
docker run --privileged myapp:latest
# --privileged gives ALL capabilities + device access + disables seccomp/AppArmor
Critical Warning: The --privileged flag is the nuclear option — it grants ALL capabilities, disables seccomp, disables AppArmor, gives access to all host devices, and uses the host's cgroup namespace. A privileged container can trivially escape to the host. Never use --privileged in production; identify the specific capability or device your container needs instead.

Seccomp (Secure Computing Mode)

Seccomp filters which system calls a process can make. The Linux kernel has ~450 syscalls, but most applications need only 50-100. By blocking unused syscalls, you eliminate entire classes of kernel exploits — an attacker who gains code execution inside the container cannot use blocked syscalls to escalate privileges.

Docker applies a default seccomp profile that blocks ~44 dangerous syscalls while allowing ~300+ safe ones:

Blocked Syscall Why It's Dangerous
mount / umount2Could mount host filesystems into container
rebootCould reboot the host
clock_settimeCould alter system time affecting all containers
kexec_loadCould load a new kernel (complete host takeover)
ptraceCould debug/control other processes
add_key / keyctlAccess kernel keyring (secrets of other containers)
unshareCreate new namespaces (escape current isolation)
bpfLoad eBPF programs (kernel-level code execution)

Creating a Custom Seccomp Profile

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "archMap": [
        { "architecture": "SCMP_ARCH_X86_64", "subArchitectures": ["SCMP_ARCH_X86"] }
    ],
    "syscalls": [
        {
            "names": [
                "accept", "accept4", "access", "bind", "brk",
                "chdir", "chmod", "chown", "close", "connect",
                "dup", "dup2", "dup3", "epoll_create", "epoll_ctl",
                "epoll_wait", "execve", "exit", "exit_group",
                "fchmod", "fchown", "fcntl", "fstat", "futex",
                "getcwd", "getdents64", "getegid", "geteuid",
                "getgid", "getpid", "getppid", "getuid",
                "ioctl", "listen", "lseek", "madvise", "mmap",
                "mprotect", "munmap", "nanosleep", "open",
                "openat", "pipe", "poll", "read", "readlink",
                "recvfrom", "recvmsg", "rename", "rt_sigaction",
                "rt_sigprocmask", "rt_sigreturn", "select",
                "sendmsg", "sendto", "set_robust_list",
                "setsockopt", "shutdown", "socket", "stat",
                "statfs", "sysinfo", "tgkill", "uname",
                "unlink", "wait4", "write", "writev"
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}
# Apply a custom seccomp profile
docker run --security-opt seccomp=custom-profile.json myapp:latest

# Disable seccomp entirely (DANGEROUS — for debugging only)
docker run --security-opt seccomp=unconfined myapp:latest

# Generate a custom profile from observed syscalls (using strace)
strace -f -o /tmp/syscalls.log -e trace=all docker run --rm myapp:latest
# Parse the log to build a minimal allow-list

# Use OCI seccomp bpf generator for tighter profiles
sudo apt-get install golang-github-seccomp-libseccomp-golang-dev

AppArmor

AppArmor is a Linux Security Module (LSM) that confines programs to a set of listed resources — files they can read/write, network operations they can perform, and other capabilities. Docker automatically loads a default AppArmor profile (docker-default) for every container.

# Check if AppArmor is enabled
cat /sys/module/apparmor/parameters/enabled
# Y

# View Docker's default AppArmor profile
cat /etc/apparmor.d/docker-default

# List loaded profiles
aa-status

# Create a custom AppArmor profile for a web application
cat > /etc/apparmor.d/docker-webapp << 'EOF'
#include 

profile docker-webapp flags=(attach_disconnected,mediate_deleted) {
  #include 
  #include 

  # Deny all file writes except to /tmp and /var/log
  deny /etc/** w,
  deny /usr/** w,
  deny /bin/** w,
  deny /sbin/** w,

  # Allow read access to application files
  /app/** r,
  /app/node_modules/** r,

  # Allow writes to specific directories only
  /tmp/** rw,
  /var/log/app/** rw,
  /run/nginx.pid rw,

  # Deny network raw access (no packet sniffing)
  deny network raw,

  # Deny mount operations
  deny mount,

  # Deny ptrace (no debugging other processes)
  deny ptrace,
}
EOF

# Load the profile
apparmor_parser -r /etc/apparmor.d/docker-webapp

# Apply to a container
docker run --security-opt apparmor=docker-webapp mywebapp:latest

# Run without AppArmor (DANGEROUS)
docker run --security-opt apparmor=unconfined myapp:latest

SELinux

SELinux (Security-Enhanced Linux) provides Mandatory Access Control through type enforcement: every process and file has a security label, and policies define which labels can interact. SELinux is the default on RHEL/CentOS/Fedora systems.

Feature AppArmor SELinux
Default distroUbuntu, Debian, SUSERHEL, CentOS, Fedora
ModelPath-based access controlLabel-based type enforcement
Learning curveModerate (profile syntax)Steep (policy language, labels)
GranularityFile paths + capabilitiesLabels on all objects (files, ports, processes)
Multi-container isolationSame profile per imageMCS labels (unique per container)
Docker supportDefault profile auto-loadedWorks when host SELinux is enforcing
# Check SELinux status
getenforce
# Enforcing

# Run Docker with SELinux enabled (RHEL/Fedora)
# Docker automatically assigns MCS (Multi-Category Security) labels
docker run --rm -it fedora:39 cat /proc/1/attr/current
# system_u:system_r:container_t:s0:c123,c456

# Each container gets unique MCS categories (c123,c456)
# This prevents containers from accessing each other's files

# Apply a custom SELinux label
docker run --security-opt label=type:custom_container_t myapp:latest

# Disable SELinux for a container (DANGEROUS)
docker run --security-opt label=disable myapp:latest

# Relabel host volumes for container access
docker run -v /host/data:/data:Z myapp:latest
# :Z relabels the directory with the container's MCS label (private)
# :z relabels with shared label (accessible by multiple containers)

Read-Only Filesystems

A read-only root filesystem prevents an attacker from modifying binaries, installing malware, or persisting backdoors. Combined with tmpfs for necessary writable paths, this creates an immutable container runtime.

# Run with read-only root filesystem
docker run --read-only nginx:alpine
# This will likely fail because nginx needs to write to /var/cache/nginx and /var/run

# Solution: Add tmpfs mounts for required writable paths
docker run --read-only \
  --tmpfs /var/cache/nginx:size=10m \
  --tmpfs /var/run:size=1m \
  --tmpfs /tmp:size=50m \
  nginx:alpine

# Full production example with all hardening combined
docker run -d \
  --name secure-nginx \
  --read-only \
  --tmpfs /var/cache/nginx:size=10m,noexec,nosuid \
  --tmpfs /var/run:size=1m,noexec,nosuid \
  --tmpfs /tmp:size=50m,noexec,nosuid \
  --cap-drop ALL \
  --cap-add NET_BIND_SERVICE \
  --security-opt no-new-privileges \
  --user 101:101 \
  -p 8080:80 \
  nginx:alpine

# Verify the filesystem is read-only
docker exec secure-nginx touch /etc/test
# touch: /etc/test: Read-only file system
Key Insight: The noexec flag on tmpfs mounts prevents execution of any files written to those directories. Even if an attacker manages to write a script to /tmp, they cannot execute it. Combined with a read-only root filesystem, this eliminates the most common post-exploitation technique: downloading and running malware.

Rootless Containers

Rootless mode runs the entire Docker daemon and containers as a non-root user. Even if an attacker breaks out of the container, they land in an unprivileged user namespace with no access to host resources.

# Install rootless Docker (Ubuntu/Debian)
# Prerequisites
sudo apt-get install -y uidmap dbus-user-session

# Install Docker rootless
dockerd-rootless-setuptool.sh install

# The daemon runs as your user (no sudo required)
docker info | grep "Root Dir"
# Root Dir: /home/user/.local/share/docker

# Verify rootless mode
docker info | grep "Security Options"
# rootless

# Start/stop rootless Docker
systemctl --user start docker
systemctl --user stop docker

# Enable on login
systemctl --user enable docker
loginctl enable-linger $(whoami)

Rootless mode limitations:

Feature Root Mode Rootless Mode
Bind to port < 1024YesNo (without sysctl)
overlay2 storage driverYesYes (kernel 5.11+)
Network performanceNativeSlight overhead (slirp4netns/pasta)
cgroup v2 resource limitsFullLimited (systemd user slice)
Host device accessYesNo
Container escape impactFull host root accessUnprivileged user only

No-New-Privileges Flag

The no-new-privileges flag prevents processes inside the container from gaining additional privileges through setuid/setgid binaries, file capabilities, or other escalation mechanisms.

# Enable no-new-privileges
docker run --security-opt no-new-privileges myapp:latest

# What this blocks:
# - setuid binaries (like sudo, su, ping)
# - setgid binaries
# - File capabilities (getcap/setcap)
# - execve() with elevated privileges

# Example: Without no-new-privileges, a setuid binary can escalate
docker run --rm -it ubuntu bash -c "
  chmod u+s /usr/bin/find
  # find is now setuid — could be used for escalation
"

# With no-new-privileges, setuid is ignored
docker run --rm --security-opt no-new-privileges -it ubuntu bash -c "
  chmod u+s /usr/bin/find
  ls -la /usr/bin/find
  # The setuid bit is set but the kernel ignores it
"
Best Practice: Always enable no-new-privileges in production. Combined with a non-root USER and dropped capabilities, this creates a container where privilege escalation is effectively impossible through standard Linux mechanisms.

Resource Limits as Security

Resource limits (cgroups) aren't just for performance — they're a security control that prevents denial-of-service attacks, cryptomining abuse, and fork bombs.

# Memory limit: Container is OOM-killed if it exceeds 256MB
docker run --memory=256m --memory-swap=256m myapp:latest

# CPU limit: Container gets at most 0.5 CPU cores
docker run --cpus=0.5 myapp:latest

# PID limit: Prevents fork bombs (maximum 100 processes)
docker run --pids-limit=100 myapp:latest

# Block I/O limits: Prevent I/O saturation attacks
docker run --device-write-bps /dev/sda:10mb myapp:latest

# Ulimits: File descriptor and process limits
docker run --ulimit nofile=1024:1024 --ulimit nproc=50:50 myapp:latest

# Complete production security profile
docker run -d \
  --name hardened-app \
  --memory=512m \
  --memory-swap=512m \
  --cpus=1.0 \
  --pids-limit=200 \
  --ulimit nofile=2048:2048 \
  --read-only \
  --tmpfs /tmp:size=100m,noexec,nosuid \
  --cap-drop ALL \
  --cap-add NET_BIND_SERVICE \
  --security-opt no-new-privileges \
  --security-opt seccomp=custom-profile.json \
  --user 1000:1000 \
  myapp:latest

Docker Bench for Security

Docker Bench is an automated script that checks your Docker installation against the CIS Docker Benchmark — over 100 security best practices covering the host, daemon configuration, images, containers, and networking.

# Run Docker Bench for Security
docker run --rm --net host --pid host --userns host --cap-add audit_control \
  -e DOCKER_CONTENT_TRUST=$DOCKER_CONTENT_TRUST \
  -v /etc:/etc:ro \
  -v /usr/bin/containerd:/usr/bin/containerd:ro \
  -v /usr/bin/runc:/usr/bin/runc:ro \
  -v /usr/lib/systemd:/usr/lib/systemd:ro \
  -v /var/lib:/var/lib:ro \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  docker/docker-bench-security

# Example output categories:
# [PASS] 1.1  - Ensure a separate partition for containers exists
# [WARN] 2.1  - Ensure network traffic is restricted between containers
# [PASS] 4.1  - Ensure that a user for the container has been created
# [WARN] 4.5  - Ensure Content trust for Docker is Enabled
# [PASS] 5.1  - Ensure that AppArmor Profile is Set
# [WARN] 5.4  - Ensure that privileged containers are not used
# [PASS] 5.12 - Ensure that the container's root filesystem is read-only

Runtime Security Tools

Static hardening (seccomp, AppArmor, capabilities) defines what containers can do. Runtime security tools detect anomalous behaviour — activity that's technically allowed but suspicious, like spawning a shell in a container that should only serve HTTP.

Real-World Tool
Falco — Cloud-Native Runtime Security

Falco (CNCF Incubating project) monitors system calls in real-time using eBPF and alerts on suspicious behaviour:

  • Shell spawned in a container (bash, sh)
  • Unexpected network connections (reverse shells, C2 traffic)
  • Sensitive file access (/etc/shadow, /proc/kcore)
  • Binary modification or file creation in non-tmp directories
  • Package manager execution (apt, yum) inside running containers
# Example Falco rule
- rule: Shell Spawned in Container
  desc: Detect shell execution in a container
  condition: >
    spawned_process and container and
    proc.name in (bash, sh, zsh, dash, csh)
  output: >
    Shell spawned in container
    (user=%user.name container=%container.name
     shell=%proc.name parent=%proc.pname)
  priority: WARNING
  tags: [container, shell]
falco runtime-detection eBPF
Tool Mechanism Strengths
FalcoeBPF/kernel module syscall monitoringCNCF project, rich rule language, low overhead
Sysdig SecureeBPF + commercial platformCommercial Falco + compliance + forensics
TraceeeBPF (by Aqua Security)Event-based, good for container forensics
TetragoneBPF (by Cilium/Isovalent)Enforcement (kill processes), low overhead

Exercises

Exercise 1: Run nginx:alpine with --cap-drop ALL. It will fail because it can't bind to port 80. Add back the minimum capability needed (NET_BIND_SERVICE) to make it work. Then try running on port 8080 instead — does it still need the capability?
Exercise 2: Create a custom seccomp profile that only allows the syscalls your application actually uses. Use strace to identify the syscall set, then test your container with the restrictive profile.
Exercise 3: Run a container with --read-only and --security-opt no-new-privileges. Use docker exec to attempt: writing files, installing packages (apt-get), running setuid binaries. Document which operations fail and why.
Exercise 4: Install and run Docker Bench for Security against your Docker host. Identify the top 5 WARN items and fix them. Re-run the benchmark and compare your score.

Conclusion & Next Steps

Runtime security transforms containers from convenient isolation into genuine security boundaries through layered defences:

  • Capabilities — Drop ALL, add back only what's needed (usually just NET_BIND_SERVICE)
  • Seccomp — Filter syscalls to a minimal allow-list, blocking kernel exploit vectors
  • AppArmor/SELinux — Mandatory access control preventing unauthorized file and network access
  • Read-only FS — Immutable containers that cannot be modified post-deployment
  • Rootless mode — Even container escapes land in an unprivileged user namespace
  • Runtime detection — Falco and eBPF tools catch anomalous behaviour in real-time

With images secured (Part 15) and runtime hardened (this article), the remaining attack vector is the supply chain itself — the provenance of your dependencies and the handling of secrets.

Next in the Series

In Part 17: Supply Chain Security & Secrets, we'll secure the software supply chain with SBOMs, provenance attestations, image signing workflows, and safe secret management patterns — ensuring that every component in your container has a verified origin and that credentials never leak into image layers.