Part 21: Container Troubleshooting

Troubleshooting Mindset

Effective container debugging follows a consistent methodology. The most common mistake is jumping to random fixes before understanding the problem. Instead, follow this sequence: Observe → Hypothesize → Test → Fix → Verify.

                            
                            The Golden Rule: Always start with logs. docker logs <container> answers 80% of container issues within 10 seconds. The remaining 20% require the advanced tools in this article.
                        

Diagnostic Decision Tree

Container Troubleshooting Decision Tree

flowchart TD
    START["Container Issue"] --> Q1{"Container running?"}
    Q1 -->|No| Q2{"Ever started?"}
    Q1 -->|Yes| Q3{"Responding?"}

    Q2 -->|"Never started"| A1["Check: docker logs
Image exists? Ports free?
Volumes valid?"]
    Q2 -->|"Started then died"| A2["Check: Exit code
docker inspect State
OOM? Crash?"]

    Q3 -->|"Not responding"| Q4{"Health check?"}
    Q3 -->|"Slow response"| A5["Performance debug:
CPU throttle? I/O wait?
Memory pressure?"]

    Q4 -->|"Failing"| A3["App-level issue:
docker exec to test
Check dependencies"]
    Q4 -->|"No health check"| A4["Network issue:
Port mapping? DNS?
Firewall rules?"]

    A2 --> Q5{"Exit code?"}
    Q5 -->|"137"| OOM["OOM Kill
Increase memory limit"]
    Q5 -->|"139"| SEG["Segfault
Check binary/deps"]
    Q5 -->|"1"| APP["App error
Check logs"]
    Q5 -->|"143"| SIG["SIGTERM
Graceful shutdown"]
    Q5 -->|"0"| DONE["Normal exit
Check CMD/entrypoint"]

    style START fill:#f8f9fa,stroke:#132440
    style OOM fill:#fff5f5,stroke:#BF092F
    style SEG fill:#fff5f5,stroke:#BF092F

Container Won't Start

When docker run or docker start fails immediately, the container never reaches "running" state. Common causes and their diagnostics:

# Step 1: Check what happened
docker ps -a --filter "name=myapp"
# CONTAINER ID  IMAGE       COMMAND     CREATED        STATUS                    NAMES
# abc123def456  myapp:1.0   "/start"    2 minutes ago  Created                   myapp
# (Status "Created" means it was never started successfully)

# Step 2: Read the error from logs
docker logs myapp
# /start: no such file or directory
# (The entrypoint binary doesn't exist in the image)

# Step 3: Alternative — check events for the error
docker events --since "5m" --filter container=myapp
# container create abc123 (image=myapp:1.0, name=myapp)
# container die abc123 (exitCode=127)

# Common "won't start" causes and fixes:

# 1. Image not found
docker run nonexistent-image:latest
# Unable to find image 'nonexistent-image:latest' locally
# Error response from daemon: pull access denied
# FIX: Check image name/tag, login to registry

# 2. Port already in use
docker run -p 80:80 nginx
# Error response from daemon: driver failed programming external connectivity:
# Bind for 0.0.0.0:80 failed: port is already allocated
# FIX: Use different port or stop conflicting container
docker ps --filter "publish=80"  # Find what's using port 80
lsof -i :80                      # Or check host processes

# 3. Volume mount path doesn't exist
docker run -v /nonexistent/path:/data myapp
# Error response from daemon: invalid mount config: invalid mount path
# FIX: Create directory first, or use named volumes

# 4. Insufficient permissions
docker run --memory=100g myapp
# Error response from daemon: cannot allocate memory
# FIX: Reduce memory request to available host memory

# 5. Invalid entrypoint/CMD
docker run myapp /bin/nonexistent
# exec: "/bin/nonexistent": stat /bin/nonexistent: no such file or directory
# FIX: Check Dockerfile CMD/ENTRYPOINT, verify binary exists in image
docker run -it --entrypoint /bin/sh myapp  # Override to debug

Crash Loops

A crash loop occurs when a container starts, runs briefly, then exits — and Docker's restart policy keeps restarting it. The container cycles between "starting" and "exited" indefinitely. Exit codes are your primary diagnostic tool:

Exit Code	Signal	Meaning	Common Cause
0	—	Normal exit (success)	CMD completed. Container isn't meant to be long-running, or foreground process ended.
1	—	Application error	Unhandled exception, missing config, dependency unavailable.
2	—	Shell misuse	Incorrect command syntax in entrypoint script.
126	—	Command not executable	Permission denied on entrypoint binary (missing +x).
127	—	Command not found	Binary doesn't exist in image (wrong PATH or missing install).
137	SIGKILL (9)	Killed by external signal	OOM kill, `docker kill`, or orchestrator termination.
139	SIGSEGV (11)	Segmentation fault	Binary crash, corrupt memory, wrong architecture (amd64 on arm64).
143	SIGTERM (15)	Graceful termination	`docker stop`, orchestrator rolling update. Normal if app handles SIGTERM.

# Identify a crash loop
docker ps -a --filter "name=app"
# STATUS: Restarting (1) 2 seconds ago   ← Restarting = crash loop

# Get the exit code
docker inspect --format '{{.State.ExitCode}}' app
# 137

# Check if OOM killed
docker inspect --format '{{.State.OOMKilled}}' app
# true  ← Memory limit exceeded

# View restart count
docker inspect --format '{{.RestartCount}}' app
# 47  ← Restarted 47 times

# View the last crash logs (even for a restarting container)
docker logs app --tail 50
# Last 50 lines before the crash

# Debugging strategy for crash loops:
# 1. Override entrypoint to keep container alive for inspection
docker run -it --entrypoint /bin/sh myapp:latest
# Now you're inside the container — check files, env, deps

# 2. Add sleep to see what's happening
docker run -it --entrypoint /bin/sh myapp -c "sleep 3600"
# Container stays alive for 1 hour — debug inside it

# 3. Check if it's a dependency issue (database not ready)
docker logs app 2>&1 | grep -i "connection\|timeout\|refused"
# Error: Connection refused to postgres:5432
# FIX: Add health check dependency, retry logic, or init container

OOM Kills

When a container exceeds its memory limit, the Linux kernel's OOM (Out of Memory) killer terminates the process with SIGKILL (exit code 137). This is the most dangerous failure mode because the application gets no chance to shut down gracefully — data can be lost.

# Confirm OOM kill via Docker
docker inspect app --format '{{.State.OOMKilled}}'
# true

docker inspect app --format '{{json .State}}' | jq '{Status, ExitCode, OOMKilled, FinishedAt}'
# {
#   "Status": "exited",
#   "ExitCode": 137,
#   "OOMKilled": true,
#   "FinishedAt": "2026-05-14T10:30:00.123456789Z"
# }

# Confirm via kernel logs (dmesg)
dmesg | grep -i "oom\|killed" | tail -10
# [1234567.890] Memory cgroup out of memory: Killed process 12345 (node)
#    total-vm:1048576kB, anon-rss:524288kB, file-rss:0kB, shmem-rss:0kB
# [1234567.891] oom_reaper: reaped process 12345 (node), now anon-rss:0kB

# Check current memory usage vs limit
docker stats app --no-stream --format "{{.MemUsage}}"
# 245MiB / 256MiB  ← Almost at limit, OOM imminent

# View memory limit from container config
docker inspect --format '{{.HostConfig.Memory}}' app
# 268435456  (bytes = 256 MiB)

# Check cgroup memory events (how many OOM kills occurred)
CONTAINER_ID=$(docker inspect --format '{{.Id}}' app)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.events
# low 0
# high 0
# max 15        ← Hit memory.max limit 15 times
# oom 3         ← Kernel OOM killed 3 times
# oom_kill 3    ← Confirmed 3 OOM kills

# Solutions:
# 1. Increase memory limit (if app genuinely needs more)
docker update --memory=512m --memory-swap=512m app

# 2. Find the memory leak (if usage grows unbounded)
# Run with relaxed limits and monitor growth over time
docker run --memory=2g myapp
docker stats app  # Watch memory climb over hours

# 3. Set memory-swap equal to memory (disable swap, cleaner OOM)
docker run --memory=256m --memory-swap=256m myapp
# Without this, container can swap to disk, causing slowness before OOM

                            
                            Silent OOM Kills: If OOMKilled is false but exit code is still 137, the OOM kill happened to a child process inside the container (not PID 1). Docker only reports OOMKilled for the main container process. Check dmesg for the full picture.
                        

Networking Failures

Container networking issues fall into three categories: can't reach the internet, can't reach other containers, or external clients can't reach the container. Debug systematically from inside out:

# === Step 1: Can the container reach the internet? ===
docker exec app ping -c 3 8.8.8.8
# If FAILS → network connectivity issue (bridge config, iptables)
# If WORKS → DNS or application-level issue

# === Step 2: DNS resolution working? ===
docker exec app nslookup google.com
# If FAILS → DNS configuration issue
docker exec app cat /etc/resolv.conf
# nameserver 127.0.0.11  ← Docker's embedded DNS (expected)

# Check Docker DNS is working
docker exec app nslookup other-container
# If FAILS for container names → containers not on same network

# === Step 3: Can containers reach each other? ===
# Verify both containers are on the same Docker network
docker network inspect bridge --format '{{range .Containers}}{{.Name}} {{.IPv4Address}}{{"\n"}}{{end}}'
# nginx 172.17.0.2/16
# app   172.17.0.3/16

# Test connectivity between containers
docker exec app ping -c 3 172.17.0.2
docker exec app curl -s http://nginx:80/

# === Step 4: Port mapping from host? ===
# Verify port binding
docker port app
# 3000/tcp -> 0.0.0.0:3000

# Test from host
curl -v http://localhost:3000/
# If FAILS → app not listening on correct interface inside container

# Common mistake: app listens on 127.0.0.1 inside container
docker exec app ss -tlnp
# LISTEN  0  128  127.0.0.1:3000  *:*   ← WRONG: bound to localhost only
# LISTEN  0  128  0.0.0.0:3000    *:*   ← CORRECT: bound to all interfaces

# === Step 5: iptables interference? ===
# Docker manages iptables rules for port forwarding
sudo iptables -t nat -L DOCKER -n --line-numbers
# Check that DNAT rules exist for published ports

# === Step 6: Docker network driver issues ===
# Recreate the default bridge if corrupted
docker network prune  # Remove unused networks
docker network create --driver bridge my-network
docker run --network my-network --name app myapp

Filesystem Issues

Filesystem problems manifest as permission denied errors, read-only filesystem errors, or "no space left on device" — even when the host has plenty of space:

# === Read-only filesystem ===
docker exec app touch /test
# touch: cannot touch '/test': Read-only file system

# Cause 1: Container started with --read-only flag
docker inspect --format '{{.HostConfig.ReadonlyRootfs}}' app
# true  ← Intentional security hardening
# FIX: Write to tmpfs mounts (/tmp, /var/run) or designated writable volumes

# Cause 2: OverlayFS layer issue
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' app
ls -la /var/lib/docker/overlay2/LAYER_ID/merged/
# Check if layers are intact

# === No space left on device ===
docker exec app df -h /
# Filesystem      Size  Used  Avail  Use%  Mounted on
# overlay         50G   50G   0      100%  /

# Check Docker disk usage
docker system df
# TYPE           TOTAL   ACTIVE  SIZE     RECLAIMABLE
# Images         45      12      12.5GB   8.3GB (66%)
# Containers     15      8       2.1GB    1.5GB (71%)
# Local Volumes  23      8       5.4GB    3.2GB (59%)
# Build Cache    0       0       3.8GB    3.8GB

# Clean up unused resources
docker system prune -a --volumes
# WARNING! This will remove all stopped containers, unused networks,
# unused images, and unused volumes.

# === Permission denied in volume mounts ===
docker run -v /host/data:/app/data myapp
docker exec app ls -la /app/data/
# ls: cannot open directory '/app/data/': Permission denied

# Cause: UID mismatch between host and container
ls -la /host/data/
# drwxr-xr-x 2 root root 4096 ...  ← Owned by root (UID 0)
docker exec app id
# uid=1000(appuser) gid=1000(appuser)  ← Container runs as UID 1000

# FIX: Match UIDs
# Option 1: Change host directory ownership
sudo chown -R 1000:1000 /host/data/

# Option 2: Run container with matching UID
docker run --user $(id -u):$(id -g) -v /host/data:/app/data myapp

# Option 3: Use named volumes (Docker manages permissions)
docker run -v app-data:/app/data myapp

Advanced Debugging with docker exec

docker exec is your first-line debugging tool — running commands inside a container's namespaces. But many production images lack debugging tools. Here's how to work around that:

# Basic debugging inside a running container
docker exec -it app /bin/sh           # Get a shell
docker exec app cat /proc/1/status    # Check PID 1 details
docker exec app env                   # View environment variables
docker exec app cat /etc/hosts        # DNS overrides
docker exec app ls -la /proc/1/fd/    # Open file descriptors

# Problem: Distroless/minimal images have no shell or tools
docker exec -it app /bin/sh
# OCI runtime exec failed: exec failed: unable to start container process:
# exec: "/bin/sh": stat /bin/sh: no such file or directory

# Solution 1: Copy a static binary into the container
docker cp /usr/bin/busybox app:/tmp/busybox
docker exec app /tmp/busybox sh

# Solution 2: Use Docker's debug feature (Docker Desktop 4.27+)
docker debug app
# Attaches a debug shell with common tools pre-installed
# Works even on distroless images (injects a toolbox)

# Solution 3: Install tools temporarily (if package manager exists)
docker exec app apt-get update && apt-get install -y curl net-tools procps
# WARNING: Changes lost on container restart. Only for debugging.

# Useful one-liners for debugging inside containers:
docker exec app cat /proc/net/tcp          # Active TCP connections
docker exec app cat /proc/meminfo          # Memory details
docker exec app cat /proc/1/cgroup         # Cgroup membership
docker exec app cat /proc/1/mountinfo      # Mount table
docker exec app find / -name "*.log" 2>/dev/null  # Find log files

nsenter — Entering Container Namespaces from Host

When docker exec fails (no shell in image, container is stopped, Docker daemon issues), nsenter lets you enter a container's namespaces directly from the host. It operates at the kernel level, bypassing Docker entirely:

# Get the container's PID on the host
PID=$(docker inspect --format '{{.State.Pid}}' app)
echo $PID  # e.g., 12345

# Enter ALL namespaces of the container (equivalent to docker exec)
sudo nsenter -t $PID -m -u -i -n -p -- /bin/sh
# -t PID     Target process
# -m         Mount namespace (see container's filesystem)
# -u         UTS namespace (container's hostname)
# -i         IPC namespace
# -n         Network namespace (container's network stack)
# -p         PID namespace (see container's process tree)

# Enter ONLY the network namespace (useful for network debugging)
sudo nsenter -t $PID -n -- ip addr show
# Shows network interfaces as the container sees them
sudo nsenter -t $PID -n -- ss -tlnp
# Shows listening ports inside the container
sudo nsenter -t $PID -n -- ping 8.8.8.8
# Test connectivity from the container's perspective

# Enter ONLY the mount namespace (inspect container's filesystem)
sudo nsenter -t $PID -m -- ls /app/config/
sudo nsenter -t $PID -m -- cat /etc/resolv.conf

# Enter ONLY the PID namespace (see container's process tree)
sudo nsenter -t $PID -p -- ps aux
# PID  USER  TIME  COMMAND
# 1    root  0:05  node server.js
# 15   root  0:00  ps aux

# Why nsenter over docker exec?
# 1. Works when Docker daemon is unresponsive
# 2. Works on stopped containers (if PID still exists in /proc)
# 3. Can enter individual namespaces (not all-or-nothing)
# 4. Has access to host tools (strace, tcpdump, perf)

                            
                            Security Note: nsenter requires root on the host and effectively gives you full access to the container's isolated environment. In production Kubernetes clusters, use kubectl debug (ephemeral containers) instead of SSH-ing into nodes to run nsenter.
                        

strace — System Call Tracing

strace intercepts and records every system call a process makes. When a container process hangs, crashes without useful logs, or behaves unexpectedly, strace reveals exactly what it's doing at the kernel level:

# Get the container's main PID
PID=$(docker inspect --format '{{.State.Pid}}' app)

# Trace all system calls of the container's main process
sudo strace -p $PID -f -tt
# -p PID   Attach to running process
# -f       Follow forked child processes
# -tt      Print microsecond timestamps

# Output example:
# 10:30:01.123456 read(5, "GET / HTTP/1.1\r\n", 4096) = 16
# 10:30:01.123500 write(5, "HTTP/1.1 200 OK\r\n", 17) = 17
# 10:30:01.123550 epoll_wait(3, [{EPOLLIN, {u32=5}}], 128, 5000) = 1

# Filter for specific syscall categories:
# Network operations only
sudo strace -p $PID -f -e trace=network -tt
# connect(5, {sa_family=AF_INET, sin_port=5432, sin_addr="10.0.0.5"}, 16) = -1 ECONNREFUSED
# ← Shows exactly which connection is failing and why

# File operations only
sudo strace -p $PID -f -e trace=file -tt
# open("/app/config/database.yml", O_RDONLY) = -1 ENOENT (No such file or directory)
# ← Shows what files the app is trying to read

# Process operations only
sudo strace -p $PID -f -e trace=process

# Common discoveries via strace:
# 1. "Permission denied" — which file? → strace shows the exact path
# 2. "Connection refused" — to where? → strace shows IP:port
# 3. "Process hangs" — on what? → strace shows it's blocked on read/poll/futex
# 4. "Slow startup" — why? → strace shows DNS lookups or file scans taking seconds

# Save trace to file for analysis
sudo strace -p $PID -f -tt -o /tmp/app-trace.log
# Then search: grep -i "ENOENT\|EACCES\|ECONNREFUSED" /tmp/app-trace.log

tcpdump — Network Packet Capture

tcpdump captures raw network packets, letting you see exactly what data flows in and out of a container. Essential for debugging API failures, TLS issues, DNS problems, and connection timeouts:

# Method 1: Capture from inside container's network namespace
PID=$(docker inspect --format '{{.State.Pid}}' app)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -n -c 50
# Captures 50 packets from the container's perspective

# Method 2: Find container's veth pair on host and capture there
# Step 1: Get container's interface index
docker exec app cat /sys/class/net/eth0/iflink
# 15  (this is the ifindex of the host-side veth)

# Step 2: Find matching veth on host
ip link | grep "^15:"
# 15: veth1a2b3c4@if14: 

# Step 3: Capture on that interface
sudo tcpdump -i veth1a2b3c4 -n -c 100

# Useful capture filters:
# All HTTP traffic to/from the container
sudo nsenter -t $PID -n -- tcpdump -i eth0 -n port 80 or port 443

# DNS queries (why is name resolution failing?)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -n port 53
# 10:30:01.123 IP 172.17.0.3.54321 > 127.0.0.11.53: A? database.internal

# TCP connection attempts (why is connection refused?)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -n "tcp[tcpflags] & (tcp-syn|tcp-rst) != 0"
# Shows SYN (connection attempts) and RST (rejections)

# Save capture to file for Wireshark analysis
sudo nsenter -t $PID -n -- tcpdump -i eth0 -w /tmp/container-traffic.pcap -c 1000
# Open in Wireshark: wireshark /tmp/container-traffic.pcap

# Capture with ASCII output (see HTTP bodies)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -A port 80 -c 20

Docker Debug Container Pattern

Instead of installing tools in production containers, use an ephemeral debug container that shares the target container's namespaces. The debug container has all the tools; the target container stays lean:

# nicolaka/netshoot — the Swiss Army knife of container debugging
# Contains: curl, ping, dig, nslookup, tcpdump, ip, iptables, ss, netstat,
#           strace, ltrace, perf, drill, mtr, iperf3, and 50+ more tools

# Share the target container's NETWORK namespace
docker run -it --rm \
    --network container:app \
    nicolaka/netshoot

# Inside netshoot, you see app's network stack:
ip addr show          # app's interfaces
ss -tlnp              # app's listening ports
curl localhost:3000   # access app's ports via localhost
dig database.internal # resolve DNS from app's perspective
tcpdump -i eth0      # capture app's traffic

# Share BOTH network and PID namespace
docker run -it --rm \
    --network container:app \
    --pid container:app \
    nicolaka/netshoot

# Now you can also see app's processes:
ps aux                # See all processes in app container
strace -p 1          # Trace app's PID 1

# For filesystem access too, mount the container's filesystem
docker run -it --rm \
    --network container:app \
    --pid container:app \
    --volumes-from app \
    nicolaka/netshoot

# Access app's files:
ls /app/              # See app's application code
cat /app/config.yml   # Read config files
cat /proc/1/environ   # Read environment variables of PID 1

                            
                            Production Best Practice: Keep a debug container image (like netshoot) pre-pulled on all production hosts. When an incident occurs, you can immediately launch it — no image pull delay during a crisis.
                        

Performance Debugging

When containers are slow but not crashing, the problem is usually CPU throttling, I/O saturation, or memory pressure. These issues are invisible without the right tools:

# === CPU Throttling Detection ===
# Check if container is being throttled by CFS scheduler
CONTAINER_ID=$(docker inspect --format '{{.Id}}' app)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.stat
# nr_periods 100000      # Total scheduling periods
# nr_throttled 15000     # Periods where container was throttled
# throttled_usec 30000000  # Total microseconds throttled (30 seconds!)

# Throttle ratio (should be < 5% for healthy containers)
# throttled_ratio = nr_throttled / nr_periods = 15000/100000 = 15% ← TOO HIGH

# FIX: Increase CPU limits
docker update --cpus=2.0 app  # Allow 2 full CPU cores

# === Memory Pressure ===
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.pressure
# some avg10=25.00 avg60=15.50 avg300=8.20 total=567890000
# full avg10=5.00 avg60=2.00 avg300=0.80 total=123456000
# "some" > 10% means processes are waiting for memory reclaim

# === I/O Saturation ===
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/io.pressure
# some avg10=45.00 avg60=30.20 avg300=12.80 total=89012345
# ← 45% of time at least one task is waiting for I/O → disk bottleneck

# === Process-level investigation from host ===
PID=$(docker inspect --format '{{.State.Pid}}' app)

# CPU usage per thread inside the container
sudo top -H -p $PID
# Shows which threads are consuming CPU

# What is the process doing? (requires perf tools)
sudo perf top -p $PID
# Real-time view of which functions consume CPU

# I/O operations per process
sudo pidstat -d -p $PID 1 5
# Shows read/write bytes per second for the container process

Debugging Technique

The "CPU Usage is Low but App is Slow" Mystery

A container shows only 30% CPU usage yet response times are 10x normal. The paradox resolves when you understand CFS scheduling:

Container has --cpus=0.5 (50% of one core)
In each 100ms CFS period, it gets 50ms of CPU time
If a request arrives and needs 80ms of CPU work, it completes in 160ms wall-clock time (50ms running + 50ms throttled + 50ms running + 10ms throttled)
docker stats shows 50% CPU — "normal" — but latency has doubled

Diagnostic: Check nr_throttled in cpu.stat. If throttling is frequent, either increase CPU limits or optimize the hot path.

CFS-scheduling CPU-throttling latency

Common Issues Quick Reference

Symptom	Likely Cause	Diagnostic Command	Fix
Exit code 137	OOM kill	`docker inspect --format '{{.State.OOMKilled}}'`	Increase `--memory`
Exit code 139	Segfault	`dmesg \| grep segfault`	Check binary architecture, rebuild
Exit code 127	Binary not found	`docker run --entrypoint sh image -c "which cmd"`	Fix CMD/ENTRYPOINT path
"Permission denied"	UID mismatch	`docker exec app id; ls -la /path`	Match UIDs or use named volumes
"No space left"	Overlay full	`docker system df`	`docker system prune`
"Connection refused"	Wrong bind address	`docker exec app ss -tlnp`	Bind to 0.0.0.0 not 127.0.0.1
"Name resolution failed"	DNS misconfiguration	`docker exec app cat /etc/resolv.conf`	Check Docker DNS, network driver
Port not accessible	Port not published	`docker port container`	Add `-p host:container`
Container extremely slow	CPU throttling	`cat cpu.stat \| grep throttled`	Increase `--cpus`
Random kills under load	PID limit reached	`cat pids.current; cat pids.max`	Increase `--pids-limit`
Volume data missing	Anonymous volume	`docker inspect -f '{{.Mounts}}'`	Use named volumes
Health check failing	App not ready on expected port	`docker exec app curl localhost:PORT`	Check app startup, add start_period
Intermittent network drops	MTU mismatch	`docker exec app ip link show eth0`	Set `--opt com.docker.network.driver.mtu=1400`
Container can't reach host	Bridge isolation	`docker exec app ping host.docker.internal`	Use `--add-host` or host network
Logs missing after restart	json-file driver without volume	`docker inspect --format '{{.LogPath}}'`	Use log aggregation (Fluent Bit)

Exercises

                            
                            Exercise 1: Create a container that deliberately crashes with each exit code (0, 1, 126, 127, 137, 139). For each, use docker inspect to confirm the exit code and diagnose what happened. For exit 137, trigger a real OOM kill by running a memory-eating process inside a container with --memory=50m.
                        

                            
                            Exercise 2: Set up two containers on the same Docker network. Deliberately break connectivity (wrong network, DNS misconfiguration, firewall rule). Use the systematic debugging approach from Section 5 to identify and fix each issue.
                        

                            
                            Exercise 3: Run a web server container, then use nsenter and tcpdump to capture HTTP traffic flowing through it. Identify the container's veth pair on the host and capture traffic from the host side. Compare both captures.
                        

                            
                            Exercise 4: Deploy nicolaka/netshoot as a debug container sharing the network namespace of a running application. Use it to diagnose a simulated DNS failure (override /etc/resolv.conf with invalid nameservers).
                        

Conclusion & Next Steps

Container troubleshooting is a skill that combines systematic methodology with deep Linux knowledge. The toolkit we've built in this article escalates from simple to advanced:

docker logs — 80% of issues (application errors, missing config)
docker inspect — Exit codes, OOM flags, mount points, network config
docker exec — Interactive debugging inside running containers
nsenter — Enter specific namespaces when Docker isn't cooperating
strace — See exactly what system calls a stuck process is making
tcpdump — Capture and analyze network traffic at the packet level
Debug containers — Full toolbox without polluting production images

With monitoring (Part 20) telling you what's wrong and troubleshooting (this part) telling you why, you can handle any container incident. The final piece is scaling these practices to the enterprise — the subject of our concluding article.

Next in the Series

In Part 22: Enterprise Container Platforms, we'll scale containers to the enterprise — registry replication, access control, air-gapped deployments, policy enforcement, multi-architecture builds, and choosing between Docker Enterprise, OpenShift, and Rancher.

Previous Part 20: Container Monitoring & Observability Next Part 22: Enterprise Container Platforms

Cookie Consent

Part 21: Container Troubleshooting

Table of Contents

Troubleshooting Mindset

Diagnostic Decision Tree

Container Won't Start

Crash Loops

OOM Kills

Networking Failures

Filesystem Issues

Advanced Debugging with docker exec

nsenter — Entering Container Namespaces from Host

strace — System Call Tracing

tcpdump — Network Packet Capture

Docker Debug Container Pattern

Performance Debugging

The "CPU Usage is Low but App is Slow" Mystery

Common Issues Quick Reference

Exercises

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 21: Container Troubleshooting

Table of Contents

Troubleshooting Mindset

Diagnostic Decision Tree

Container Won't Start

Crash Loops

OOM Kills

Networking Failures

Filesystem Issues

Advanced Debugging with docker exec

nsenter — Entering Container Namespaces from Host

strace — System Call Tracing

tcpdump — Network Packet Capture

Docker Debug Container Pattern

Performance Debugging

The "CPU Usage is Low but App is Slow" Mystery

Common Issues Quick Reference

Exercises

Conclusion & Next Steps

Next in the Series

Continue the Series

Part 20: Container Monitoring & Observability

Part 10: Networking Fundamentals

Part 3: Control Groups (cgroups)