Troubleshooting Mindset
Effective container debugging follows a consistent methodology. The most common mistake is jumping to random fixes before understanding the problem. Instead, follow this sequence: Observe → Hypothesize → Test → Fix → Verify.
docker logs <container> answers 80% of container issues within 10 seconds. The remaining 20% require the advanced tools in this article.
Diagnostic Decision Tree
flowchart TD
START["Container Issue"] --> Q1{"Container running?"}
Q1 -->|No| Q2{"Ever started?"}
Q1 -->|Yes| Q3{"Responding?"}
Q2 -->|"Never started"| A1["Check: docker logs
Image exists? Ports free?
Volumes valid?"]
Q2 -->|"Started then died"| A2["Check: Exit code
docker inspect State
OOM? Crash?"]
Q3 -->|"Not responding"| Q4{"Health check?"}
Q3 -->|"Slow response"| A5["Performance debug:
CPU throttle? I/O wait?
Memory pressure?"]
Q4 -->|"Failing"| A3["App-level issue:
docker exec to test
Check dependencies"]
Q4 -->|"No health check"| A4["Network issue:
Port mapping? DNS?
Firewall rules?"]
A2 --> Q5{"Exit code?"}
Q5 -->|"137"| OOM["OOM Kill
Increase memory limit"]
Q5 -->|"139"| SEG["Segfault
Check binary/deps"]
Q5 -->|"1"| APP["App error
Check logs"]
Q5 -->|"143"| SIG["SIGTERM
Graceful shutdown"]
Q5 -->|"0"| DONE["Normal exit
Check CMD/entrypoint"]
style START fill:#f8f9fa,stroke:#132440
style OOM fill:#fff5f5,stroke:#BF092F
style SEG fill:#fff5f5,stroke:#BF092F
Container Won't Start
When docker run or docker start fails immediately, the container never reaches "running" state. Common causes and their diagnostics:
# Step 1: Check what happened
docker ps -a --filter "name=myapp"
# CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES
# abc123def456 myapp:1.0 "/start" 2 minutes ago Created myapp
# (Status "Created" means it was never started successfully)
# Step 2: Read the error from logs
docker logs myapp
# /start: no such file or directory
# (The entrypoint binary doesn't exist in the image)
# Step 3: Alternative — check events for the error
docker events --since "5m" --filter container=myapp
# container create abc123 (image=myapp:1.0, name=myapp)
# container die abc123 (exitCode=127)
# Common "won't start" causes and fixes:
# 1. Image not found
docker run nonexistent-image:latest
# Unable to find image 'nonexistent-image:latest' locally
# Error response from daemon: pull access denied
# FIX: Check image name/tag, login to registry
# 2. Port already in use
docker run -p 80:80 nginx
# Error response from daemon: driver failed programming external connectivity:
# Bind for 0.0.0.0:80 failed: port is already allocated
# FIX: Use different port or stop conflicting container
docker ps --filter "publish=80" # Find what's using port 80
lsof -i :80 # Or check host processes
# 3. Volume mount path doesn't exist
docker run -v /nonexistent/path:/data myapp
# Error response from daemon: invalid mount config: invalid mount path
# FIX: Create directory first, or use named volumes
# 4. Insufficient permissions
docker run --memory=100g myapp
# Error response from daemon: cannot allocate memory
# FIX: Reduce memory request to available host memory
# 5. Invalid entrypoint/CMD
docker run myapp /bin/nonexistent
# exec: "/bin/nonexistent": stat /bin/nonexistent: no such file or directory
# FIX: Check Dockerfile CMD/ENTRYPOINT, verify binary exists in image
docker run -it --entrypoint /bin/sh myapp # Override to debug
Crash Loops
A crash loop occurs when a container starts, runs briefly, then exits — and Docker's restart policy keeps restarting it. The container cycles between "starting" and "exited" indefinitely. Exit codes are your primary diagnostic tool:
| Exit Code | Signal | Meaning | Common Cause |
|---|---|---|---|
| 0 | — | Normal exit (success) | CMD completed. Container isn't meant to be long-running, or foreground process ended. |
| 1 | — | Application error | Unhandled exception, missing config, dependency unavailable. |
| 2 | — | Shell misuse | Incorrect command syntax in entrypoint script. |
| 126 | — | Command not executable | Permission denied on entrypoint binary (missing +x). |
| 127 | — | Command not found | Binary doesn't exist in image (wrong PATH or missing install). |
| 137 | SIGKILL (9) | Killed by external signal | OOM kill, docker kill, or orchestrator termination. |
| 139 | SIGSEGV (11) | Segmentation fault | Binary crash, corrupt memory, wrong architecture (amd64 on arm64). |
| 143 | SIGTERM (15) | Graceful termination | docker stop, orchestrator rolling update. Normal if app handles SIGTERM. |
# Identify a crash loop
docker ps -a --filter "name=app"
# STATUS: Restarting (1) 2 seconds ago ← Restarting = crash loop
# Get the exit code
docker inspect --format '{{.State.ExitCode}}' app
# 137
# Check if OOM killed
docker inspect --format '{{.State.OOMKilled}}' app
# true ← Memory limit exceeded
# View restart count
docker inspect --format '{{.RestartCount}}' app
# 47 ← Restarted 47 times
# View the last crash logs (even for a restarting container)
docker logs app --tail 50
# Last 50 lines before the crash
# Debugging strategy for crash loops:
# 1. Override entrypoint to keep container alive for inspection
docker run -it --entrypoint /bin/sh myapp:latest
# Now you're inside the container — check files, env, deps
# 2. Add sleep to see what's happening
docker run -it --entrypoint /bin/sh myapp -c "sleep 3600"
# Container stays alive for 1 hour — debug inside it
# 3. Check if it's a dependency issue (database not ready)
docker logs app 2>&1 | grep -i "connection\|timeout\|refused"
# Error: Connection refused to postgres:5432
# FIX: Add health check dependency, retry logic, or init container
OOM Kills
When a container exceeds its memory limit, the Linux kernel's OOM (Out of Memory) killer terminates the process with SIGKILL (exit code 137). This is the most dangerous failure mode because the application gets no chance to shut down gracefully — data can be lost.
# Confirm OOM kill via Docker
docker inspect app --format '{{.State.OOMKilled}}'
# true
docker inspect app --format '{{json .State}}' | jq '{Status, ExitCode, OOMKilled, FinishedAt}'
# {
# "Status": "exited",
# "ExitCode": 137,
# "OOMKilled": true,
# "FinishedAt": "2026-05-14T10:30:00.123456789Z"
# }
# Confirm via kernel logs (dmesg)
dmesg | grep -i "oom\|killed" | tail -10
# [1234567.890] Memory cgroup out of memory: Killed process 12345 (node)
# total-vm:1048576kB, anon-rss:524288kB, file-rss:0kB, shmem-rss:0kB
# [1234567.891] oom_reaper: reaped process 12345 (node), now anon-rss:0kB
# Check current memory usage vs limit
docker stats app --no-stream --format "{{.MemUsage}}"
# 245MiB / 256MiB ← Almost at limit, OOM imminent
# View memory limit from container config
docker inspect --format '{{.HostConfig.Memory}}' app
# 268435456 (bytes = 256 MiB)
# Check cgroup memory events (how many OOM kills occurred)
CONTAINER_ID=$(docker inspect --format '{{.Id}}' app)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.events
# low 0
# high 0
# max 15 ← Hit memory.max limit 15 times
# oom 3 ← Kernel OOM killed 3 times
# oom_kill 3 ← Confirmed 3 OOM kills
# Solutions:
# 1. Increase memory limit (if app genuinely needs more)
docker update --memory=512m --memory-swap=512m app
# 2. Find the memory leak (if usage grows unbounded)
# Run with relaxed limits and monitor growth over time
docker run --memory=2g myapp
docker stats app # Watch memory climb over hours
# 3. Set memory-swap equal to memory (disable swap, cleaner OOM)
docker run --memory=256m --memory-swap=256m myapp
# Without this, container can swap to disk, causing slowness before OOM
OOMKilled is false but exit code is still 137, the OOM kill happened to a child process inside the container (not PID 1). Docker only reports OOMKilled for the main container process. Check dmesg for the full picture.
Networking Failures
Container networking issues fall into three categories: can't reach the internet, can't reach other containers, or external clients can't reach the container. Debug systematically from inside out:
# === Step 1: Can the container reach the internet? ===
docker exec app ping -c 3 8.8.8.8
# If FAILS → network connectivity issue (bridge config, iptables)
# If WORKS → DNS or application-level issue
# === Step 2: DNS resolution working? ===
docker exec app nslookup google.com
# If FAILS → DNS configuration issue
docker exec app cat /etc/resolv.conf
# nameserver 127.0.0.11 ← Docker's embedded DNS (expected)
# Check Docker DNS is working
docker exec app nslookup other-container
# If FAILS for container names → containers not on same network
# === Step 3: Can containers reach each other? ===
# Verify both containers are on the same Docker network
docker network inspect bridge --format '{{range .Containers}}{{.Name}} {{.IPv4Address}}{{"\n"}}{{end}}'
# nginx 172.17.0.2/16
# app 172.17.0.3/16
# Test connectivity between containers
docker exec app ping -c 3 172.17.0.2
docker exec app curl -s http://nginx:80/
# === Step 4: Port mapping from host? ===
# Verify port binding
docker port app
# 3000/tcp -> 0.0.0.0:3000
# Test from host
curl -v http://localhost:3000/
# If FAILS → app not listening on correct interface inside container
# Common mistake: app listens on 127.0.0.1 inside container
docker exec app ss -tlnp
# LISTEN 0 128 127.0.0.1:3000 *:* ← WRONG: bound to localhost only
# LISTEN 0 128 0.0.0.0:3000 *:* ← CORRECT: bound to all interfaces
# === Step 5: iptables interference? ===
# Docker manages iptables rules for port forwarding
sudo iptables -t nat -L DOCKER -n --line-numbers
# Check that DNAT rules exist for published ports
# === Step 6: Docker network driver issues ===
# Recreate the default bridge if corrupted
docker network prune # Remove unused networks
docker network create --driver bridge my-network
docker run --network my-network --name app myapp
Filesystem Issues
Filesystem problems manifest as permission denied errors, read-only filesystem errors, or "no space left on device" — even when the host has plenty of space:
# === Read-only filesystem ===
docker exec app touch /test
# touch: cannot touch '/test': Read-only file system
# Cause 1: Container started with --read-only flag
docker inspect --format '{{.HostConfig.ReadonlyRootfs}}' app
# true ← Intentional security hardening
# FIX: Write to tmpfs mounts (/tmp, /var/run) or designated writable volumes
# Cause 2: OverlayFS layer issue
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' app
ls -la /var/lib/docker/overlay2/LAYER_ID/merged/
# Check if layers are intact
# === No space left on device ===
docker exec app df -h /
# Filesystem Size Used Avail Use% Mounted on
# overlay 50G 50G 0 100% /
# Check Docker disk usage
docker system df
# TYPE TOTAL ACTIVE SIZE RECLAIMABLE
# Images 45 12 12.5GB 8.3GB (66%)
# Containers 15 8 2.1GB 1.5GB (71%)
# Local Volumes 23 8 5.4GB 3.2GB (59%)
# Build Cache 0 0 3.8GB 3.8GB
# Clean up unused resources
docker system prune -a --volumes
# WARNING! This will remove all stopped containers, unused networks,
# unused images, and unused volumes.
# === Permission denied in volume mounts ===
docker run -v /host/data:/app/data myapp
docker exec app ls -la /app/data/
# ls: cannot open directory '/app/data/': Permission denied
# Cause: UID mismatch between host and container
ls -la /host/data/
# drwxr-xr-x 2 root root 4096 ... ← Owned by root (UID 0)
docker exec app id
# uid=1000(appuser) gid=1000(appuser) ← Container runs as UID 1000
# FIX: Match UIDs
# Option 1: Change host directory ownership
sudo chown -R 1000:1000 /host/data/
# Option 2: Run container with matching UID
docker run --user $(id -u):$(id -g) -v /host/data:/app/data myapp
# Option 3: Use named volumes (Docker manages permissions)
docker run -v app-data:/app/data myapp
Advanced Debugging with docker exec
docker exec is your first-line debugging tool — running commands inside a container's namespaces. But many production images lack debugging tools. Here's how to work around that:
# Basic debugging inside a running container
docker exec -it app /bin/sh # Get a shell
docker exec app cat /proc/1/status # Check PID 1 details
docker exec app env # View environment variables
docker exec app cat /etc/hosts # DNS overrides
docker exec app ls -la /proc/1/fd/ # Open file descriptors
# Problem: Distroless/minimal images have no shell or tools
docker exec -it app /bin/sh
# OCI runtime exec failed: exec failed: unable to start container process:
# exec: "/bin/sh": stat /bin/sh: no such file or directory
# Solution 1: Copy a static binary into the container
docker cp /usr/bin/busybox app:/tmp/busybox
docker exec app /tmp/busybox sh
# Solution 2: Use Docker's debug feature (Docker Desktop 4.27+)
docker debug app
# Attaches a debug shell with common tools pre-installed
# Works even on distroless images (injects a toolbox)
# Solution 3: Install tools temporarily (if package manager exists)
docker exec app apt-get update && apt-get install -y curl net-tools procps
# WARNING: Changes lost on container restart. Only for debugging.
# Useful one-liners for debugging inside containers:
docker exec app cat /proc/net/tcp # Active TCP connections
docker exec app cat /proc/meminfo # Memory details
docker exec app cat /proc/1/cgroup # Cgroup membership
docker exec app cat /proc/1/mountinfo # Mount table
docker exec app find / -name "*.log" 2>/dev/null # Find log files
nsenter — Entering Container Namespaces from Host
When docker exec fails (no shell in image, container is stopped, Docker daemon issues), nsenter lets you enter a container's namespaces directly from the host. It operates at the kernel level, bypassing Docker entirely:
# Get the container's PID on the host
PID=$(docker inspect --format '{{.State.Pid}}' app)
echo $PID # e.g., 12345
# Enter ALL namespaces of the container (equivalent to docker exec)
sudo nsenter -t $PID -m -u -i -n -p -- /bin/sh
# -t PID Target process
# -m Mount namespace (see container's filesystem)
# -u UTS namespace (container's hostname)
# -i IPC namespace
# -n Network namespace (container's network stack)
# -p PID namespace (see container's process tree)
# Enter ONLY the network namespace (useful for network debugging)
sudo nsenter -t $PID -n -- ip addr show
# Shows network interfaces as the container sees them
sudo nsenter -t $PID -n -- ss -tlnp
# Shows listening ports inside the container
sudo nsenter -t $PID -n -- ping 8.8.8.8
# Test connectivity from the container's perspective
# Enter ONLY the mount namespace (inspect container's filesystem)
sudo nsenter -t $PID -m -- ls /app/config/
sudo nsenter -t $PID -m -- cat /etc/resolv.conf
# Enter ONLY the PID namespace (see container's process tree)
sudo nsenter -t $PID -p -- ps aux
# PID USER TIME COMMAND
# 1 root 0:05 node server.js
# 15 root 0:00 ps aux
# Why nsenter over docker exec?
# 1. Works when Docker daemon is unresponsive
# 2. Works on stopped containers (if PID still exists in /proc)
# 3. Can enter individual namespaces (not all-or-nothing)
# 4. Has access to host tools (strace, tcpdump, perf)
kubectl debug (ephemeral containers) instead of SSH-ing into nodes to run nsenter.
strace — System Call Tracing
strace intercepts and records every system call a process makes. When a container process hangs, crashes without useful logs, or behaves unexpectedly, strace reveals exactly what it's doing at the kernel level:
# Get the container's main PID
PID=$(docker inspect --format '{{.State.Pid}}' app)
# Trace all system calls of the container's main process
sudo strace -p $PID -f -tt
# -p PID Attach to running process
# -f Follow forked child processes
# -tt Print microsecond timestamps
# Output example:
# 10:30:01.123456 read(5, "GET / HTTP/1.1\r\n", 4096) = 16
# 10:30:01.123500 write(5, "HTTP/1.1 200 OK\r\n", 17) = 17
# 10:30:01.123550 epoll_wait(3, [{EPOLLIN, {u32=5}}], 128, 5000) = 1
# Filter for specific syscall categories:
# Network operations only
sudo strace -p $PID -f -e trace=network -tt
# connect(5, {sa_family=AF_INET, sin_port=5432, sin_addr="10.0.0.5"}, 16) = -1 ECONNREFUSED
# ← Shows exactly which connection is failing and why
# File operations only
sudo strace -p $PID -f -e trace=file -tt
# open("/app/config/database.yml", O_RDONLY) = -1 ENOENT (No such file or directory)
# ← Shows what files the app is trying to read
# Process operations only
sudo strace -p $PID -f -e trace=process
# Common discoveries via strace:
# 1. "Permission denied" — which file? → strace shows the exact path
# 2. "Connection refused" — to where? → strace shows IP:port
# 3. "Process hangs" — on what? → strace shows it's blocked on read/poll/futex
# 4. "Slow startup" — why? → strace shows DNS lookups or file scans taking seconds
# Save trace to file for analysis
sudo strace -p $PID -f -tt -o /tmp/app-trace.log
# Then search: grep -i "ENOENT\|EACCES\|ECONNREFUSED" /tmp/app-trace.log
tcpdump — Network Packet Capture
tcpdump captures raw network packets, letting you see exactly what data flows in and out of a container. Essential for debugging API failures, TLS issues, DNS problems, and connection timeouts:
# Method 1: Capture from inside container's network namespace
PID=$(docker inspect --format '{{.State.Pid}}' app)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -n -c 50
# Captures 50 packets from the container's perspective
# Method 2: Find container's veth pair on host and capture there
# Step 1: Get container's interface index
docker exec app cat /sys/class/net/eth0/iflink
# 15 (this is the ifindex of the host-side veth)
# Step 2: Find matching veth on host
ip link | grep "^15:"
# 15: veth1a2b3c4@if14:
# Step 3: Capture on that interface
sudo tcpdump -i veth1a2b3c4 -n -c 100
# Useful capture filters:
# All HTTP traffic to/from the container
sudo nsenter -t $PID -n -- tcpdump -i eth0 -n port 80 or port 443
# DNS queries (why is name resolution failing?)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -n port 53
# 10:30:01.123 IP 172.17.0.3.54321 > 127.0.0.11.53: A? database.internal
# TCP connection attempts (why is connection refused?)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -n "tcp[tcpflags] & (tcp-syn|tcp-rst) != 0"
# Shows SYN (connection attempts) and RST (rejections)
# Save capture to file for Wireshark analysis
sudo nsenter -t $PID -n -- tcpdump -i eth0 -w /tmp/container-traffic.pcap -c 1000
# Open in Wireshark: wireshark /tmp/container-traffic.pcap
# Capture with ASCII output (see HTTP bodies)
sudo nsenter -t $PID -n -- tcpdump -i eth0 -A port 80 -c 20
Docker Debug Container Pattern
Instead of installing tools in production containers, use an ephemeral debug container that shares the target container's namespaces. The debug container has all the tools; the target container stays lean:
# nicolaka/netshoot — the Swiss Army knife of container debugging
# Contains: curl, ping, dig, nslookup, tcpdump, ip, iptables, ss, netstat,
# strace, ltrace, perf, drill, mtr, iperf3, and 50+ more tools
# Share the target container's NETWORK namespace
docker run -it --rm \
--network container:app \
nicolaka/netshoot
# Inside netshoot, you see app's network stack:
ip addr show # app's interfaces
ss -tlnp # app's listening ports
curl localhost:3000 # access app's ports via localhost
dig database.internal # resolve DNS from app's perspective
tcpdump -i eth0 # capture app's traffic
# Share BOTH network and PID namespace
docker run -it --rm \
--network container:app \
--pid container:app \
nicolaka/netshoot
# Now you can also see app's processes:
ps aux # See all processes in app container
strace -p 1 # Trace app's PID 1
# For filesystem access too, mount the container's filesystem
docker run -it --rm \
--network container:app \
--pid container:app \
--volumes-from app \
nicolaka/netshoot
# Access app's files:
ls /app/ # See app's application code
cat /app/config.yml # Read config files
cat /proc/1/environ # Read environment variables of PID 1
Performance Debugging
When containers are slow but not crashing, the problem is usually CPU throttling, I/O saturation, or memory pressure. These issues are invisible without the right tools:
# === CPU Throttling Detection ===
# Check if container is being throttled by CFS scheduler
CONTAINER_ID=$(docker inspect --format '{{.Id}}' app)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.stat
# nr_periods 100000 # Total scheduling periods
# nr_throttled 15000 # Periods where container was throttled
# throttled_usec 30000000 # Total microseconds throttled (30 seconds!)
# Throttle ratio (should be < 5% for healthy containers)
# throttled_ratio = nr_throttled / nr_periods = 15000/100000 = 15% ← TOO HIGH
# FIX: Increase CPU limits
docker update --cpus=2.0 app # Allow 2 full CPU cores
# === Memory Pressure ===
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.pressure
# some avg10=25.00 avg60=15.50 avg300=8.20 total=567890000
# full avg10=5.00 avg60=2.00 avg300=0.80 total=123456000
# "some" > 10% means processes are waiting for memory reclaim
# === I/O Saturation ===
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/io.pressure
# some avg10=45.00 avg60=30.20 avg300=12.80 total=89012345
# ← 45% of time at least one task is waiting for I/O → disk bottleneck
# === Process-level investigation from host ===
PID=$(docker inspect --format '{{.State.Pid}}' app)
# CPU usage per thread inside the container
sudo top -H -p $PID
# Shows which threads are consuming CPU
# What is the process doing? (requires perf tools)
sudo perf top -p $PID
# Real-time view of which functions consume CPU
# I/O operations per process
sudo pidstat -d -p $PID 1 5
# Shows read/write bytes per second for the container process
The "CPU Usage is Low but App is Slow" Mystery
A container shows only 30% CPU usage yet response times are 10x normal. The paradox resolves when you understand CFS scheduling:
- Container has
--cpus=0.5(50% of one core) - In each 100ms CFS period, it gets 50ms of CPU time
- If a request arrives and needs 80ms of CPU work, it completes in 160ms wall-clock time (50ms running + 50ms throttled + 50ms running + 10ms throttled)
docker statsshows 50% CPU — "normal" — but latency has doubled
Diagnostic: Check nr_throttled in cpu.stat. If throttling is frequent, either increase CPU limits or optimize the hot path.
Common Issues Quick Reference
| Symptom | Likely Cause | Diagnostic Command | Fix |
|---|---|---|---|
| Exit code 137 | OOM kill | docker inspect --format '{{.State.OOMKilled}}' | Increase --memory |
| Exit code 139 | Segfault | dmesg | grep segfault | Check binary architecture, rebuild |
| Exit code 127 | Binary not found | docker run --entrypoint sh image -c "which cmd" | Fix CMD/ENTRYPOINT path |
| "Permission denied" | UID mismatch | docker exec app id; ls -la /path | Match UIDs or use named volumes |
| "No space left" | Overlay full | docker system df | docker system prune |
| "Connection refused" | Wrong bind address | docker exec app ss -tlnp | Bind to 0.0.0.0 not 127.0.0.1 |
| "Name resolution failed" | DNS misconfiguration | docker exec app cat /etc/resolv.conf | Check Docker DNS, network driver |
| Port not accessible | Port not published | docker port container | Add -p host:container |
| Container extremely slow | CPU throttling | cat cpu.stat | grep throttled | Increase --cpus |
| Random kills under load | PID limit reached | cat pids.current; cat pids.max | Increase --pids-limit |
| Volume data missing | Anonymous volume | docker inspect -f '{{.Mounts}}' | Use named volumes |
| Health check failing | App not ready on expected port | docker exec app curl localhost:PORT | Check app startup, add start_period |
| Intermittent network drops | MTU mismatch | docker exec app ip link show eth0 | Set --opt com.docker.network.driver.mtu=1400 |
| Container can't reach host | Bridge isolation | docker exec app ping host.docker.internal | Use --add-host or host network |
| Logs missing after restart | json-file driver without volume | docker inspect --format '{{.LogPath}}' | Use log aggregation (Fluent Bit) |
Exercises
docker inspect to confirm the exit code and diagnose what happened. For exit 137, trigger a real OOM kill by running a memory-eating process inside a container with --memory=50m.
Conclusion & Next Steps
Container troubleshooting is a skill that combines systematic methodology with deep Linux knowledge. The toolkit we've built in this article escalates from simple to advanced:
- docker logs — 80% of issues (application errors, missing config)
- docker inspect — Exit codes, OOM flags, mount points, network config
- docker exec — Interactive debugging inside running containers
- nsenter — Enter specific namespaces when Docker isn't cooperating
- strace — See exactly what system calls a stuck process is making
- tcpdump — Capture and analyze network traffic at the packet level
- Debug containers — Full toolbox without polluting production images
With monitoring (Part 20) telling you what's wrong and troubleshooting (this part) telling you why, you can handle any container incident. The final piece is scaling these practices to the enterprise — the subject of our concluding article.
Next in the Series
In Part 22: Enterprise Container Platforms, we'll scale containers to the enterprise — registry replication, access control, air-gapped deployments, policy enforcement, multi-architecture builds, and choosing between Docker Enterprise, OpenShift, and Rancher.