The Three Pillars of Observability
Observability answers one question: "Why is my system behaving this way?" Not just what is happening, but why. For containerised systems, observability is built on three complementary signal types:
- Metrics — Numeric measurements over time. CPU usage at 85%, memory at 2.1 GB, 1,247 requests/second. Metrics tell you what is happening quantitatively and enable alerting.
- Logs — Discrete events with context. "Connection refused to database at 10:42:03", "User 12345 authentication failed". Logs tell you what happened with rich detail.
- Traces — Request journeys across services. A single HTTP request traversing API gateway → auth service → user service → database. Traces show you where time is spent and how services interact.
flowchart TD
subgraph Signals["Observability Signals"]
M["Metrics
Numeric time-series"]
L["Logs
Discrete events"]
T["Traces
Request journeys"]
end
subgraph Collection["Collection Layer"]
P["Prometheus / cAdvisor"]
F["Fluent Bit / Fluentd"]
J["Jaeger / Zipkin / OTLP"]
end
subgraph Storage["Storage & Query"]
PS["Prometheus TSDB"]
ES["Elasticsearch / Loki"]
TS["Jaeger Backend / Tempo"]
end
subgraph Viz["Visualization"]
G["Grafana Dashboards"]
K["Kibana / Grafana Explore"]
JU["Jaeger UI / Grafana Tempo"]
end
M --> P --> PS --> G
L --> F --> ES --> K
T --> J --> TS --> JU
style Signals fill:#f0f9f9,stroke:#3B9797
style Collection fill:#f8f9fa,stroke:#132440
style Storage fill:#f8f9fa,stroke:#16476A
style Viz fill:#fff5f5,stroke:#BF092F
Why Containers Need Special Observability
Containers introduce unique observability challenges that traditional server monitoring doesn't face:
Docker Stats Command
The simplest monitoring tool is built into Docker itself. docker stats provides a real-time stream of resource usage for every running container — no setup required:
# Real-time stats for all running containers (live updating)
docker stats
# CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
# a1b2c3d4e5f6 nginx 0.02% 4.5MiB / 512MiB 0.88% 1.2kB / 648B 0B / 4.1kB 3
# f6e5d4c3b2a1 redis 0.15% 7.8MiB / 256MiB 3.05% 2.4kB / 1.1kB 0B / 0B 5
# 1a2b3c4d5e6f app 1.23% 145MiB / 1GiB 14.16% 54kB / 32kB 8.2MB / 512kB 12
# Stats for specific containers (useful in scripts)
docker stats nginx redis --no-stream
# Prints one snapshot and exits (no live updating)
# Custom format for machine-readable output
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
# NAME CPU % MEM USAGE / LIMIT NET I/O
# nginx 0.02% 4.5MiB / 512MiB 1.2kB / 648B
# redis 0.15% 7.8MiB / 256MiB 2.4kB / 1.1kB
# app 1.23% 145MiB / 1GiB 54kB / 32kB
# JSON format for programmatic consumption
docker stats --no-stream --format '{{json .}}' | jq '.'
# {
# "BlockIO": "0B / 4.1kB",
# "CPUPerc": "0.02%",
# "Container": "a1b2c3d4e5f6",
# "ID": "a1b2c3d4e5f6",
# "MemPerc": "0.88%",
# "MemUsage": "4.5MiB / 512MiB",
# "Name": "nginx",
# "NetIO": "1.2kB / 648B",
# "PIDs": "3"
# }
# Script to alert on high memory usage
docker stats --no-stream --format '{{.Name}} {{.MemPerc}}' | while read name pct; do
value=$(echo "$pct" | tr -d '%')
if [ "$(echo "$value > 80" | bc)" -eq 1 ]; then
echo "WARNING: $name memory at $pct"
fi
done
docker stats is useful for quick debugging but has serious limitations for production: no history (real-time only), no alerting, no per-process breakdown, and it requires Docker socket access. For production, we need proper metrics collection.
Container Metrics Sources
Every metric Docker shows ultimately comes from the Linux kernel's cgroups pseudo-filesystem. Understanding this source helps you build custom monitoring and verify what higher-level tools report:
# Find the cgroup path for a running container
CONTAINER_ID=$(docker inspect --format '{{.Id}}' nginx)
CGROUP_PATH="/sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope"
# --- CPU Metrics ---
# Total CPU time consumed (in microseconds for cgroup v2)
cat ${CGROUP_PATH}/cpu.stat
# usage_usec 1234567890 # Total CPU time used
# user_usec 1000000000 # Time in user space
# system_usec 234567890 # Time in kernel space
# nr_periods 50000 # Number of enforcement periods
# nr_throttled 150 # Times the container was throttled
# throttled_usec 3000000 # Total throttled time
# Current CPU pressure (cgroup v2 PSI)
cat ${CGROUP_PATH}/cpu.pressure
# some avg10=2.50 avg60=1.80 avg300=0.95 total=45678901
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# --- Memory Metrics ---
cat ${CGROUP_PATH}/memory.current
# 47185920 (bytes = ~45 MiB current usage)
cat ${CGROUP_PATH}/memory.max
# 536870912 (bytes = 512 MiB limit)
cat ${CGROUP_PATH}/memory.stat
# anon 30000000 # Anonymous memory (heap, stack)
# file 15000000 # File-backed memory (page cache)
# slab 2000000 # Kernel slab allocations
# pgfault 125000 # Page faults
# pgmajfault 50 # Major page faults (disk reads)
# Memory pressure
cat ${CGROUP_PATH}/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# --- I/O Metrics ---
cat ${CGROUP_PATH}/io.stat
# 8:0 rbytes=1048576 wbytes=524288 rios=100 wios=50 dbytes=0 dios=0
# --- PID Count ---
cat ${CGROUP_PATH}/pids.current
# 5
cat ${CGROUP_PATH}/pids.max
# 200
/sys/fs/cgroup/memory/docker/<id>/memory.usage_in_bytes). On cgroup v2 (unified hierarchy), everything lives under a single path. Modern Docker defaults to cgroup v2. All examples above use the v2 layout.
cAdvisor
cAdvisor (Container Advisor) is Google's open-source container metrics collector. It runs as a daemon, automatically discovers all containers on the host, and exposes their resource usage via a web UI and a Prometheus-compatible metrics endpoint. It's the standard "metrics agent" for container environments.
# Run cAdvisor as a container (the standard deployment method)
docker run -d \
--name=cadvisor \
--restart=always \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--privileged \
--device=/dev/kmsg \
gcr.io/cadvisor/cadvisor:v0.49.1
# Verify cAdvisor is running
curl -s http://localhost:8080/healthz
# ok
# Access the web UI at http://localhost:8080
# Shows: container list, per-container CPU/memory/network/filesystem graphs
# Access Prometheus metrics endpoint
curl -s http://localhost:8080/metrics | head -20
# # HELP container_cpu_usage_seconds_total Cumulative cpu time consumed
# # TYPE container_cpu_usage_seconds_total counter
# container_cpu_usage_seconds_total{container_label_com_docker_compose_service="nginx",id="/docker/a1b2c3..."} 12.345
# container_cpu_usage_seconds_total{container_label_com_docker_compose_service="redis",id="/docker/f6e5d4..."} 5.678
# Key metrics exposed by cAdvisor:
curl -s http://localhost:8080/metrics | grep -E "^container_(cpu|memory|network|fs)" | sort -u | head -20
# container_cpu_cfs_periods_total
# container_cpu_cfs_throttled_periods_total
# container_cpu_cfs_throttled_seconds_total
# container_cpu_usage_seconds_total
# container_fs_reads_bytes_total
# container_fs_writes_bytes_total
# container_memory_cache
# container_memory_rss
# container_memory_usage_bytes
# container_memory_working_set_bytes
# container_network_receive_bytes_total
# container_network_transmit_bytes_total
memory_usage_bytes includes all memory (active + inactive page cache). memory_working_set_bytes is what you should alert on — it represents memory that cannot be reclaimed without impacting the container. This is the metric Kubernetes uses for OOM kill decisions.
Prometheus & Grafana Stack
The industry-standard monitoring stack for containers is Prometheus (metrics collection and storage) paired with Grafana (visualization and alerting). This combination gives you historical data, powerful queries (PromQL), and beautiful dashboards:
# docker-compose.monitoring.yml
# Complete monitoring stack: Prometheus + Grafana + cAdvisor + Node Exporter
version: "3.8"
services:
# Prometheus - Metrics collection and storage
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-lifecycle"
networks:
- monitoring
# Grafana - Visualization and dashboards
grafana:
image: grafana/grafana:10.4.0
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
depends_on:
- prometheus
networks:
- monitoring
# cAdvisor - Container metrics collector
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: unless-stopped
privileged: true
devices:
- /dev/kmsg:/dev/kmsg
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports:
- "8080:8080"
networks:
- monitoring
# Node Exporter - Host-level metrics
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: unless-stopped
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.rootfs=/rootfs"
- "--path.sysfs=/host/sys"
ports:
- "9100:9100"
networks:
- monitoring
volumes:
prometheus-data:
grafana-data:
networks:
monitoring:
driver: bridge
# prometheus/prometheus.yml
# Prometheus configuration for container monitoring
global:
scrape_interval: 15s # Scrape targets every 15 seconds
evaluation_interval: 15s # Evaluate alert rules every 15 seconds
rule_files:
- "alert-rules.yml"
scrape_configs:
# Prometheus self-monitoring
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# cAdvisor - container metrics
- job_name: "cadvisor"
static_configs:
- targets: ["cadvisor:8080"]
metric_relabel_configs:
# Drop high-cardinality metrics to save storage
- source_labels: [__name__]
regex: "container_tasks_state"
action: drop
# Node Exporter - host metrics
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
# Docker daemon metrics (requires daemon.json config)
- job_name: "docker-daemon"
static_configs:
- targets: ["host.docker.internal:9323"]
# prometheus/alert-rules.yml
# Alert rules for container monitoring
groups:
- name: container-alerts
rules:
# Container using more than 90% of memory limit
- alert: ContainerHighMemory
expr: |
(container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.9
and container_spec_memory_limit_bytes > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} memory above 90%"
description: "Memory usage is {{ $value | humanizePercentage }}"
# Container CPU throttled more than 25% of periods
- alert: ContainerCPUThrottling
expr: |
rate(container_cpu_cfs_throttled_periods_total[5m])
/ rate(container_cpu_cfs_periods_total[5m]) > 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} CPU throttled"
# Container restarting frequently
- alert: ContainerRestartLoop
expr: |
increase(container_restart_count[1h]) > 5
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} restarting frequently"
Essential Container PromQL
CPU usage rate (cores): rate(container_cpu_usage_seconds_total{name=~".+"}[5m])
Memory usage %: container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100
Network receive rate: rate(container_network_receive_bytes_total[5m])
Top 5 CPU consumers: topk(5, rate(container_cpu_usage_seconds_total{name=~".+"}[5m]))
Throttled containers: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0
Key Metrics to Monitor
Not all metrics are equally important. Focus on these signals that indicate actual problems before they become outages:
| Metric | What It Measures | Warning Threshold | Critical Threshold | Action |
|---|---|---|---|---|
| CPU Throttling % | % of periods where CPU was capped | > 10% | > 25% | Increase CPU limits or optimize code |
| Memory Working Set | Non-reclaimable memory usage | > 80% of limit | > 90% of limit | Increase limit or fix memory leaks |
| Restart Count | Container restarts in time window | > 2/hour | > 5/hour | Check logs for crash reason |
| Network Errors | TX/RX errors and drops | > 0.1% | > 1% | Check network configuration, MTU |
| Disk I/O Wait | Time spent waiting for I/O | > 20ms avg | > 100ms avg | Move to faster storage, optimize queries |
| PID Count | Number of processes in container | > 80% of pids.max | > 95% of pids.max | Fix fork bombs or increase limit |
| Health Check Failures | Consecutive failed health probes | > 1 failure | > 3 consecutive | Check application health endpoint |
| Image Pull Time | Time to pull container image | > 30s | > 120s | Use smaller images, registry mirrors |
nr_throttled alongside CPU percentage.
Docker Logging Architecture
Docker captures all stdout and stderr output from container processes and routes it through a configurable logging driver. The driver determines where logs are stored and in what format:
flowchart LR
A["Container Process
(stdout/stderr)"] --> B["Docker Daemon
(log router)"]
B --> C["json-file
(default)"]
B --> D["syslog"]
B --> E["fluentd"]
B --> F["awslogs"]
B --> G["gcplogs"]
B --> H["journald"]
C --> C1["/var/lib/docker/containers/ID/ID-json.log"]
D --> D1["syslog daemon"]
E --> E1["Fluentd collector"]
F --> F1["CloudWatch Logs"]
G --> G1["Cloud Logging"]
H --> H1["systemd journal"]
style A fill:#f0f9f9,stroke:#3B9797
style B fill:#f8f9fa,stroke:#132440
| Driver | Storage | docker logs | Best For | Notes |
|---|---|---|---|---|
| json-file | Local JSON files | Yes | Development, single-host | Default. Configure max-size and max-file for rotation. |
| local | Custom binary format | Yes | Better performance than json-file | Compressed, faster writes. Docker 18.09+. |
| journald | systemd journal | Yes | systemd-based Linux hosts | Integrates with journalctl. |
| syslog | Remote syslog server | No | Enterprise syslog infrastructure | Supports TLS, TCP/UDP. |
| fluentd | Fluentd daemon | No | Flexible log routing/filtering | Buffered, async delivery. |
| awslogs | CloudWatch Logs | No | AWS environments | Direct to CloudWatch, no agent needed. |
| gcplogs | Google Cloud Logging | No | GCP environments | Direct to Cloud Logging. |
| splunk | Splunk HEC | No | Enterprise Splunk deployments | HTTP Event Collector integration. |
| none | Discarded | No | Performance-critical, no logs needed | Container output is thrown away entirely. |
# Check current logging driver for a container
docker inspect --format '{{.HostConfig.LogConfig.Type}}' nginx
# json-file
# Run a container with a specific logging driver
docker run -d --name app \
--log-driver=json-file \
--log-opt max-size=10m \
--log-opt max-file=5 \
--log-opt compress=true \
nginx:alpine
# Set daemon-wide default in /etc/docker/daemon.json
cat /etc/docker/daemon.json
# {
# "log-driver": "json-file",
# "log-opts": {
# "max-size": "20m",
# "max-file": "5",
# "compress": "true"
# }
# }
# View container logs (only works with json-file, local, journald drivers)
docker logs nginx --tail 50 --follow --timestamps
# 2026-05-14T10:30:01.123Z 172.17.0.1 - - [14/May/2026:10:30:01 +0000] "GET / HTTP/1.1" 200 615
# View raw log file on host
cat /var/lib/docker/containers/CONTAINER_ID/CONTAINER_ID-json.log | jq '.'
# {
# "log": "172.17.0.1 - - [14/May/2026:10:30:01 +0000] \"GET / HTTP/1.1\" 200 615\n",
# "stream": "stdout",
# "time": "2026-05-14T10:30:01.123456789Z"
# }
max-size and max-file options, the json-file driver will write logs indefinitely until the disk is full. This is the #1 cause of "mystery disk full" incidents on Docker hosts. Always configure log rotation — even in development.
Structured Logging Best Practices
Unstructured log lines ("Error: something went wrong") are nearly useless at scale. Structured logging outputs machine-parseable records (typically JSON) that can be indexed, filtered, and correlated automatically:
// BAD: Unstructured log line
"ERROR 2026-05-14 10:30:01 Connection to database failed after 3 retries"
// GOOD: Structured JSON log
{
"timestamp": "2026-05-14T10:30:01.456Z",
"level": "error",
"service": "user-api",
"message": "Database connection failed",
"error": "connection refused",
"host": "db-primary.internal",
"port": 5432,
"retries": 3,
"retry_interval_ms": 1000,
"correlation_id": "req-a1b2c3d4-e5f6-7890",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"container_id": "f6e5d4c3b2a1"
}
// Node.js application with structured logging (pino)
const pino = require('pino');
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
base: {
service: 'user-api',
version: process.env.APP_VERSION || 'unknown',
environment: process.env.NODE_ENV || 'development',
},
timestamp: pino.stdTimeFunctions.isoTime,
});
// Usage — outputs JSON to stdout (Docker captures it)
logger.info({ userId: 12345, action: 'login' }, 'User authenticated');
// {"level":"info","time":"2026-05-14T10:30:01.456Z","service":"user-api","userId":12345,"action":"login","msg":"User authenticated"}
logger.error({ err: error, requestId: req.id }, 'Database query failed');
// {"level":"error","time":"2026-05-14T10:30:02.789Z","service":"user-api","err":{"message":"timeout","stack":"..."},"requestId":"abc-123","msg":"Database query failed"}
Log Aggregation with Fluent Bit
Fluent Bit is a lightweight log processor that collects container logs, parses them, and forwards them to storage backends. It's the cloud-native successor to Fluentd — 10x less memory, written in C, designed for container environments:
# docker-compose.logging.yml
# Log aggregation stack: Fluent Bit + Loki + Grafana
version: "3.8"
services:
# Fluent Bit - Log collector and forwarder
fluent-bit:
image: fluent/fluent-bit:3.0
container_name: fluent-bit
restart: unless-stopped
volumes:
- ./fluent-bit/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf:ro
- ./fluent-bit/parsers.conf:/fluent-bit/etc/parsers.conf:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/log:/var/log:ro
depends_on:
- loki
networks:
- logging
# Grafana Loki - Log storage and indexing
loki:
image: grafana/loki:2.9.6
container_name: loki
restart: unless-stopped
ports:
- "3100:3100"
volumes:
- loki-data:/loki
command: -config.file=/etc/loki/local-config.yaml
networks:
- logging
# Grafana - Log visualization (query via LogQL)
grafana:
image: grafana/grafana:10.4.0
container_name: grafana-logs
restart: unless-stopped
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
depends_on:
- loki
networks:
- logging
volumes:
loki-data:
networks:
logging:
driver: bridge
# fluent-bit/fluent-bit.conf
# Fluent Bit configuration for Docker container logs
[SERVICE]
Flush 5
Daemon Off
Log_Level info
Parsers_File parsers.conf
# Input: Read Docker container JSON log files
[INPUT]
Name tail
Path /var/lib/docker/containers/*/*.log
Parser docker
Tag docker.*
Refresh_Interval 10
Mem_Buf_Limit 5MB
Skip_Long_Lines On
DB /var/log/flb_docker.db
# Filter: Parse JSON log content from applications
[FILTER]
Name parser
Match docker.*
Key_Name log
Parser json_log
Reserve_Data On
# Filter: Add container metadata
[FILTER]
Name modify
Match docker.*
Add cluster local-dev
Add environment development
# Output: Send to Grafana Loki
[OUTPUT]
Name loki
Match docker.*
Host loki
Port 3100
Labels job=docker,container=$container_name
Remove_Keys stream,time
Line_Format json
# Output: Also print to stdout for debugging
[OUTPUT]
Name stdout
Match docker.*
Format json_lines
Docker Events
Docker emits real-time events for every lifecycle change — container creation, start, stop, die, OOM kill, network connect/disconnect. These events are the foundation of automated monitoring and self-healing systems:
# Stream all Docker events in real-time
docker events
# 2026-05-14T10:30:01.000000 container create abc123 (image=nginx:alpine, name=web)
# 2026-05-14T10:30:01.500000 container start abc123 (image=nginx:alpine, name=web)
# 2026-05-14T10:35:00.000000 container die abc123 (exitCode=137, image=nginx:alpine, name=web)
# Filter events by type and action
docker events --filter type=container --filter event=die
# Only shows container death events
# Filter by container name
docker events --filter container=nginx --filter container=redis
# JSON format for machine parsing
docker events --format '{{json .}}' --filter event=oom
# {"status":"oom","id":"abc123","from":"myapp:latest","Type":"container",
# "Action":"oom","Actor":{"ID":"abc123","Attributes":{"name":"app"}},
# "time":1715684400,"timeNano":1715684400123456789}
# Time-bounded query (historical events)
docker events --since "2026-05-14T09:00:00" --until "2026-05-14T11:00:00"
# Script: Auto-restart containers that die with non-zero exit
docker events --filter event=die --format '{{.Actor.Attributes.name}} {{.Actor.Attributes.exitCode}}' | while read name code; do
if [ "$code" != "0" ]; then
echo "$(date): Container $name died with exit code $code — restarting"
docker start "$name" 2>/dev/null || echo "Failed to restart $name"
fi
done
# Monitor OOM kills specifically
docker events --filter event=oom --format '{{.Actor.Attributes.name}}' | while read name; do
echo "CRITICAL: Container $name was OOM killed at $(date)"
# Send alert to PagerDuty, Slack, etc.
done
| Event | Triggered When | Useful For |
|---|---|---|
create | Container metadata created | Audit logging, deployment tracking |
start | Container process begins | Service discovery, health check init |
die | Container process exits | Alerting, auto-restart logic |
oom | Kernel OOM kills the container | Critical alerts, capacity planning |
health_status | Health check state changes | Load balancer drain, alerting |
destroy | Container is removed | Cleanup, resource accounting |
exec_start | docker exec command runs | Security audit, intrusion detection |
Health Monitoring
Docker HEALTHCHECK provides application-level monitoring — not just "is the process running?" but "is the application actually serving traffic correctly?" Integrating health checks with monitoring creates a self-healing feedback loop:
# Dockerfile with comprehensive health check
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
# Health check: verify the app responds with 200 on /healthz
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1
EXPOSE 3000
CMD ["node", "server.js"]
# Monitor health status of all containers
docker ps --format "table {{.Names}}\t{{.Status}}"
# NAMES STATUS
# nginx Up 2 hours (healthy)
# app Up 2 hours (unhealthy)
# redis Up 2 hours (healthy)
# Inspect health check history
docker inspect --format '{{json .State.Health}}' app | jq '.'
# {
# "Status": "unhealthy",
# "FailingStreak": 5,
# "Log": [
# {
# "Start": "2026-05-14T10:30:00Z",
# "End": "2026-05-14T10:30:05Z",
# "ExitCode": 1,
# "Output": "wget: server returned error: HTTP/1.1 503 Service Unavailable"
# }
# ]
# }
# Alert on unhealthy containers using docker events
docker events --filter health_status=unhealthy \
--format '{{.Actor.Attributes.name}}' | while read name; do
echo "ALERT: Container $name is unhealthy at $(date)"
docker logs "$name" --tail 20 # Capture recent logs for context
done
# Docker Compose with health-dependent startup
# In docker-compose.yml:
# services:
# app:
# depends_on:
# db:
# condition: service_healthy
# db:
# healthcheck:
# test: ["CMD", "pg_isready", "-U", "postgres"]
# interval: 10s
# timeout: 5s
# retries: 5
The Observability Feedback Loop
In production, monitoring isn't passive — it drives automated responses:
- Detect: Health check fails → container marked unhealthy
- Alert: Docker event triggers notification to on-call engineer
- Automate: Orchestrator (Kubernetes/Swarm) replaces unhealthy container
- Diagnose: Logs + metrics from the failed container preserved for post-mortem
- Prevent: Alert thresholds catch degradation before users notice
The goal: users never experience outages because automation detects and resolves issues faster than humans can respond.
Exercises
docker stats.
Conclusion & Next Steps
Container observability is not optional — it's the difference between confidently operating production systems and blindly hoping nothing breaks. The stack we built in this article provides:
- Metrics: cAdvisor exposes per-container resource usage; Prometheus stores and queries it; Grafana visualises trends and fires alerts
- Logs: Docker logging drivers capture output; structured JSON enables filtering; Fluent Bit aggregates and routes to storage
- Events: Docker events provide real-time lifecycle notifications for automation and audit
- Health: HEALTHCHECK integrates application-level monitoring with orchestrator automation
With observability in place, the next challenge is diagnosing problems when things go wrong. Metrics tell you what is broken; troubleshooting determines why and how to fix it.
Next in the Series
In Part 21: Container Troubleshooting, we'll build a systematic debugging toolkit — diagnosing crash loops, OOM kills, networking failures, and using advanced tools like nsenter, strace, and tcpdump to investigate container issues from the host.