Back to Containers & Runtime Environments Mastery Series

Part 20: Container Monitoring & Observability

May 14, 2026 Wasil Zafar 26 min read

A container without observability is a black box. You can't improve what you can't measure, and you can't debug what you can't see. This article builds a complete observability stack for containers — from raw docker stats to production-grade Prometheus + Grafana dashboards, from Docker's built-in logging drivers to structured log aggregation with Fluent Bit. By the end, every container in your system will be transparent.

Table of Contents

  1. Three Pillars of Observability
  2. Docker Stats Command
  3. Container Metrics Sources
  4. cAdvisor
  5. Prometheus & Grafana Stack
  6. Key Metrics to Monitor
  7. Docker Logging Architecture
  8. Structured Logging
  9. Log Aggregation
  10. Docker Events
  11. Health Monitoring
  12. Exercises
  13. Conclusion & Next Steps

The Three Pillars of Observability

Observability answers one question: "Why is my system behaving this way?" Not just what is happening, but why. For containerised systems, observability is built on three complementary signal types:

  • Metrics — Numeric measurements over time. CPU usage at 85%, memory at 2.1 GB, 1,247 requests/second. Metrics tell you what is happening quantitatively and enable alerting.
  • Logs — Discrete events with context. "Connection refused to database at 10:42:03", "User 12345 authentication failed". Logs tell you what happened with rich detail.
  • Traces — Request journeys across services. A single HTTP request traversing API gateway → auth service → user service → database. Traces show you where time is spent and how services interact.
Three Pillars of Container Observability
flowchart TD
    subgraph Signals["Observability Signals"]
        M["Metrics
Numeric time-series"] L["Logs
Discrete events"] T["Traces
Request journeys"] end subgraph Collection["Collection Layer"] P["Prometheus / cAdvisor"] F["Fluent Bit / Fluentd"] J["Jaeger / Zipkin / OTLP"] end subgraph Storage["Storage & Query"] PS["Prometheus TSDB"] ES["Elasticsearch / Loki"] TS["Jaeger Backend / Tempo"] end subgraph Viz["Visualization"] G["Grafana Dashboards"] K["Kibana / Grafana Explore"] JU["Jaeger UI / Grafana Tempo"] end M --> P --> PS --> G L --> F --> ES --> K T --> J --> TS --> JU style Signals fill:#f0f9f9,stroke:#3B9797 style Collection fill:#f8f9fa,stroke:#132440 style Storage fill:#f8f9fa,stroke:#16476A style Viz fill:#fff5f5,stroke:#BF092F

Why Containers Need Special Observability

Containers introduce unique observability challenges that traditional server monitoring doesn't face:

Ephemeral: Containers start, stop, restart, and get replaced constantly. You can't SSH in and look around — by the time you connect, the problematic container is already gone and replaced by a fresh one.
Dynamic Density: A single host might run 50+ containers. Resource contention between containers is invisible without proper per-container metrics. Host-level CPU of 60% tells you nothing about which container is throttled.
Shared Kernel: All containers share the host kernel. An I/O-heavy container can starve neighbours. Memory pressure from one container triggers OOM kills in another. Per-container isolation metrics are essential.

Docker Stats Command

The simplest monitoring tool is built into Docker itself. docker stats provides a real-time stream of resource usage for every running container — no setup required:

# Real-time stats for all running containers (live updating)
docker stats
# CONTAINER ID   NAME      CPU %   MEM USAGE / LIMIT     MEM %   NET I/O         BLOCK I/O       PIDS
# a1b2c3d4e5f6   nginx     0.02%   4.5MiB / 512MiB       0.88%   1.2kB / 648B    0B / 4.1kB      3
# f6e5d4c3b2a1   redis     0.15%   7.8MiB / 256MiB       3.05%   2.4kB / 1.1kB   0B / 0B         5
# 1a2b3c4d5e6f   app       1.23%   145MiB / 1GiB         14.16%  54kB / 32kB     8.2MB / 512kB   12

# Stats for specific containers (useful in scripts)
docker stats nginx redis --no-stream
# Prints one snapshot and exits (no live updating)

# Custom format for machine-readable output
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
# NAME    CPU %   MEM USAGE / LIMIT   NET I/O
# nginx   0.02%   4.5MiB / 512MiB     1.2kB / 648B
# redis   0.15%   7.8MiB / 256MiB     2.4kB / 1.1kB
# app     1.23%   145MiB / 1GiB       54kB / 32kB

# JSON format for programmatic consumption
docker stats --no-stream --format '{{json .}}' | jq '.'
# {
#   "BlockIO": "0B / 4.1kB",
#   "CPUPerc": "0.02%",
#   "Container": "a1b2c3d4e5f6",
#   "ID": "a1b2c3d4e5f6",
#   "MemPerc": "0.88%",
#   "MemUsage": "4.5MiB / 512MiB",
#   "Name": "nginx",
#   "NetIO": "1.2kB / 648B",
#   "PIDs": "3"
# }

# Script to alert on high memory usage
docker stats --no-stream --format '{{.Name}} {{.MemPerc}}' | while read name pct; do
    value=$(echo "$pct" | tr -d '%')
    if [ "$(echo "$value > 80" | bc)" -eq 1 ]; then
        echo "WARNING: $name memory at $pct"
    fi
done

docker stats is useful for quick debugging but has serious limitations for production: no history (real-time only), no alerting, no per-process breakdown, and it requires Docker socket access. For production, we need proper metrics collection.

Container Metrics Sources

Every metric Docker shows ultimately comes from the Linux kernel's cgroups pseudo-filesystem. Understanding this source helps you build custom monitoring and verify what higher-level tools report:

# Find the cgroup path for a running container
CONTAINER_ID=$(docker inspect --format '{{.Id}}' nginx)
CGROUP_PATH="/sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope"

# --- CPU Metrics ---
# Total CPU time consumed (in microseconds for cgroup v2)
cat ${CGROUP_PATH}/cpu.stat
# usage_usec 1234567890        # Total CPU time used
# user_usec 1000000000         # Time in user space
# system_usec 234567890        # Time in kernel space
# nr_periods 50000             # Number of enforcement periods
# nr_throttled 150             # Times the container was throttled
# throttled_usec 3000000       # Total throttled time

# Current CPU pressure (cgroup v2 PSI)
cat ${CGROUP_PATH}/cpu.pressure
# some avg10=2.50 avg60=1.80 avg300=0.95 total=45678901
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

# --- Memory Metrics ---
cat ${CGROUP_PATH}/memory.current
# 47185920  (bytes = ~45 MiB current usage)

cat ${CGROUP_PATH}/memory.max
# 536870912 (bytes = 512 MiB limit)

cat ${CGROUP_PATH}/memory.stat
# anon 30000000               # Anonymous memory (heap, stack)
# file 15000000               # File-backed memory (page cache)
# slab 2000000                # Kernel slab allocations
# pgfault 125000              # Page faults
# pgmajfault 50               # Major page faults (disk reads)

# Memory pressure
cat ${CGROUP_PATH}/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

# --- I/O Metrics ---
cat ${CGROUP_PATH}/io.stat
# 8:0 rbytes=1048576 wbytes=524288 rios=100 wios=50 dbytes=0 dios=0

# --- PID Count ---
cat ${CGROUP_PATH}/pids.current
# 5

cat ${CGROUP_PATH}/pids.max
# 200
cgroup v1 vs v2: On cgroup v1 systems, metrics are split across subdirectories (/sys/fs/cgroup/memory/docker/<id>/memory.usage_in_bytes). On cgroup v2 (unified hierarchy), everything lives under a single path. Modern Docker defaults to cgroup v2. All examples above use the v2 layout.

cAdvisor

cAdvisor (Container Advisor) is Google's open-source container metrics collector. It runs as a daemon, automatically discovers all containers on the host, and exposes their resource usage via a web UI and a Prometheus-compatible metrics endpoint. It's the standard "metrics agent" for container environments.

# Run cAdvisor as a container (the standard deployment method)
docker run -d \
    --name=cadvisor \
    --restart=always \
    --volume=/:/rootfs:ro \
    --volume=/var/run:/var/run:ro \
    --volume=/sys:/sys:ro \
    --volume=/var/lib/docker/:/var/lib/docker:ro \
    --volume=/dev/disk/:/dev/disk:ro \
    --publish=8080:8080 \
    --privileged \
    --device=/dev/kmsg \
    gcr.io/cadvisor/cadvisor:v0.49.1

# Verify cAdvisor is running
curl -s http://localhost:8080/healthz
# ok

# Access the web UI at http://localhost:8080
# Shows: container list, per-container CPU/memory/network/filesystem graphs

# Access Prometheus metrics endpoint
curl -s http://localhost:8080/metrics | head -20
# # HELP container_cpu_usage_seconds_total Cumulative cpu time consumed
# # TYPE container_cpu_usage_seconds_total counter
# container_cpu_usage_seconds_total{container_label_com_docker_compose_service="nginx",id="/docker/a1b2c3..."} 12.345
# container_cpu_usage_seconds_total{container_label_com_docker_compose_service="redis",id="/docker/f6e5d4..."} 5.678

# Key metrics exposed by cAdvisor:
curl -s http://localhost:8080/metrics | grep -E "^container_(cpu|memory|network|fs)" | sort -u | head -20
# container_cpu_cfs_periods_total
# container_cpu_cfs_throttled_periods_total
# container_cpu_cfs_throttled_seconds_total
# container_cpu_usage_seconds_total
# container_fs_reads_bytes_total
# container_fs_writes_bytes_total
# container_memory_cache
# container_memory_rss
# container_memory_usage_bytes
# container_memory_working_set_bytes
# container_network_receive_bytes_total
# container_network_transmit_bytes_total
memory_working_set_bytes vs memory_usage_bytes: memory_usage_bytes includes all memory (active + inactive page cache). memory_working_set_bytes is what you should alert on — it represents memory that cannot be reclaimed without impacting the container. This is the metric Kubernetes uses for OOM kill decisions.

Prometheus & Grafana Stack

The industry-standard monitoring stack for containers is Prometheus (metrics collection and storage) paired with Grafana (visualization and alerting). This combination gives you historical data, powerful queries (PromQL), and beautiful dashboards:

# docker-compose.monitoring.yml
# Complete monitoring stack: Prometheus + Grafana + cAdvisor + Node Exporter

version: "3.8"

services:
  # Prometheus - Metrics collection and storage
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
    networks:
      - monitoring

  # Grafana - Visualization and dashboards
  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    depends_on:
      - prometheus
    networks:
      - monitoring

  # cAdvisor - Container metrics collector
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring

  # Node Exporter - Host-level metrics
  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.rootfs=/rootfs"
      - "--path.sysfs=/host/sys"
    ports:
      - "9100:9100"
    networks:
      - monitoring

volumes:
  prometheus-data:
  grafana-data:

networks:
  monitoring:
    driver: bridge
# prometheus/prometheus.yml
# Prometheus configuration for container monitoring

global:
  scrape_interval: 15s          # Scrape targets every 15 seconds
  evaluation_interval: 15s      # Evaluate alert rules every 15 seconds

rule_files:
  - "alert-rules.yml"

scrape_configs:
  # Prometheus self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # cAdvisor - container metrics
  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]
    metric_relabel_configs:
      # Drop high-cardinality metrics to save storage
      - source_labels: [__name__]
        regex: "container_tasks_state"
        action: drop

  # Node Exporter - host metrics
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  # Docker daemon metrics (requires daemon.json config)
  - job_name: "docker-daemon"
    static_configs:
      - targets: ["host.docker.internal:9323"]
# prometheus/alert-rules.yml
# Alert rules for container monitoring

groups:
  - name: container-alerts
    rules:
      # Container using more than 90% of memory limit
      - alert: ContainerHighMemory
        expr: |
          (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.9
          and container_spec_memory_limit_bytes > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} memory above 90%"
          description: "Memory usage is {{ $value | humanizePercentage }}"

      # Container CPU throttled more than 25% of periods
      - alert: ContainerCPUThrottling
        expr: |
          rate(container_cpu_cfs_throttled_periods_total[5m])
          / rate(container_cpu_cfs_periods_total[5m]) > 0.25
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} CPU throttled"

      # Container restarting frequently
      - alert: ContainerRestartLoop
        expr: |
          increase(container_restart_count[1h]) > 5
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} restarting frequently"
Useful PromQL Queries
Essential Container PromQL

CPU usage rate (cores): rate(container_cpu_usage_seconds_total{name=~".+"}[5m])

Memory usage %: container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100

Network receive rate: rate(container_network_receive_bytes_total[5m])

Top 5 CPU consumers: topk(5, rate(container_cpu_usage_seconds_total{name=~".+"}[5m]))

Throttled containers: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0

PromQL Prometheus metrics

Key Metrics to Monitor

Not all metrics are equally important. Focus on these signals that indicate actual problems before they become outages:

Metric What It Measures Warning Threshold Critical Threshold Action
CPU Throttling %% of periods where CPU was capped> 10%> 25%Increase CPU limits or optimize code
Memory Working SetNon-reclaimable memory usage> 80% of limit> 90% of limitIncrease limit or fix memory leaks
Restart CountContainer restarts in time window> 2/hour> 5/hourCheck logs for crash reason
Network ErrorsTX/RX errors and drops> 0.1%> 1%Check network configuration, MTU
Disk I/O WaitTime spent waiting for I/O> 20ms avg> 100ms avgMove to faster storage, optimize queries
PID CountNumber of processes in container> 80% of pids.max> 95% of pids.maxFix fork bombs or increase limit
Health Check FailuresConsecutive failed health probes> 1 failure> 3 consecutiveCheck application health endpoint
Image Pull TimeTime to pull container image> 30s> 120sUse smaller images, registry mirrors
The CPU Throttling Trap: A container showing 50% CPU usage can still be throttled. CPU limits are enforced per 100ms period (CFS period). If your container uses its entire CPU quota in the first 50ms, it's throttled for the remaining 50ms — even though the average looks fine. Always monitor nr_throttled alongside CPU percentage.

Docker Logging Architecture

Docker captures all stdout and stderr output from container processes and routes it through a configurable logging driver. The driver determines where logs are stored and in what format:

Docker Logging Pipeline
flowchart LR
    A["Container Process
(stdout/stderr)"] --> B["Docker Daemon
(log router)"] B --> C["json-file
(default)"] B --> D["syslog"] B --> E["fluentd"] B --> F["awslogs"] B --> G["gcplogs"] B --> H["journald"] C --> C1["/var/lib/docker/containers/ID/ID-json.log"] D --> D1["syslog daemon"] E --> E1["Fluentd collector"] F --> F1["CloudWatch Logs"] G --> G1["Cloud Logging"] H --> H1["systemd journal"] style A fill:#f0f9f9,stroke:#3B9797 style B fill:#f8f9fa,stroke:#132440
Driver Storage docker logs Best For Notes
json-fileLocal JSON filesYesDevelopment, single-hostDefault. Configure max-size and max-file for rotation.
localCustom binary formatYesBetter performance than json-fileCompressed, faster writes. Docker 18.09+.
journaldsystemd journalYessystemd-based Linux hostsIntegrates with journalctl.
syslogRemote syslog serverNoEnterprise syslog infrastructureSupports TLS, TCP/UDP.
fluentdFluentd daemonNoFlexible log routing/filteringBuffered, async delivery.
awslogsCloudWatch LogsNoAWS environmentsDirect to CloudWatch, no agent needed.
gcplogsGoogle Cloud LoggingNoGCP environmentsDirect to Cloud Logging.
splunkSplunk HECNoEnterprise Splunk deploymentsHTTP Event Collector integration.
noneDiscardedNoPerformance-critical, no logs neededContainer output is thrown away entirely.
# Check current logging driver for a container
docker inspect --format '{{.HostConfig.LogConfig.Type}}' nginx
# json-file

# Run a container with a specific logging driver
docker run -d --name app \
    --log-driver=json-file \
    --log-opt max-size=10m \
    --log-opt max-file=5 \
    --log-opt compress=true \
    nginx:alpine

# Set daemon-wide default in /etc/docker/daemon.json
cat /etc/docker/daemon.json
# {
#   "log-driver": "json-file",
#   "log-opts": {
#     "max-size": "20m",
#     "max-file": "5",
#     "compress": "true"
#   }
# }

# View container logs (only works with json-file, local, journald drivers)
docker logs nginx --tail 50 --follow --timestamps
# 2026-05-14T10:30:01.123Z 172.17.0.1 - - [14/May/2026:10:30:01 +0000] "GET / HTTP/1.1" 200 615

# View raw log file on host
cat /var/lib/docker/containers/CONTAINER_ID/CONTAINER_ID-json.log | jq '.'
# {
#   "log": "172.17.0.1 - - [14/May/2026:10:30:01 +0000] \"GET / HTTP/1.1\" 200 615\n",
#   "stream": "stdout",
#   "time": "2026-05-14T10:30:01.123456789Z"
# }
Log Disk Explosion: Without max-size and max-file options, the json-file driver will write logs indefinitely until the disk is full. This is the #1 cause of "mystery disk full" incidents on Docker hosts. Always configure log rotation — even in development.

Structured Logging Best Practices

Unstructured log lines ("Error: something went wrong") are nearly useless at scale. Structured logging outputs machine-parseable records (typically JSON) that can be indexed, filtered, and correlated automatically:

// BAD: Unstructured log line
"ERROR 2026-05-14 10:30:01 Connection to database failed after 3 retries"

// GOOD: Structured JSON log
{
    "timestamp": "2026-05-14T10:30:01.456Z",
    "level": "error",
    "service": "user-api",
    "message": "Database connection failed",
    "error": "connection refused",
    "host": "db-primary.internal",
    "port": 5432,
    "retries": 3,
    "retry_interval_ms": 1000,
    "correlation_id": "req-a1b2c3d4-e5f6-7890",
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "container_id": "f6e5d4c3b2a1"
}
// Node.js application with structured logging (pino)
const pino = require('pino');

const logger = pino({
    level: process.env.LOG_LEVEL || 'info',
    formatters: {
        level: (label) => ({ level: label }),
    },
    base: {
        service: 'user-api',
        version: process.env.APP_VERSION || 'unknown',
        environment: process.env.NODE_ENV || 'development',
    },
    timestamp: pino.stdTimeFunctions.isoTime,
});

// Usage — outputs JSON to stdout (Docker captures it)
logger.info({ userId: 12345, action: 'login' }, 'User authenticated');
// {"level":"info","time":"2026-05-14T10:30:01.456Z","service":"user-api","userId":12345,"action":"login","msg":"User authenticated"}

logger.error({ err: error, requestId: req.id }, 'Database query failed');
// {"level":"error","time":"2026-05-14T10:30:02.789Z","service":"user-api","err":{"message":"timeout","stack":"..."},"requestId":"abc-123","msg":"Database query failed"}
Correlation IDs: Every incoming request should receive a unique correlation ID (passed via X-Request-ID header). This ID propagates through all downstream service calls, appearing in every log entry. When debugging a failure, you filter by correlation ID and see the complete request journey across all services — even across 20+ containers.

Log Aggregation with Fluent Bit

Fluent Bit is a lightweight log processor that collects container logs, parses them, and forwards them to storage backends. It's the cloud-native successor to Fluentd — 10x less memory, written in C, designed for container environments:

# docker-compose.logging.yml
# Log aggregation stack: Fluent Bit + Loki + Grafana

version: "3.8"

services:
  # Fluent Bit - Log collector and forwarder
  fluent-bit:
    image: fluent/fluent-bit:3.0
    container_name: fluent-bit
    restart: unless-stopped
    volumes:
      - ./fluent-bit/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf:ro
      - ./fluent-bit/parsers.conf:/fluent-bit/etc/parsers.conf:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/log:/var/log:ro
    depends_on:
      - loki
    networks:
      - logging

  # Grafana Loki - Log storage and indexing
  loki:
    image: grafana/loki:2.9.6
    container_name: loki
    restart: unless-stopped
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - logging

  # Grafana - Log visualization (query via LogQL)
  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana-logs
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - loki
    networks:
      - logging

volumes:
  loki-data:

networks:
  logging:
    driver: bridge
# fluent-bit/fluent-bit.conf
# Fluent Bit configuration for Docker container logs

[SERVICE]
    Flush         5
    Daemon        Off
    Log_Level     info
    Parsers_File  parsers.conf

# Input: Read Docker container JSON log files
[INPUT]
    Name              tail
    Path              /var/lib/docker/containers/*/*.log
    Parser            docker
    Tag               docker.*
    Refresh_Interval  10
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    DB                /var/log/flb_docker.db

# Filter: Parse JSON log content from applications
[FILTER]
    Name              parser
    Match             docker.*
    Key_Name          log
    Parser            json_log
    Reserve_Data      On

# Filter: Add container metadata
[FILTER]
    Name              modify
    Match             docker.*
    Add               cluster local-dev
    Add               environment development

# Output: Send to Grafana Loki
[OUTPUT]
    Name              loki
    Match             docker.*
    Host              loki
    Port              3100
    Labels            job=docker,container=$container_name
    Remove_Keys       stream,time
    Line_Format       json

# Output: Also print to stdout for debugging
[OUTPUT]
    Name              stdout
    Match             docker.*
    Format            json_lines

Docker Events

Docker emits real-time events for every lifecycle change — container creation, start, stop, die, OOM kill, network connect/disconnect. These events are the foundation of automated monitoring and self-healing systems:

# Stream all Docker events in real-time
docker events
# 2026-05-14T10:30:01.000000 container create abc123 (image=nginx:alpine, name=web)
# 2026-05-14T10:30:01.500000 container start abc123 (image=nginx:alpine, name=web)
# 2026-05-14T10:35:00.000000 container die abc123 (exitCode=137, image=nginx:alpine, name=web)

# Filter events by type and action
docker events --filter type=container --filter event=die
# Only shows container death events

# Filter by container name
docker events --filter container=nginx --filter container=redis

# JSON format for machine parsing
docker events --format '{{json .}}' --filter event=oom
# {"status":"oom","id":"abc123","from":"myapp:latest","Type":"container",
#  "Action":"oom","Actor":{"ID":"abc123","Attributes":{"name":"app"}},
#  "time":1715684400,"timeNano":1715684400123456789}

# Time-bounded query (historical events)
docker events --since "2026-05-14T09:00:00" --until "2026-05-14T11:00:00"

# Script: Auto-restart containers that die with non-zero exit
docker events --filter event=die --format '{{.Actor.Attributes.name}} {{.Actor.Attributes.exitCode}}' | while read name code; do
    if [ "$code" != "0" ]; then
        echo "$(date): Container $name died with exit code $code — restarting"
        docker start "$name" 2>/dev/null || echo "Failed to restart $name"
    fi
done

# Monitor OOM kills specifically
docker events --filter event=oom --format '{{.Actor.Attributes.name}}' | while read name; do
    echo "CRITICAL: Container $name was OOM killed at $(date)"
    # Send alert to PagerDuty, Slack, etc.
done
Event Triggered When Useful For
createContainer metadata createdAudit logging, deployment tracking
startContainer process beginsService discovery, health check init
dieContainer process exitsAlerting, auto-restart logic
oomKernel OOM kills the containerCritical alerts, capacity planning
health_statusHealth check state changesLoad balancer drain, alerting
destroyContainer is removedCleanup, resource accounting
exec_startdocker exec command runsSecurity audit, intrusion detection

Health Monitoring

Docker HEALTHCHECK provides application-level monitoring — not just "is the process running?" but "is the application actually serving traffic correctly?" Integrating health checks with monitoring creates a self-healing feedback loop:

# Dockerfile with comprehensive health check
FROM node:20-alpine

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .

# Health check: verify the app responds with 200 on /healthz
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1

EXPOSE 3000
CMD ["node", "server.js"]
# Monitor health status of all containers
docker ps --format "table {{.Names}}\t{{.Status}}"
# NAMES     STATUS
# nginx     Up 2 hours (healthy)
# app       Up 2 hours (unhealthy)
# redis     Up 2 hours (healthy)

# Inspect health check history
docker inspect --format '{{json .State.Health}}' app | jq '.'
# {
#   "Status": "unhealthy",
#   "FailingStreak": 5,
#   "Log": [
#     {
#       "Start": "2026-05-14T10:30:00Z",
#       "End": "2026-05-14T10:30:05Z",
#       "ExitCode": 1,
#       "Output": "wget: server returned error: HTTP/1.1 503 Service Unavailable"
#     }
#   ]
# }

# Alert on unhealthy containers using docker events
docker events --filter health_status=unhealthy \
    --format '{{.Actor.Attributes.name}}' | while read name; do
    echo "ALERT: Container $name is unhealthy at $(date)"
    docker logs "$name" --tail 20  # Capture recent logs for context
done

# Docker Compose with health-dependent startup
# In docker-compose.yml:
# services:
#   app:
#     depends_on:
#       db:
#         condition: service_healthy
#   db:
#     healthcheck:
#       test: ["CMD", "pg_isready", "-U", "postgres"]
#       interval: 10s
#       timeout: 5s
#       retries: 5
Production Pattern
The Observability Feedback Loop

In production, monitoring isn't passive — it drives automated responses:

  1. Detect: Health check fails → container marked unhealthy
  2. Alert: Docker event triggers notification to on-call engineer
  3. Automate: Orchestrator (Kubernetes/Swarm) replaces unhealthy container
  4. Diagnose: Logs + metrics from the failed container preserved for post-mortem
  5. Prevent: Alert thresholds catch degradation before users notice

The goal: users never experience outages because automation detects and resolves issues faster than humans can respond.

self-healing automation feedback-loop

Exercises

Exercise 1: Deploy the complete Prometheus + Grafana + cAdvisor stack using the docker-compose.monitoring.yml above. Run 3-4 application containers alongside it. Create a Grafana dashboard showing CPU usage, memory usage, and network I/O per container. Set up an alert that fires when any container exceeds 80% memory.
Exercise 2: Write a shell script that reads cgroup files directly (without Docker CLI) and produces a CSV with columns: container_id, cpu_usage_microseconds, memory_bytes, pids_count. Run it every 5 seconds and compare results with docker stats.
Exercise 3: Configure a Node.js or Python application to output structured JSON logs. Deploy it with the Fluent Bit + Loki stack. Query Loki via Grafana to find all error-level logs for a specific correlation ID.
Exercise 4: Set up a Docker events monitor that watches for OOM kills and container deaths, then posts alerts to a Slack webhook or writes to a file with full context (container name, exit code, last 20 log lines).

Conclusion & Next Steps

Container observability is not optional — it's the difference between confidently operating production systems and blindly hoping nothing breaks. The stack we built in this article provides:

  • Metrics: cAdvisor exposes per-container resource usage; Prometheus stores and queries it; Grafana visualises trends and fires alerts
  • Logs: Docker logging drivers capture output; structured JSON enables filtering; Fluent Bit aggregates and routes to storage
  • Events: Docker events provide real-time lifecycle notifications for automation and audit
  • Health: HEALTHCHECK integrates application-level monitoring with orchestrator automation

With observability in place, the next challenge is diagnosing problems when things go wrong. Metrics tell you what is broken; troubleshooting determines why and how to fix it.

Next in the Series

In Part 21: Container Troubleshooting, we'll build a systematic debugging toolkit — diagnosing crash loops, OOM kills, networking failures, and using advanced tools like nsenter, strace, and tcpdump to investigate container issues from the host.