Part 20: Container Monitoring & Observability

The Three Pillars of Observability

Observability answers one question: "Why is my system behaving this way?" Not just what is happening, but why. For containerised systems, observability is built on three complementary signal types:

Metrics — Numeric measurements over time. CPU usage at 85%, memory at 2.1 GB, 1,247 requests/second. Metrics tell you what is happening quantitatively and enable alerting.
Logs — Discrete events with context. "Connection refused to database at 10:42:03", "User 12345 authentication failed". Logs tell you what happened with rich detail.
Traces — Request journeys across services. A single HTTP request traversing API gateway → auth service → user service → database. Traces show you where time is spent and how services interact.

Three Pillars of Container Observability

flowchart TD
    subgraph Signals["Observability Signals"]
        M["Metrics
Numeric time-series"]
        L["Logs
Discrete events"]
        T["Traces
Request journeys"]
    end
    subgraph Collection["Collection Layer"]
        P["Prometheus / cAdvisor"]
        F["Fluent Bit / Fluentd"]
        J["Jaeger / Zipkin / OTLP"]
    end
    subgraph Storage["Storage & Query"]
        PS["Prometheus TSDB"]
        ES["Elasticsearch / Loki"]
        TS["Jaeger Backend / Tempo"]
    end
    subgraph Viz["Visualization"]
        G["Grafana Dashboards"]
        K["Kibana / Grafana Explore"]
        JU["Jaeger UI / Grafana Tempo"]
    end
    M --> P --> PS --> G
    L --> F --> ES --> K
    T --> J --> TS --> JU

    style Signals fill:#f0f9f9,stroke:#3B9797
    style Collection fill:#f8f9fa,stroke:#132440
    style Storage fill:#f8f9fa,stroke:#16476A
    style Viz fill:#fff5f5,stroke:#BF092F

Why Containers Need Special Observability

Containers introduce unique observability challenges that traditional server monitoring doesn't face:

                            
                            Ephemeral: Containers start, stop, restart, and get replaced constantly. You can't SSH in and look around — by the time you connect, the problematic container is already gone and replaced by a fresh one.
                        

                            
                            Dynamic Density: A single host might run 50+ containers. Resource contention between containers is invisible without proper per-container metrics. Host-level CPU of 60% tells you nothing about which container is throttled.
                        

                            
                            Shared Kernel: All containers share the host kernel. An I/O-heavy container can starve neighbours. Memory pressure from one container triggers OOM kills in another. Per-container isolation metrics are essential.
                        

Docker Stats Command

The simplest monitoring tool is built into Docker itself. docker stats provides a real-time stream of resource usage for every running container — no setup required:

# Real-time stats for all running containers (live updating)
docker stats
# CONTAINER ID   NAME      CPU %   MEM USAGE / LIMIT     MEM %   NET I/O         BLOCK I/O       PIDS
# a1b2c3d4e5f6   nginx     0.02%   4.5MiB / 512MiB       0.88%   1.2kB / 648B    0B / 4.1kB      3
# f6e5d4c3b2a1   redis     0.15%   7.8MiB / 256MiB       3.05%   2.4kB / 1.1kB   0B / 0B         5
# 1a2b3c4d5e6f   app       1.23%   145MiB / 1GiB         14.16%  54kB / 32kB     8.2MB / 512kB   12

# Stats for specific containers (useful in scripts)
docker stats nginx redis --no-stream
# Prints one snapshot and exits (no live updating)

# Custom format for machine-readable output
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
# NAME    CPU %   MEM USAGE / LIMIT   NET I/O
# nginx   0.02%   4.5MiB / 512MiB     1.2kB / 648B
# redis   0.15%   7.8MiB / 256MiB     2.4kB / 1.1kB
# app     1.23%   145MiB / 1GiB       54kB / 32kB

# JSON format for programmatic consumption
docker stats --no-stream --format '{{json .}}' | jq '.'
# {
#   "BlockIO": "0B / 4.1kB",
#   "CPUPerc": "0.02%",
#   "Container": "a1b2c3d4e5f6",
#   "ID": "a1b2c3d4e5f6",
#   "MemPerc": "0.88%",
#   "MemUsage": "4.5MiB / 512MiB",
#   "Name": "nginx",
#   "NetIO": "1.2kB / 648B",
#   "PIDs": "3"
# }

# Script to alert on high memory usage
docker stats --no-stream --format '{{.Name}} {{.MemPerc}}' | while read name pct; do
    value=$(echo "$pct" | tr -d '%')
    if [ "$(echo "$value > 80" | bc)" -eq 1 ]; then
        echo "WARNING: $name memory at $pct"
    fi
done

docker stats is useful for quick debugging but has serious limitations for production: no history (real-time only), no alerting, no per-process breakdown, and it requires Docker socket access. For production, we need proper metrics collection.

Container Metrics Sources

Every metric Docker shows ultimately comes from the Linux kernel's cgroups pseudo-filesystem. Understanding this source helps you build custom monitoring and verify what higher-level tools report:

# Find the cgroup path for a running container
CONTAINER_ID=$(docker inspect --format '{{.Id}}' nginx)
CGROUP_PATH="/sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope"

# --- CPU Metrics ---
# Total CPU time consumed (in microseconds for cgroup v2)
cat ${CGROUP_PATH}/cpu.stat
# usage_usec 1234567890        # Total CPU time used
# user_usec 1000000000         # Time in user space
# system_usec 234567890        # Time in kernel space
# nr_periods 50000             # Number of enforcement periods
# nr_throttled 150             # Times the container was throttled
# throttled_usec 3000000       # Total throttled time

# Current CPU pressure (cgroup v2 PSI)
cat ${CGROUP_PATH}/cpu.pressure
# some avg10=2.50 avg60=1.80 avg300=0.95 total=45678901
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

# --- Memory Metrics ---
cat ${CGROUP_PATH}/memory.current
# 47185920  (bytes = ~45 MiB current usage)

cat ${CGROUP_PATH}/memory.max
# 536870912 (bytes = 512 MiB limit)

cat ${CGROUP_PATH}/memory.stat
# anon 30000000               # Anonymous memory (heap, stack)
# file 15000000               # File-backed memory (page cache)
# slab 2000000                # Kernel slab allocations
# pgfault 125000              # Page faults
# pgmajfault 50               # Major page faults (disk reads)

# Memory pressure
cat ${CGROUP_PATH}/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

# --- I/O Metrics ---
cat ${CGROUP_PATH}/io.stat
# 8:0 rbytes=1048576 wbytes=524288 rios=100 wios=50 dbytes=0 dios=0

# --- PID Count ---
cat ${CGROUP_PATH}/pids.current
# 5

cat ${CGROUP_PATH}/pids.max
# 200

                            
                            cgroup v1 vs v2: On cgroup v1 systems, metrics are split across subdirectories (/sys/fs/cgroup/memory/docker/<id>/memory.usage_in_bytes). On cgroup v2 (unified hierarchy), everything lives under a single path. Modern Docker defaults to cgroup v2. All examples above use the v2 layout.
                        

cAdvisor

cAdvisor (Container Advisor) is Google's open-source container metrics collector. It runs as a daemon, automatically discovers all containers on the host, and exposes their resource usage via a web UI and a Prometheus-compatible metrics endpoint. It's the standard "metrics agent" for container environments.

# Run cAdvisor as a container (the standard deployment method)
docker run -d \
    --name=cadvisor \
    --restart=always \
    --volume=/:/rootfs:ro \
    --volume=/var/run:/var/run:ro \
    --volume=/sys:/sys:ro \
    --volume=/var/lib/docker/:/var/lib/docker:ro \
    --volume=/dev/disk/:/dev/disk:ro \
    --publish=8080:8080 \
    --privileged \
    --device=/dev/kmsg \
    gcr.io/cadvisor/cadvisor:v0.49.1

# Verify cAdvisor is running
curl -s http://localhost:8080/healthz
# ok

# Access the web UI at http://localhost:8080
# Shows: container list, per-container CPU/memory/network/filesystem graphs

# Access Prometheus metrics endpoint
curl -s http://localhost:8080/metrics | head -20
# # HELP container_cpu_usage_seconds_total Cumulative cpu time consumed
# # TYPE container_cpu_usage_seconds_total counter
# container_cpu_usage_seconds_total{container_label_com_docker_compose_service="nginx",id="/docker/a1b2c3..."} 12.345
# container_cpu_usage_seconds_total{container_label_com_docker_compose_service="redis",id="/docker/f6e5d4..."} 5.678

# Key metrics exposed by cAdvisor:
curl -s http://localhost:8080/metrics | grep -E "^container_(cpu|memory|network|fs)" | sort -u | head -20
# container_cpu_cfs_periods_total
# container_cpu_cfs_throttled_periods_total
# container_cpu_cfs_throttled_seconds_total
# container_cpu_usage_seconds_total
# container_fs_reads_bytes_total
# container_fs_writes_bytes_total
# container_memory_cache
# container_memory_rss
# container_memory_usage_bytes
# container_memory_working_set_bytes
# container_network_receive_bytes_total
# container_network_transmit_bytes_total

                            
                            memory_working_set_bytes vs memory_usage_bytes: memory_usage_bytes includes all memory (active + inactive page cache). memory_working_set_bytes is what you should alert on — it represents memory that cannot be reclaimed without impacting the container. This is the metric Kubernetes uses for OOM kill decisions.
                        

Prometheus & Grafana Stack

The industry-standard monitoring stack for containers is Prometheus (metrics collection and storage) paired with Grafana (visualization and alerting). This combination gives you historical data, powerful queries (PromQL), and beautiful dashboards:

# docker-compose.monitoring.yml
# Complete monitoring stack: Prometheus + Grafana + cAdvisor + Node Exporter

version: "3.8"

services:
  # Prometheus - Metrics collection and storage
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
    networks:
      - monitoring

  # Grafana - Visualization and dashboards
  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    depends_on:
      - prometheus
    networks:
      - monitoring

  # cAdvisor - Container metrics collector
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring

  # Node Exporter - Host-level metrics
  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.rootfs=/rootfs"
      - "--path.sysfs=/host/sys"
    ports:
      - "9100:9100"
    networks:
      - monitoring

volumes:
  prometheus-data:
  grafana-data:

networks:
  monitoring:
    driver: bridge

# prometheus/prometheus.yml
# Prometheus configuration for container monitoring

global:
  scrape_interval: 15s          # Scrape targets every 15 seconds
  evaluation_interval: 15s      # Evaluate alert rules every 15 seconds

rule_files:
  - "alert-rules.yml"

scrape_configs:
  # Prometheus self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # cAdvisor - container metrics
  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]
    metric_relabel_configs:
      # Drop high-cardinality metrics to save storage
      - source_labels: [__name__]
        regex: "container_tasks_state"
        action: drop

  # Node Exporter - host metrics
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  # Docker daemon metrics (requires daemon.json config)
  - job_name: "docker-daemon"
    static_configs:
      - targets: ["host.docker.internal:9323"]

# prometheus/alert-rules.yml
# Alert rules for container monitoring

groups:
  - name: container-alerts
    rules:
      # Container using more than 90% of memory limit
      - alert: ContainerHighMemory
        expr: |
          (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.9
          and container_spec_memory_limit_bytes > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} memory above 90%"
          description: "Memory usage is {{ $value | humanizePercentage }}"

      # Container CPU throttled more than 25% of periods
      - alert: ContainerCPUThrottling
        expr: |
          rate(container_cpu_cfs_throttled_periods_total[5m])
          / rate(container_cpu_cfs_periods_total[5m]) > 0.25
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} CPU throttled"

      # Container restarting frequently
      - alert: ContainerRestartLoop
        expr: |
          increase(container_restart_count[1h]) > 5
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} restarting frequently"

Useful PromQL Queries

Essential Container PromQL

CPU usage rate (cores): rate(container_cpu_usage_seconds_total{name=~".+"}[5m])

Memory usage %: container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100

Network receive rate: rate(container_network_receive_bytes_total[5m])

Top 5 CPU consumers: topk(5, rate(container_cpu_usage_seconds_total{name=~".+"}[5m]))

Throttled containers: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0

PromQL Prometheus metrics

Key Metrics to Monitor

Not all metrics are equally important. Focus on these signals that indicate actual problems before they become outages:

Metric	What It Measures	Warning Threshold	Critical Threshold	Action
CPU Throttling %	% of periods where CPU was capped	> 10%	> 25%	Increase CPU limits or optimize code
Memory Working Set	Non-reclaimable memory usage	> 80% of limit	> 90% of limit	Increase limit or fix memory leaks
Restart Count	Container restarts in time window	> 2/hour	> 5/hour	Check logs for crash reason
Network Errors	TX/RX errors and drops	> 0.1%	> 1%	Check network configuration, MTU
Disk I/O Wait	Time spent waiting for I/O	> 20ms avg	> 100ms avg	Move to faster storage, optimize queries
PID Count	Number of processes in container	> 80% of pids.max	> 95% of pids.max	Fix fork bombs or increase limit
Health Check Failures	Consecutive failed health probes	> 1 failure	> 3 consecutive	Check application health endpoint
Image Pull Time	Time to pull container image	> 30s	> 120s	Use smaller images, registry mirrors

                            
                            The CPU Throttling Trap: A container showing 50% CPU usage can still be throttled. CPU limits are enforced per 100ms period (CFS period). If your container uses its entire CPU quota in the first 50ms, it's throttled for the remaining 50ms — even though the average looks fine. Always monitor nr_throttled alongside CPU percentage.
                        

Docker Logging Architecture

Docker captures all stdout and stderr output from container processes and routes it through a configurable logging driver. The driver determines where logs are stored and in what format:

Docker Logging Pipeline

flowchart LR
    A["Container Process
(stdout/stderr)"] --> B["Docker Daemon
(log router)"]
    B --> C["json-file
(default)"]
    B --> D["syslog"]
    B --> E["fluentd"]
    B --> F["awslogs"]
    B --> G["gcplogs"]
    B --> H["journald"]

    C --> C1["/var/lib/docker/containers/ID/ID-json.log"]
    D --> D1["syslog daemon"]
    E --> E1["Fluentd collector"]
    F --> F1["CloudWatch Logs"]
    G --> G1["Cloud Logging"]
    H --> H1["systemd journal"]

    style A fill:#f0f9f9,stroke:#3B9797
    style B fill:#f8f9fa,stroke:#132440

Driver	Storage	docker logs	Best For	Notes
json-file	Local JSON files	Yes	Development, single-host	Default. Configure max-size and max-file for rotation.
local	Custom binary format	Yes	Better performance than json-file	Compressed, faster writes. Docker 18.09+.
journald	systemd journal	Yes	systemd-based Linux hosts	Integrates with journalctl.
syslog	Remote syslog server	No	Enterprise syslog infrastructure	Supports TLS, TCP/UDP.
fluentd	Fluentd daemon	No	Flexible log routing/filtering	Buffered, async delivery.
awslogs	CloudWatch Logs	No	AWS environments	Direct to CloudWatch, no agent needed.
gcplogs	Google Cloud Logging	No	GCP environments	Direct to Cloud Logging.
splunk	Splunk HEC	No	Enterprise Splunk deployments	HTTP Event Collector integration.
none	Discarded	No	Performance-critical, no logs needed	Container output is thrown away entirely.

# Check current logging driver for a container
docker inspect --format '{{.HostConfig.LogConfig.Type}}' nginx
# json-file

# Run a container with a specific logging driver
docker run -d --name app \
    --log-driver=json-file \
    --log-opt max-size=10m \
    --log-opt max-file=5 \
    --log-opt compress=true \
    nginx:alpine

# Set daemon-wide default in /etc/docker/daemon.json
cat /etc/docker/daemon.json
# {
#   "log-driver": "json-file",
#   "log-opts": {
#     "max-size": "20m",
#     "max-file": "5",
#     "compress": "true"
#   }
# }

# View container logs (only works with json-file, local, journald drivers)
docker logs nginx --tail 50 --follow --timestamps
# 2026-05-14T10:30:01.123Z 172.17.0.1 - - [14/May/2026:10:30:01 +0000] "GET / HTTP/1.1" 200 615

# View raw log file on host
cat /var/lib/docker/containers/CONTAINER_ID/CONTAINER_ID-json.log | jq '.'
# {
#   "log": "172.17.0.1 - - [14/May/2026:10:30:01 +0000] \"GET / HTTP/1.1\" 200 615\n",
#   "stream": "stdout",
#   "time": "2026-05-14T10:30:01.123456789Z"
# }

                            
                            Log Disk Explosion: Without max-size and max-file options, the json-file driver will write logs indefinitely until the disk is full. This is the #1 cause of "mystery disk full" incidents on Docker hosts. Always configure log rotation — even in development.
                        

Structured Logging Best Practices

Unstructured log lines ("Error: something went wrong") are nearly useless at scale. Structured logging outputs machine-parseable records (typically JSON) that can be indexed, filtered, and correlated automatically:

// BAD: Unstructured log line
"ERROR 2026-05-14 10:30:01 Connection to database failed after 3 retries"

// GOOD: Structured JSON log
{
    "timestamp": "2026-05-14T10:30:01.456Z",
    "level": "error",
    "service": "user-api",
    "message": "Database connection failed",
    "error": "connection refused",
    "host": "db-primary.internal",
    "port": 5432,
    "retries": 3,
    "retry_interval_ms": 1000,
    "correlation_id": "req-a1b2c3d4-e5f6-7890",
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "container_id": "f6e5d4c3b2a1"
}

// Node.js application with structured logging (pino)
const pino = require('pino');

const logger = pino({
    level: process.env.LOG_LEVEL || 'info',
    formatters: {
        level: (label) => ({ level: label }),
    },
    base: {
        service: 'user-api',
        version: process.env.APP_VERSION || 'unknown',
        environment: process.env.NODE_ENV || 'development',
    },
    timestamp: pino.stdTimeFunctions.isoTime,
});

// Usage — outputs JSON to stdout (Docker captures it)
logger.info({ userId: 12345, action: 'login' }, 'User authenticated');
// {"level":"info","time":"2026-05-14T10:30:01.456Z","service":"user-api","userId":12345,"action":"login","msg":"User authenticated"}

logger.error({ err: error, requestId: req.id }, 'Database query failed');
// {"level":"error","time":"2026-05-14T10:30:02.789Z","service":"user-api","err":{"message":"timeout","stack":"..."},"requestId":"abc-123","msg":"Database query failed"}

                            
                            Correlation IDs: Every incoming request should receive a unique correlation ID (passed via X-Request-ID header). This ID propagates through all downstream service calls, appearing in every log entry. When debugging a failure, you filter by correlation ID and see the complete request journey across all services — even across 20+ containers.
                        

Log Aggregation with Fluent Bit

Fluent Bit is a lightweight log processor that collects container logs, parses them, and forwards them to storage backends. It's the cloud-native successor to Fluentd — 10x less memory, written in C, designed for container environments:

# docker-compose.logging.yml
# Log aggregation stack: Fluent Bit + Loki + Grafana

version: "3.8"

services:
  # Fluent Bit - Log collector and forwarder
  fluent-bit:
    image: fluent/fluent-bit:3.0
    container_name: fluent-bit
    restart: unless-stopped
    volumes:
      - ./fluent-bit/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf:ro
      - ./fluent-bit/parsers.conf:/fluent-bit/etc/parsers.conf:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/log:/var/log:ro
    depends_on:
      - loki
    networks:
      - logging

  # Grafana Loki - Log storage and indexing
  loki:
    image: grafana/loki:2.9.6
    container_name: loki
    restart: unless-stopped
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - logging

  # Grafana - Log visualization (query via LogQL)
  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana-logs
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - loki
    networks:
      - logging

volumes:
  loki-data:

networks:
  logging:
    driver: bridge

# fluent-bit/fluent-bit.conf
# Fluent Bit configuration for Docker container logs

[SERVICE]
    Flush         5
    Daemon        Off
    Log_Level     info
    Parsers_File  parsers.conf

# Input: Read Docker container JSON log files
[INPUT]
    Name              tail
    Path              /var/lib/docker/containers/*/*.log
    Parser            docker
    Tag               docker.*
    Refresh_Interval  10
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    DB                /var/log/flb_docker.db

# Filter: Parse JSON log content from applications
[FILTER]
    Name              parser
    Match             docker.*
    Key_Name          log
    Parser            json_log
    Reserve_Data      On

# Filter: Add container metadata
[FILTER]
    Name              modify
    Match             docker.*
    Add               cluster local-dev
    Add               environment development

# Output: Send to Grafana Loki
[OUTPUT]
    Name              loki
    Match             docker.*
    Host              loki
    Port              3100
    Labels            job=docker,container=$container_name
    Remove_Keys       stream,time
    Line_Format       json

# Output: Also print to stdout for debugging
[OUTPUT]
    Name              stdout
    Match             docker.*
    Format            json_lines

Docker Events

Docker emits real-time events for every lifecycle change — container creation, start, stop, die, OOM kill, network connect/disconnect. These events are the foundation of automated monitoring and self-healing systems:

# Stream all Docker events in real-time
docker events
# 2026-05-14T10:30:01.000000 container create abc123 (image=nginx:alpine, name=web)
# 2026-05-14T10:30:01.500000 container start abc123 (image=nginx:alpine, name=web)
# 2026-05-14T10:35:00.000000 container die abc123 (exitCode=137, image=nginx:alpine, name=web)

# Filter events by type and action
docker events --filter type=container --filter event=die
# Only shows container death events

# Filter by container name
docker events --filter container=nginx --filter container=redis

# JSON format for machine parsing
docker events --format '{{json .}}' --filter event=oom
# {"status":"oom","id":"abc123","from":"myapp:latest","Type":"container",
#  "Action":"oom","Actor":{"ID":"abc123","Attributes":{"name":"app"}},
#  "time":1715684400,"timeNano":1715684400123456789}

# Time-bounded query (historical events)
docker events --since "2026-05-14T09:00:00" --until "2026-05-14T11:00:00"

# Script: Auto-restart containers that die with non-zero exit
docker events --filter event=die --format '{{.Actor.Attributes.name}} {{.Actor.Attributes.exitCode}}' | while read name code; do
    if [ "$code" != "0" ]; then
        echo "$(date): Container $name died with exit code $code — restarting"
        docker start "$name" 2>/dev/null || echo "Failed to restart $name"
    fi
done

# Monitor OOM kills specifically
docker events --filter event=oom --format '{{.Actor.Attributes.name}}' | while read name; do
    echo "CRITICAL: Container $name was OOM killed at $(date)"
    # Send alert to PagerDuty, Slack, etc.
done

Event	Triggered When	Useful For
`create`	Container metadata created	Audit logging, deployment tracking
`start`	Container process begins	Service discovery, health check init
`die`	Container process exits	Alerting, auto-restart logic
`oom`	Kernel OOM kills the container	Critical alerts, capacity planning
`health_status`	Health check state changes	Load balancer drain, alerting
`destroy`	Container is removed	Cleanup, resource accounting
`exec_start`	docker exec command runs	Security audit, intrusion detection

Health Monitoring

Docker HEALTHCHECK provides application-level monitoring — not just "is the process running?" but "is the application actually serving traffic correctly?" Integrating health checks with monitoring creates a self-healing feedback loop:

# Dockerfile with comprehensive health check
FROM node:20-alpine

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .

# Health check: verify the app responds with 200 on /healthz
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD wget --no-verbose --tries=1 --spider http://localhost:3000/healthz || exit 1

EXPOSE 3000
CMD ["node", "server.js"]

# Monitor health status of all containers
docker ps --format "table {{.Names}}\t{{.Status}}"
# NAMES     STATUS
# nginx     Up 2 hours (healthy)
# app       Up 2 hours (unhealthy)
# redis     Up 2 hours (healthy)

# Inspect health check history
docker inspect --format '{{json .State.Health}}' app | jq '.'
# {
#   "Status": "unhealthy",
#   "FailingStreak": 5,
#   "Log": [
#     {
#       "Start": "2026-05-14T10:30:00Z",
#       "End": "2026-05-14T10:30:05Z",
#       "ExitCode": 1,
#       "Output": "wget: server returned error: HTTP/1.1 503 Service Unavailable"
#     }
#   ]
# }

# Alert on unhealthy containers using docker events
docker events --filter health_status=unhealthy \
    --format '{{.Actor.Attributes.name}}' | while read name; do
    echo "ALERT: Container $name is unhealthy at $(date)"
    docker logs "$name" --tail 20  # Capture recent logs for context
done

# Docker Compose with health-dependent startup
# In docker-compose.yml:
# services:
#   app:
#     depends_on:
#       db:
#         condition: service_healthy
#   db:
#     healthcheck:
#       test: ["CMD", "pg_isready", "-U", "postgres"]
#       interval: 10s
#       timeout: 5s
#       retries: 5

Production Pattern

The Observability Feedback Loop

In production, monitoring isn't passive — it drives automated responses:

Detect: Health check fails → container marked unhealthy
Alert: Docker event triggers notification to on-call engineer
Automate: Orchestrator (Kubernetes/Swarm) replaces unhealthy container
Diagnose: Logs + metrics from the failed container preserved for post-mortem
Prevent: Alert thresholds catch degradation before users notice

The goal: users never experience outages because automation detects and resolves issues faster than humans can respond.

self-healing automation feedback-loop

Exercises

                            
                            Exercise 1: Deploy the complete Prometheus + Grafana + cAdvisor stack using the docker-compose.monitoring.yml above. Run 3-4 application containers alongside it. Create a Grafana dashboard showing CPU usage, memory usage, and network I/O per container. Set up an alert that fires when any container exceeds 80% memory.
                        

                            
                            Exercise 2: Write a shell script that reads cgroup files directly (without Docker CLI) and produces a CSV with columns: container_id, cpu_usage_microseconds, memory_bytes, pids_count. Run it every 5 seconds and compare results with docker stats.
                        

                            
                            Exercise 3: Configure a Node.js or Python application to output structured JSON logs. Deploy it with the Fluent Bit + Loki stack. Query Loki via Grafana to find all error-level logs for a specific correlation ID.
                        

                            
                            Exercise 4: Set up a Docker events monitor that watches for OOM kills and container deaths, then posts alerts to a Slack webhook or writes to a file with full context (container name, exit code, last 20 log lines).
                        

Conclusion & Next Steps

Container observability is not optional — it's the difference between confidently operating production systems and blindly hoping nothing breaks. The stack we built in this article provides:

Metrics: cAdvisor exposes per-container resource usage; Prometheus stores and queries it; Grafana visualises trends and fires alerts
Logs: Docker logging drivers capture output; structured JSON enables filtering; Fluent Bit aggregates and routes to storage
Events: Docker events provide real-time lifecycle notifications for automation and audit
Health: HEALTHCHECK integrates application-level monitoring with orchestrator automation

With observability in place, the next challenge is diagnosing problems when things go wrong. Metrics tell you what is broken; troubleshooting determines why and how to fix it.

Next in the Series

In Part 21: Container Troubleshooting, we'll build a systematic debugging toolkit — diagnosing crash loops, OOM kills, networking failures, and using advanced tools like nsenter, strace, and tcpdump to investigate container issues from the host.

Previous Part 19: Orchestration Readiness Next Part 21: Container Troubleshooting

Cookie Consent

Part 20: Container Monitoring & Observability

Table of Contents

The Three Pillars of Observability

Why Containers Need Special Observability

Docker Stats Command

Container Metrics Sources

cAdvisor

Prometheus & Grafana Stack

Essential Container PromQL

Key Metrics to Monitor

Docker Logging Architecture

Structured Logging Best Practices

Log Aggregation with Fluent Bit

Docker Events

Health Monitoring

The Observability Feedback Loop

Exercises

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 20: Container Monitoring & Observability

Table of Contents

The Three Pillars of Observability

Why Containers Need Special Observability

Docker Stats Command

Container Metrics Sources

cAdvisor

Prometheus & Grafana Stack

Essential Container PromQL

Key Metrics to Monitor

Docker Logging Architecture

Structured Logging Best Practices

Log Aggregation with Fluent Bit

Docker Events

Health Monitoring

The Observability Feedback Loop

Exercises

Conclusion & Next Steps

Next in the Series

Continue the Series

Part 19: Orchestration Readiness

Part 21: Container Troubleshooting

Part 3: Control Groups (cgroups)