Part 13: Monitoring & Observability

Why Observability Matters

Modern distributed systems are inherently complex. Microservices communicate across networks, containers spin up and down dynamically, and cloud resources scale automatically. When something goes wrong — and it will — you need to understand what happened, where it happened, and why it happened. This is the domain of observability.

Traditional monitoring answers known questions: "Is the server up? Is CPU above 80%?" Observability goes further — it enables you to ask new questions you never anticipated. It shifts you from reactive firefighting to proactive understanding of system behavior.

                            
                            Key Insight: Monitoring tells you when something is broken. Observability tells you why it is broken — even for failure modes you have never seen before. In distributed systems, you cannot predict every possible failure; you need the ability to explore system state from the outside using telemetry data.
                        

The Three Pillars of Observability

Observability rests on three complementary data types, each providing a different lens into system behavior:

Metrics — Numeric time-series data aggregated over time (e.g., request rate, error percentage, latency percentiles). Cheap to store, fast to query, ideal for dashboards and alerts.
Logs — Timestamped text records of discrete events (e.g., "User X failed authentication at 14:32:05"). Rich context but expensive at scale. Essential for debugging specific incidents.
Traces — End-to-end records of request flow across services. Each trace contains spans showing timing, dependencies, and errors through the entire call chain.

Three Pillars of Observability

flowchart TD
    A[Observability] --> B[Metrics]
    A --> C[Logs]
    A --> D[Traces]
    B --> B1[Counters & Gauges]
    B --> B2[Dashboards & Alerts]
    B --> B3[Trend Analysis]
    C --> C1[Event Records]
    C --> C2[Error Details]
    C --> C3[Audit Trail]
    D --> D1[Request Flow]
    D --> D2[Latency Breakdown]
    D --> D3[Dependency Map]
    B1 --> E[Detect]
    C1 --> F[Diagnose]
    D1 --> G[Understand]
    E --> H[Resolve Faster]
    F --> H
    G --> H

The three pillars work together: metrics detect anomalies, logs help diagnose root causes, and traces help understand the full picture of how a request flowed through the system. No single pillar is sufficient alone.

Pillar	Data Type	Best For	Storage Cost	Query Speed
Metrics	Numeric time-series	Alerting, dashboards, trends	Low	Fast
Logs	Text events	Debugging, audit, context	High	Medium
Traces	Span trees	Request flow, latency	Medium	Medium

Metrics

Metrics are numeric measurements collected at regular intervals. They are the foundation of monitoring — cheap to store, fast to query, and ideal for detecting when something deviates from normal behavior.

Metric Types

Type	Description	Example	Use Case
Counter	Monotonically increasing value	`http_requests_total`	Request rates, error counts
Gauge	Value that can go up or down	`node_memory_available_bytes`	CPU usage, queue depth
Histogram	Samples in configurable buckets	`http_request_duration_seconds`	Latency percentiles (p50, p95, p99)
Summary	Client-calculated quantiles	`go_gc_duration_seconds`	Pre-computed percentiles

Golden Signals, RED, and USE Methods

Three frameworks help determine what to measure. Each targets different layers of the stack:

                            
                            Google SRE Golden Signals: The four signals that matter most for user-facing systems: Latency (how long requests take), Traffic (how much demand), Errors (how many failures), and Saturation (how full the system is).
                        

Framework	Focus	Signals	Best For
Golden Signals	User-facing services	Latency, Traffic, Errors, Saturation	APIs, web apps, microservices
RED Method	Request-driven services	Rate, Errors, Duration	HTTP services, gRPC endpoints
USE Method	Infrastructure resources	Utilization, Saturation, Errors	CPU, memory, disk, network

Key Infrastructure Metrics

# Key infrastructure metrics to monitor

# CPU
# - node_cpu_seconds_total (counter) → rate() for utilization
# - system.cpu.utilization (gauge, 0-1)

# Memory
# - node_memory_MemAvailable_bytes (gauge)
# - node_memory_MemTotal_bytes (gauge)
# → Available / Total = utilization percentage

# Disk
# - node_filesystem_avail_bytes (gauge)
# - node_disk_io_time_seconds_total (counter) → rate() for IO utilization

# Network
# - node_network_receive_bytes_total (counter) → rate() for throughput
# - node_network_transmit_bytes_total (counter)

# Containers (cAdvisor / kubelet)
# - container_cpu_usage_seconds_total
# - container_memory_working_set_bytes
# - container_network_receive_bytes_total
# - kube_pod_container_status_restarts_total

Prometheus

Prometheus is the de facto standard for cloud-native metrics collection. Originally built at SoundCloud and donated to the CNCF, it provides a pull-based scraping model, powerful query language (PromQL), and built-in alerting — all designed for dynamic, container-based environments.

Prometheus Architecture

flowchart LR
    subgraph Targets
        A[Application /metrics]
        B[Node Exporter]
        C[cAdvisor]
        D[Custom Exporter]
    end
    subgraph Prometheus
        E[Prometheus Server]
        F[TSDB Storage]
        G[Rule Engine]
    end
    subgraph Alerting
        H[Alertmanager]
        I[PagerDuty]
        J[Slack]
        K[Email]
    end
    subgraph Visualization
        L[Grafana]
        M[Prometheus UI]
    end
    A -->|scrape| E
    B -->|scrape| E
    C -->|scrape| E
    D -->|scrape| E
    E --> F
    E --> G
    G -->|fire alerts| H
    H --> I
    H --> J
    H --> K
    E -->|query| L
    E -->|query| M

Prometheus Configuration

# prometheus.yml - Main configuration file
global:
  scrape_interval: 15s          # How often to scrape targets
  evaluation_interval: 15s      # How often to evaluate rules
  scrape_timeout: 10s           # Timeout for scrape requests

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Rule files for recording and alerting
rule_files:
  - "recording_rules.yml"
  - "alerting_rules.yml"

# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Node Exporter for system metrics
  - job_name: "node-exporter"
    static_configs:
      - targets:
          - "node1:9100"
          - "node2:9100"
          - "node3:9100"
    relabel_configs:
      - source_labels: [__address__]
        regex: "(.+):.*"
        target_label: instance
        replacement: "$1"

  # Kubernetes pods with prometheus.io annotations
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

PromQL Fundamentals

PromQL is a functional query language purpose-built for time-series data. It allows instant vector queries (single point in time) and range vector queries (data over time windows).

# PromQL Query Examples

# --- Instant Vectors ---
# Current HTTP request rate (per second, averaged over 5 minutes)
rate(http_requests_total[5m])

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# 95th percentile latency from histogram
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# --- Aggregations ---
# Total requests per service
sum by (service) (rate(http_requests_total[5m]))

# Top 5 pods by CPU usage
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))

# Memory utilization percentage per node
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# --- Range Vectors & Functions ---
# Average request rate over 1 hour
avg_over_time(rate(http_requests_total[5m])[1h:5m])

# Predict disk full in 4 hours (linear extrapolation)
predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0

# Rate of change (derivative) for error count
deriv(http_errors_total[15m])

Recording and Alerting Rules

# recording_rules.yml - Pre-compute expensive queries
groups:
  - name: http_recording_rules
    interval: 30s
    rules:
      # Record request rate per service
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # Record error ratio per service
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # Record p99 latency per service
      - record: job:http_duration:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

# alerting_rules.yml - Define alert conditions
groups:
  - name: infrastructure_alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

      # Disk space running low
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Only {{ $value | humanizePercentage }} available on {{ $labels.mountpoint }}"

      # Pod crash looping
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          description: "Container {{ $labels.container }} in pod {{ $labels.pod }} restarted {{ $value }} times in 15m"

Service Discovery

Prometheus dynamically discovers targets in cloud-native environments using service discovery mechanisms:

Discovery Type	Use Case	Configuration
kubernetes_sd	Pods, services, endpoints in K8s	`role: pod/service/endpoints`
ec2_sd	AWS EC2 instances	IAM role, region, filters
azure_sd	Azure VMs and scale sets	Subscription, tenant, tags
consul_sd	Consul-registered services	Consul server address
file_sd	Static file-based discovery	JSON/YAML target files

Grafana

Grafana is the industry-standard visualization platform for observability data. It connects to dozens of data sources (Prometheus, Loki, Elasticsearch, CloudWatch, Azure Monitor) and provides rich, interactive dashboards for metrics, logs, and traces.

Dashboard Design Principles

Layer dashboards: High-level overview → service-specific → detailed debug
Use variables: Allow filtering by environment, region, service, and instance
Golden signals first: Every service dashboard should show latency, traffic, errors, and saturation prominently
Correlate panels: Place related metrics side-by-side (e.g., latency + error rate)
Include annotations: Overlay deployment markers and incidents on time-series graphs

                            
                            Common Mistake: Creating dashboards with 50+ panels that nobody reads. Start with 4-6 panels showing the golden signals, then link to detailed dashboards for drill-down. The most effective dashboards answer a single question clearly.
                        

Grafana as Code

{
  "dashboard": {
    "title": "Service Overview - Payment API",
    "tags": ["production", "payment", "golden-signals"],
    "timezone": "utc",
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "environment",
          "type": "query",
          "query": "label_values(http_requests_total, environment)",
          "current": { "text": "production", "value": "production" }
        },
        {
          "name": "instance",
          "type": "query",
          "query": "label_values(http_requests_total{environment=\"$environment\"}, instance)"
        }
      ]
    },
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{environment=\"$environment\"}[5m]))",
            "legendFormat": "Total Requests/sec"
          }
        ]
      },
      {
        "title": "Error Rate (%)",
        "type": "stat",
        "gridPos": { "x": 12, "y": 0, "w": 6, "h": 4 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\",environment=\"$environment\"}[5m])) / sum(rate(http_requests_total{environment=\"$environment\"}[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 1, "color": "yellow" },
                { "value": 5, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "P95 Latency (ms)",
        "type": "timeseries",
        "gridPos": { "x": 0, "y": 8, "w": 12, "h": 8 },
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{environment=\"$environment\"}[5m])) by (le)) * 1000",
            "legendFormat": "P95 Latency"
          }
        ]
      }
    ]
  }
}

# Terraform: Deploy Grafana dashboard from JSON file
resource "grafana_dashboard" "payment_service" {
  config_json = file("${path.module}/dashboards/payment-service.json")
  folder      = grafana_folder.services.id
  overwrite   = true
}

resource "grafana_folder" "services" {
  title = "Service Dashboards"
}

# Grafana data source configuration
resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus-server.monitoring.svc:9090"

  json_data_encoded = jsonencode({
    timeInterval = "15s"
    httpMethod   = "POST"
  })
}

# Grafana alerting via Terraform
resource "grafana_contact_point" "pagerduty" {
  name = "PagerDuty - Critical"

  pagerduty {
    integration_key = var.pagerduty_integration_key
    severity        = "critical"
  }
}

resource "grafana_notification_policy" "default" {
  contact_point = grafana_contact_point.pagerduty.name
  group_by      = ["alertname", "service"]

  policy {
    matcher {
      label = "severity"
      match = "="
      value = "critical"
    }
    contact_point = grafana_contact_point.pagerduty.name
    group_wait    = "30s"
    group_interval = "5m"
    repeat_interval = "4h"
  }
}

Centralized Logging

In distributed systems, logs are scattered across dozens or hundreds of containers, VMs, and services. Centralized logging collects all logs into a single queryable system, making it possible to correlate events across services and debug complex issues.

Centralized Logging Architecture

flowchart LR
    subgraph Sources
        A[Application Pods]
        B[System Logs]
        C[Load Balancer]
        D[Database]
    end
    subgraph Collection
        E[Fluentd / Fluent Bit]
        F[Filebeat]
    end
    subgraph Processing
        G[Logstash / Fluentd]
    end
    subgraph Storage & Query
        H[Elasticsearch]
        I[Loki]
        J[CloudWatch Logs]
    end
    subgraph Visualization
        K[Kibana]
        L[Grafana]
        M[CloudWatch Insights]
    end
    A --> E
    B --> E
    C --> F
    D --> F
    E --> G
    F --> G
    G --> H
    G --> I
    G --> J
    H --> K
    I --> L
    J --> M

Structured Logging Best Practices

Structured logs (JSON format) are machine-parseable, enabling efficient querying and indexing. Always emit structured logs in production:

{
  "timestamp": "2026-05-14T10:30:45.123Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "method": "POST",
  "path": "/api/v1/payments",
  "status_code": 500,
  "duration_ms": 2340,
  "user_id": "usr_12345",
  "error": "connection timeout to payment gateway",
  "error_type": "TimeoutError",
  "retry_count": 3,
  "environment": "production",
  "region": "us-east-1",
  "pod": "payment-api-7f8b9c6d4-x2k9p"
}

Fluentd Configuration

# fluentd.conf - Kubernetes log collection
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

# Enrich with Kubernetes metadata
<filter kubernetes.**>
  @type kubernetes_metadata
  @id filter_kube_metadata
</filter>

# Parse JSON application logs
<filter kubernetes.**>
  @type parser
  key_name log
  reserve_data true
  remove_key_name_field true
  <parse>
    @type json
  </parse>
</filter>

# Output to Elasticsearch
<match kubernetes.**>
  @type elasticsearch
  host elasticsearch.logging.svc
  port 9200
  logstash_format true
  logstash_prefix k8s-logs
  include_tag_key true
  <buffer>
    @type file
    path /var/log/fluentd-buffers/kubernetes.buffer
    flush_mode interval
    flush_interval 5s
    retry_max_interval 30
    chunk_limit_size 2M
    total_limit_size 500M
  </buffer>
</match>

Logging Solution Comparison

Solution	Architecture	Query Language	Best For	Cost Model
ELK Stack	Full-text indexing	KQL / Lucene	Complex searches, analytics	Self-hosted (resource-heavy)
Loki	Label-based, log chunks	LogQL	Kubernetes, cost-effective	Self-hosted (lightweight)
CloudWatch Logs	Managed (AWS)	CloudWatch Insights	AWS-native workloads	Per GB ingested + stored
Azure Monitor Logs	Managed (Azure)	KQL (Kusto)	Azure-native workloads	Per GB ingested + retained
Datadog Logs	SaaS	Datadog query syntax	Multi-cloud, unified platform	Per GB ingested (expensive)

Distributed Tracing

When a single user request passes through 5, 10, or 20 microservices, understanding where time is spent and where errors occur becomes nearly impossible without tracing. Distributed tracing gives you the end-to-end story of every request.

Spans and Context Propagation

A trace represents the complete journey of a request. Each trace is composed of spans — individual units of work within a service. Spans form a tree structure showing parent-child relationships and timing.

Distributed Trace Across Microservices

sequenceDiagram
    participant U as User
    participant GW as API Gateway
    participant AS as Auth Service
    participant PS as Payment Service
    participant DB as Database
    participant MQ as Message Queue
    participant NS as Notification Service

    U->>GW: POST /checkout (trace_id: abc123)
    GW->>AS: Validate Token (span: auth-check)
    AS-->>GW: Token Valid (12ms)
    GW->>PS: Process Payment (span: payment)
    PS->>DB: Insert Transaction (span: db-write)
    DB-->>PS: OK (45ms)
    PS->>MQ: Publish Event (span: queue-publish)
    MQ-->>PS: ACK (8ms)
    PS-->>GW: Payment Success (120ms)
    MQ->>NS: Send Confirmation (span: notify)
    NS-->>MQ: Sent (200ms)
    GW-->>U: 200 OK (150ms total)

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for telemetry collection. It provides vendor-neutral APIs, SDKs, and the Collector for metrics, logs, and traces. It has become the universal instrumentation layer.

# otel-collector-config.yaml - OpenTelemetry Collector configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Scrape Prometheus metrics
  prometheus:
    config:
      scrape_configs:
        - job_name: "otel-collector"
          scrape_interval: 10s
          static_configs:
            - targets: ["0.0.0.0:8888"]

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: service.namespace
        value: checkout
        action: upsert

  # Tail-based sampling (keep errors + 10% of success)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  # Send traces to Jaeger
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true

  # Send metrics to Prometheus
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

  # Send logs to Loki
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, tail_sampling]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch, resource]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]

// OpenTelemetry instrumentation - Node.js example
// tracing.js - Initialize before any other imports
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'payment-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '2.1.0',
    environment: process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
  }),
  metricExporter: new OTLPMetricExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();
console.log('OpenTelemetry SDK initialized');

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});

Tracing Backend	Type	Storage	Best For
Jaeger	Open Source (CNCF)	Elasticsearch, Cassandra, Kafka	Kubernetes-native tracing
Zipkin	Open Source	Elasticsearch, MySQL, Cassandra	Simple setup, Java ecosystem
Tempo (Grafana)	Open Source	Object storage (S3, GCS)	Cost-effective, Grafana integration
AWS X-Ray	Managed (AWS)	Managed	AWS Lambda, ECS, EKS
Azure App Insights	Managed (Azure)	Log Analytics workspace	Azure-native applications

Alerting Strategies

Alerts are the bridge between automated monitoring and human action. But poorly designed alerting creates alert fatigue — the #1 problem in operations. When teams receive hundreds of non-actionable alerts daily, they learn to ignore them all, including the critical ones.

                            
                            Warning: Every alert must be actionable. If there is no action an engineer should take when an alert fires, the alert should not exist. Alerts that say "this metric is high" without context about impact or remediation steps actively harm reliability by training teams to ignore notifications.
                        

Alert Severity Levels

Severity	Impact	Response Time	Notification	Example
P1 - Critical	Service down, data loss risk	Immediate (wake up)	PagerDuty + phone call	Production database unreachable
P2 - High	Degraded performance, partial outage	15 minutes	PagerDuty + Slack	Error rate > 5% for 5 minutes
P3 - Medium	Non-critical issue, workaround exists	Business hours	Slack channel	Disk usage > 80%
P4 - Low	Informational, trend-based	Next sprint	Email / ticket	Certificate expiring in 30 days

Alertmanager Configuration

# alertmanager.yml - Routing and notification configuration
global:
  resolve_timeout: 5m
  smtp_from: "alerts@company.com"
  smtp_smarthost: "smtp.company.com:587"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

# Notification templates
templates:
  - "/etc/alertmanager/templates/*.tmpl"

# Routing tree - match alerts to receivers
route:
  receiver: "slack-default"
  group_by: ["alertname", "cluster", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts → PagerDuty (wake people up)
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      group_wait: 10s
      repeat_interval: 1h
      continue: true

    # High alerts → PagerDuty + Slack
    - match:
        severity: high
      receiver: "pagerduty-high"
      group_wait: 30s
      repeat_interval: 2h

    # Infrastructure team alerts
    - match:
        team: infrastructure
      receiver: "slack-infrastructure"
      routes:
        - match:
            severity: critical
          receiver: "pagerduty-infra"

    # Silence during maintenance windows
    - match_re:
        alertname: "^Maintenance.*"
      receiver: "null"

# Receivers define notification channels
receivers:
  - name: "null"

  - name: "slack-default"
    slack_configs:
      - channel: "#alerts-general"
        send_resolved: true
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Labels.severity | toUpper }}* - {{ .Annotations.summary }}
          {{ .Annotations.description }}
          {{ end }}

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key_file: "/etc/alertmanager/secrets/pagerduty-critical-key"
        severity: critical
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          resolved: '{{ .Alerts.Resolved | len }}'

  - name: "pagerduty-high"
    pagerduty_configs:
      - service_key_file: "/etc/alertmanager/secrets/pagerduty-high-key"
        severity: error

  - name: "slack-infrastructure"
    slack_configs:
      - channel: "#alerts-infrastructure"
        send_resolved: true

# Inhibition rules - suppress less severe alerts when critical fires
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["alertname", "cluster", "service"]

  - source_match:
      severity: critical
    target_match:
      severity: high
    equal: ["cluster"]

Runbooks and Automated Remediation

Every alert should link to a runbook — a document describing the alert, its impact, diagnostic steps, and remediation actions. For common issues, automate the remediation entirely:

# Example: Automated remediation with Kubernetes Event-Driven Autoscaling
# When disk usage exceeds threshold, trigger cleanup job
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: disk-cleanup
  namespace: maintenance
spec:
  jobTargetRef:
    template:
      spec:
        containers:
          - name: cleanup
            image: alpine:3.18
            command:
              - /bin/sh
              - -c
              - |
                echo "Running disk cleanup..."
                find /data/logs -mtime +7 -delete
                find /data/tmp -mtime +1 -delete
                echo "Cleanup complete"
        restartPolicy: Never
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: node_filesystem_avail_bytes
        query: |
          (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) < 0.15
        threshold: "1"
  pollingInterval: 60
  maxReplicaCount: 1

SLOs, SLIs, and SLAs

Service Level Objectives bring mathematical rigor to reliability. Instead of vague goals like "the system should be fast," SLOs define exactly what "reliable" means and provide a framework for making trade-offs between reliability and velocity.

Concept	Definition	Audience	Example
SLA	Business contract with consequences for violations	Customers, legal	"99.9% uptime or credits issued"
SLO	Internal reliability target (stricter than SLA)	Engineering teams	"99.95% of requests succeed within 200ms"
SLI	Measured metric used to assess SLO compliance	Monitoring systems	"Ratio of successful requests under 200ms"

Error Budgets and Burn Rate

The error budget is the inverse of your SLO — the amount of unreliability you can tolerate. A 99.9% SLO gives you a 0.1% error budget (approximately 43 minutes of downtime per month). When the budget is exhausted, you stop deploying new features and focus on reliability.

                            
                            Error Budget Math: SLO = 99.9% → Error Budget = 0.1% → 43.2 minutes/month. If your burn rate is 2x (consuming budget twice as fast as expected), the budget will be exhausted in ~15 days instead of 30. Alert when burn rate exceeds thresholds over multiple windows.
                        

SLO Error Budget Workflow

flowchart TD
    A[Define SLI] --> B[Set SLO Target]
    B --> C[Calculate Error Budget]
    C --> D{Budget Remaining?}
    D -->|Yes - Budget Healthy| E[Ship Features]
    D -->|No - Budget Exhausted| F[Freeze Deployments]
    E --> G[Monitor Burn Rate]
    F --> H[Focus on Reliability]
    G --> I{Burn Rate High?}
    I -->|Yes| J[Alert + Investigate]
    I -->|No| E
    J --> K{Incident?}
    K -->|Yes| L[Incident Response]
    K -->|No| M[Tune Alert Threshold]
    L --> N[Post-Mortem]
    N --> O[Reduce Future Burn]
    O --> G
    H --> G
    M --> G

# PromQL: SLO-based alerting with multi-window burn rate
# Reference: Google SRE Workbook Chapter 5

# SLI: Ratio of successful requests (non-5xx) under 200ms
# SLO: 99.9% over 30 days
# Error budget: 0.1% = 43.2 minutes/month

# Fast burn alert: 14.4x burn rate over 1 hour (2% budget in 1h)
# → Pages on-call immediately
(
  1 - (
    sum(rate(http_requests_total{status!~"5.."}[1h]))
    /
    sum(rate(http_requests_total[1h]))
  )
) > (14.4 * 0.001)
and
(
  1 - (
    sum(rate(http_requests_total{status!~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
  )
) > (14.4 * 0.001)

# Slow burn alert: 3x burn rate over 6 hours (10% budget in 3 days)
# → Tickets for investigation
(
  1 - (
    sum(rate(http_requests_total{status!~"5.."}[6h]))
    /
    sum(rate(http_requests_total[6h]))
  )
) > (3 * 0.001)
and
(
  1 - (
    sum(rate(http_requests_total{status!~"5.."}[30m]))
    /
    sum(rate(http_requests_total[30m]))
  )
) > (3 * 0.001)

Cloud-Native Monitoring

Every major cloud provider offers integrated monitoring services. These provide deep integration with cloud resources, managed infrastructure, and pay-per-use pricing — but at the cost of vendor lock-in.

Capability	AWS	Azure	GCP
Metrics	CloudWatch Metrics	Azure Monitor Metrics	Cloud Monitoring
Logs	CloudWatch Logs	Log Analytics (KQL)	Cloud Logging
Tracing	X-Ray	Application Insights	Cloud Trace
Dashboards	CloudWatch Dashboards	Azure Dashboards / Workbooks	Cloud Monitoring Dashboards
Alerting	CloudWatch Alarms + SNS	Azure Monitor Alerts	Alerting Policies
APM	X-Ray + CloudWatch RUM	Application Insights	Cloud Profiler
Audit	CloudTrail	Activity Log	Cloud Audit Logs

Cloud-Native vs Open-Source: When to Use Which

                            
                            Decision Framework: Use cloud-native monitoring when you are single-cloud, want zero operational overhead, and need deep integration with managed services. Use open-source (Prometheus + Grafana + Loki) when you are multi-cloud, want portability, need advanced PromQL capabilities, or want to avoid per-metric/per-GB pricing at scale.
                        

# AWS CloudWatch - Query metrics with CLI
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2026-05-14T00:00:00Z \
  --end-time 2026-05-14T12:00:00Z \
  --period 300 \
  --statistics Average Maximum

# Azure Monitor - Query logs with KQL
az monitor log-analytics query \
  --workspace "my-workspace-id" \
  --analytics-query "
    AppRequests
    | where TimeGenerated > ago(1h)
    | where ResultCode >= 500
    | summarize ErrorCount=count() by bin(TimeGenerated, 5m), AppRoleName
    | order by TimeGenerated desc
  " \
  --output table

# GCP Cloud Monitoring - List metrics
gcloud monitoring metrics-descriptors list \
  --filter='metric.type = starts_with("compute.googleapis.com/instance/cpu")'

Infrastructure Monitoring with Terraform

Monitoring should be treated as code — versioned, reviewed, and deployed through the same CI/CD pipelines as your infrastructure. Terraform can provision monitoring stacks, configure alerts, and manage dashboards declaratively.

Deploying Prometheus + Grafana with Helm

# Deploy kube-prometheus-stack via Helm
resource "helm_release" "kube_prometheus" {
  name       = "kube-prometheus"
  namespace  = "monitoring"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "kube-prometheus-stack"
  version    = "56.6.2"

  create_namespace = true

  # Prometheus configuration
  set {
    name  = "prometheus.prometheusSpec.retention"
    value = "30d"
  }
  set {
    name  = "prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage"
    value = "100Gi"
  }
  set {
    name  = "prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName"
    value = "gp3"
  }

  # Grafana configuration
  set {
    name  = "grafana.adminPassword"
    value = var.grafana_admin_password
  }
  set {
    name  = "grafana.persistence.enabled"
    value = "true"
  }
  set {
    name  = "grafana.persistence.size"
    value = "10Gi"
  }

  # Alertmanager configuration
  set {
    name  = "alertmanager.alertmanagerSpec.retention"
    value = "120h"
  }

  # Enable ServiceMonitor for auto-discovery
  set {
    name  = "prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues"
    value = "false"
  }
}

Cloud Alert Resources via Terraform

# AWS CloudWatch Alarm via Terraform
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.service_name}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 60
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU utilization exceeds 80% for 3 minutes"

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.app.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]

  tags = var.common_tags
}

# AWS CloudWatch composite alarm
resource "aws_cloudwatch_composite_alarm" "service_health" {
  alarm_name = "${var.service_name}-health-composite"
  alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.high_error_rate.alarm_name})"

  alarm_actions = [aws_sns_topic.critical_alerts.arn]
}

# Azure Monitor Alert via Terraform
resource "azurerm_monitor_metric_alert" "response_time" {
  name                = "${var.service_name}-response-time"
  resource_group_name = azurerm_resource_group.main.name
  scopes              = [azurerm_linux_web_app.main.id]
  description         = "Alert when average response time exceeds 2 seconds"
  severity            = 2
  frequency           = "PT1M"
  window_size         = "PT5M"

  criteria {
    metric_namespace = "Microsoft.Web/sites"
    metric_name      = "HttpResponseTime"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 2
  }

  action {
    action_group_id = azurerm_monitor_action_group.platform.id
  }

  tags = var.common_tags
}

resource "azurerm_monitor_action_group" "platform" {
  name                = "platform-alerts"
  resource_group_name = azurerm_resource_group.main.name
  short_name          = "platform"

  email_receiver {
    name          = "oncall"
    email_address = "oncall@company.com"
  }

  webhook_receiver {
    name = "pagerduty"
    uri  = "https://events.pagerduty.com/integration/${var.pd_key}/enqueue"
  }
}

Hands-On Exercises

Exercise 1 45 minutes

Deploy Prometheus and Write PromQL Queries

Deploy Prometheus using Docker Compose with Node Exporter. Configure scraping, then write PromQL queries to answer operational questions about your system.

Create a docker-compose.yml with Prometheus + Node Exporter + a sample app
Configure prometheus.yml with scrape targets
Write PromQL queries for: CPU utilization rate, memory usage percentage, top 3 endpoints by request rate, 95th percentile latency, and error rate per service
Create recording rules for expensive queries you would use in dashboards

Prometheus PromQL Docker

Exercise 2 40 minutes

Build a Grafana Dashboard for Infrastructure Metrics

Create a golden-signals dashboard in Grafana that provides at-a-glance infrastructure health and supports drill-down into problem areas.

Connect Grafana to your Prometheus instance from Exercise 1
Create a dashboard with 6 panels: request rate, error rate, P95 latency, CPU utilization, memory usage, and disk available
Add template variables for environment and instance filtering
Configure threshold colors (green/yellow/red) on stat panels
Export the dashboard as JSON and commit it to version control

Grafana Dashboards Visualization

Exercise 3 35 minutes

Configure Alerting with Severity Routing

Set up Alertmanager with a multi-tier routing configuration that directs alerts to different channels based on severity and team ownership.

Deploy Alertmanager alongside Prometheus
Create alerting rules for: high error rate (P1), disk space low (P3), and pod crash looping (P2)
Configure routing: P1 → PagerDuty simulation, P2 → Slack webhook, P3 → email
Add inhibition rules to suppress warnings when a related critical alert fires
Test by triggering alerts and verifying correct routing

Alertmanager Routing On-Call

Exercise 4 30 minutes

Define SLOs and Calculate Error Budgets

Define meaningful SLOs for a web service, implement SLI measurement in PromQL, calculate error budgets, and create burn-rate alerts.

Define SLOs: 99.9% availability, 99% of requests under 200ms latency
Write PromQL SLI queries that measure compliance over 30-day windows
Calculate the error budget in minutes for each SLO
Create multi-window burn rate alerts (fast: 1h/5m, slow: 6h/30m)
Simulate an incident and observe budget consumption

SLO Error Budget Burn Rate

Conclusion & Next Steps

Observability is not a product you buy — it is a property of your system. By instrumenting your infrastructure with comprehensive metrics, structured logs, and distributed traces, you gain the ability to understand system behavior, detect anomalies before users notice, and diagnose root causes in minutes instead of hours.

The key principles to carry forward:

Three pillars together — metrics detect, logs diagnose, traces explain the full picture
Prometheus + Grafana — the industry standard for cloud-native metrics and visualization
OpenTelemetry — the universal standard for instrumentation; invest in it now
Actionable alerts only — every alert must have a clear action and runbook
SLOs drive decisions — error budgets provide the framework for balancing reliability and velocity
Monitoring as Code — dashboards, alerts, and configurations belong in version control

Next in the Series

In Part 14: Platform Engineering, we will explore Internal Developer Platforms, Backstage, developer experience, self-service infrastructure, and golden paths — building the abstractions that let development teams move fast without sacrificing operational quality.

Previous Part 12: CI/CD Pipelines Next Part 14: Platform Engineering

Cookie Consent