Back to Infrastructure & Cloud Automation Series

Part 13: Monitoring & Observability

May 14, 2026 Wasil Zafar 55 min read

Build complete observability into your infrastructure — from metrics and logs to distributed traces and intelligent alerting — so you can detect, diagnose, and resolve issues before they impact users.

Table of Contents

  1. Why Observability Matters
  2. Metrics
  3. Prometheus
  4. Grafana
  5. Centralized Logging
  6. Distributed Tracing
  7. Alerting Strategies
  8. SLOs, SLIs, and SLAs
  9. Cloud-Native Monitoring
  10. Infrastructure Monitoring with Terraform
  11. Hands-On Exercises
  12. Conclusion & Next Steps

Why Observability Matters

Modern distributed systems are inherently complex. Microservices communicate across networks, containers spin up and down dynamically, and cloud resources scale automatically. When something goes wrong — and it will — you need to understand what happened, where it happened, and why it happened. This is the domain of observability.

Traditional monitoring answers known questions: "Is the server up? Is CPU above 80%?" Observability goes further — it enables you to ask new questions you never anticipated. It shifts you from reactive firefighting to proactive understanding of system behavior.

Key Insight: Monitoring tells you when something is broken. Observability tells you why it is broken — even for failure modes you have never seen before. In distributed systems, you cannot predict every possible failure; you need the ability to explore system state from the outside using telemetry data.

The Three Pillars of Observability

Observability rests on three complementary data types, each providing a different lens into system behavior:

  • Metrics — Numeric time-series data aggregated over time (e.g., request rate, error percentage, latency percentiles). Cheap to store, fast to query, ideal for dashboards and alerts.
  • Logs — Timestamped text records of discrete events (e.g., "User X failed authentication at 14:32:05"). Rich context but expensive at scale. Essential for debugging specific incidents.
  • Traces — End-to-end records of request flow across services. Each trace contains spans showing timing, dependencies, and errors through the entire call chain.
Three Pillars of Observability
flowchart TD
    A[Observability] --> B[Metrics]
    A --> C[Logs]
    A --> D[Traces]
    B --> B1[Counters & Gauges]
    B --> B2[Dashboards & Alerts]
    B --> B3[Trend Analysis]
    C --> C1[Event Records]
    C --> C2[Error Details]
    C --> C3[Audit Trail]
    D --> D1[Request Flow]
    D --> D2[Latency Breakdown]
    D --> D3[Dependency Map]
    B1 --> E[Detect]
    C1 --> F[Diagnose]
    D1 --> G[Understand]
    E --> H[Resolve Faster]
    F --> H
    G --> H
                            

The three pillars work together: metrics detect anomalies, logs help diagnose root causes, and traces help understand the full picture of how a request flowed through the system. No single pillar is sufficient alone.

Pillar Data Type Best For Storage Cost Query Speed
Metrics Numeric time-series Alerting, dashboards, trends Low Fast
Logs Text events Debugging, audit, context High Medium
Traces Span trees Request flow, latency Medium Medium

Metrics

Metrics are numeric measurements collected at regular intervals. They are the foundation of monitoring — cheap to store, fast to query, and ideal for detecting when something deviates from normal behavior.

Metric Types

Type Description Example Use Case
Counter Monotonically increasing value http_requests_total Request rates, error counts
Gauge Value that can go up or down node_memory_available_bytes CPU usage, queue depth
Histogram Samples in configurable buckets http_request_duration_seconds Latency percentiles (p50, p95, p99)
Summary Client-calculated quantiles go_gc_duration_seconds Pre-computed percentiles

Golden Signals, RED, and USE Methods

Three frameworks help determine what to measure. Each targets different layers of the stack:

Google SRE Golden Signals: The four signals that matter most for user-facing systems: Latency (how long requests take), Traffic (how much demand), Errors (how many failures), and Saturation (how full the system is).
Framework Focus Signals Best For
Golden Signals User-facing services Latency, Traffic, Errors, Saturation APIs, web apps, microservices
RED Method Request-driven services Rate, Errors, Duration HTTP services, gRPC endpoints
USE Method Infrastructure resources Utilization, Saturation, Errors CPU, memory, disk, network

Key Infrastructure Metrics

# Key infrastructure metrics to monitor

# CPU
# - node_cpu_seconds_total (counter) → rate() for utilization
# - system.cpu.utilization (gauge, 0-1)

# Memory
# - node_memory_MemAvailable_bytes (gauge)
# - node_memory_MemTotal_bytes (gauge)
# → Available / Total = utilization percentage

# Disk
# - node_filesystem_avail_bytes (gauge)
# - node_disk_io_time_seconds_total (counter) → rate() for IO utilization

# Network
# - node_network_receive_bytes_total (counter) → rate() for throughput
# - node_network_transmit_bytes_total (counter)

# Containers (cAdvisor / kubelet)
# - container_cpu_usage_seconds_total
# - container_memory_working_set_bytes
# - container_network_receive_bytes_total
# - kube_pod_container_status_restarts_total

Prometheus

Prometheus is the de facto standard for cloud-native metrics collection. Originally built at SoundCloud and donated to the CNCF, it provides a pull-based scraping model, powerful query language (PromQL), and built-in alerting — all designed for dynamic, container-based environments.

Prometheus Architecture
flowchart LR
    subgraph Targets
        A[Application /metrics]
        B[Node Exporter]
        C[cAdvisor]
        D[Custom Exporter]
    end
    subgraph Prometheus
        E[Prometheus Server]
        F[TSDB Storage]
        G[Rule Engine]
    end
    subgraph Alerting
        H[Alertmanager]
        I[PagerDuty]
        J[Slack]
        K[Email]
    end
    subgraph Visualization
        L[Grafana]
        M[Prometheus UI]
    end
    A -->|scrape| E
    B -->|scrape| E
    C -->|scrape| E
    D -->|scrape| E
    E --> F
    E --> G
    G -->|fire alerts| H
    H --> I
    H --> J
    H --> K
    E -->|query| L
    E -->|query| M
                            

Prometheus Configuration

# prometheus.yml - Main configuration file
global:
  scrape_interval: 15s          # How often to scrape targets
  evaluation_interval: 15s      # How often to evaluate rules
  scrape_timeout: 10s           # Timeout for scrape requests

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Rule files for recording and alerting
rule_files:
  - "recording_rules.yml"
  - "alerting_rules.yml"

# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Node Exporter for system metrics
  - job_name: "node-exporter"
    static_configs:
      - targets:
          - "node1:9100"
          - "node2:9100"
          - "node3:9100"
    relabel_configs:
      - source_labels: [__address__]
        regex: "(.+):.*"
        target_label: instance
        replacement: "$1"

  # Kubernetes pods with prometheus.io annotations
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

PromQL Fundamentals

PromQL is a functional query language purpose-built for time-series data. It allows instant vector queries (single point in time) and range vector queries (data over time windows).

# PromQL Query Examples

# --- Instant Vectors ---
# Current HTTP request rate (per second, averaged over 5 minutes)
rate(http_requests_total[5m])

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# 95th percentile latency from histogram
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# --- Aggregations ---
# Total requests per service
sum by (service) (rate(http_requests_total[5m]))

# Top 5 pods by CPU usage
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))

# Memory utilization percentage per node
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# --- Range Vectors & Functions ---
# Average request rate over 1 hour
avg_over_time(rate(http_requests_total[5m])[1h:5m])

# Predict disk full in 4 hours (linear extrapolation)
predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0

# Rate of change (derivative) for error count
deriv(http_errors_total[15m])

Recording and Alerting Rules

# recording_rules.yml - Pre-compute expensive queries
groups:
  - name: http_recording_rules
    interval: 30s
    rules:
      # Record request rate per service
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # Record error ratio per service
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # Record p99 latency per service
      - record: job:http_duration:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )
# alerting_rules.yml - Define alert conditions
groups:
  - name: infrastructure_alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

      # Disk space running low
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Only {{ $value | humanizePercentage }} available on {{ $labels.mountpoint }}"

      # Pod crash looping
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          description: "Container {{ $labels.container }} in pod {{ $labels.pod }} restarted {{ $value }} times in 15m"

Service Discovery

Prometheus dynamically discovers targets in cloud-native environments using service discovery mechanisms:

Discovery Type Use Case Configuration
kubernetes_sd Pods, services, endpoints in K8s role: pod/service/endpoints
ec2_sd AWS EC2 instances IAM role, region, filters
azure_sd Azure VMs and scale sets Subscription, tenant, tags
consul_sd Consul-registered services Consul server address
file_sd Static file-based discovery JSON/YAML target files

Grafana

Grafana is the industry-standard visualization platform for observability data. It connects to dozens of data sources (Prometheus, Loki, Elasticsearch, CloudWatch, Azure Monitor) and provides rich, interactive dashboards for metrics, logs, and traces.

Dashboard Design Principles

  • Layer dashboards: High-level overview → service-specific → detailed debug
  • Use variables: Allow filtering by environment, region, service, and instance
  • Golden signals first: Every service dashboard should show latency, traffic, errors, and saturation prominently
  • Correlate panels: Place related metrics side-by-side (e.g., latency + error rate)
  • Include annotations: Overlay deployment markers and incidents on time-series graphs
Common Mistake: Creating dashboards with 50+ panels that nobody reads. Start with 4-6 panels showing the golden signals, then link to detailed dashboards for drill-down. The most effective dashboards answer a single question clearly.

Grafana as Code

{
  "dashboard": {
    "title": "Service Overview - Payment API",
    "tags": ["production", "payment", "golden-signals"],
    "timezone": "utc",
    "refresh": "30s",
    "templating": {
      "list": [
        {
          "name": "environment",
          "type": "query",
          "query": "label_values(http_requests_total, environment)",
          "current": { "text": "production", "value": "production" }
        },
        {
          "name": "instance",
          "type": "query",
          "query": "label_values(http_requests_total{environment=\"$environment\"}, instance)"
        }
      ]
    },
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{environment=\"$environment\"}[5m]))",
            "legendFormat": "Total Requests/sec"
          }
        ]
      },
      {
        "title": "Error Rate (%)",
        "type": "stat",
        "gridPos": { "x": 12, "y": 0, "w": 6, "h": 4 },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\",environment=\"$environment\"}[5m])) / sum(rate(http_requests_total{environment=\"$environment\"}[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 1, "color": "yellow" },
                { "value": 5, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "P95 Latency (ms)",
        "type": "timeseries",
        "gridPos": { "x": 0, "y": 8, "w": 12, "h": 8 },
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{environment=\"$environment\"}[5m])) by (le)) * 1000",
            "legendFormat": "P95 Latency"
          }
        ]
      }
    ]
  }
}
# Terraform: Deploy Grafana dashboard from JSON file
resource "grafana_dashboard" "payment_service" {
  config_json = file("${path.module}/dashboards/payment-service.json")
  folder      = grafana_folder.services.id
  overwrite   = true
}

resource "grafana_folder" "services" {
  title = "Service Dashboards"
}

# Grafana data source configuration
resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://prometheus-server.monitoring.svc:9090"

  json_data_encoded = jsonencode({
    timeInterval = "15s"
    httpMethod   = "POST"
  })
}

# Grafana alerting via Terraform
resource "grafana_contact_point" "pagerduty" {
  name = "PagerDuty - Critical"

  pagerduty {
    integration_key = var.pagerduty_integration_key
    severity        = "critical"
  }
}

resource "grafana_notification_policy" "default" {
  contact_point = grafana_contact_point.pagerduty.name
  group_by      = ["alertname", "service"]

  policy {
    matcher {
      label = "severity"
      match = "="
      value = "critical"
    }
    contact_point = grafana_contact_point.pagerduty.name
    group_wait    = "30s"
    group_interval = "5m"
    repeat_interval = "4h"
  }
}

Centralized Logging

In distributed systems, logs are scattered across dozens or hundreds of containers, VMs, and services. Centralized logging collects all logs into a single queryable system, making it possible to correlate events across services and debug complex issues.

Centralized Logging Architecture
flowchart LR
    subgraph Sources
        A[Application Pods]
        B[System Logs]
        C[Load Balancer]
        D[Database]
    end
    subgraph Collection
        E[Fluentd / Fluent Bit]
        F[Filebeat]
    end
    subgraph Processing
        G[Logstash / Fluentd]
    end
    subgraph Storage & Query
        H[Elasticsearch]
        I[Loki]
        J[CloudWatch Logs]
    end
    subgraph Visualization
        K[Kibana]
        L[Grafana]
        M[CloudWatch Insights]
    end
    A --> E
    B --> E
    C --> F
    D --> F
    E --> G
    F --> G
    G --> H
    G --> I
    G --> J
    H --> K
    I --> L
    J --> M
                            

Structured Logging Best Practices

Structured logs (JSON format) are machine-parseable, enabling efficient querying and indexing. Always emit structured logs in production:

{
  "timestamp": "2026-05-14T10:30:45.123Z",
  "level": "error",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "method": "POST",
  "path": "/api/v1/payments",
  "status_code": 500,
  "duration_ms": 2340,
  "user_id": "usr_12345",
  "error": "connection timeout to payment gateway",
  "error_type": "TimeoutError",
  "retry_count": 3,
  "environment": "production",
  "region": "us-east-1",
  "pod": "payment-api-7f8b9c6d4-x2k9p"
}

Fluentd Configuration

# fluentd.conf - Kubernetes log collection
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

# Enrich with Kubernetes metadata
<filter kubernetes.**>
  @type kubernetes_metadata
  @id filter_kube_metadata
</filter>

# Parse JSON application logs
<filter kubernetes.**>
  @type parser
  key_name log
  reserve_data true
  remove_key_name_field true
  <parse>
    @type json
  </parse>
</filter>

# Output to Elasticsearch
<match kubernetes.**>
  @type elasticsearch
  host elasticsearch.logging.svc
  port 9200
  logstash_format true
  logstash_prefix k8s-logs
  include_tag_key true
  <buffer>
    @type file
    path /var/log/fluentd-buffers/kubernetes.buffer
    flush_mode interval
    flush_interval 5s
    retry_max_interval 30
    chunk_limit_size 2M
    total_limit_size 500M
  </buffer>
</match>

Logging Solution Comparison

Solution Architecture Query Language Best For Cost Model
ELK Stack Full-text indexing KQL / Lucene Complex searches, analytics Self-hosted (resource-heavy)
Loki Label-based, log chunks LogQL Kubernetes, cost-effective Self-hosted (lightweight)
CloudWatch Logs Managed (AWS) CloudWatch Insights AWS-native workloads Per GB ingested + stored
Azure Monitor Logs Managed (Azure) KQL (Kusto) Azure-native workloads Per GB ingested + retained
Datadog Logs SaaS Datadog query syntax Multi-cloud, unified platform Per GB ingested (expensive)

Distributed Tracing

When a single user request passes through 5, 10, or 20 microservices, understanding where time is spent and where errors occur becomes nearly impossible without tracing. Distributed tracing gives you the end-to-end story of every request.

Spans and Context Propagation

A trace represents the complete journey of a request. Each trace is composed of spans — individual units of work within a service. Spans form a tree structure showing parent-child relationships and timing.

Distributed Trace Across Microservices
sequenceDiagram
    participant U as User
    participant GW as API Gateway
    participant AS as Auth Service
    participant PS as Payment Service
    participant DB as Database
    participant MQ as Message Queue
    participant NS as Notification Service

    U->>GW: POST /checkout (trace_id: abc123)
    GW->>AS: Validate Token (span: auth-check)
    AS-->>GW: Token Valid (12ms)
    GW->>PS: Process Payment (span: payment)
    PS->>DB: Insert Transaction (span: db-write)
    DB-->>PS: OK (45ms)
    PS->>MQ: Publish Event (span: queue-publish)
    MQ-->>PS: ACK (8ms)
    PS-->>GW: Payment Success (120ms)
    MQ->>NS: Send Confirmation (span: notify)
    NS-->>MQ: Sent (200ms)
    GW-->>U: 200 OK (150ms total)
                            

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for telemetry collection. It provides vendor-neutral APIs, SDKs, and the Collector for metrics, logs, and traces. It has become the universal instrumentation layer.

# otel-collector-config.yaml - OpenTelemetry Collector configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Scrape Prometheus metrics
  prometheus:
    config:
      scrape_configs:
        - job_name: "otel-collector"
          scrape_interval: 10s
          static_configs:
            - targets: ["0.0.0.0:8888"]

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Add resource attributes
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: service.namespace
        value: checkout
        action: upsert

  # Tail-based sampling (keep errors + 10% of success)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

exporters:
  # Send traces to Jaeger
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true

  # Send metrics to Prometheus
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

  # Send logs to Loki
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, tail_sampling]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch, resource]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]
// OpenTelemetry instrumentation - Node.js example
// tracing.js - Initialize before any other imports
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'payment-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '2.1.0',
    environment: process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
  }),
  metricExporter: new OTLPMetricExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();
console.log('OpenTelemetry SDK initialized');

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown().then(() => process.exit(0));
});
Tracing Backend Type Storage Best For
Jaeger Open Source (CNCF) Elasticsearch, Cassandra, Kafka Kubernetes-native tracing
Zipkin Open Source Elasticsearch, MySQL, Cassandra Simple setup, Java ecosystem
Tempo (Grafana) Open Source Object storage (S3, GCS) Cost-effective, Grafana integration
AWS X-Ray Managed (AWS) Managed AWS Lambda, ECS, EKS
Azure App Insights Managed (Azure) Log Analytics workspace Azure-native applications

Alerting Strategies

Alerts are the bridge between automated monitoring and human action. But poorly designed alerting creates alert fatigue — the #1 problem in operations. When teams receive hundreds of non-actionable alerts daily, they learn to ignore them all, including the critical ones.

Warning: Every alert must be actionable. If there is no action an engineer should take when an alert fires, the alert should not exist. Alerts that say "this metric is high" without context about impact or remediation steps actively harm reliability by training teams to ignore notifications.

Alert Severity Levels

Severity Impact Response Time Notification Example
P1 - Critical Service down, data loss risk Immediate (wake up) PagerDuty + phone call Production database unreachable
P2 - High Degraded performance, partial outage 15 minutes PagerDuty + Slack Error rate > 5% for 5 minutes
P3 - Medium Non-critical issue, workaround exists Business hours Slack channel Disk usage > 80%
P4 - Low Informational, trend-based Next sprint Email / ticket Certificate expiring in 30 days

Alertmanager Configuration

# alertmanager.yml - Routing and notification configuration
global:
  resolve_timeout: 5m
  smtp_from: "alerts@company.com"
  smtp_smarthost: "smtp.company.com:587"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

# Notification templates
templates:
  - "/etc/alertmanager/templates/*.tmpl"

# Routing tree - match alerts to receivers
route:
  receiver: "slack-default"
  group_by: ["alertname", "cluster", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts → PagerDuty (wake people up)
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      group_wait: 10s
      repeat_interval: 1h
      continue: true

    # High alerts → PagerDuty + Slack
    - match:
        severity: high
      receiver: "pagerduty-high"
      group_wait: 30s
      repeat_interval: 2h

    # Infrastructure team alerts
    - match:
        team: infrastructure
      receiver: "slack-infrastructure"
      routes:
        - match:
            severity: critical
          receiver: "pagerduty-infra"

    # Silence during maintenance windows
    - match_re:
        alertname: "^Maintenance.*"
      receiver: "null"

# Receivers define notification channels
receivers:
  - name: "null"

  - name: "slack-default"
    slack_configs:
      - channel: "#alerts-general"
        send_resolved: true
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Labels.severity | toUpper }}* - {{ .Annotations.summary }}
          {{ .Annotations.description }}
          {{ end }}

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key_file: "/etc/alertmanager/secrets/pagerduty-critical-key"
        severity: critical
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          resolved: '{{ .Alerts.Resolved | len }}'

  - name: "pagerduty-high"
    pagerduty_configs:
      - service_key_file: "/etc/alertmanager/secrets/pagerduty-high-key"
        severity: error

  - name: "slack-infrastructure"
    slack_configs:
      - channel: "#alerts-infrastructure"
        send_resolved: true

# Inhibition rules - suppress less severe alerts when critical fires
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["alertname", "cluster", "service"]

  - source_match:
      severity: critical
    target_match:
      severity: high
    equal: ["cluster"]

Runbooks and Automated Remediation

Every alert should link to a runbook — a document describing the alert, its impact, diagnostic steps, and remediation actions. For common issues, automate the remediation entirely:

# Example: Automated remediation with Kubernetes Event-Driven Autoscaling
# When disk usage exceeds threshold, trigger cleanup job
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: disk-cleanup
  namespace: maintenance
spec:
  jobTargetRef:
    template:
      spec:
        containers:
          - name: cleanup
            image: alpine:3.18
            command:
              - /bin/sh
              - -c
              - |
                echo "Running disk cleanup..."
                find /data/logs -mtime +7 -delete
                find /data/tmp -mtime +1 -delete
                echo "Cleanup complete"
        restartPolicy: Never
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: node_filesystem_avail_bytes
        query: |
          (node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) < 0.15
        threshold: "1"
  pollingInterval: 60
  maxReplicaCount: 1

SLOs, SLIs, and SLAs

Service Level Objectives bring mathematical rigor to reliability. Instead of vague goals like "the system should be fast," SLOs define exactly what "reliable" means and provide a framework for making trade-offs between reliability and velocity.

Concept Definition Audience Example
SLA Business contract with consequences for violations Customers, legal "99.9% uptime or credits issued"
SLO Internal reliability target (stricter than SLA) Engineering teams "99.95% of requests succeed within 200ms"
SLI Measured metric used to assess SLO compliance Monitoring systems "Ratio of successful requests under 200ms"

Error Budgets and Burn Rate

The error budget is the inverse of your SLO — the amount of unreliability you can tolerate. A 99.9% SLO gives you a 0.1% error budget (approximately 43 minutes of downtime per month). When the budget is exhausted, you stop deploying new features and focus on reliability.

Error Budget Math: SLO = 99.9% → Error Budget = 0.1% → 43.2 minutes/month. If your burn rate is 2x (consuming budget twice as fast as expected), the budget will be exhausted in ~15 days instead of 30. Alert when burn rate exceeds thresholds over multiple windows.
SLO Error Budget Workflow
flowchart TD
    A[Define SLI] --> B[Set SLO Target]
    B --> C[Calculate Error Budget]
    C --> D{Budget Remaining?}
    D -->|Yes - Budget Healthy| E[Ship Features]
    D -->|No - Budget Exhausted| F[Freeze Deployments]
    E --> G[Monitor Burn Rate]
    F --> H[Focus on Reliability]
    G --> I{Burn Rate High?}
    I -->|Yes| J[Alert + Investigate]
    I -->|No| E
    J --> K{Incident?}
    K -->|Yes| L[Incident Response]
    K -->|No| M[Tune Alert Threshold]
    L --> N[Post-Mortem]
    N --> O[Reduce Future Burn]
    O --> G
    H --> G
    M --> G
                            
# PromQL: SLO-based alerting with multi-window burn rate
# Reference: Google SRE Workbook Chapter 5

# SLI: Ratio of successful requests (non-5xx) under 200ms
# SLO: 99.9% over 30 days
# Error budget: 0.1% = 43.2 minutes/month

# Fast burn alert: 14.4x burn rate over 1 hour (2% budget in 1h)
# → Pages on-call immediately
(
  1 - (
    sum(rate(http_requests_total{status!~"5.."}[1h]))
    /
    sum(rate(http_requests_total[1h]))
  )
) > (14.4 * 0.001)
and
(
  1 - (
    sum(rate(http_requests_total{status!~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))
  )
) > (14.4 * 0.001)

# Slow burn alert: 3x burn rate over 6 hours (10% budget in 3 days)
# → Tickets for investigation
(
  1 - (
    sum(rate(http_requests_total{status!~"5.."}[6h]))
    /
    sum(rate(http_requests_total[6h]))
  )
) > (3 * 0.001)
and
(
  1 - (
    sum(rate(http_requests_total{status!~"5.."}[30m]))
    /
    sum(rate(http_requests_total[30m]))
  )
) > (3 * 0.001)

Cloud-Native Monitoring

Every major cloud provider offers integrated monitoring services. These provide deep integration with cloud resources, managed infrastructure, and pay-per-use pricing — but at the cost of vendor lock-in.

Capability AWS Azure GCP
Metrics CloudWatch Metrics Azure Monitor Metrics Cloud Monitoring
Logs CloudWatch Logs Log Analytics (KQL) Cloud Logging
Tracing X-Ray Application Insights Cloud Trace
Dashboards CloudWatch Dashboards Azure Dashboards / Workbooks Cloud Monitoring Dashboards
Alerting CloudWatch Alarms + SNS Azure Monitor Alerts Alerting Policies
APM X-Ray + CloudWatch RUM Application Insights Cloud Profiler
Audit CloudTrail Activity Log Cloud Audit Logs

Cloud-Native vs Open-Source: When to Use Which

Decision Framework: Use cloud-native monitoring when you are single-cloud, want zero operational overhead, and need deep integration with managed services. Use open-source (Prometheus + Grafana + Loki) when you are multi-cloud, want portability, need advanced PromQL capabilities, or want to avoid per-metric/per-GB pricing at scale.
# AWS CloudWatch - Query metrics with CLI
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2026-05-14T00:00:00Z \
  --end-time 2026-05-14T12:00:00Z \
  --period 300 \
  --statistics Average Maximum

# Azure Monitor - Query logs with KQL
az monitor log-analytics query \
  --workspace "my-workspace-id" \
  --analytics-query "
    AppRequests
    | where TimeGenerated > ago(1h)
    | where ResultCode >= 500
    | summarize ErrorCount=count() by bin(TimeGenerated, 5m), AppRoleName
    | order by TimeGenerated desc
  " \
  --output table

# GCP Cloud Monitoring - List metrics
gcloud monitoring metrics-descriptors list \
  --filter='metric.type = starts_with("compute.googleapis.com/instance/cpu")'

Infrastructure Monitoring with Terraform

Monitoring should be treated as code — versioned, reviewed, and deployed through the same CI/CD pipelines as your infrastructure. Terraform can provision monitoring stacks, configure alerts, and manage dashboards declaratively.

Deploying Prometheus + Grafana with Helm

# Deploy kube-prometheus-stack via Helm
resource "helm_release" "kube_prometheus" {
  name       = "kube-prometheus"
  namespace  = "monitoring"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "kube-prometheus-stack"
  version    = "56.6.2"

  create_namespace = true

  # Prometheus configuration
  set {
    name  = "prometheus.prometheusSpec.retention"
    value = "30d"
  }
  set {
    name  = "prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage"
    value = "100Gi"
  }
  set {
    name  = "prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName"
    value = "gp3"
  }

  # Grafana configuration
  set {
    name  = "grafana.adminPassword"
    value = var.grafana_admin_password
  }
  set {
    name  = "grafana.persistence.enabled"
    value = "true"
  }
  set {
    name  = "grafana.persistence.size"
    value = "10Gi"
  }

  # Alertmanager configuration
  set {
    name  = "alertmanager.alertmanagerSpec.retention"
    value = "120h"
  }

  # Enable ServiceMonitor for auto-discovery
  set {
    name  = "prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues"
    value = "false"
  }
}

Cloud Alert Resources via Terraform

# AWS CloudWatch Alarm via Terraform
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.service_name}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 60
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "CPU utilization exceeds 80% for 3 minutes"

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.app.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]

  tags = var.common_tags
}

# AWS CloudWatch composite alarm
resource "aws_cloudwatch_composite_alarm" "service_health" {
  alarm_name = "${var.service_name}-health-composite"
  alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.high_error_rate.alarm_name})"

  alarm_actions = [aws_sns_topic.critical_alerts.arn]
}
# Azure Monitor Alert via Terraform
resource "azurerm_monitor_metric_alert" "response_time" {
  name                = "${var.service_name}-response-time"
  resource_group_name = azurerm_resource_group.main.name
  scopes              = [azurerm_linux_web_app.main.id]
  description         = "Alert when average response time exceeds 2 seconds"
  severity            = 2
  frequency           = "PT1M"
  window_size         = "PT5M"

  criteria {
    metric_namespace = "Microsoft.Web/sites"
    metric_name      = "HttpResponseTime"
    aggregation      = "Average"
    operator         = "GreaterThan"
    threshold        = 2
  }

  action {
    action_group_id = azurerm_monitor_action_group.platform.id
  }

  tags = var.common_tags
}

resource "azurerm_monitor_action_group" "platform" {
  name                = "platform-alerts"
  resource_group_name = azurerm_resource_group.main.name
  short_name          = "platform"

  email_receiver {
    name          = "oncall"
    email_address = "oncall@company.com"
  }

  webhook_receiver {
    name = "pagerduty"
    uri  = "https://events.pagerduty.com/integration/${var.pd_key}/enqueue"
  }
}

Hands-On Exercises

Exercise 1 45 minutes

Deploy Prometheus and Write PromQL Queries

Deploy Prometheus using Docker Compose with Node Exporter. Configure scraping, then write PromQL queries to answer operational questions about your system.

  1. Create a docker-compose.yml with Prometheus + Node Exporter + a sample app
  2. Configure prometheus.yml with scrape targets
  3. Write PromQL queries for: CPU utilization rate, memory usage percentage, top 3 endpoints by request rate, 95th percentile latency, and error rate per service
  4. Create recording rules for expensive queries you would use in dashboards
Prometheus PromQL Docker
Exercise 2 40 minutes

Build a Grafana Dashboard for Infrastructure Metrics

Create a golden-signals dashboard in Grafana that provides at-a-glance infrastructure health and supports drill-down into problem areas.

  1. Connect Grafana to your Prometheus instance from Exercise 1
  2. Create a dashboard with 6 panels: request rate, error rate, P95 latency, CPU utilization, memory usage, and disk available
  3. Add template variables for environment and instance filtering
  4. Configure threshold colors (green/yellow/red) on stat panels
  5. Export the dashboard as JSON and commit it to version control
Grafana Dashboards Visualization
Exercise 3 35 minutes

Configure Alerting with Severity Routing

Set up Alertmanager with a multi-tier routing configuration that directs alerts to different channels based on severity and team ownership.

  1. Deploy Alertmanager alongside Prometheus
  2. Create alerting rules for: high error rate (P1), disk space low (P3), and pod crash looping (P2)
  3. Configure routing: P1 → PagerDuty simulation, P2 → Slack webhook, P3 → email
  4. Add inhibition rules to suppress warnings when a related critical alert fires
  5. Test by triggering alerts and verifying correct routing
Alertmanager Routing On-Call
Exercise 4 30 minutes

Define SLOs and Calculate Error Budgets

Define meaningful SLOs for a web service, implement SLI measurement in PromQL, calculate error budgets, and create burn-rate alerts.

  1. Define SLOs: 99.9% availability, 99% of requests under 200ms latency
  2. Write PromQL SLI queries that measure compliance over 30-day windows
  3. Calculate the error budget in minutes for each SLO
  4. Create multi-window burn rate alerts (fast: 1h/5m, slow: 6h/30m)
  5. Simulate an incident and observe budget consumption
SLO Error Budget Burn Rate

Conclusion & Next Steps

Observability is not a product you buy — it is a property of your system. By instrumenting your infrastructure with comprehensive metrics, structured logs, and distributed traces, you gain the ability to understand system behavior, detect anomalies before users notice, and diagnose root causes in minutes instead of hours.

The key principles to carry forward:

  • Three pillars together — metrics detect, logs diagnose, traces explain the full picture
  • Prometheus + Grafana — the industry standard for cloud-native metrics and visualization
  • OpenTelemetry — the universal standard for instrumentation; invest in it now
  • Actionable alerts only — every alert must have a clear action and runbook
  • SLOs drive decisions — error budgets provide the framework for balancing reliability and velocity
  • Monitoring as Code — dashboards, alerts, and configurations belong in version control

Next in the Series

In Part 14: Platform Engineering, we will explore Internal Developer Platforms, Backstage, developer experience, self-service infrastructure, and golden paths — building the abstractions that let development teams move fast without sacrificing operational quality.