Why Observability Matters
Modern distributed systems are inherently complex. Microservices communicate across networks, containers spin up and down dynamically, and cloud resources scale automatically. When something goes wrong — and it will — you need to understand what happened, where it happened, and why it happened. This is the domain of observability.
Traditional monitoring answers known questions: "Is the server up? Is CPU above 80%?" Observability goes further — it enables you to ask new questions you never anticipated. It shifts you from reactive firefighting to proactive understanding of system behavior.
The Three Pillars of Observability
Observability rests on three complementary data types, each providing a different lens into system behavior:
- Metrics — Numeric time-series data aggregated over time (e.g., request rate, error percentage, latency percentiles). Cheap to store, fast to query, ideal for dashboards and alerts.
- Logs — Timestamped text records of discrete events (e.g., "User X failed authentication at 14:32:05"). Rich context but expensive at scale. Essential for debugging specific incidents.
- Traces — End-to-end records of request flow across services. Each trace contains spans showing timing, dependencies, and errors through the entire call chain.
flowchart TD
A[Observability] --> B[Metrics]
A --> C[Logs]
A --> D[Traces]
B --> B1[Counters & Gauges]
B --> B2[Dashboards & Alerts]
B --> B3[Trend Analysis]
C --> C1[Event Records]
C --> C2[Error Details]
C --> C3[Audit Trail]
D --> D1[Request Flow]
D --> D2[Latency Breakdown]
D --> D3[Dependency Map]
B1 --> E[Detect]
C1 --> F[Diagnose]
D1 --> G[Understand]
E --> H[Resolve Faster]
F --> H
G --> H
The three pillars work together: metrics detect anomalies, logs help diagnose root causes, and traces help understand the full picture of how a request flowed through the system. No single pillar is sufficient alone.
| Pillar | Data Type | Best For | Storage Cost | Query Speed |
|---|---|---|---|---|
| Metrics | Numeric time-series | Alerting, dashboards, trends | Low | Fast |
| Logs | Text events | Debugging, audit, context | High | Medium |
| Traces | Span trees | Request flow, latency | Medium | Medium |
Metrics
Metrics are numeric measurements collected at regular intervals. They are the foundation of monitoring — cheap to store, fast to query, and ideal for detecting when something deviates from normal behavior.
Metric Types
| Type | Description | Example | Use Case |
|---|---|---|---|
| Counter | Monotonically increasing value | http_requests_total |
Request rates, error counts |
| Gauge | Value that can go up or down | node_memory_available_bytes |
CPU usage, queue depth |
| Histogram | Samples in configurable buckets | http_request_duration_seconds |
Latency percentiles (p50, p95, p99) |
| Summary | Client-calculated quantiles | go_gc_duration_seconds |
Pre-computed percentiles |
Golden Signals, RED, and USE Methods
Three frameworks help determine what to measure. Each targets different layers of the stack:
| Framework | Focus | Signals | Best For |
|---|---|---|---|
| Golden Signals | User-facing services | Latency, Traffic, Errors, Saturation | APIs, web apps, microservices |
| RED Method | Request-driven services | Rate, Errors, Duration | HTTP services, gRPC endpoints |
| USE Method | Infrastructure resources | Utilization, Saturation, Errors | CPU, memory, disk, network |
Key Infrastructure Metrics
# Key infrastructure metrics to monitor
# CPU
# - node_cpu_seconds_total (counter) → rate() for utilization
# - system.cpu.utilization (gauge, 0-1)
# Memory
# - node_memory_MemAvailable_bytes (gauge)
# - node_memory_MemTotal_bytes (gauge)
# → Available / Total = utilization percentage
# Disk
# - node_filesystem_avail_bytes (gauge)
# - node_disk_io_time_seconds_total (counter) → rate() for IO utilization
# Network
# - node_network_receive_bytes_total (counter) → rate() for throughput
# - node_network_transmit_bytes_total (counter)
# Containers (cAdvisor / kubelet)
# - container_cpu_usage_seconds_total
# - container_memory_working_set_bytes
# - container_network_receive_bytes_total
# - kube_pod_container_status_restarts_total
Prometheus
Prometheus is the de facto standard for cloud-native metrics collection. Originally built at SoundCloud and donated to the CNCF, it provides a pull-based scraping model, powerful query language (PromQL), and built-in alerting — all designed for dynamic, container-based environments.
flowchart LR
subgraph Targets
A[Application /metrics]
B[Node Exporter]
C[cAdvisor]
D[Custom Exporter]
end
subgraph Prometheus
E[Prometheus Server]
F[TSDB Storage]
G[Rule Engine]
end
subgraph Alerting
H[Alertmanager]
I[PagerDuty]
J[Slack]
K[Email]
end
subgraph Visualization
L[Grafana]
M[Prometheus UI]
end
A -->|scrape| E
B -->|scrape| E
C -->|scrape| E
D -->|scrape| E
E --> F
E --> G
G -->|fire alerts| H
H --> I
H --> J
H --> K
E -->|query| L
E -->|query| M
Prometheus Configuration
# prometheus.yml - Main configuration file
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate rules
scrape_timeout: 10s # Timeout for scrape requests
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Rule files for recording and alerting
rule_files:
- "recording_rules.yml"
- "alerting_rules.yml"
# Scrape configurations
scrape_configs:
# Prometheus self-monitoring
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Node Exporter for system metrics
- job_name: "node-exporter"
static_configs:
- targets:
- "node1:9100"
- "node2:9100"
- "node3:9100"
relabel_configs:
- source_labels: [__address__]
regex: "(.+):.*"
target_label: instance
replacement: "$1"
# Kubernetes pods with prometheus.io annotations
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
PromQL Fundamentals
PromQL is a functional query language purpose-built for time-series data. It allows instant vector queries (single point in time) and range vector queries (data over time windows).
# PromQL Query Examples
# --- Instant Vectors ---
# Current HTTP request rate (per second, averaged over 5 minutes)
rate(http_requests_total[5m])
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# 95th percentile latency from histogram
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# --- Aggregations ---
# Total requests per service
sum by (service) (rate(http_requests_total[5m]))
# Top 5 pods by CPU usage
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total[5m])))
# Memory utilization percentage per node
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# --- Range Vectors & Functions ---
# Average request rate over 1 hour
avg_over_time(rate(http_requests_total[5m])[1h:5m])
# Predict disk full in 4 hours (linear extrapolation)
predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0
# Rate of change (derivative) for error count
deriv(http_errors_total[15m])
Recording and Alerting Rules
# recording_rules.yml - Pre-compute expensive queries
groups:
- name: http_recording_rules
interval: 30s
rules:
# Record request rate per service
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
# Record error ratio per service
- record: job:http_errors:ratio5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
# Record p99 latency per service
- record: job:http_duration:p99_5m
expr: |
histogram_quantile(0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# alerting_rules.yml - Define alert conditions
groups:
- name: infrastructure_alerts
rules:
# High error rate
- alert: HighErrorRate
expr: job:http_errors:ratio5m > 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
# Disk space running low
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Only {{ $value | humanizePercentage }} available on {{ $labels.mountpoint }}"
# Pod crash looping
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Container {{ $labels.container }} in pod {{ $labels.pod }} restarted {{ $value }} times in 15m"
Service Discovery
Prometheus dynamically discovers targets in cloud-native environments using service discovery mechanisms:
| Discovery Type | Use Case | Configuration |
|---|---|---|
| kubernetes_sd | Pods, services, endpoints in K8s | role: pod/service/endpoints |
| ec2_sd | AWS EC2 instances | IAM role, region, filters |
| azure_sd | Azure VMs and scale sets | Subscription, tenant, tags |
| consul_sd | Consul-registered services | Consul server address |
| file_sd | Static file-based discovery | JSON/YAML target files |
Grafana
Grafana is the industry-standard visualization platform for observability data. It connects to dozens of data sources (Prometheus, Loki, Elasticsearch, CloudWatch, Azure Monitor) and provides rich, interactive dashboards for metrics, logs, and traces.
Dashboard Design Principles
- Layer dashboards: High-level overview → service-specific → detailed debug
- Use variables: Allow filtering by environment, region, service, and instance
- Golden signals first: Every service dashboard should show latency, traffic, errors, and saturation prominently
- Correlate panels: Place related metrics side-by-side (e.g., latency + error rate)
- Include annotations: Overlay deployment markers and incidents on time-series graphs
Grafana as Code
{
"dashboard": {
"title": "Service Overview - Payment API",
"tags": ["production", "payment", "golden-signals"],
"timezone": "utc",
"refresh": "30s",
"templating": {
"list": [
{
"name": "environment",
"type": "query",
"query": "label_values(http_requests_total, environment)",
"current": { "text": "production", "value": "production" }
},
{
"name": "instance",
"type": "query",
"query": "label_values(http_requests_total{environment=\"$environment\"}, instance)"
}
]
},
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
"targets": [
{
"expr": "sum(rate(http_requests_total{environment=\"$environment\"}[5m]))",
"legendFormat": "Total Requests/sec"
}
]
},
{
"title": "Error Rate (%)",
"type": "stat",
"gridPos": { "x": 12, "y": 0, "w": 6, "h": 4 },
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\",environment=\"$environment\"}[5m])) / sum(rate(http_requests_total{environment=\"$environment\"}[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]
}
}
}
},
{
"title": "P95 Latency (ms)",
"type": "timeseries",
"gridPos": { "x": 0, "y": 8, "w": 12, "h": 8 },
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{environment=\"$environment\"}[5m])) by (le)) * 1000",
"legendFormat": "P95 Latency"
}
]
}
]
}
}
# Terraform: Deploy Grafana dashboard from JSON file
resource "grafana_dashboard" "payment_service" {
config_json = file("${path.module}/dashboards/payment-service.json")
folder = grafana_folder.services.id
overwrite = true
}
resource "grafana_folder" "services" {
title = "Service Dashboards"
}
# Grafana data source configuration
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus"
url = "http://prometheus-server.monitoring.svc:9090"
json_data_encoded = jsonencode({
timeInterval = "15s"
httpMethod = "POST"
})
}
# Grafana alerting via Terraform
resource "grafana_contact_point" "pagerduty" {
name = "PagerDuty - Critical"
pagerduty {
integration_key = var.pagerduty_integration_key
severity = "critical"
}
}
resource "grafana_notification_policy" "default" {
contact_point = grafana_contact_point.pagerduty.name
group_by = ["alertname", "service"]
policy {
matcher {
label = "severity"
match = "="
value = "critical"
}
contact_point = grafana_contact_point.pagerduty.name
group_wait = "30s"
group_interval = "5m"
repeat_interval = "4h"
}
}
Centralized Logging
In distributed systems, logs are scattered across dozens or hundreds of containers, VMs, and services. Centralized logging collects all logs into a single queryable system, making it possible to correlate events across services and debug complex issues.
flowchart LR
subgraph Sources
A[Application Pods]
B[System Logs]
C[Load Balancer]
D[Database]
end
subgraph Collection
E[Fluentd / Fluent Bit]
F[Filebeat]
end
subgraph Processing
G[Logstash / Fluentd]
end
subgraph Storage & Query
H[Elasticsearch]
I[Loki]
J[CloudWatch Logs]
end
subgraph Visualization
K[Kibana]
L[Grafana]
M[CloudWatch Insights]
end
A --> E
B --> E
C --> F
D --> F
E --> G
F --> G
G --> H
G --> I
G --> J
H --> K
I --> L
J --> M
Structured Logging Best Practices
Structured logs (JSON format) are machine-parseable, enabling efficient querying and indexing. Always emit structured logs in production:
{
"timestamp": "2026-05-14T10:30:45.123Z",
"level": "error",
"service": "payment-api",
"trace_id": "abc123def456",
"span_id": "span789",
"method": "POST",
"path": "/api/v1/payments",
"status_code": 500,
"duration_ms": 2340,
"user_id": "usr_12345",
"error": "connection timeout to payment gateway",
"error_type": "TimeoutError",
"retry_count": 3,
"environment": "production",
"region": "us-east-1",
"pod": "payment-api-7f8b9c6d4-x2k9p"
}
Fluentd Configuration
# fluentd.conf - Kubernetes log collection
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
# Enrich with Kubernetes metadata
<filter kubernetes.**>
@type kubernetes_metadata
@id filter_kube_metadata
</filter>
# Parse JSON application logs
<filter kubernetes.**>
@type parser
key_name log
reserve_data true
remove_key_name_field true
<parse>
@type json
</parse>
</filter>
# Output to Elasticsearch
<match kubernetes.**>
@type elasticsearch
host elasticsearch.logging.svc
port 9200
logstash_format true
logstash_prefix k8s-logs
include_tag_key true
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.buffer
flush_mode interval
flush_interval 5s
retry_max_interval 30
chunk_limit_size 2M
total_limit_size 500M
</buffer>
</match>
Logging Solution Comparison
| Solution | Architecture | Query Language | Best For | Cost Model |
|---|---|---|---|---|
| ELK Stack | Full-text indexing | KQL / Lucene | Complex searches, analytics | Self-hosted (resource-heavy) |
| Loki | Label-based, log chunks | LogQL | Kubernetes, cost-effective | Self-hosted (lightweight) |
| CloudWatch Logs | Managed (AWS) | CloudWatch Insights | AWS-native workloads | Per GB ingested + stored |
| Azure Monitor Logs | Managed (Azure) | KQL (Kusto) | Azure-native workloads | Per GB ingested + retained |
| Datadog Logs | SaaS | Datadog query syntax | Multi-cloud, unified platform | Per GB ingested (expensive) |
Distributed Tracing
When a single user request passes through 5, 10, or 20 microservices, understanding where time is spent and where errors occur becomes nearly impossible without tracing. Distributed tracing gives you the end-to-end story of every request.
Spans and Context Propagation
A trace represents the complete journey of a request. Each trace is composed of spans — individual units of work within a service. Spans form a tree structure showing parent-child relationships and timing.
sequenceDiagram
participant U as User
participant GW as API Gateway
participant AS as Auth Service
participant PS as Payment Service
participant DB as Database
participant MQ as Message Queue
participant NS as Notification Service
U->>GW: POST /checkout (trace_id: abc123)
GW->>AS: Validate Token (span: auth-check)
AS-->>GW: Token Valid (12ms)
GW->>PS: Process Payment (span: payment)
PS->>DB: Insert Transaction (span: db-write)
DB-->>PS: OK (45ms)
PS->>MQ: Publish Event (span: queue-publish)
MQ-->>PS: ACK (8ms)
PS-->>GW: Payment Success (120ms)
MQ->>NS: Send Confirmation (span: notify)
NS-->>MQ: Sent (200ms)
GW-->>U: 200 OK (150ms total)
OpenTelemetry
OpenTelemetry (OTel) is the CNCF standard for telemetry collection. It provides vendor-neutral APIs, SDKs, and the Collector for metrics, logs, and traces. It has become the universal instrumentation layer.
# otel-collector-config.yaml - OpenTelemetry Collector configuration
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Scrape Prometheus metrics
prometheus:
config:
scrape_configs:
- job_name: "otel-collector"
scrape_interval: 10s
static_configs:
- targets: ["0.0.0.0:8888"]
processors:
batch:
timeout: 5s
send_batch_size: 1024
# Add resource attributes
resource:
attributes:
- key: environment
value: production
action: upsert
- key: service.namespace
value: checkout
action: upsert
# Tail-based sampling (keep errors + 10% of success)
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
# Send traces to Jaeger
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
# Send metrics to Prometheus
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
# Send logs to Loki
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource, tail_sampling]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp, prometheus]
processors: [batch, resource]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [loki]
// OpenTelemetry instrumentation - Node.js example
// tracing.js - Initialize before any other imports
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'payment-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '2.1.0',
environment: process.env.NODE_ENV || 'development',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
metricExporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
});
sdk.start();
console.log('OpenTelemetry SDK initialized');
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown().then(() => process.exit(0));
});
| Tracing Backend | Type | Storage | Best For |
|---|---|---|---|
| Jaeger | Open Source (CNCF) | Elasticsearch, Cassandra, Kafka | Kubernetes-native tracing |
| Zipkin | Open Source | Elasticsearch, MySQL, Cassandra | Simple setup, Java ecosystem |
| Tempo (Grafana) | Open Source | Object storage (S3, GCS) | Cost-effective, Grafana integration |
| AWS X-Ray | Managed (AWS) | Managed | AWS Lambda, ECS, EKS |
| Azure App Insights | Managed (Azure) | Log Analytics workspace | Azure-native applications |
Alerting Strategies
Alerts are the bridge between automated monitoring and human action. But poorly designed alerting creates alert fatigue — the #1 problem in operations. When teams receive hundreds of non-actionable alerts daily, they learn to ignore them all, including the critical ones.
Alert Severity Levels
| Severity | Impact | Response Time | Notification | Example |
|---|---|---|---|---|
| P1 - Critical | Service down, data loss risk | Immediate (wake up) | PagerDuty + phone call | Production database unreachable |
| P2 - High | Degraded performance, partial outage | 15 minutes | PagerDuty + Slack | Error rate > 5% for 5 minutes |
| P3 - Medium | Non-critical issue, workaround exists | Business hours | Slack channel | Disk usage > 80% |
| P4 - Low | Informational, trend-based | Next sprint | Email / ticket | Certificate expiring in 30 days |
Alertmanager Configuration
# alertmanager.yml - Routing and notification configuration
global:
resolve_timeout: 5m
smtp_from: "alerts@company.com"
smtp_smarthost: "smtp.company.com:587"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
# Notification templates
templates:
- "/etc/alertmanager/templates/*.tmpl"
# Routing tree - match alerts to receivers
route:
receiver: "slack-default"
group_by: ["alertname", "cluster", "service"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts → PagerDuty (wake people up)
- match:
severity: critical
receiver: "pagerduty-critical"
group_wait: 10s
repeat_interval: 1h
continue: true
# High alerts → PagerDuty + Slack
- match:
severity: high
receiver: "pagerduty-high"
group_wait: 30s
repeat_interval: 2h
# Infrastructure team alerts
- match:
team: infrastructure
receiver: "slack-infrastructure"
routes:
- match:
severity: critical
receiver: "pagerduty-infra"
# Silence during maintenance windows
- match_re:
alertname: "^Maintenance.*"
receiver: "null"
# Receivers define notification channels
receivers:
- name: "null"
- name: "slack-default"
slack_configs:
- channel: "#alerts-general"
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Labels.severity | toUpper }}* - {{ .Annotations.summary }}
{{ .Annotations.description }}
{{ end }}
- name: "pagerduty-critical"
pagerduty_configs:
- service_key_file: "/etc/alertmanager/secrets/pagerduty-critical-key"
severity: critical
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
- name: "pagerduty-high"
pagerduty_configs:
- service_key_file: "/etc/alertmanager/secrets/pagerduty-high-key"
severity: error
- name: "slack-infrastructure"
slack_configs:
- channel: "#alerts-infrastructure"
send_resolved: true
# Inhibition rules - suppress less severe alerts when critical fires
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ["alertname", "cluster", "service"]
- source_match:
severity: critical
target_match:
severity: high
equal: ["cluster"]
Runbooks and Automated Remediation
Every alert should link to a runbook — a document describing the alert, its impact, diagnostic steps, and remediation actions. For common issues, automate the remediation entirely:
# Example: Automated remediation with Kubernetes Event-Driven Autoscaling
# When disk usage exceeds threshold, trigger cleanup job
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: disk-cleanup
namespace: maintenance
spec:
jobTargetRef:
template:
spec:
containers:
- name: cleanup
image: alpine:3.18
command:
- /bin/sh
- -c
- |
echo "Running disk cleanup..."
find /data/logs -mtime +7 -delete
find /data/tmp -mtime +1 -delete
echo "Cleanup complete"
restartPolicy: Never
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: node_filesystem_avail_bytes
query: |
(node_filesystem_avail_bytes{mountpoint="/data"} / node_filesystem_size_bytes{mountpoint="/data"}) < 0.15
threshold: "1"
pollingInterval: 60
maxReplicaCount: 1
SLOs, SLIs, and SLAs
Service Level Objectives bring mathematical rigor to reliability. Instead of vague goals like "the system should be fast," SLOs define exactly what "reliable" means and provide a framework for making trade-offs between reliability and velocity.
| Concept | Definition | Audience | Example |
|---|---|---|---|
| SLA | Business contract with consequences for violations | Customers, legal | "99.9% uptime or credits issued" |
| SLO | Internal reliability target (stricter than SLA) | Engineering teams | "99.95% of requests succeed within 200ms" |
| SLI | Measured metric used to assess SLO compliance | Monitoring systems | "Ratio of successful requests under 200ms" |
Error Budgets and Burn Rate
The error budget is the inverse of your SLO — the amount of unreliability you can tolerate. A 99.9% SLO gives you a 0.1% error budget (approximately 43 minutes of downtime per month). When the budget is exhausted, you stop deploying new features and focus on reliability.
flowchart TD
A[Define SLI] --> B[Set SLO Target]
B --> C[Calculate Error Budget]
C --> D{Budget Remaining?}
D -->|Yes - Budget Healthy| E[Ship Features]
D -->|No - Budget Exhausted| F[Freeze Deployments]
E --> G[Monitor Burn Rate]
F --> H[Focus on Reliability]
G --> I{Burn Rate High?}
I -->|Yes| J[Alert + Investigate]
I -->|No| E
J --> K{Incident?}
K -->|Yes| L[Incident Response]
K -->|No| M[Tune Alert Threshold]
L --> N[Post-Mortem]
N --> O[Reduce Future Burn]
O --> G
H --> G
M --> G
# PromQL: SLO-based alerting with multi-window burn rate
# Reference: Google SRE Workbook Chapter 5
# SLI: Ratio of successful requests (non-5xx) under 200ms
# SLO: 99.9% over 30 days
# Error budget: 0.1% = 43.2 minutes/month
# Fast burn alert: 14.4x burn rate over 1 hour (2% budget in 1h)
# → Pages on-call immediately
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (14.4 * 0.001)
and
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)
) > (14.4 * 0.001)
# Slow burn alert: 3x burn rate over 6 hours (10% budget in 3 days)
# → Tickets for investigation
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
)
) > (3 * 0.001)
and
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[30m]))
/
sum(rate(http_requests_total[30m]))
)
) > (3 * 0.001)
Cloud-Native Monitoring
Every major cloud provider offers integrated monitoring services. These provide deep integration with cloud resources, managed infrastructure, and pay-per-use pricing — but at the cost of vendor lock-in.
| Capability | AWS | Azure | GCP |
|---|---|---|---|
| Metrics | CloudWatch Metrics | Azure Monitor Metrics | Cloud Monitoring |
| Logs | CloudWatch Logs | Log Analytics (KQL) | Cloud Logging |
| Tracing | X-Ray | Application Insights | Cloud Trace |
| Dashboards | CloudWatch Dashboards | Azure Dashboards / Workbooks | Cloud Monitoring Dashboards |
| Alerting | CloudWatch Alarms + SNS | Azure Monitor Alerts | Alerting Policies |
| APM | X-Ray + CloudWatch RUM | Application Insights | Cloud Profiler |
| Audit | CloudTrail | Activity Log | Cloud Audit Logs |
Cloud-Native vs Open-Source: When to Use Which
# AWS CloudWatch - Query metrics with CLI
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2026-05-14T00:00:00Z \
--end-time 2026-05-14T12:00:00Z \
--period 300 \
--statistics Average Maximum
# Azure Monitor - Query logs with KQL
az monitor log-analytics query \
--workspace "my-workspace-id" \
--analytics-query "
AppRequests
| where TimeGenerated > ago(1h)
| where ResultCode >= 500
| summarize ErrorCount=count() by bin(TimeGenerated, 5m), AppRoleName
| order by TimeGenerated desc
" \
--output table
# GCP Cloud Monitoring - List metrics
gcloud monitoring metrics-descriptors list \
--filter='metric.type = starts_with("compute.googleapis.com/instance/cpu")'
Infrastructure Monitoring with Terraform
Monitoring should be treated as code — versioned, reviewed, and deployed through the same CI/CD pipelines as your infrastructure. Terraform can provision monitoring stacks, configure alerts, and manage dashboards declaratively.
Deploying Prometheus + Grafana with Helm
# Deploy kube-prometheus-stack via Helm
resource "helm_release" "kube_prometheus" {
name = "kube-prometheus"
namespace = "monitoring"
repository = "https://prometheus-community.github.io/helm-charts"
chart = "kube-prometheus-stack"
version = "56.6.2"
create_namespace = true
# Prometheus configuration
set {
name = "prometheus.prometheusSpec.retention"
value = "30d"
}
set {
name = "prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage"
value = "100Gi"
}
set {
name = "prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName"
value = "gp3"
}
# Grafana configuration
set {
name = "grafana.adminPassword"
value = var.grafana_admin_password
}
set {
name = "grafana.persistence.enabled"
value = "true"
}
set {
name = "grafana.persistence.size"
value = "10Gi"
}
# Alertmanager configuration
set {
name = "alertmanager.alertmanagerSpec.retention"
value = "120h"
}
# Enable ServiceMonitor for auto-discovery
set {
name = "prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues"
value = "false"
}
}
Cloud Alert Resources via Terraform
# AWS CloudWatch Alarm via Terraform
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "${var.service_name}-high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 60
statistic = "Average"
threshold = 80
alarm_description = "CPU utilization exceeds 80% for 3 minutes"
dimensions = {
AutoScalingGroupName = aws_autoscaling_group.app.name
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
tags = var.common_tags
}
# AWS CloudWatch composite alarm
resource "aws_cloudwatch_composite_alarm" "service_health" {
alarm_name = "${var.service_name}-health-composite"
alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.high_cpu.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.high_error_rate.alarm_name})"
alarm_actions = [aws_sns_topic.critical_alerts.arn]
}
# Azure Monitor Alert via Terraform
resource "azurerm_monitor_metric_alert" "response_time" {
name = "${var.service_name}-response-time"
resource_group_name = azurerm_resource_group.main.name
scopes = [azurerm_linux_web_app.main.id]
description = "Alert when average response time exceeds 2 seconds"
severity = 2
frequency = "PT1M"
window_size = "PT5M"
criteria {
metric_namespace = "Microsoft.Web/sites"
metric_name = "HttpResponseTime"
aggregation = "Average"
operator = "GreaterThan"
threshold = 2
}
action {
action_group_id = azurerm_monitor_action_group.platform.id
}
tags = var.common_tags
}
resource "azurerm_monitor_action_group" "platform" {
name = "platform-alerts"
resource_group_name = azurerm_resource_group.main.name
short_name = "platform"
email_receiver {
name = "oncall"
email_address = "oncall@company.com"
}
webhook_receiver {
name = "pagerduty"
uri = "https://events.pagerduty.com/integration/${var.pd_key}/enqueue"
}
}
Hands-On Exercises
Deploy Prometheus and Write PromQL Queries
Deploy Prometheus using Docker Compose with Node Exporter. Configure scraping, then write PromQL queries to answer operational questions about your system.
- Create a
docker-compose.ymlwith Prometheus + Node Exporter + a sample app - Configure
prometheus.ymlwith scrape targets - Write PromQL queries for: CPU utilization rate, memory usage percentage, top 3 endpoints by request rate, 95th percentile latency, and error rate per service
- Create recording rules for expensive queries you would use in dashboards
Build a Grafana Dashboard for Infrastructure Metrics
Create a golden-signals dashboard in Grafana that provides at-a-glance infrastructure health and supports drill-down into problem areas.
- Connect Grafana to your Prometheus instance from Exercise 1
- Create a dashboard with 6 panels: request rate, error rate, P95 latency, CPU utilization, memory usage, and disk available
- Add template variables for environment and instance filtering
- Configure threshold colors (green/yellow/red) on stat panels
- Export the dashboard as JSON and commit it to version control
Configure Alerting with Severity Routing
Set up Alertmanager with a multi-tier routing configuration that directs alerts to different channels based on severity and team ownership.
- Deploy Alertmanager alongside Prometheus
- Create alerting rules for: high error rate (P1), disk space low (P3), and pod crash looping (P2)
- Configure routing: P1 → PagerDuty simulation, P2 → Slack webhook, P3 → email
- Add inhibition rules to suppress warnings when a related critical alert fires
- Test by triggering alerts and verifying correct routing
Define SLOs and Calculate Error Budgets
Define meaningful SLOs for a web service, implement SLI measurement in PromQL, calculate error budgets, and create burn-rate alerts.
- Define SLOs: 99.9% availability, 99% of requests under 200ms latency
- Write PromQL SLI queries that measure compliance over 30-day windows
- Calculate the error budget in minutes for each SLO
- Create multi-window burn rate alerts (fast: 1h/5m, slow: 6h/30m)
- Simulate an incident and observe budget consumption
Conclusion & Next Steps
Observability is not a product you buy — it is a property of your system. By instrumenting your infrastructure with comprehensive metrics, structured logs, and distributed traces, you gain the ability to understand system behavior, detect anomalies before users notice, and diagnose root causes in minutes instead of hours.
The key principles to carry forward:
- Three pillars together — metrics detect, logs diagnose, traces explain the full picture
- Prometheus + Grafana — the industry standard for cloud-native metrics and visualization
- OpenTelemetry — the universal standard for instrumentation; invest in it now
- Actionable alerts only — every alert must have a clear action and runbook
- SLOs drive decisions — error budgets provide the framework for balancing reliability and velocity
- Monitoring as Code — dashboards, alerts, and configurations belong in version control
Next in the Series
In Part 14: Platform Engineering, we will explore Internal Developer Platforms, Backstage, developer experience, self-service infrastructure, and golden paths — building the abstractions that let development teams move fast without sacrificing operational quality.