Why Observability as Code?
Manually configured dashboards and alerts have the same problems as manually configured servers: they drift, they are not reproducible, they are not reviewable, and they do not survive an environment rebuild.
| Manual Configuration | Observability as Code |
|---|---|
| Click-ops in Grafana UI | JSON/Jsonnet templates in Git |
| Alert rules edited by hand | PrometheusRule CRDs in version control |
| No review process for changes | Pull requests with peer review |
| Cannot reproduce after disaster | Full rebuild from code in minutes |
| Knowledge lives in one person's head | Knowledge lives in the repository |
| Inconsistent across environments | Same code deploys to dev, staging, prod |
Terraform for Observability Infrastructure
Terraform manages the infrastructure layer: Prometheus instances, Grafana servers, Loki storage backends, alerting channels, and data sources.
# main.tf — Observability infrastructure with Terraform
terraform {
required_providers {
helm = { source = "hashicorp/helm", version = "~> 2.12" }
kubernetes = { source = "hashicorp/kubernetes", version = "~> 2.25" }
grafana = { source = "grafana/grafana", version = "~> 2.9" }
}
}
# Deploy kube-prometheus-stack via Helm
resource "helm_release" "kube_prometheus" {
name = "monitoring"
repository = "https://prometheus-community.github.io/helm-charts"
chart = "kube-prometheus-stack"
namespace = "monitoring"
version = "56.6.2"
values = [file("values/kube-prometheus-values.yaml")]
}
# Deploy Loki for log aggregation
resource "helm_release" "loki" {
name = "loki"
repository = "https://grafana.github.io/helm-charts"
chart = "loki"
namespace = "monitoring"
version = "5.42.0"
values = [file("values/loki-values.yaml")]
}
# Deploy Tempo for distributed tracing
resource "helm_release" "tempo" {
name = "tempo"
repository = "https://grafana.github.io/helm-charts"
chart = "tempo"
namespace = "monitoring"
version = "1.7.1"
values = [file("values/tempo-values.yaml")]
}
# Configure Grafana data sources
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus"
url = "http://monitoring-kube-prometheus-prometheus.monitoring:9090"
is_default = true
}
resource "grafana_data_source" "loki" {
type = "loki"
name = "Loki"
url = "http://loki.monitoring:3100"
}
resource "grafana_data_source" "tempo" {
type = "tempo"
name = "Tempo"
url = "http://tempo.monitoring:3200"
}
Dashboards as Code
Grafonnet — Grafana Dashboards with Jsonnet
Grafonnet is a Jsonnet library for generating Grafana dashboard JSON. Instead of manually creating 20 similar service dashboards, you write a template once and generate dashboards for every service.
// service-dashboard.jsonnet — Reusable service dashboard template
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local dashboard = grafana.dashboard;
local panel = grafana.panel;
// Function: generate a Golden Signals dashboard for any service
local serviceDashboard(serviceName, sloTarget=0.999) =
dashboard.new('Service: %s' % serviceName)
+ dashboard.withUid('service-%s' % serviceName)
+ dashboard.withTags(['auto-generated', 'golden-signals'])
+ dashboard.withPanels([
// Request Rate
panel.timeSeries.new('Request Rate')
+ panel.timeSeries.queryOptions.withTargets([
grafana.query.prometheus.new(
'Prometheus',
'sum(rate(http_requests_total{service="%s"}[5m]))' % serviceName
),
])
+ panel.timeSeries.standardOptions.withUnit('reqps'),
// Error Rate
panel.timeSeries.new('Error Rate')
+ panel.timeSeries.queryOptions.withTargets([
grafana.query.prometheus.new(
'Prometheus',
'sum(rate(http_requests_total{service="%s",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="%s"}[5m])) * 100' % [serviceName, serviceName]
),
])
+ panel.timeSeries.standardOptions.withUnit('percent'),
// p99 Latency
panel.timeSeries.new('p99 Latency')
+ panel.timeSeries.queryOptions.withTargets([
grafana.query.prometheus.new(
'Prometheus',
'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="%s"}[5m])) by (le))' % serviceName
),
])
+ panel.timeSeries.standardOptions.withUnit('s'),
]);
// Generate dashboards for all services
{
'order-service-dashboard.json': serviceDashboard('order-service'),
'payment-service-dashboard.json': serviceDashboard('payment-service', 0.9999),
'user-service-dashboard.json': serviceDashboard('user-service'),
'inventory-service-dashboard.json': serviceDashboard('inventory-service'),
}
Deploying Dashboards with Terraform
# Generate JSON from Jsonnet
jsonnet -J vendor -m output/ service-dashboard.jsonnet
# Deploy dashboards via Terraform
resource "grafana_dashboard" "services" {
for_each = fileset("${path.module}/output", "*.json")
config_json = file("${path.module}/output/${each.value}")
overwrite = true
}
Alert Rules via GitOps
With the Prometheus Operator, alert rules are Kubernetes CRDs (PrometheusRule). Store them in Git, review changes via pull requests, and deploy via ArgoCD or Flux.
# alerts/slo-burn-rate.yaml — SLO alerts as Kubernetes CRDs
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: slo-burn-rate-alerts
namespace: monitoring
labels:
release: monitoring # Must match Prometheus Operator's ruleSelector
spec:
groups:
- name: slo.burn-rate
rules:
- alert: SLOBurnRateCritical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
/ sum(rate(http_requests_total[1h])) by (service)
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} SLO burn rate critical"
runbook_url: "https://runbooks.internal/slo-burn-critical"
- alert: SLOBurnRateHigh
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h])) by (service)
/ sum(rate(http_requests_total[6h])) by (service)
) > (6 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[30m])) by (service)
/ sum(rate(http_requests_total[30m])) by (service)
) > (6 * 0.001)
for: 5m
labels:
severity: high
annotations:
summary: "{{ $labels.service }} SLO burn rate elevated"
flowchart LR
A[Developer edits\nalert rule YAML] --> B[Pull Request\nPeer review + CI validation]
B --> C[Merge to main]
C --> D[ArgoCD / Flux\nDetects change in Git]
D --> E[Apply to Kubernetes\nPrometheusRule CRD updated]
E --> F[Prometheus reloads\nNew alert rules active]
Platform Engineering Patterns
Platform engineering creates self-service abstractions so application teams get observability automatically without configuring anything. The platform team builds golden paths; product teams walk them.
| Pattern | Implementation | What Teams Get |
|---|---|---|
| Auto-dashboard | Jsonnet template + CI generates dashboard on service registration | Dashboard appears automatically in Grafana |
| Auto-alerting | Default PrometheusRules applied via namespace labels | SLO alerts with sensible defaults, overridable via annotations |
| Auto-instrumentation | OTel Operator + namespace annotation | All pods in namespace get traces/metrics without code changes |
| Standardised logging | Fluent Bit DaemonSet with structured JSON format | Logs queryable in Loki with consistent schema |
| SLO registry | YAML file per service defining SLO targets | Burn rate dashboards and alerts generated from registry |
# slo-registry/order-service.yaml — SLO definition in the platform registry
apiVersion: slo.platform.internal/v1
kind: ServiceSLO
metadata:
name: order-service
namespace: production
spec:
service: order-service
team: payments
tier: critical # critical | standard | internal
slos:
- name: availability
description: "Proportion of successful requests"
sli:
type: ratio
good: 'http_requests_total{service="order-service",status!~"5.."}'
total: 'http_requests_total{service="order-service"}'
target: 0.999
window: 30d
- name: latency
description: "Proportion of requests under 300ms"
sli:
type: ratio
good: 'http_request_duration_seconds_bucket{service="order-service",le="0.3"}'
total: 'http_request_duration_seconds_count{service="order-service"}'
target: 0.99
window: 30d
alerting:
burnRate:
- window: 1h
rate: 14.4
severity: critical
- window: 6h
rate: 6
severity: high
Complete Observability Platform Stack
- Infrastructure Layer (Terraform): Prometheus, Grafana, Loki, Tempo deployed via Helm charts
- Collection Layer (OTel): OTel Operator + DaemonSet Collectors, Fluent Bit for logs
- Configuration Layer (GitOps): Alert rules, recording rules, and dashboards in Git, synced by ArgoCD
- Automation Layer (Platform): SLO registry generates dashboards + alerts; auto-instrumentation via annotations
- Consumption Layer (Self-Service): Teams view pre-built dashboards, customise SLO targets, access traces/logs
Conclusion & Series Summary
This completes the 12-part core series on Monitoring, Observability & Reliability. Here is the full arc:
| Part | Topic | Key Concept |
|---|---|---|
| 1 | Foundations | Monitoring vs observability, three pillars |
| 2 | Metrics | Four Golden Signals, Prometheus data model |
| 3 | Logging | Structured logging, ELK vs Loki |
| 4 | Prometheus Deep Dive | PromQL, recording rules, federation |
| 5 | Distributed Tracing | Spans, context propagation, sampling |
| 6 | OpenTelemetry | Unified telemetry, Collector pipelines |
| 7 | Visualization & Alerting | Grafana dashboards, Alertmanager routing |
| 8 | Kubernetes Observability | Control plane, KSM, OTel Operator |
| 9 | SLOs & Error Budgets | Burn rate alerting, error budget policies |
| 10 | Incident Management | Structured response, blameless post-mortems |
| 11 | Chaos Engineering | Hypothesis-driven fault injection |
| 12 | Observability as Code | Terraform, Jsonnet, GitOps, platform engineering |
Key takeaways from Part 12 and the series as a whole:
- Codify everything — dashboards, alerts, infrastructure, SLOs — in version-controlled repositories
- GitOps for alert rules gives you review, rollback, and audit trail
- Grafonnet/Jsonnet templates eliminate dashboard duplication across services
- Platform engineering makes observability self-service — the golden path gives teams observability by default
- The complete stack: OTel for collection → Prometheus/Loki/Tempo for storage → Grafana for visualization → Alertmanager for notification → SLOs for decision-making