Back to Monitoring, Observability & Reliability Series

Part 12: Observability as Code & Platform Engineering

May 14, 2026 Wasil Zafar 18 min read

The final part of the core series brings everything together: codifying your observability stack with Terraform, generating dashboards from templates with Jsonnet, managing alert rules via GitOps, and building an internal developer platform that makes observability self-service.

Table of Contents

  1. Why Observability as Code?
  2. Terraform for Observability Infrastructure
  3. Dashboards as Code
  4. Alert Rules via GitOps
  5. Platform Engineering Patterns
  6. Conclusion & Series Summary

Why Observability as Code?

Manually configured dashboards and alerts have the same problems as manually configured servers: they drift, they are not reproducible, they are not reviewable, and they do not survive an environment rebuild.

Manual ConfigurationObservability as Code
Click-ops in Grafana UIJSON/Jsonnet templates in Git
Alert rules edited by handPrometheusRule CRDs in version control
No review process for changesPull requests with peer review
Cannot reproduce after disasterFull rebuild from code in minutes
Knowledge lives in one person's headKnowledge lives in the repository
Inconsistent across environmentsSame code deploys to dev, staging, prod

Terraform for Observability Infrastructure

Terraform manages the infrastructure layer: Prometheus instances, Grafana servers, Loki storage backends, alerting channels, and data sources.

# main.tf — Observability infrastructure with Terraform
terraform {
  required_providers {
    helm = { source = "hashicorp/helm", version = "~> 2.12" }
    kubernetes = { source = "hashicorp/kubernetes", version = "~> 2.25" }
    grafana = { source = "grafana/grafana", version = "~> 2.9" }
  }
}

# Deploy kube-prometheus-stack via Helm
resource "helm_release" "kube_prometheus" {
  name       = "monitoring"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "kube-prometheus-stack"
  namespace  = "monitoring"
  version    = "56.6.2"

  values = [file("values/kube-prometheus-values.yaml")]
}

# Deploy Loki for log aggregation
resource "helm_release" "loki" {
  name       = "loki"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "loki"
  namespace  = "monitoring"
  version    = "5.42.0"

  values = [file("values/loki-values.yaml")]
}

# Deploy Tempo for distributed tracing
resource "helm_release" "tempo" {
  name       = "tempo"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "tempo"
  namespace  = "monitoring"
  version    = "1.7.1"

  values = [file("values/tempo-values.yaml")]
}

# Configure Grafana data sources
resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://monitoring-kube-prometheus-prometheus.monitoring:9090"
  is_default = true
}

resource "grafana_data_source" "loki" {
  type = "loki"
  name = "Loki"
  url  = "http://loki.monitoring:3100"
}

resource "grafana_data_source" "tempo" {
  type = "tempo"
  name = "Tempo"
  url  = "http://tempo.monitoring:3200"
}

Dashboards as Code

Grafonnet — Grafana Dashboards with Jsonnet

Grafonnet is a Jsonnet library for generating Grafana dashboard JSON. Instead of manually creating 20 similar service dashboards, you write a template once and generate dashboards for every service.

// service-dashboard.jsonnet — Reusable service dashboard template
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local dashboard = grafana.dashboard;
local panel = grafana.panel;

// Function: generate a Golden Signals dashboard for any service
local serviceDashboard(serviceName, sloTarget=0.999) =
  dashboard.new('Service: %s' % serviceName)
  + dashboard.withUid('service-%s' % serviceName)
  + dashboard.withTags(['auto-generated', 'golden-signals'])
  + dashboard.withPanels([
    // Request Rate
    panel.timeSeries.new('Request Rate')
    + panel.timeSeries.queryOptions.withTargets([
      grafana.query.prometheus.new(
        'Prometheus',
        'sum(rate(http_requests_total{service="%s"}[5m]))' % serviceName
      ),
    ])
    + panel.timeSeries.standardOptions.withUnit('reqps'),

    // Error Rate
    panel.timeSeries.new('Error Rate')
    + panel.timeSeries.queryOptions.withTargets([
      grafana.query.prometheus.new(
        'Prometheus',
        'sum(rate(http_requests_total{service="%s",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="%s"}[5m])) * 100' % [serviceName, serviceName]
      ),
    ])
    + panel.timeSeries.standardOptions.withUnit('percent'),

    // p99 Latency
    panel.timeSeries.new('p99 Latency')
    + panel.timeSeries.queryOptions.withTargets([
      grafana.query.prometheus.new(
        'Prometheus',
        'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="%s"}[5m])) by (le))' % serviceName
      ),
    ])
    + panel.timeSeries.standardOptions.withUnit('s'),
  ]);

// Generate dashboards for all services
{
  'order-service-dashboard.json': serviceDashboard('order-service'),
  'payment-service-dashboard.json': serviceDashboard('payment-service', 0.9999),
  'user-service-dashboard.json': serviceDashboard('user-service'),
  'inventory-service-dashboard.json': serviceDashboard('inventory-service'),
}

Deploying Dashboards with Terraform

# Generate JSON from Jsonnet
jsonnet -J vendor -m output/ service-dashboard.jsonnet

# Deploy dashboards via Terraform
resource "grafana_dashboard" "services" {
  for_each = fileset("${path.module}/output", "*.json")

  config_json = file("${path.module}/output/${each.value}")
  overwrite   = true
}

Alert Rules via GitOps

With the Prometheus Operator, alert rules are Kubernetes CRDs (PrometheusRule). Store them in Git, review changes via pull requests, and deploy via ArgoCD or Flux.

# alerts/slo-burn-rate.yaml — SLO alerts as Kubernetes CRDs
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-burn-rate-alerts
  namespace: monitoring
  labels:
    release: monitoring  # Must match Prometheus Operator's ruleSelector
spec:
  groups:
    - name: slo.burn-rate
      rules:
        - alert: SLOBurnRateCritical
          expr: |
            (
              sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
              / sum(rate(http_requests_total[1h])) by (service)
            ) > (14.4 * 0.001)
            and
            (
              sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
              / sum(rate(http_requests_total[5m])) by (service)
            ) > (14.4 * 0.001)
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "{{ $labels.service }} SLO burn rate critical"
            runbook_url: "https://runbooks.internal/slo-burn-critical"

        - alert: SLOBurnRateHigh
          expr: |
            (
              sum(rate(http_requests_total{status=~"5.."}[6h])) by (service)
              / sum(rate(http_requests_total[6h])) by (service)
            ) > (6 * 0.001)
            and
            (
              sum(rate(http_requests_total{status=~"5.."}[30m])) by (service)
              / sum(rate(http_requests_total[30m])) by (service)
            ) > (6 * 0.001)
          for: 5m
          labels:
            severity: high
          annotations:
            summary: "{{ $labels.service }} SLO burn rate elevated"
GitOps Pipeline for Observability Configuration
                                flowchart LR
                                    A[Developer edits\nalert rule YAML] --> B[Pull Request\nPeer review + CI validation]
                                    B --> C[Merge to main]
                                    C --> D[ArgoCD / Flux\nDetects change in Git]
                                    D --> E[Apply to Kubernetes\nPrometheusRule CRD updated]
                                    E --> F[Prometheus reloads\nNew alert rules active]
                            

Platform Engineering Patterns

Platform engineering creates self-service abstractions so application teams get observability automatically without configuring anything. The platform team builds golden paths; product teams walk them.

The Golden Path: When a team creates a new service, the platform automatically provisions: (1) a Golden Signals dashboard, (2) SLO burn rate alerts at the default target, (3) an OTel Collector sidecar for telemetry, (4) log forwarding to Loki, (5) trace collection to Tempo. Zero configuration required from the application team.
PatternImplementationWhat Teams Get
Auto-dashboardJsonnet template + CI generates dashboard on service registrationDashboard appears automatically in Grafana
Auto-alertingDefault PrometheusRules applied via namespace labelsSLO alerts with sensible defaults, overridable via annotations
Auto-instrumentationOTel Operator + namespace annotationAll pods in namespace get traces/metrics without code changes
Standardised loggingFluent Bit DaemonSet with structured JSON formatLogs queryable in Loki with consistent schema
SLO registryYAML file per service defining SLO targetsBurn rate dashboards and alerts generated from registry
# slo-registry/order-service.yaml — SLO definition in the platform registry
apiVersion: slo.platform.internal/v1
kind: ServiceSLO
metadata:
  name: order-service
  namespace: production
spec:
  service: order-service
  team: payments
  tier: critical  # critical | standard | internal

  slos:
    - name: availability
      description: "Proportion of successful requests"
      sli:
        type: ratio
        good: 'http_requests_total{service="order-service",status!~"5.."}'
        total: 'http_requests_total{service="order-service"}'
      target: 0.999
      window: 30d

    - name: latency
      description: "Proportion of requests under 300ms"
      sli:
        type: ratio
        good: 'http_request_duration_seconds_bucket{service="order-service",le="0.3"}'
        total: 'http_request_duration_seconds_count{service="order-service"}'
      target: 0.99
      window: 30d

  alerting:
    burnRate:
      - window: 1h
        rate: 14.4
        severity: critical
      - window: 6h
        rate: 6
        severity: high
Reference Architecture

Complete Observability Platform Stack

  1. Infrastructure Layer (Terraform): Prometheus, Grafana, Loki, Tempo deployed via Helm charts
  2. Collection Layer (OTel): OTel Operator + DaemonSet Collectors, Fluent Bit for logs
  3. Configuration Layer (GitOps): Alert rules, recording rules, and dashboards in Git, synced by ArgoCD
  4. Automation Layer (Platform): SLO registry generates dashboards + alerts; auto-instrumentation via annotations
  5. Consumption Layer (Self-Service): Teams view pre-built dashboards, customise SLO targets, access traces/logs
Platform Engineering Self-Service GitOps

Conclusion & Series Summary

This completes the 12-part core series on Monitoring, Observability & Reliability. Here is the full arc:

PartTopicKey Concept
1FoundationsMonitoring vs observability, three pillars
2MetricsFour Golden Signals, Prometheus data model
3LoggingStructured logging, ELK vs Loki
4Prometheus Deep DivePromQL, recording rules, federation
5Distributed TracingSpans, context propagation, sampling
6OpenTelemetryUnified telemetry, Collector pipelines
7Visualization & AlertingGrafana dashboards, Alertmanager routing
8Kubernetes ObservabilityControl plane, KSM, OTel Operator
9SLOs & Error BudgetsBurn rate alerting, error budget policies
10Incident ManagementStructured response, blameless post-mortems
11Chaos EngineeringHypothesis-driven fault injection
12Observability as CodeTerraform, Jsonnet, GitOps, platform engineering

Key takeaways from Part 12 and the series as a whole:

  • Codify everything — dashboards, alerts, infrastructure, SLOs — in version-controlled repositories
  • GitOps for alert rules gives you review, rollback, and audit trail
  • Grafonnet/Jsonnet templates eliminate dashboard duplication across services
  • Platform engineering makes observability self-service — the golden path gives teams observability by default
  • The complete stack: OTel for collection → Prometheus/Loki/Tempo for storage → Grafana for visualization → Alertmanager for notification → SLOs for decision-making