Part 12: Observability as Code & Platform Engineering

Why Observability as Code?

Manually configured dashboards and alerts have the same problems as manually configured servers: they drift, they are not reproducible, they are not reviewable, and they do not survive an environment rebuild.

Manual Configuration	Observability as Code
Click-ops in Grafana UI	JSON/Jsonnet templates in Git
Alert rules edited by hand	PrometheusRule CRDs in version control
No review process for changes	Pull requests with peer review
Cannot reproduce after disaster	Full rebuild from code in minutes
Knowledge lives in one person's head	Knowledge lives in the repository
Inconsistent across environments	Same code deploys to dev, staging, prod

Terraform for Observability Infrastructure

Terraform manages the infrastructure layer: Prometheus instances, Grafana servers, Loki storage backends, alerting channels, and data sources.

# main.tf — Observability infrastructure with Terraform
terraform {
  required_providers {
    helm = { source = "hashicorp/helm", version = "~> 2.12" }
    kubernetes = { source = "hashicorp/kubernetes", version = "~> 2.25" }
    grafana = { source = "grafana/grafana", version = "~> 2.9" }
  }
}

# Deploy kube-prometheus-stack via Helm
resource "helm_release" "kube_prometheus" {
  name       = "monitoring"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "kube-prometheus-stack"
  namespace  = "monitoring"
  version    = "56.6.2"

  values = [file("values/kube-prometheus-values.yaml")]
}

# Deploy Loki for log aggregation
resource "helm_release" "loki" {
  name       = "loki"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "loki"
  namespace  = "monitoring"
  version    = "5.42.0"

  values = [file("values/loki-values.yaml")]
}

# Deploy Tempo for distributed tracing
resource "helm_release" "tempo" {
  name       = "tempo"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "tempo"
  namespace  = "monitoring"
  version    = "1.7.1"

  values = [file("values/tempo-values.yaml")]
}

# Configure Grafana data sources
resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = "http://monitoring-kube-prometheus-prometheus.monitoring:9090"
  is_default = true
}

resource "grafana_data_source" "loki" {
  type = "loki"
  name = "Loki"
  url  = "http://loki.monitoring:3100"
}

resource "grafana_data_source" "tempo" {
  type = "tempo"
  name = "Tempo"
  url  = "http://tempo.monitoring:3200"
}

Dashboards as Code

Grafonnet — Grafana Dashboards with Jsonnet

Grafonnet is a Jsonnet library for generating Grafana dashboard JSON. Instead of manually creating 20 similar service dashboards, you write a template once and generate dashboards for every service.

// service-dashboard.jsonnet — Reusable service dashboard template
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local dashboard = grafana.dashboard;
local panel = grafana.panel;

// Function: generate a Golden Signals dashboard for any service
local serviceDashboard(serviceName, sloTarget=0.999) =
  dashboard.new('Service: %s' % serviceName)
  + dashboard.withUid('service-%s' % serviceName)
  + dashboard.withTags(['auto-generated', 'golden-signals'])
  + dashboard.withPanels([
    // Request Rate
    panel.timeSeries.new('Request Rate')
    + panel.timeSeries.queryOptions.withTargets([
      grafana.query.prometheus.new(
        'Prometheus',
        'sum(rate(http_requests_total{service="%s"}[5m]))' % serviceName
      ),
    ])
    + panel.timeSeries.standardOptions.withUnit('reqps'),

    // Error Rate
    panel.timeSeries.new('Error Rate')
    + panel.timeSeries.queryOptions.withTargets([
      grafana.query.prometheus.new(
        'Prometheus',
        'sum(rate(http_requests_total{service="%s",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="%s"}[5m])) * 100' % [serviceName, serviceName]
      ),
    ])
    + panel.timeSeries.standardOptions.withUnit('percent'),

    // p99 Latency
    panel.timeSeries.new('p99 Latency')
    + panel.timeSeries.queryOptions.withTargets([
      grafana.query.prometheus.new(
        'Prometheus',
        'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="%s"}[5m])) by (le))' % serviceName
      ),
    ])
    + panel.timeSeries.standardOptions.withUnit('s'),
  ]);

// Generate dashboards for all services
{
  'order-service-dashboard.json': serviceDashboard('order-service'),
  'payment-service-dashboard.json': serviceDashboard('payment-service', 0.9999),
  'user-service-dashboard.json': serviceDashboard('user-service'),
  'inventory-service-dashboard.json': serviceDashboard('inventory-service'),
}

Deploying Dashboards with Terraform

# Generate JSON from Jsonnet
jsonnet -J vendor -m output/ service-dashboard.jsonnet

# Deploy dashboards via Terraform
resource "grafana_dashboard" "services" {
  for_each = fileset("${path.module}/output", "*.json")

  config_json = file("${path.module}/output/${each.value}")
  overwrite   = true
}

Alert Rules via GitOps

With the Prometheus Operator, alert rules are Kubernetes CRDs (PrometheusRule). Store them in Git, review changes via pull requests, and deploy via ArgoCD or Flux.

# alerts/slo-burn-rate.yaml — SLO alerts as Kubernetes CRDs
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-burn-rate-alerts
  namespace: monitoring
  labels:
    release: monitoring  # Must match Prometheus Operator's ruleSelector
spec:
  groups:
    - name: slo.burn-rate
      rules:
        - alert: SLOBurnRateCritical
          expr: |
            (
              sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
              / sum(rate(http_requests_total[1h])) by (service)
            ) > (14.4 * 0.001)
            and
            (
              sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
              / sum(rate(http_requests_total[5m])) by (service)
            ) > (14.4 * 0.001)
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "{{ $labels.service }} SLO burn rate critical"
            runbook_url: "https://runbooks.internal/slo-burn-critical"

        - alert: SLOBurnRateHigh
          expr: |
            (
              sum(rate(http_requests_total{status=~"5.."}[6h])) by (service)
              / sum(rate(http_requests_total[6h])) by (service)
            ) > (6 * 0.001)
            and
            (
              sum(rate(http_requests_total{status=~"5.."}[30m])) by (service)
              / sum(rate(http_requests_total[30m])) by (service)
            ) > (6 * 0.001)
          for: 5m
          labels:
            severity: high
          annotations:
            summary: "{{ $labels.service }} SLO burn rate elevated"

GitOps Pipeline for Observability Configuration

                                flowchart LR
                                    A[Developer edits\nalert rule YAML] --> B[Pull Request\nPeer review + CI validation]
                                    B --> C[Merge to main]
                                    C --> D[ArgoCD / Flux\nDetects change in Git]
                                    D --> E[Apply to Kubernetes\nPrometheusRule CRD updated]
                                    E --> F[Prometheus reloads\nNew alert rules active]

Platform Engineering Patterns

Platform engineering creates self-service abstractions so application teams get observability automatically without configuring anything. The platform team builds golden paths; product teams walk them.

                            
                            The Golden Path: When a team creates a new service, the platform automatically provisions: (1) a Golden Signals dashboard, (2) SLO burn rate alerts at the default target, (3) an OTel Collector sidecar for telemetry, (4) log forwarding to Loki, (5) trace collection to Tempo. Zero configuration required from the application team.
                        

Pattern	Implementation	What Teams Get
Auto-dashboard	Jsonnet template + CI generates dashboard on service registration	Dashboard appears automatically in Grafana
Auto-alerting	Default PrometheusRules applied via namespace labels	SLO alerts with sensible defaults, overridable via annotations
Auto-instrumentation	OTel Operator + namespace annotation	All pods in namespace get traces/metrics without code changes
Standardised logging	Fluent Bit DaemonSet with structured JSON format	Logs queryable in Loki with consistent schema
SLO registry	YAML file per service defining SLO targets	Burn rate dashboards and alerts generated from registry

# slo-registry/order-service.yaml — SLO definition in the platform registry
apiVersion: slo.platform.internal/v1
kind: ServiceSLO
metadata:
  name: order-service
  namespace: production
spec:
  service: order-service
  team: payments
  tier: critical  # critical | standard | internal

  slos:
    - name: availability
      description: "Proportion of successful requests"
      sli:
        type: ratio
        good: 'http_requests_total{service="order-service",status!~"5.."}'
        total: 'http_requests_total{service="order-service"}'
      target: 0.999
      window: 30d

    - name: latency
      description: "Proportion of requests under 300ms"
      sli:
        type: ratio
        good: 'http_request_duration_seconds_bucket{service="order-service",le="0.3"}'
        total: 'http_request_duration_seconds_count{service="order-service"}'
      target: 0.99
      window: 30d

  alerting:
    burnRate:
      - window: 1h
        rate: 14.4
        severity: critical
      - window: 6h
        rate: 6
        severity: high

Reference Architecture

Complete Observability Platform Stack

Infrastructure Layer (Terraform): Prometheus, Grafana, Loki, Tempo deployed via Helm charts
Collection Layer (OTel): OTel Operator + DaemonSet Collectors, Fluent Bit for logs
Configuration Layer (GitOps): Alert rules, recording rules, and dashboards in Git, synced by ArgoCD
Automation Layer (Platform): SLO registry generates dashboards + alerts; auto-instrumentation via annotations
Consumption Layer (Self-Service): Teams view pre-built dashboards, customise SLO targets, access traces/logs

Platform Engineering Self-Service GitOps

Conclusion & Series Summary

This completes the 12-part core series on Monitoring, Observability & Reliability. Here is the full arc:

Part	Topic	Key Concept
1	Foundations	Monitoring vs observability, three pillars
2	Metrics	Four Golden Signals, Prometheus data model
3	Logging	Structured logging, ELK vs Loki
4	Prometheus Deep Dive	PromQL, recording rules, federation
5	Distributed Tracing	Spans, context propagation, sampling
6	OpenTelemetry	Unified telemetry, Collector pipelines
7	Visualization & Alerting	Grafana dashboards, Alertmanager routing
8	Kubernetes Observability	Control plane, KSM, OTel Operator
9	SLOs & Error Budgets	Burn rate alerting, error budget policies
10	Incident Management	Structured response, blameless post-mortems
11	Chaos Engineering	Hypothesis-driven fault injection
12	Observability as Code	Terraform, Jsonnet, GitOps, platform engineering

Key takeaways from Part 12 and the series as a whole:

Codify everything — dashboards, alerts, infrastructure, SLOs — in version-controlled repositories
GitOps for alert rules gives you review, rollback, and audit trail
Grafonnet/Jsonnet templates eliminate dashboard duplication across services
Platform engineering makes observability self-service — the golden path gives teams observability by default
The complete stack: OTel for collection → Prometheus/Loki/Tempo for storage → Grafana for visualization → Alertmanager for notification → SLOs for decision-making

Previous Part 11: Chaos Engineering Series Index All Parts & Deep Dives

Cookie Consent

Part 12: Observability as Code & Platform Engineering

Table of Contents

Why Observability as Code?

Terraform for Observability Infrastructure

Dashboards as Code

Grafonnet — Grafana Dashboards with Jsonnet

Deploying Dashboards with Terraform

Alert Rules via GitOps

Platform Engineering Patterns

Complete Observability Platform Stack

Conclusion & Series Summary

Cookie Consent

Part 12: Observability as Code & Platform Engineering

Table of Contents

Why Observability as Code?

Terraform for Observability Infrastructure

Dashboards as Code

Grafonnet — Grafana Dashboards with Jsonnet

Deploying Dashboards with Terraform

Alert Rules via GitOps

Platform Engineering Patterns

Complete Observability Platform Stack

Conclusion & Series Summary

Explore the Deep Dives

Tool Deep Dive: Prometheus Complete Guide

Tool Deep Dive: Grafana Complete Guide

Platform Deep Dive: Datadog