Grafana Deep Dive Part 10: Automation with Infrastructure as Code

Why Automate Grafana?

Managing observability infrastructure through manual UI interactions — what some call “click-ops” — is a pattern that scales poorly. When your Grafana stack exists only as configurations stored in a database, backed by memory and tribal knowledge, you inherit every risk of unversioned, unreproducible infrastructure. A single misclick can delete weeks of dashboard work. A disgruntled team member can silently modify alert thresholds. A cloud region failure can lose your entire monitoring configuration. Infrastructure as Code (IaC) eliminates these risks by applying the same engineering discipline to observability that we apply to application code.

Reproducibility

Reproducibility means that given the same code inputs, you can reliably produce the same infrastructure state across any environment. For observability, this translates to spinning up a complete monitoring stack — dashboards, alert rules, notification policies, data sources, folders, and team permissions — in minutes rather than days. When a new microservice team is onboarded, they receive a fully configured monitoring experience from a template, not a manual setup guide that takes three sprints to complete.

                            
                            The Golden Signal Dashboard Problem: Without IaC, teams independently create their own “golden signals” dashboards with inconsistent panel layouts, different PromQL patterns, and varying alert thresholds. With IaC, you define a canonical golden signals template once and instantiate it per-service with variables — guaranteeing consistency while allowing service-specific customization.
                        

Version Control & Audit Trail

When observability configuration lives in Git, you gain the full power of version control: blame history shows who changed what and when, pull requests enable peer review of alert threshold changes, branches allow experimentation without affecting production monitoring, and tags provide rollback points. This audit trail is increasingly important for compliance frameworks like SOC 2 and ISO 27001, which require evidence of change management for security-critical systems — and your monitoring system is definitively security-critical.

Consider the scenario where a latency alert suddenly stops firing. Without version control, you’re left questioning whether the threshold was changed, by whom, and whether it was intentional. With Git history, a simple git log --follow alerts/api-latency.yaml reveals the complete change history, including the pull request discussion that justified the modification.

Disaster Recovery

Disaster recovery for observability is often overlooked until it’s needed. If your Grafana Cloud stack becomes unavailable, or if you need to migrate between cloud providers, having your entire configuration in code means recovery is a terraform apply away. Without IaC, recreating hundreds of dashboards, dozens of alert rules, complex notification routing trees, and team permission structures from memory is effectively impossible under the time pressure of a real disaster.

Case Study Multi-Region Failover

A financial services company maintained their entire Grafana configuration in Terraform. When their primary cloud region experienced a 4-hour outage, they executed their DR plan:

Activated secondary Grafana Cloud stack (pre-provisioned via Terraform)
Ran terraform apply -var="environment=dr" to configure all dashboards and alerts
Updated DNS to point to the DR instance
Full monitoring restored in under 12 minutes

Without IaC, their estimated recovery time was 2–3 days of manual recreation, during which they would have no visibility into their production systems.

disaster recovery terraform multi-region

Environment Promotion

In mature organizations, observability configuration follows the same promotion path as application code: development → staging → production. A new alert rule is tested against staging traffic before being promoted to production. Dashboard changes are validated against realistic data before reaching on-call engineers. IaC makes this workflow natural — the same Terraform modules or Ansible playbooks are applied to each environment with environment-specific variables (different data source URLs, different alert thresholds, different notification channels).

Components of Observability Systems

What Needs to Be Automated

A complete observability platform consists of multiple interconnected layers, each requiring automation. Understanding these layers helps you prioritize what to automate first and choose the right tools for each component.

Observability Stack Automation Layers

flowchart TD
    A[Collection Layer] --> B[Transport Layer]
    B --> C[Storage Layer]
    C --> D[Visualization Layer]
    D --> E[Alerting Layer]
    E --> F[Incident Layer]

    A1[OTel Collector
Grafana Alloy
Prometheus Agent] --> A
    B1[Kafka
Load Balancers
mTLS Certs] --> B
    C1[Mimir
Loki
Tempo
Pyroscope] --> C
    D1[Dashboards
Folders
Data Sources
Variables] --> D
    E1[Alert Rules
Contact Points
Notification Policies
Silences] --> E
    F1[OnCall Schedules
Escalation Chains
Incident Workflows] --> F

Automation Layers

Each layer maps to specific automation tools:

Layer	Components	Primary Tools	Priority
Collection	OTel Collector, Alloy, Prometheus	Helm, Ansible, Kubernetes Operators	High
Storage	Mimir, Loki, Tempo clusters	Helm, Terraform (cloud-managed)	High
Visualization	Dashboards, folders, data sources	Terraform, Grafonnet, Grafana API	Critical
Alerting	Rules, contacts, policies, silences	Terraform, file-based provisioning	Critical
Access Control	Users, teams, RBAC, service accounts	Terraform, SCIM, Grafana API	Medium
Incident	OnCall schedules, escalation chains	Terraform, Grafana API	Medium

                            
                            Start with Dashboards and Alerts: These are the components most frequently changed, most prone to configuration drift, and most impactful when lost. Collection infrastructure typically changes less frequently and is often already managed by platform teams through Helm charts.
                        

Automating Collection Infrastructure

OpenTelemetry Collector with Helm

The OpenTelemetry Collector is the vendor-neutral telemetry pipeline that receives, processes, and exports metrics, logs, and traces. Deploying it via Helm charts provides repeatable installation with environment-specific customization through values.yaml overrides.

First, add the OpenTelemetry Helm repository:

# Add the OpenTelemetry Helm chart repository
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

# Install the collector in DaemonSet mode (one per node)
helm install otel-collector open-telemetry/opentelemetry-collector \
  --namespace observability \
  --create-namespace \
  --values values-otel-collector.yaml

The values.yaml file customizes the collector’s pipeline configuration, resource limits, and export destinations:

# values-otel-collector.yaml
mode: daemonset

presets:
  logsCollection:
    enabled: true
  kubernetesAttributes:
    enabled: true
  kubeletMetrics:
    enabled: true

config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
    prometheus:
      config:
        scrape_configs:
          - job_name: 'kubernetes-pods'
            kubernetes_sd_configs:
              - role: pod
            relabel_configs:
              - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                action: keep
                regex: true

  processors:
    batch:
      timeout: 5s
      send_batch_size: 1000
    memory_limiter:
      check_interval: 1s
      limit_mib: 512
      spike_limit_mib: 128
    resourcedetection:
      detectors: [env, system, gcp, aws, azure]
      timeout: 5s

  exporters:
    otlphttp/grafana:
      endpoint: https://otlp-gateway-prod-us-central-0.grafana.net/otlp
      headers:
        Authorization: "Basic ${GRAFANA_CLOUD_TOKEN}"
    prometheusremotewrite:
      endpoint: https://prometheus-prod-us-central-0.grafana.net/api/prom/push
      headers:
        Authorization: "Basic ${GRAFANA_CLOUD_TOKEN}"

  service:
    pipelines:
      metrics:
        receivers: [otlp, prometheus]
        processors: [memory_limiter, resourcedetection, batch]
        exporters: [prometheusremotewrite]
      traces:
        receivers: [otlp]
        processors: [memory_limiter, resourcedetection, batch]
        exporters: [otlphttp/grafana]
      logs:
        receivers: [otlp]
        processors: [memory_limiter, resourcedetection, batch]
        exporters: [otlphttp/grafana]

resources:
  limits:
    cpu: 500m
    memory: 768Mi
  requests:
    cpu: 100m
    memory: 256Mi

For production deployments, use a Gateway pattern combining DaemonSet collectors (lightweight, per-node) with a centralized Gateway deployment (handles authentication, batching, and retry logic):

# Deploy per-node agents (lightweight, no auth credentials)
helm install otel-agent open-telemetry/opentelemetry-collector \
  --namespace observability \
  --values values-agent.yaml

# Deploy centralized gateway (handles auth, export to Grafana Cloud)
helm install otel-gateway open-telemetry/opentelemetry-collector \
  --namespace observability \
  --values values-gateway.yaml \
  --set mode=deployment \
  --set replicaCount=3

Grafana Alloy with Helm

Grafana Alloy (the successor to Grafana Agent) is Grafana’s distribution of the OpenTelemetry Collector with additional components for Prometheus scraping, Loki log collection, and native Grafana Cloud integration. Its Helm chart supports both standalone and operator-managed deployment modes.

# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Alloy
helm install alloy grafana/alloy \
  --namespace observability \
  --create-namespace \
  --values values-alloy.yaml

Alloy uses a River-based configuration language. The Helm chart manages this configuration through values.yaml:

# values-alloy.yaml
alloy:
  configMap:
    create: true
    content: |
      // Kubernetes Discovery
      discovery.kubernetes "pods" {
        role = "pod"
      }

      // Prometheus Scraping
      prometheus.scrape "kubernetes" {
        targets    = discovery.kubernetes.pods.targets
        forward_to = [prometheus.remote_write.grafana_cloud.receiver]

        scrape_interval = "30s"
      }

      // Remote Write to Grafana Cloud
      prometheus.remote_write "grafana_cloud" {
        endpoint {
          url = env("METRICS_ENDPOINT")
          basic_auth {
            username = env("METRICS_USERNAME")
            password = env("METRICS_PASSWORD")
          }
        }
      }

      // Loki Log Collection
      loki.source.kubernetes "pods" {
        targets    = discovery.kubernetes.pods.targets
        forward_to = [loki.write.grafana_cloud.receiver]
      }

      loki.write "grafana_cloud" {
        endpoint {
          url = env("LOGS_ENDPOINT")
          basic_auth {
            username = env("LOGS_USERNAME")
            password = env("LOGS_PASSWORD")
          }
        }
      }

controller:
  type: daemonset

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

envFrom:
  - secretRef:
      name: grafana-cloud-credentials

                            
                            Operator Mode: For large-scale deployments, Alloy supports an operator mode where a Kubernetes CRD (GrafanaAgent) defines the desired collection configuration. The operator reconciles the actual collector fleet to match the desired state, enabling dynamic scaling and configuration updates without Helm upgrades.
                        

Getting to Grips with the Grafana API

Every piece of Grafana configuration is accessible through REST APIs. Understanding these APIs is foundational — Terraform providers, Ansible modules, and custom automation scripts all ultimately interact with these endpoints.

Grafana Cloud API

The Grafana Cloud API manages cloud-level resources: stacks, API keys, plugins, and billing. It operates at a higher level than the individual Grafana instance API and uses a Cloud API key for authentication.

# Create a Cloud API key via the Grafana Cloud Portal
# Then use it to manage stacks programmatically

# List all stacks in your organization
curl -s -H "Authorization: Bearer $GRAFANA_CLOUD_API_KEY" \
  https://grafana.com/api/orgs/$ORG_SLUG/instances | jq '.items[].name'

# Create a new Grafana Cloud stack
curl -X POST -H "Authorization: Bearer $GRAFANA_CLOUD_API_KEY" \
  -H "Content-Type: application/json" \
  https://grafana.com/api/instances \
  -d '{
    "name": "prod-us-east",
    "slug": "prod-us-east",
    "region": "us",
    "description": "Production stack for US East region"
  }'

# Create a service account token for Terraform
curl -X POST -H "Authorization: Bearer $GRAFANA_CLOUD_API_KEY" \
  "https://grafana.com/api/instances/$STACK_SLUG/api/serviceaccounts" \
  -H "Content-Type: application/json" \
  -d '{"name": "terraform-sa", "role": "Admin"}'

Grafana Instance API

The instance API manages resources within a specific Grafana deployment: dashboards, folders, data sources, alert rules, annotations, users, and teams. Authentication uses either API keys or service account tokens.

# Dashboard CRUD Operations
# Create or update a dashboard
curl -X POST "$GRAFANA_URL/api/dashboards/db" \
  -H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "dashboard": {
      "title": "API Gateway Health",
      "uid": "api-gw-health",
      "panels": [],
      "schemaVersion": 39
    },
    "folderUid": "platform-team",
    "overwrite": true,
    "message": "Updated via CI/CD pipeline"
  }'

# List all folders
curl -s -H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
  "$GRAFANA_URL/api/folders" | jq '.[].title'

# Create a data source
curl -X POST "$GRAFANA_URL/api/datasources" \
  -H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Mimir Production",
    "type": "prometheus",
    "url": "https://mimir-prod.internal:9009/prometheus",
    "access": "proxy",
    "basicAuth": true,
    "basicAuthUser": "mimir",
    "secureJsonData": {"basicAuthPassword": "'$MIMIR_PASSWORD'"},
    "jsonData": {"httpMethod": "POST", "timeInterval": "15s"}
  }'

# Export alert rules for backup
curl -s -H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
  "$GRAFANA_URL/api/v1/provisioning/alert-rules" | jq '.' > alert-rules-backup.json

Reference Key API Endpoints

Resource	Endpoint	Methods
Dashboards	`/api/dashboards/db`	GET, POST, DELETE
Folders	`/api/folders`	GET, POST, PUT, DELETE
Data Sources	`/api/datasources`	GET, POST, PUT, DELETE
Alert Rules	`/api/v1/provisioning/alert-rules`	GET, POST, PUT, DELETE
Contact Points	`/api/v1/provisioning/contact-points`	GET, POST, PUT, DELETE
Notification Policies	`/api/v1/provisioning/policies`	GET, PUT
Service Accounts	`/api/serviceaccounts`	GET, POST, PATCH, DELETE
Teams	`/api/teams`	GET, POST, PUT, DELETE
Annotations	`/api/annotations`	GET, POST, PUT, DELETE

REST API CRUD automation

Terraform & Ansible for Grafana

Grafana Terraform Provider

The official grafana/grafana Terraform provider wraps the Grafana APIs into declarative HCL resources. It supports both Grafana Cloud management (stacks, API keys, plugins) and instance-level configuration (dashboards, alerts, data sources). The provider is maintained by Grafana Labs and sees frequent releases aligned with new Grafana features.

// providers.tf - Configure the Grafana Terraform provider
terraform {
  required_providers {
    grafana = {
      source  = "grafana/grafana"
      version = "~> 3.0"
    }
  }
}

// Cloud provider for stack management
provider "grafana" {
  alias         = "cloud"
  cloud_api_key = var.grafana_cloud_api_key
}

// Instance provider for dashboard/alert management
provider "grafana" {
  alias = "stack"
  url   = var.grafana_url
  auth  = var.grafana_service_account_token
}

The provider exposes resources for every major Grafana component:

// folders.tf - Organize dashboards into team folders
resource "grafana_folder" "platform" {
  provider = grafana.stack
  title    = "Platform Team"
  uid      = "platform-team"
}

resource "grafana_folder" "services" {
  provider = grafana.stack
  title    = "Microservices"
  uid      = "microservices"
}

// datasources.tf - Configure Prometheus and Loki data sources
resource "grafana_data_source" "mimir" {
  provider = grafana.stack
  type     = "prometheus"
  name     = "Mimir (Metrics)"
  uid      = "mimir-prod"
  url      = var.mimir_endpoint

  json_data_encoded = jsonencode({
    httpMethod   = "POST"
    timeInterval = "15s"
  })

  secure_json_data_encoded = jsonencode({
    basicAuthPassword = var.mimir_password
  })

  basic_auth_enabled  = true
  basic_auth_username = var.mimir_username
}

resource "grafana_data_source" "loki" {
  provider = grafana.stack
  type     = "loki"
  name     = "Loki (Logs)"
  uid      = "loki-prod"
  url      = var.loki_endpoint

  json_data_encoded = jsonencode({
    maxLines    = 5000
    derivedFields = [{
      name          = "TraceID"
      matcherRegex  = "traceID=(\\w+)"
      url           = "$${__value.raw}"
      datasourceUid = "tempo-prod"
    }]
  })
}

Alert configuration with Terraform enables version-controlled, peer-reviewed alert rule management:

// alerts.tf - Define alert rules as code
resource "grafana_rule_group" "api_slos" {
  provider         = grafana.stack
  name             = "API SLO Alerts"
  folder_uid       = grafana_folder.platform.uid
  interval_seconds = 60

  rule {
    name      = "API Availability SLO Burn Rate"
    condition = "C"
    for       = "5m"

    labels = {
      severity = "critical"
      team     = "platform"
      slo      = "api-availability"
    }

    annotations = {
      summary     = "API availability SLO burn rate is too high"
      description = "Error budget consumption rate exceeds threshold. Current burn rate: {{ $values.B }}x"
      runbook_url = "https://runbooks.internal/api-availability-slo"
    }

    data {
      ref_id         = "A"
      datasource_uid = grafana_data_source.mimir.uid
      relative_time_range {
        from = 3600
        to   = 0
      }
      model = jsonencode({
        expr = "1 - (sum(rate(http_requests_total{code!~\"5..\"}[1h])) / sum(rate(http_requests_total[1h])))"
      })
    }

    data {
      ref_id         = "B"
      datasource_uid = "__expr__"
      model = jsonencode({
        type       = "math"
        expression = "$A / (1 - 0.999)"
      })
    }

    data {
      ref_id         = "C"
      datasource_uid = "__expr__"
      model = jsonencode({
        type       = "threshold"
        expression = "B"
        conditions = [{
          evaluator = { type = "gt", params = [14.4] }
        }]
      })
    }
  }
}

// contact_points.tf
resource "grafana_contact_point" "platform_pagerduty" {
  provider = grafana.stack
  name     = "Platform Team PagerDuty"

  pagerduty {
    integration_key = var.pagerduty_integration_key
    severity        = "critical"
    summary         = "{{ template \"default.title\" . }}"
    details = jsonencode({
      firing       = "{{ .Alerts.Firing | len }}"
      resolved     = "{{ .Alerts.Resolved | len }}"
      alertname    = "{{ .CommonLabels.alertname }}"
    })
  }
}

// notification_policies.tf
resource "grafana_notification_policy" "root" {
  provider      = grafana.stack
  contact_point = "Platform Team Slack"
  group_by      = ["alertname", "team"]
  group_wait    = "30s"
  group_interval = "5m"
  repeat_interval = "4h"

  policy {
    matcher {
      label = "severity"
      match = "="
      value = "critical"
    }
    contact_point   = grafana_contact_point.platform_pagerduty.name
    group_wait      = "10s"
    repeat_interval = "1h"
  }

  policy {
    matcher {
      label = "team"
      match = "="
      value = "payments"
    }
    contact_point = "Payments Team Slack"
  }
}

// slo.tf - Define SLOs as code (Grafana Cloud)
resource "grafana_slo" "api_availability" {
  provider    = grafana.stack
  name        = "API Availability"
  description = "99.9% of API requests succeed within 500ms"

  objectives {
    value  = 0.999
    window = "30d"
  }

  query {
    type = "ratio"
    ratio {
      success_metric  = "sum(rate(http_requests_total{code!~\"5..\"}[5m]))"
      total_metric    = "sum(rate(http_requests_total[5m]))"
    }
  }

  alerting {
    fastburn {
      annotation {
        key   = "runbook_url"
        value = "https://runbooks.internal/api-availability"
      }
    }
  }
}

Ansible Collection for Grafana

The grafana.grafana Ansible collection provides modules and roles for both installing Grafana software and managing its configuration. It’s particularly useful for organizations already using Ansible for configuration management, and excels at tasks that combine system-level operations (package installation, service management) with API-level configuration.

# Install the Grafana Ansible collection
# ansible-galaxy collection install grafana.grafana

# playbook-grafana-config.yaml
---
- name: Configure Grafana Observability Stack
  hosts: localhost
  connection: local
  vars:
    grafana_url: "{{ lookup('env', 'GRAFANA_URL') }}"
    grafana_api_key: "{{ lookup('env', 'GRAFANA_SA_TOKEN') }}"

  tasks:
    - name: Create team folders
      grafana.grafana.folder:
        url: "{{ grafana_url }}"
        url_username: ""
        url_password: "{{ grafana_api_key }}"
        title: "{{ item.title }}"
        uid: "{{ item.uid }}"
        state: present
      loop:
        - { title: "Platform Team", uid: "platform-team" }
        - { title: "Payments Team", uid: "payments-team" }
        - { title: "Shared Dashboards", uid: "shared" }

    - name: Configure Prometheus data source
      grafana.grafana.datasource:
        url: "{{ grafana_url }}"
        url_username: ""
        url_password: "{{ grafana_api_key }}"
        name: "Mimir Production"
        ds_type: prometheus
        ds_url: "{{ mimir_endpoint }}"
        access: proxy
        basic_auth_user: "{{ mimir_username }}"
        basic_auth_password: "{{ mimir_password }}"
        json_data:
          httpMethod: POST
          timeInterval: "15s"
        state: present

    - name: Deploy dashboard from JSON file
      grafana.grafana.dashboard:
        url: "{{ grafana_url }}"
        url_username: ""
        url_password: "{{ grafana_api_key }}"
        dashboard_id: null
        dashboard_revision: null
        state: present
        overwrite: true
        commit_message: "Deployed via Ansible"
        folder: "Platform Team"
        path: "dashboards/{{ item }}.json"
      loop:
        - api-gateway-health
        - kubernetes-cluster-overview
        - slo-overview

    - name: Import community dashboard from Grafana.com
      grafana.grafana.dashboard:
        url: "{{ grafana_url }}"
        url_username: ""
        url_password: "{{ grafana_api_key }}"
        state: present
        overwrite: true
        folder: "Shared Dashboards"
        dashboard_id: 15760
        dashboard_revision: 1

                            
                            Terraform vs Ansible — When to Use Each: Use Terraform when you need declarative state management with drift detection, complex dependency graphs, and plan/apply workflows. Use Ansible when you need to combine system-level tasks (installing packages, managing files) with API configuration, or when your organization already standardizes on Ansible for configuration management. Many teams use both — Terraform for cloud resource provisioning and Ansible for software configuration.
                        

Dashboards & Alerts as Code

Dashboard as Code with Grafonnet

Grafonnet is a Jsonnet library that provides a type-safe, composable way to generate Grafana dashboard JSON. Instead of manually crafting 2000-line JSON files (which are nearly impossible to review in pull requests), you write concise Jsonnet code that compiles into valid dashboard JSON. This approach enables:

Reusable templates — define a panel factory once, instantiate per-service
Type safety — catch invalid configurations at compile time
Composability — build complex dashboards from small, testable building blocks
Readable diffs — PR reviews show meaningful changes, not JSON position shifts

// service-dashboard.jsonnet - Generate a golden signals dashboard
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local dashboard = grafana.dashboard;
local panel = grafana.panel;
local prometheus = grafana.query.prometheus;
local variable = grafana.dashboard.variable;

// Reusable panel factory for golden signals
local goldenSignalPanel(title, expr, unit='short') =
  panel.timeSeries.new(title)
  + panel.timeSeries.queryOptions.withTargets([
    prometheus.new('mimir-prod', expr)
    + prometheus.withLegendFormat('{{instance}}'),
  ])
  + panel.timeSeries.standardOptions.withUnit(unit)
  + panel.timeSeries.gridPos.withW(12)
  + panel.timeSeries.gridPos.withH(8);

// Dashboard definition
dashboard.new('Service Golden Signals - ${service}')
+ dashboard.withUid('golden-signals-${service}')
+ dashboard.withTags(['golden-signals', 'generated', 'platform'])
+ dashboard.withRefresh('30s')
+ dashboard.withVariables([
  variable.query.new('service')
  + variable.query.withDatasource('mimir-prod')
  + variable.query.queryTypes.withLabelValues('service_name', 'up'),
])
+ dashboard.withPanels([
  // Traffic
  goldenSignalPanel(
    'Request Rate',
    'sum(rate(http_requests_total{service_name="$service"}[5m])) by (method)',
    'reqps'
  ) + panel.timeSeries.gridPos.withX(0) + panel.timeSeries.gridPos.withY(0),

  // Errors
  goldenSignalPanel(
    'Error Rate',
    'sum(rate(http_requests_total{service_name="$service",code=~"5.."}[5m])) / sum(rate(http_requests_total{service_name="$service"}[5m]))',
    'percentunit'
  ) + panel.timeSeries.gridPos.withX(12) + panel.timeSeries.gridPos.withY(0),

  // Latency
  goldenSignalPanel(
    'P99 Latency',
    'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service_name="$service"}[5m])) by (le))',
    's'
  ) + panel.timeSeries.gridPos.withX(0) + panel.timeSeries.gridPos.withY(8),

  // Saturation
  goldenSignalPanel(
    'CPU Saturation',
    'sum(rate(container_cpu_usage_seconds_total{pod=~"$service.*"}[5m])) / sum(kube_pod_container_resource_limits{pod=~"$service.*",resource="cpu"})',
    'percentunit'
  ) + panel.timeSeries.gridPos.withX(12) + panel.timeSeries.gridPos.withY(8),
])

Compile the Jsonnet to JSON and deploy:

# Install jsonnet tooling
go install github.com/google/go-jsonnet/cmd/jsonnet@latest
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest

# Initialize jsonnet-bundler and fetch Grafonnet
jb init
jb install github.com/grafana/grafonnet/gen/grafonnet-latest

# Compile dashboard to JSON
jsonnet -J vendor/ service-dashboard.jsonnet > dashboards/golden-signals.json

# Deploy via Grafana API or Terraform
curl -X POST "$GRAFANA_URL/api/dashboards/db" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"dashboard\": $(cat dashboards/golden-signals.json), \"folderUid\": \"platform-team\", \"overwrite\": true}"

Alert Rules Provisioning

Grafana supports file-based provisioning for alert rules, contact points, and notification policies. This approach works with any deployment model (Docker, Kubernetes, bare metal) and doesn’t require Terraform or external API access — Grafana reads YAML files from a configured directory on startup and reconciles them continuously.

# /etc/grafana/provisioning/alerting/platform-alerts.yaml
apiVersion: 1

groups:
  - orgId: 1
    name: Platform SLO Alerts
    folder: Platform Team
    interval: 1m
    rules:
      - uid: api-availability-burn-rate
        title: "API Availability - Fast Burn"
        condition: C
        for: 2m
        labels:
          severity: critical
          team: platform
          slo: api-availability
        annotations:
          summary: "API availability burn rate exceeds 14.4x threshold"
          description: |
            The 1-hour error burn rate is {{ $values.B }}x the budget rate.
            At this rate, the monthly error budget will be exhausted in {{ printf "%.1f" (div 720.0 $values.B) }} hours.
          runbook_url: https://runbooks.internal/api-availability-slo
        data:
          - refId: A
            datasourceUid: mimir-prod
            relativeTimeRange:
              from: 3600
              to: 0
            model:
              expr: |
                1 - (
                  sum(rate(http_requests_total{code!~"5.."}[1h]))
                  /
                  sum(rate(http_requests_total[1h]))
                )
              intervalMs: 15000
              maxDataPoints: 43200
          - refId: B
            datasourceUid: __expr__
            model:
              type: math
              expression: "$A / (1 - 0.999)"
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [14.4]

      - uid: api-latency-p99
        title: "API P99 Latency > 500ms"
        condition: B
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "API P99 latency exceeds 500ms SLO target"
        data:
          - refId: A
            datasourceUid: mimir-prod
            relativeTimeRange:
              from: 600
              to: 0
            model:
              expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          - refId: B
            datasourceUid: __expr__
            model:
              type: threshold
              expression: A
              conditions:
                - evaluator:
                    type: gt
                    params: [0.5]

# /etc/grafana/provisioning/alerting/contact-points.yaml
apiVersion: 1

contactPoints:
  - orgId: 1
    name: Platform Team PagerDuty
    receivers:
      - uid: platform-pd
        type: pagerduty
        settings:
          integrationKey: "$__env{PAGERDUTY_PLATFORM_KEY}"
          severity: critical
        disableResolveMessage: false

  - orgId: 1
    name: Platform Team Slack
    receivers:
      - uid: platform-slack
        type: slack
        settings:
          recipient: "#platform-alerts"
          token: "$__env{SLACK_BOT_TOKEN}"
          mentionChannel: here

# /etc/grafana/provisioning/alerting/notification-policies.yaml
apiVersion: 1

policies:
  - orgId: 1
    receiver: Platform Team Slack
    group_by: ['alertname', 'team']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: Platform Team PagerDuty
        matchers:
          - severity = critical
        group_wait: 10s
        repeat_interval: 1h

GitOps Workflow

A GitOps workflow for observability treats a Git repository as the single source of truth for all monitoring configuration. Changes flow through pull requests with automated validation, peer review, and controlled deployment — the same workflow used for application code.

GitOps Workflow for Observability Configuration

flowchart LR
    A[Developer
Creates Branch] --> B[Edit Config
Terraform/Jsonnet/YAML]
    B --> C[Push &
Open PR]
    C --> D{CI Pipeline}
    D --> |terraform plan
jsonnet lint
yamllint| E[Review &
Approve]
    E --> F[Merge to
Main]
    F --> G{CD Pipeline}
    G --> |terraform apply
or API deploy| H[Staging
Validated]
    H --> I[Promote to
Production]
    I --> J[Drift Detection
Scheduled]
    J --> |Drift found| A

A typical repository structure for GitOps-managed observability:

observability-config/
├── terraform/
│   ├── environments/
│   │   ├── staging/
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   └── terraform.tfvars
│   │   └── production/
│   │       ├── main.tf
│   │       ├── variables.tf
│   │       └── terraform.tfvars
│   ├── modules/
│   │   ├── grafana-stack/
│   │   │   ├── main.tf          # Data sources, folders, teams
│   │   │   ├── dashboards.tf    # Dashboard resources
│   │   │   ├── alerts.tf        # Alert rules and groups
│   │   │   ├── notifications.tf # Contact points, policies
│   │   │   └── variables.tf
│   │   └── golden-signals/
│   │       ├── main.tf          # Per-service golden signals template
│   │       └── variables.tf
│   └── shared/
│       └── provider.tf
├── dashboards/
│   ├── jsonnet/
│   │   ├── lib/                  # Reusable Grafonnet libraries
│   │   ├── golden-signals.jsonnet
│   │   ├── slo-overview.jsonnet
│   │   └── kubernetes-cluster.jsonnet
│   └── compiled/                 # Git-tracked compiled JSON
│       ├── golden-signals.json
│       ├── slo-overview.json
│       └── kubernetes-cluster.json
├── alerts/
│   ├── platform/
│   │   ├── slo-burn-rates.yaml
│   │   └── infrastructure.yaml
│   └── teams/
│       ├── payments.yaml
│       └── search.yaml
├── helm/
│   ├── otel-collector/
│   │   └── values.yaml
│   └── alloy/
│       └── values.yaml
├── .github/
│   └── workflows/
│       ├── validate.yaml         # PR validation
│       ├── deploy-staging.yaml   # Deploy to staging on merge
│       ├── deploy-prod.yaml      # Deploy to prod (manual approval)
│       └── drift-check.yaml      # Scheduled drift detection
├── Makefile
└── README.md

The CI/CD pipeline validates changes before deployment:

# .github/workflows/validate.yaml
name: Validate Observability Config
on:
  pull_request:
    paths:
      - 'terraform/**'
      - 'dashboards/**'
      - 'alerts/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.8

      - name: Terraform Format Check
        run: terraform fmt -check -recursive terraform/

      - name: Terraform Validate (Staging)
        working-directory: terraform/environments/staging
        run: |
          terraform init -backend=false
          terraform validate

      - name: Terraform Plan (Staging)
        working-directory: terraform/environments/staging
        env:
          GRAFANA_AUTH: ${{ secrets.GRAFANA_SA_TOKEN_STAGING }}
        run: |
          terraform init
          terraform plan -no-color -out=tfplan
          terraform show -no-color tfplan > plan.txt

      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('terraform/environments/staging/plan.txt', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Terraform Plan (Staging)\n\`\`\`\n${plan.substring(0, 60000)}\n\`\`\``
            });

      - name: Validate Jsonnet
        run: |
          jsonnet -J vendor/ dashboards/jsonnet/golden-signals.jsonnet > /dev/null
          jsonnet -J vendor/ dashboards/jsonnet/slo-overview.jsonnet > /dev/null

      - name: Lint Alert YAML
        run: yamllint -d relaxed alerts/

Best Practices

State Management

Terraform state for Grafana resources requires careful handling. The state file contains sensitive information (data source credentials, API tokens) and represents the authoritative record of what Terraform manages versus what was manually created.

                            
                            Remote State with Locking: Always use remote state backends (S3 + DynamoDB, GCS, Terraform Cloud) with state locking enabled. Multiple team members running terraform apply simultaneously against the same Grafana instance will cause race conditions and state corruption without locking.
                        

// backend.tf - Remote state with locking
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "observability/grafana/production/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Key state management practices:

Import existing resources before managing them: terraform import grafana_dashboard.existing abc123
Use lifecycle { prevent_destroy = true } on critical resources (production alert rules, notification policies)
Separate state files per environment — staging and production should never share state
Use terraform state list regularly to audit what’s managed
Handle UID conflicts — set explicit UIDs in Terraform to avoid conflicts with manually-created resources

Secrets Handling

Observability configuration inevitably involves secrets: API tokens, data source passwords, PagerDuty integration keys, and Slack webhook URLs. These must never appear in version control.

// variables.tf - Declare sensitive variables
variable "grafana_service_account_token" {
  description = "Grafana service account token for API access"
  type        = string
  sensitive   = true
}

variable "pagerduty_integration_key" {
  description = "PagerDuty integration key for critical alerts"
  type        = string
  sensitive   = true
}

variable "mimir_password" {
  description = "Password for Mimir basic auth"
  type        = string
  sensitive   = true
}

Secrets injection strategies:

Method	Best For	Example
Environment Variables	CI/CD pipelines	`TF_VAR_grafana_token=$SECRET`
Vault Provider	Enterprise teams	`data "vault_generic_secret" "grafana" {}`
SOPS	Git-encrypted secrets	`sops -d secrets.enc.yaml`
1Password/AWS SSM	Secret store integration	`data "aws_ssm_parameter" "token" {}`
.tfvars (gitignored)	Local development only	`terraform.tfvars` in `.gitignore`

CI/CD Pipelines

A well-designed CI/CD pipeline for observability configuration enforces quality gates while maintaining deployment velocity. The pipeline should validate syntax, check for breaking changes, deploy to staging for verification, and require approval before production promotion.

Pipeline Design Multi-Stage Deployment

Stage 1 — Validate (on every PR):

terraform fmt -check — enforce consistent formatting
terraform validate — catch syntax errors
jsonnet lint — validate Grafonnet templates
yamllint — validate alert YAML files
terraform plan — show what will change (posted as PR comment)

Stage 2 — Deploy Staging (on merge to main):

terraform apply -auto-approve against staging Grafana
Run smoke tests (verify dashboards load, alerts evaluate)
Notify team of staging deployment

Stage 3 — Deploy Production (manual approval gate):

Require approval from on-call engineer or team lead
terraform apply against production Grafana
Create annotation in Grafana marking the deployment
Monitor for alert rule evaluation errors for 15 minutes

CI/CD GitOps deployment gates

Drift Detection

Configuration drift occurs when manual changes are made through the Grafana UI that diverge from the code-defined state. This is inevitable in practice — engineers will edit dashboards during incidents, adjust alert thresholds for immediate relief, or experiment with new visualizations. Drift detection ensures these changes are eventually captured in code or reverted.

# .github/workflows/drift-check.yaml
name: Drift Detection
on:
  schedule:
    - cron: '0 8 * * 1-5'  # Weekdays at 8 AM UTC

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Detect Drift (Production)
        working-directory: terraform/environments/production
        env:
          GRAFANA_AUTH: ${{ secrets.GRAFANA_SA_TOKEN_PROD }}
        run: |
          terraform init
          terraform plan -detailed-exitcode -no-color > drift-report.txt 2>&1 || true
          
          if grep -q "No changes" drift-report.txt; then
            echo "DRIFT_DETECTED=false" >> $GITHUB_ENV
          else
            echo "DRIFT_DETECTED=true" >> $GITHUB_ENV
          fi

      - name: Create Issue for Drift
        if: env.DRIFT_DETECTED == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const drift = fs.readFileSync('terraform/environments/production/drift-report.txt', 'utf8');
            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: `[Drift] Grafana configuration drift detected - ${new Date().toISOString().split('T')[0]}`,
              body: `## Configuration Drift Detected\n\nManual changes were detected in production Grafana.\n\n\`\`\`\n${drift.substring(0, 60000)}\n\`\`\`\n\n### Action Required\n1. Review the changes above\n2. If intentional, import them into Terraform code\n3. If unintentional, run \`terraform apply\` to revert`,
              labels: ['drift', 'observability']
            });

                            
                            Handling Intentional Drift: Not all drift is bad. During an incident, an engineer might lower an alert threshold to reduce noise while investigating. The drift detection system should create a ticket, not automatically revert changes. The team then decides whether to codify the change (update Terraform) or revert it (run terraform apply).
                        

Summary & Next Steps

Infrastructure as Code transforms observability from a fragile, manually-maintained system into a robust, auditable, and reproducible engineering practice. The key concepts covered in this article:

Why Automate — Reproducibility, version control with audit trails, disaster recovery measured in minutes not days, and safe environment promotion from staging to production
Automation Scope — Every layer needs automation: collection infrastructure (Helm charts), storage (Terraform-managed cloud resources), dashboards, alert rules, notification policies, and access control
Collection Infrastructure — OpenTelemetry Collector and Grafana Alloy deployed via Helm charts with customized values.yaml for environment-specific configuration
Grafana APIs — Cloud API for stack management, instance API for dashboards/alerts/data sources — the foundation that all automation tools build upon
Terraform & Ansible — The Grafana Terraform provider for declarative state management; the Ansible collection for combined system and API configuration
Dashboards as Code — Grafonnet (Jsonnet) for type-safe, composable, reviewable dashboard definitions; file-based provisioning for alerts and notification policies
Best Practices — Remote state with locking, secrets management through Vault/SOPS/environment variables, multi-stage CI/CD pipelines with approval gates, and scheduled drift detection

The principle that underpins all of this: if it’s not in Git, it doesn’t exist. Treat your observability configuration with the same rigor as your application code — version controlled, peer reviewed, tested in staging, and deployed through automated pipelines.

Next in the Series

In Part 11: Platform Architecture & Scaling, we’ll explore designing observability platforms at scale — multi-tenant architectures, horizontal scaling patterns for Mimir/Loki/Tempo, cost optimization strategies, and building an internal observability platform team.

Previous Part 9: Managing Incidents Using Alerts Next Part 11: Platform Architecture & Scaling