Why Automate Grafana?
Managing observability infrastructure through manual UI interactions — what some call “click-ops” — is a pattern that scales poorly. When your Grafana stack exists only as configurations stored in a database, backed by memory and tribal knowledge, you inherit every risk of unversioned, unreproducible infrastructure. A single misclick can delete weeks of dashboard work. A disgruntled team member can silently modify alert thresholds. A cloud region failure can lose your entire monitoring configuration. Infrastructure as Code (IaC) eliminates these risks by applying the same engineering discipline to observability that we apply to application code.
Reproducibility
Reproducibility means that given the same code inputs, you can reliably produce the same infrastructure state across any environment. For observability, this translates to spinning up a complete monitoring stack — dashboards, alert rules, notification policies, data sources, folders, and team permissions — in minutes rather than days. When a new microservice team is onboarded, they receive a fully configured monitoring experience from a template, not a manual setup guide that takes three sprints to complete.
Version Control & Audit Trail
When observability configuration lives in Git, you gain the full power of version control: blame history shows who changed what and when, pull requests enable peer review of alert threshold changes, branches allow experimentation without affecting production monitoring, and tags provide rollback points. This audit trail is increasingly important for compliance frameworks like SOC 2 and ISO 27001, which require evidence of change management for security-critical systems — and your monitoring system is definitively security-critical.
Consider the scenario where a latency alert suddenly stops firing. Without version control, you’re left questioning whether the threshold was changed, by whom, and whether it was intentional. With Git history, a simple git log --follow alerts/api-latency.yaml reveals the complete change history, including the pull request discussion that justified the modification.
Disaster Recovery
Disaster recovery for observability is often overlooked until it’s needed. If your Grafana Cloud stack becomes unavailable, or if you need to migrate between cloud providers, having your entire configuration in code means recovery is a terraform apply away. Without IaC, recreating hundreds of dashboards, dozens of alert rules, complex notification routing trees, and team permission structures from memory is effectively impossible under the time pressure of a real disaster.
A financial services company maintained their entire Grafana configuration in Terraform. When their primary cloud region experienced a 4-hour outage, they executed their DR plan:
- Activated secondary Grafana Cloud stack (pre-provisioned via Terraform)
- Ran
terraform apply -var="environment=dr"to configure all dashboards and alerts - Updated DNS to point to the DR instance
- Full monitoring restored in under 12 minutes
Without IaC, their estimated recovery time was 2–3 days of manual recreation, during which they would have no visibility into their production systems.
Environment Promotion
In mature organizations, observability configuration follows the same promotion path as application code: development → staging → production. A new alert rule is tested against staging traffic before being promoted to production. Dashboard changes are validated against realistic data before reaching on-call engineers. IaC makes this workflow natural — the same Terraform modules or Ansible playbooks are applied to each environment with environment-specific variables (different data source URLs, different alert thresholds, different notification channels).
Components of Observability Systems
What Needs to Be Automated
A complete observability platform consists of multiple interconnected layers, each requiring automation. Understanding these layers helps you prioritize what to automate first and choose the right tools for each component.
flowchart TD
A[Collection Layer] --> B[Transport Layer]
B --> C[Storage Layer]
C --> D[Visualization Layer]
D --> E[Alerting Layer]
E --> F[Incident Layer]
A1[OTel Collector
Grafana Alloy
Prometheus Agent] --> A
B1[Kafka
Load Balancers
mTLS Certs] --> B
C1[Mimir
Loki
Tempo
Pyroscope] --> C
D1[Dashboards
Folders
Data Sources
Variables] --> D
E1[Alert Rules
Contact Points
Notification Policies
Silences] --> E
F1[OnCall Schedules
Escalation Chains
Incident Workflows] --> F
Automation Layers
Each layer maps to specific automation tools:
| Layer | Components | Primary Tools | Priority |
|---|---|---|---|
| Collection | OTel Collector, Alloy, Prometheus | Helm, Ansible, Kubernetes Operators | High |
| Storage | Mimir, Loki, Tempo clusters | Helm, Terraform (cloud-managed) | High |
| Visualization | Dashboards, folders, data sources | Terraform, Grafonnet, Grafana API | Critical |
| Alerting | Rules, contacts, policies, silences | Terraform, file-based provisioning | Critical |
| Access Control | Users, teams, RBAC, service accounts | Terraform, SCIM, Grafana API | Medium |
| Incident | OnCall schedules, escalation chains | Terraform, Grafana API | Medium |
Automating Collection Infrastructure
OpenTelemetry Collector with Helm
The OpenTelemetry Collector is the vendor-neutral telemetry pipeline that receives, processes, and exports metrics, logs, and traces. Deploying it via Helm charts provides repeatable installation with environment-specific customization through values.yaml overrides.
First, add the OpenTelemetry Helm repository:
# Add the OpenTelemetry Helm chart repository
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
# Install the collector in DaemonSet mode (one per node)
helm install otel-collector open-telemetry/opentelemetry-collector \
--namespace observability \
--create-namespace \
--values values-otel-collector.yaml
The values.yaml file customizes the collector’s pipeline configuration, resource limits, and export destinations:
# values-otel-collector.yaml
mode: daemonset
presets:
logsCollection:
enabled: true
kubernetesAttributes:
enabled: true
kubeletMetrics:
enabled: true
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
processors:
batch:
timeout: 5s
send_batch_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
resourcedetection:
detectors: [env, system, gcp, aws, azure]
timeout: 5s
exporters:
otlphttp/grafana:
endpoint: https://otlp-gateway-prod-us-central-0.grafana.net/otlp
headers:
Authorization: "Basic ${GRAFANA_CLOUD_TOKEN}"
prometheusremotewrite:
endpoint: https://prometheus-prod-us-central-0.grafana.net/api/prom/push
headers:
Authorization: "Basic ${GRAFANA_CLOUD_TOKEN}"
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, resourcedetection, batch]
exporters: [prometheusremotewrite]
traces:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, batch]
exporters: [otlphttp/grafana]
logs:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, batch]
exporters: [otlphttp/grafana]
resources:
limits:
cpu: 500m
memory: 768Mi
requests:
cpu: 100m
memory: 256Mi
For production deployments, use a Gateway pattern combining DaemonSet collectors (lightweight, per-node) with a centralized Gateway deployment (handles authentication, batching, and retry logic):
# Deploy per-node agents (lightweight, no auth credentials)
helm install otel-agent open-telemetry/opentelemetry-collector \
--namespace observability \
--values values-agent.yaml
# Deploy centralized gateway (handles auth, export to Grafana Cloud)
helm install otel-gateway open-telemetry/opentelemetry-collector \
--namespace observability \
--values values-gateway.yaml \
--set mode=deployment \
--set replicaCount=3
Grafana Alloy with Helm
Grafana Alloy (the successor to Grafana Agent) is Grafana’s distribution of the OpenTelemetry Collector with additional components for Prometheus scraping, Loki log collection, and native Grafana Cloud integration. Its Helm chart supports both standalone and operator-managed deployment modes.
# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Alloy
helm install alloy grafana/alloy \
--namespace observability \
--create-namespace \
--values values-alloy.yaml
Alloy uses a River-based configuration language. The Helm chart manages this configuration through values.yaml:
# values-alloy.yaml
alloy:
configMap:
create: true
content: |
// Kubernetes Discovery
discovery.kubernetes "pods" {
role = "pod"
}
// Prometheus Scraping
prometheus.scrape "kubernetes" {
targets = discovery.kubernetes.pods.targets
forward_to = [prometheus.remote_write.grafana_cloud.receiver]
scrape_interval = "30s"
}
// Remote Write to Grafana Cloud
prometheus.remote_write "grafana_cloud" {
endpoint {
url = env("METRICS_ENDPOINT")
basic_auth {
username = env("METRICS_USERNAME")
password = env("METRICS_PASSWORD")
}
}
}
// Loki Log Collection
loki.source.kubernetes "pods" {
targets = discovery.kubernetes.pods.targets
forward_to = [loki.write.grafana_cloud.receiver]
}
loki.write "grafana_cloud" {
endpoint {
url = env("LOGS_ENDPOINT")
basic_auth {
username = env("LOGS_USERNAME")
password = env("LOGS_PASSWORD")
}
}
}
controller:
type: daemonset
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
envFrom:
- secretRef:
name: grafana-cloud-credentials
GrafanaAgent) defines the desired collection configuration. The operator reconciles the actual collector fleet to match the desired state, enabling dynamic scaling and configuration updates without Helm upgrades.
Getting to Grips with the Grafana API
Every piece of Grafana configuration is accessible through REST APIs. Understanding these APIs is foundational — Terraform providers, Ansible modules, and custom automation scripts all ultimately interact with these endpoints.
Grafana Cloud API
The Grafana Cloud API manages cloud-level resources: stacks, API keys, plugins, and billing. It operates at a higher level than the individual Grafana instance API and uses a Cloud API key for authentication.
# Create a Cloud API key via the Grafana Cloud Portal
# Then use it to manage stacks programmatically
# List all stacks in your organization
curl -s -H "Authorization: Bearer $GRAFANA_CLOUD_API_KEY" \
https://grafana.com/api/orgs/$ORG_SLUG/instances | jq '.items[].name'
# Create a new Grafana Cloud stack
curl -X POST -H "Authorization: Bearer $GRAFANA_CLOUD_API_KEY" \
-H "Content-Type: application/json" \
https://grafana.com/api/instances \
-d '{
"name": "prod-us-east",
"slug": "prod-us-east",
"region": "us",
"description": "Production stack for US East region"
}'
# Create a service account token for Terraform
curl -X POST -H "Authorization: Bearer $GRAFANA_CLOUD_API_KEY" \
"https://grafana.com/api/instances/$STACK_SLUG/api/serviceaccounts" \
-H "Content-Type: application/json" \
-d '{"name": "terraform-sa", "role": "Admin"}'
Grafana Instance API
The instance API manages resources within a specific Grafana deployment: dashboards, folders, data sources, alert rules, annotations, users, and teams. Authentication uses either API keys or service account tokens.
# Dashboard CRUD Operations
# Create or update a dashboard
curl -X POST "$GRAFANA_URL/api/dashboards/db" \
-H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"dashboard": {
"title": "API Gateway Health",
"uid": "api-gw-health",
"panels": [],
"schemaVersion": 39
},
"folderUid": "platform-team",
"overwrite": true,
"message": "Updated via CI/CD pipeline"
}'
# List all folders
curl -s -H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
"$GRAFANA_URL/api/folders" | jq '.[].title'
# Create a data source
curl -X POST "$GRAFANA_URL/api/datasources" \
-H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Mimir Production",
"type": "prometheus",
"url": "https://mimir-prod.internal:9009/prometheus",
"access": "proxy",
"basicAuth": true,
"basicAuthUser": "mimir",
"secureJsonData": {"basicAuthPassword": "'$MIMIR_PASSWORD'"},
"jsonData": {"httpMethod": "POST", "timeInterval": "15s"}
}'
# Export alert rules for backup
curl -s -H "Authorization: Bearer $GRAFANA_SA_TOKEN" \
"$GRAFANA_URL/api/v1/provisioning/alert-rules" | jq '.' > alert-rules-backup.json
| Resource | Endpoint | Methods |
|---|---|---|
| Dashboards | /api/dashboards/db | GET, POST, DELETE |
| Folders | /api/folders | GET, POST, PUT, DELETE |
| Data Sources | /api/datasources | GET, POST, PUT, DELETE |
| Alert Rules | /api/v1/provisioning/alert-rules | GET, POST, PUT, DELETE |
| Contact Points | /api/v1/provisioning/contact-points | GET, POST, PUT, DELETE |
| Notification Policies | /api/v1/provisioning/policies | GET, PUT |
| Service Accounts | /api/serviceaccounts | GET, POST, PATCH, DELETE |
| Teams | /api/teams | GET, POST, PUT, DELETE |
| Annotations | /api/annotations | GET, POST, PUT, DELETE |
Terraform & Ansible for Grafana
Grafana Terraform Provider
The official grafana/grafana Terraform provider wraps the Grafana APIs into declarative HCL resources. It supports both Grafana Cloud management (stacks, API keys, plugins) and instance-level configuration (dashboards, alerts, data sources). The provider is maintained by Grafana Labs and sees frequent releases aligned with new Grafana features.
// providers.tf - Configure the Grafana Terraform provider
terraform {
required_providers {
grafana = {
source = "grafana/grafana"
version = "~> 3.0"
}
}
}
// Cloud provider for stack management
provider "grafana" {
alias = "cloud"
cloud_api_key = var.grafana_cloud_api_key
}
// Instance provider for dashboard/alert management
provider "grafana" {
alias = "stack"
url = var.grafana_url
auth = var.grafana_service_account_token
}
The provider exposes resources for every major Grafana component:
// folders.tf - Organize dashboards into team folders
resource "grafana_folder" "platform" {
provider = grafana.stack
title = "Platform Team"
uid = "platform-team"
}
resource "grafana_folder" "services" {
provider = grafana.stack
title = "Microservices"
uid = "microservices"
}
// datasources.tf - Configure Prometheus and Loki data sources
resource "grafana_data_source" "mimir" {
provider = grafana.stack
type = "prometheus"
name = "Mimir (Metrics)"
uid = "mimir-prod"
url = var.mimir_endpoint
json_data_encoded = jsonencode({
httpMethod = "POST"
timeInterval = "15s"
})
secure_json_data_encoded = jsonencode({
basicAuthPassword = var.mimir_password
})
basic_auth_enabled = true
basic_auth_username = var.mimir_username
}
resource "grafana_data_source" "loki" {
provider = grafana.stack
type = "loki"
name = "Loki (Logs)"
uid = "loki-prod"
url = var.loki_endpoint
json_data_encoded = jsonencode({
maxLines = 5000
derivedFields = [{
name = "TraceID"
matcherRegex = "traceID=(\\w+)"
url = "$${__value.raw}"
datasourceUid = "tempo-prod"
}]
})
}
Alert configuration with Terraform enables version-controlled, peer-reviewed alert rule management:
// alerts.tf - Define alert rules as code
resource "grafana_rule_group" "api_slos" {
provider = grafana.stack
name = "API SLO Alerts"
folder_uid = grafana_folder.platform.uid
interval_seconds = 60
rule {
name = "API Availability SLO Burn Rate"
condition = "C"
for = "5m"
labels = {
severity = "critical"
team = "platform"
slo = "api-availability"
}
annotations = {
summary = "API availability SLO burn rate is too high"
description = "Error budget consumption rate exceeds threshold. Current burn rate: {{ $values.B }}x"
runbook_url = "https://runbooks.internal/api-availability-slo"
}
data {
ref_id = "A"
datasource_uid = grafana_data_source.mimir.uid
relative_time_range {
from = 3600
to = 0
}
model = jsonencode({
expr = "1 - (sum(rate(http_requests_total{code!~\"5..\"}[1h])) / sum(rate(http_requests_total[1h])))"
})
}
data {
ref_id = "B"
datasource_uid = "__expr__"
model = jsonencode({
type = "math"
expression = "$A / (1 - 0.999)"
})
}
data {
ref_id = "C"
datasource_uid = "__expr__"
model = jsonencode({
type = "threshold"
expression = "B"
conditions = [{
evaluator = { type = "gt", params = [14.4] }
}]
})
}
}
}
// contact_points.tf
resource "grafana_contact_point" "platform_pagerduty" {
provider = grafana.stack
name = "Platform Team PagerDuty"
pagerduty {
integration_key = var.pagerduty_integration_key
severity = "critical"
summary = "{{ template \"default.title\" . }}"
details = jsonencode({
firing = "{{ .Alerts.Firing | len }}"
resolved = "{{ .Alerts.Resolved | len }}"
alertname = "{{ .CommonLabels.alertname }}"
})
}
}
// notification_policies.tf
resource "grafana_notification_policy" "root" {
provider = grafana.stack
contact_point = "Platform Team Slack"
group_by = ["alertname", "team"]
group_wait = "30s"
group_interval = "5m"
repeat_interval = "4h"
policy {
matcher {
label = "severity"
match = "="
value = "critical"
}
contact_point = grafana_contact_point.platform_pagerduty.name
group_wait = "10s"
repeat_interval = "1h"
}
policy {
matcher {
label = "team"
match = "="
value = "payments"
}
contact_point = "Payments Team Slack"
}
}
// slo.tf - Define SLOs as code (Grafana Cloud)
resource "grafana_slo" "api_availability" {
provider = grafana.stack
name = "API Availability"
description = "99.9% of API requests succeed within 500ms"
objectives {
value = 0.999
window = "30d"
}
query {
type = "ratio"
ratio {
success_metric = "sum(rate(http_requests_total{code!~\"5..\"}[5m]))"
total_metric = "sum(rate(http_requests_total[5m]))"
}
}
alerting {
fastburn {
annotation {
key = "runbook_url"
value = "https://runbooks.internal/api-availability"
}
}
}
}
Ansible Collection for Grafana
The grafana.grafana Ansible collection provides modules and roles for both installing Grafana software and managing its configuration. It’s particularly useful for organizations already using Ansible for configuration management, and excels at tasks that combine system-level operations (package installation, service management) with API-level configuration.
# Install the Grafana Ansible collection
# ansible-galaxy collection install grafana.grafana
# playbook-grafana-config.yaml
---
- name: Configure Grafana Observability Stack
hosts: localhost
connection: local
vars:
grafana_url: "{{ lookup('env', 'GRAFANA_URL') }}"
grafana_api_key: "{{ lookup('env', 'GRAFANA_SA_TOKEN') }}"
tasks:
- name: Create team folders
grafana.grafana.folder:
url: "{{ grafana_url }}"
url_username: ""
url_password: "{{ grafana_api_key }}"
title: "{{ item.title }}"
uid: "{{ item.uid }}"
state: present
loop:
- { title: "Platform Team", uid: "platform-team" }
- { title: "Payments Team", uid: "payments-team" }
- { title: "Shared Dashboards", uid: "shared" }
- name: Configure Prometheus data source
grafana.grafana.datasource:
url: "{{ grafana_url }}"
url_username: ""
url_password: "{{ grafana_api_key }}"
name: "Mimir Production"
ds_type: prometheus
ds_url: "{{ mimir_endpoint }}"
access: proxy
basic_auth_user: "{{ mimir_username }}"
basic_auth_password: "{{ mimir_password }}"
json_data:
httpMethod: POST
timeInterval: "15s"
state: present
- name: Deploy dashboard from JSON file
grafana.grafana.dashboard:
url: "{{ grafana_url }}"
url_username: ""
url_password: "{{ grafana_api_key }}"
dashboard_id: null
dashboard_revision: null
state: present
overwrite: true
commit_message: "Deployed via Ansible"
folder: "Platform Team"
path: "dashboards/{{ item }}.json"
loop:
- api-gateway-health
- kubernetes-cluster-overview
- slo-overview
- name: Import community dashboard from Grafana.com
grafana.grafana.dashboard:
url: "{{ grafana_url }}"
url_username: ""
url_password: "{{ grafana_api_key }}"
state: present
overwrite: true
folder: "Shared Dashboards"
dashboard_id: 15760
dashboard_revision: 1
Dashboards & Alerts as Code
Dashboard as Code with Grafonnet
Grafonnet is a Jsonnet library that provides a type-safe, composable way to generate Grafana dashboard JSON. Instead of manually crafting 2000-line JSON files (which are nearly impossible to review in pull requests), you write concise Jsonnet code that compiles into valid dashboard JSON. This approach enables:
- Reusable templates — define a panel factory once, instantiate per-service
- Type safety — catch invalid configurations at compile time
- Composability — build complex dashboards from small, testable building blocks
- Readable diffs — PR reviews show meaningful changes, not JSON position shifts
// service-dashboard.jsonnet - Generate a golden signals dashboard
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local dashboard = grafana.dashboard;
local panel = grafana.panel;
local prometheus = grafana.query.prometheus;
local variable = grafana.dashboard.variable;
// Reusable panel factory for golden signals
local goldenSignalPanel(title, expr, unit='short') =
panel.timeSeries.new(title)
+ panel.timeSeries.queryOptions.withTargets([
prometheus.new('mimir-prod', expr)
+ prometheus.withLegendFormat('{{instance}}'),
])
+ panel.timeSeries.standardOptions.withUnit(unit)
+ panel.timeSeries.gridPos.withW(12)
+ panel.timeSeries.gridPos.withH(8);
// Dashboard definition
dashboard.new('Service Golden Signals - ${service}')
+ dashboard.withUid('golden-signals-${service}')
+ dashboard.withTags(['golden-signals', 'generated', 'platform'])
+ dashboard.withRefresh('30s')
+ dashboard.withVariables([
variable.query.new('service')
+ variable.query.withDatasource('mimir-prod')
+ variable.query.queryTypes.withLabelValues('service_name', 'up'),
])
+ dashboard.withPanels([
// Traffic
goldenSignalPanel(
'Request Rate',
'sum(rate(http_requests_total{service_name="$service"}[5m])) by (method)',
'reqps'
) + panel.timeSeries.gridPos.withX(0) + panel.timeSeries.gridPos.withY(0),
// Errors
goldenSignalPanel(
'Error Rate',
'sum(rate(http_requests_total{service_name="$service",code=~"5.."}[5m])) / sum(rate(http_requests_total{service_name="$service"}[5m]))',
'percentunit'
) + panel.timeSeries.gridPos.withX(12) + panel.timeSeries.gridPos.withY(0),
// Latency
goldenSignalPanel(
'P99 Latency',
'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service_name="$service"}[5m])) by (le))',
's'
) + panel.timeSeries.gridPos.withX(0) + panel.timeSeries.gridPos.withY(8),
// Saturation
goldenSignalPanel(
'CPU Saturation',
'sum(rate(container_cpu_usage_seconds_total{pod=~"$service.*"}[5m])) / sum(kube_pod_container_resource_limits{pod=~"$service.*",resource="cpu"})',
'percentunit'
) + panel.timeSeries.gridPos.withX(12) + panel.timeSeries.gridPos.withY(8),
])
Compile the Jsonnet to JSON and deploy:
# Install jsonnet tooling
go install github.com/google/go-jsonnet/cmd/jsonnet@latest
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest
# Initialize jsonnet-bundler and fetch Grafonnet
jb init
jb install github.com/grafana/grafonnet/gen/grafonnet-latest
# Compile dashboard to JSON
jsonnet -J vendor/ service-dashboard.jsonnet > dashboards/golden-signals.json
# Deploy via Grafana API or Terraform
curl -X POST "$GRAFANA_URL/api/dashboards/db" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "{\"dashboard\": $(cat dashboards/golden-signals.json), \"folderUid\": \"platform-team\", \"overwrite\": true}"
Alert Rules Provisioning
Grafana supports file-based provisioning for alert rules, contact points, and notification policies. This approach works with any deployment model (Docker, Kubernetes, bare metal) and doesn’t require Terraform or external API access — Grafana reads YAML files from a configured directory on startup and reconciles them continuously.
# /etc/grafana/provisioning/alerting/platform-alerts.yaml
apiVersion: 1
groups:
- orgId: 1
name: Platform SLO Alerts
folder: Platform Team
interval: 1m
rules:
- uid: api-availability-burn-rate
title: "API Availability - Fast Burn"
condition: C
for: 2m
labels:
severity: critical
team: platform
slo: api-availability
annotations:
summary: "API availability burn rate exceeds 14.4x threshold"
description: |
The 1-hour error burn rate is {{ $values.B }}x the budget rate.
At this rate, the monthly error budget will be exhausted in {{ printf "%.1f" (div 720.0 $values.B) }} hours.
runbook_url: https://runbooks.internal/api-availability-slo
data:
- refId: A
datasourceUid: mimir-prod
relativeTimeRange:
from: 3600
to: 0
model:
expr: |
1 - (
sum(rate(http_requests_total{code!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
intervalMs: 15000
maxDataPoints: 43200
- refId: B
datasourceUid: __expr__
model:
type: math
expression: "$A / (1 - 0.999)"
- refId: C
datasourceUid: __expr__
model:
type: threshold
expression: B
conditions:
- evaluator:
type: gt
params: [14.4]
- uid: api-latency-p99
title: "API P99 Latency > 500ms"
condition: B
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "API P99 latency exceeds 500ms SLO target"
data:
- refId: A
datasourceUid: mimir-prod
relativeTimeRange:
from: 600
to: 0
model:
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
- refId: B
datasourceUid: __expr__
model:
type: threshold
expression: A
conditions:
- evaluator:
type: gt
params: [0.5]
# /etc/grafana/provisioning/alerting/contact-points.yaml
apiVersion: 1
contactPoints:
- orgId: 1
name: Platform Team PagerDuty
receivers:
- uid: platform-pd
type: pagerduty
settings:
integrationKey: "$__env{PAGERDUTY_PLATFORM_KEY}"
severity: critical
disableResolveMessage: false
- orgId: 1
name: Platform Team Slack
receivers:
- uid: platform-slack
type: slack
settings:
recipient: "#platform-alerts"
token: "$__env{SLACK_BOT_TOKEN}"
mentionChannel: here
# /etc/grafana/provisioning/alerting/notification-policies.yaml
apiVersion: 1
policies:
- orgId: 1
receiver: Platform Team Slack
group_by: ['alertname', 'team']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: Platform Team PagerDuty
matchers:
- severity = critical
group_wait: 10s
repeat_interval: 1h
GitOps Workflow
A GitOps workflow for observability treats a Git repository as the single source of truth for all monitoring configuration. Changes flow through pull requests with automated validation, peer review, and controlled deployment — the same workflow used for application code.
flowchart LR
A[Developer
Creates Branch] --> B[Edit Config
Terraform/Jsonnet/YAML]
B --> C[Push &
Open PR]
C --> D{CI Pipeline}
D --> |terraform plan
jsonnet lint
yamllint| E[Review &
Approve]
E --> F[Merge to
Main]
F --> G{CD Pipeline}
G --> |terraform apply
or API deploy| H[Staging
Validated]
H --> I[Promote to
Production]
I --> J[Drift Detection
Scheduled]
J --> |Drift found| A
A typical repository structure for GitOps-managed observability:
observability-config/
├── terraform/
│ ├── environments/
│ │ ├── staging/
│ │ │ ├── main.tf
│ │ │ ├── variables.tf
│ │ │ └── terraform.tfvars
│ │ └── production/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ ├── modules/
│ │ ├── grafana-stack/
│ │ │ ├── main.tf # Data sources, folders, teams
│ │ │ ├── dashboards.tf # Dashboard resources
│ │ │ ├── alerts.tf # Alert rules and groups
│ │ │ ├── notifications.tf # Contact points, policies
│ │ │ └── variables.tf
│ │ └── golden-signals/
│ │ ├── main.tf # Per-service golden signals template
│ │ └── variables.tf
│ └── shared/
│ └── provider.tf
├── dashboards/
│ ├── jsonnet/
│ │ ├── lib/ # Reusable Grafonnet libraries
│ │ ├── golden-signals.jsonnet
│ │ ├── slo-overview.jsonnet
│ │ └── kubernetes-cluster.jsonnet
│ └── compiled/ # Git-tracked compiled JSON
│ ├── golden-signals.json
│ ├── slo-overview.json
│ └── kubernetes-cluster.json
├── alerts/
│ ├── platform/
│ │ ├── slo-burn-rates.yaml
│ │ └── infrastructure.yaml
│ └── teams/
│ ├── payments.yaml
│ └── search.yaml
├── helm/
│ ├── otel-collector/
│ │ └── values.yaml
│ └── alloy/
│ └── values.yaml
├── .github/
│ └── workflows/
│ ├── validate.yaml # PR validation
│ ├── deploy-staging.yaml # Deploy to staging on merge
│ ├── deploy-prod.yaml # Deploy to prod (manual approval)
│ └── drift-check.yaml # Scheduled drift detection
├── Makefile
└── README.md
The CI/CD pipeline validates changes before deployment:
# .github/workflows/validate.yaml
name: Validate Observability Config
on:
pull_request:
paths:
- 'terraform/**'
- 'dashboards/**'
- 'alerts/**'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.8
- name: Terraform Format Check
run: terraform fmt -check -recursive terraform/
- name: Terraform Validate (Staging)
working-directory: terraform/environments/staging
run: |
terraform init -backend=false
terraform validate
- name: Terraform Plan (Staging)
working-directory: terraform/environments/staging
env:
GRAFANA_AUTH: ${{ secrets.GRAFANA_SA_TOKEN_STAGING }}
run: |
terraform init
terraform plan -no-color -out=tfplan
terraform show -no-color tfplan > plan.txt
- name: Post Plan to PR
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('terraform/environments/staging/plan.txt', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Terraform Plan (Staging)\n\`\`\`\n${plan.substring(0, 60000)}\n\`\`\``
});
- name: Validate Jsonnet
run: |
jsonnet -J vendor/ dashboards/jsonnet/golden-signals.jsonnet > /dev/null
jsonnet -J vendor/ dashboards/jsonnet/slo-overview.jsonnet > /dev/null
- name: Lint Alert YAML
run: yamllint -d relaxed alerts/
Best Practices
State Management
Terraform state for Grafana resources requires careful handling. The state file contains sensitive information (data source credentials, API tokens) and represents the authoritative record of what Terraform manages versus what was manually created.
terraform apply simultaneously against the same Grafana instance will cause race conditions and state corruption without locking.
// backend.tf - Remote state with locking
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "observability/grafana/production/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Key state management practices:
- Import existing resources before managing them:
terraform import grafana_dashboard.existing abc123 - Use
lifecycle { prevent_destroy = true }on critical resources (production alert rules, notification policies) - Separate state files per environment — staging and production should never share state
- Use
terraform state listregularly to audit what’s managed - Handle UID conflicts — set explicit UIDs in Terraform to avoid conflicts with manually-created resources
Secrets Handling
Observability configuration inevitably involves secrets: API tokens, data source passwords, PagerDuty integration keys, and Slack webhook URLs. These must never appear in version control.
// variables.tf - Declare sensitive variables
variable "grafana_service_account_token" {
description = "Grafana service account token for API access"
type = string
sensitive = true
}
variable "pagerduty_integration_key" {
description = "PagerDuty integration key for critical alerts"
type = string
sensitive = true
}
variable "mimir_password" {
description = "Password for Mimir basic auth"
type = string
sensitive = true
}
Secrets injection strategies:
| Method | Best For | Example |
|---|---|---|
| Environment Variables | CI/CD pipelines | TF_VAR_grafana_token=$SECRET |
| Vault Provider | Enterprise teams | data "vault_generic_secret" "grafana" {} |
| SOPS | Git-encrypted secrets | sops -d secrets.enc.yaml |
| 1Password/AWS SSM | Secret store integration | data "aws_ssm_parameter" "token" {} |
| .tfvars (gitignored) | Local development only | terraform.tfvars in .gitignore |
CI/CD Pipelines
A well-designed CI/CD pipeline for observability configuration enforces quality gates while maintaining deployment velocity. The pipeline should validate syntax, check for breaking changes, deploy to staging for verification, and require approval before production promotion.
Stage 1 — Validate (on every PR):
terraform fmt -check— enforce consistent formattingterraform validate— catch syntax errorsjsonnet lint— validate Grafonnet templatesyamllint— validate alert YAML filesterraform plan— show what will change (posted as PR comment)
Stage 2 — Deploy Staging (on merge to main):
terraform apply -auto-approveagainst staging Grafana- Run smoke tests (verify dashboards load, alerts evaluate)
- Notify team of staging deployment
Stage 3 — Deploy Production (manual approval gate):
- Require approval from on-call engineer or team lead
terraform applyagainst production Grafana- Create annotation in Grafana marking the deployment
- Monitor for alert rule evaluation errors for 15 minutes
Drift Detection
Configuration drift occurs when manual changes are made through the Grafana UI that diverge from the code-defined state. This is inevitable in practice — engineers will edit dashboards during incidents, adjust alert thresholds for immediate relief, or experiment with new visualizations. Drift detection ensures these changes are eventually captured in code or reverted.
# .github/workflows/drift-check.yaml
name: Drift Detection
on:
schedule:
- cron: '0 8 * * 1-5' # Weekdays at 8 AM UTC
jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Detect Drift (Production)
working-directory: terraform/environments/production
env:
GRAFANA_AUTH: ${{ secrets.GRAFANA_SA_TOKEN_PROD }}
run: |
terraform init
terraform plan -detailed-exitcode -no-color > drift-report.txt 2>&1 || true
if grep -q "No changes" drift-report.txt; then
echo "DRIFT_DETECTED=false" >> $GITHUB_ENV
else
echo "DRIFT_DETECTED=true" >> $GITHUB_ENV
fi
- name: Create Issue for Drift
if: env.DRIFT_DETECTED == 'true'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const drift = fs.readFileSync('terraform/environments/production/drift-report.txt', 'utf8');
github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: `[Drift] Grafana configuration drift detected - ${new Date().toISOString().split('T')[0]}`,
body: `## Configuration Drift Detected\n\nManual changes were detected in production Grafana.\n\n\`\`\`\n${drift.substring(0, 60000)}\n\`\`\`\n\n### Action Required\n1. Review the changes above\n2. If intentional, import them into Terraform code\n3. If unintentional, run \`terraform apply\` to revert`,
labels: ['drift', 'observability']
});
terraform apply).
Summary & Next Steps
Infrastructure as Code transforms observability from a fragile, manually-maintained system into a robust, auditable, and reproducible engineering practice. The key concepts covered in this article:
- Why Automate — Reproducibility, version control with audit trails, disaster recovery measured in minutes not days, and safe environment promotion from staging to production
- Automation Scope — Every layer needs automation: collection infrastructure (Helm charts), storage (Terraform-managed cloud resources), dashboards, alert rules, notification policies, and access control
- Collection Infrastructure — OpenTelemetry Collector and Grafana Alloy deployed via Helm charts with customized
values.yamlfor environment-specific configuration - Grafana APIs — Cloud API for stack management, instance API for dashboards/alerts/data sources — the foundation that all automation tools build upon
- Terraform & Ansible — The Grafana Terraform provider for declarative state management; the Ansible collection for combined system and API configuration
- Dashboards as Code — Grafonnet (Jsonnet) for type-safe, composable, reviewable dashboard definitions; file-based provisioning for alerts and notification policies
- Best Practices — Remote state with locking, secrets management through Vault/SOPS/environment variables, multi-stage CI/CD pipelines with approval gates, and scheduled drift detection
The principle that underpins all of this: if it’s not in Git, it doesn’t exist. Treat your observability configuration with the same rigor as your application code — version controlled, peer reviewed, tested in staging, and deployed through automated pipelines.
Next in the Series
In Part 11: Platform Architecture & Scaling, we’ll explore designing observability platforms at scale — multi-tenant architectures, horizontal scaling patterns for Mimir/Loki/Tempo, cost optimization strategies, and building an internal observability platform team.