Back to Monitoring & Observability Series

Prometheus Deep Dive Part 13: CI/CD Pipelines for Prometheus

June 15, 2026 Wasil Zafar 26 min read

Treat Prometheus configuration as production code — validate with promtool, unit test alert rules with synthetic time series, lint for best practices, and deploy through GitOps pipelines. Catch broken alerts before they reach production.

Table of Contents

  1. Config & Rule Validation
  2. Unit Testing Alert Rules
  3. Linting & Best Practices
  4. GitHub Actions Pipeline
  5. GitOps Deployment
  6. Advanced Testing Patterns
  7. Conclusion

Config & Rule Validation

promtool is Prometheus’s built-in CLI for validating configuration and rules. It catches syntax errors, invalid PromQL, label conflicts, and structural issues before deployment:

# Validate prometheus.yml configuration
promtool check config prometheus.yml
# Output: Checking prometheus.yml
#   SUCCESS: 4 rule files found
#   SUCCESS: prometheus.yml is valid prometheus config file

# Validate individual rule files
promtool check rules rules/*.yaml
# Output: Checking rules/node-alerts.yaml
#   SUCCESS: 12 rules found

# Check all rules in a directory recursively
find rules/ -name '*.yaml' -exec promtool check rules {} \;

# Common validation errors caught:
# - Invalid PromQL expressions
# - Duplicate rule names within a group
# - Missing required fields (alert name, expr)
# - Invalid label names (must match [a-zA-Z_][a-zA-Z0-9_]*)
# - Invalid template syntax in annotations

Unit Testing Alert Rules

promtool’s test rules command evaluates rules against synthetic time series data and verifies expected alert states at specific timestamps:

# tests/cpu_alerts_test.yaml
rule_files:
  - ../rules/node-alerts.yaml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      # Simulate CPU at 90% utilization (only 10% idle)
      - series: 'node_cpu_seconds_total{instance="node-1:9100",cpu="0",mode="idle"}'
        values: '0+0.1x30'    # 0.1 seconds idle per minute = 10% idle
      - series: 'node_cpu_seconds_total{instance="node-1:9100",cpu="0",mode="user"}'
        values: '0+0.9x30'    # 0.9 seconds user per minute

    alert_rule_test:
      # At 15m, the alert should be firing (15m 'for' duration met)
      - eval_time: 20m
        alertname: HighCpuUsage
        exp_alerts:
          - exp_labels:
              instance: node-1:9100
              severity: warning
            exp_annotations:
              summary: "High CPU on node-1:9100"

      # At 5m, alert should NOT be firing yet (for: 15m not met)
      - eval_time: 5m
        alertname: HighCpuUsage
        exp_alerts: []    # No alerts expected

  - interval: 1m
    input_series:
      # Simulate healthy CPU (70% idle)
      - series: 'node_cpu_seconds_total{instance="node-2:9100",cpu="0",mode="idle"}'
        values: '0+0.7x30'
      - series: 'node_cpu_seconds_total{instance="node-2:9100",cpu="0",mode="user"}'
        values: '0+0.3x30'

    alert_rule_test:
      # Should never fire for healthy node
      - eval_time: 30m
        alertname: HighCpuUsage
        exp_alerts: []
# Run unit tests
promtool test rules tests/cpu_alerts_test.yaml
# Output: Unit Testing:  tests/cpu_alerts_test.yaml
#   SUCCESS

# Run all test files
promtool test rules tests/*_test.yaml

# Verbose output for debugging
promtool test rules --debug tests/cpu_alerts_test.yaml

Linting & Best Practices

# pint — Prometheus rule linter (beyond syntax checking)
# Catches logical errors and best practice violations

# Install
go install github.com/cloudflare/pint/cmd/pint@latest

# Lint rules
pint lint rules/

# Checks performed by pint:
# - Alerts without 'for' duration (likely to be noisy)
# - Alerts without runbook_url annotation
# - Recording rules with incorrect naming convention
# - PromQL using absent() without proper labels
# - Rate() on a gauge metric (logical error)
# - Aggregations that lose important labels
# - Template syntax issues in annotations
# - Comparison to bool without clear semantic meaning
# .pint.hcl — pint configuration
prometheus "prod" {
  uri = "http://prometheus:9090"
  timeout = "30s"
}

rule {
  # Require runbook annotation on all critical alerts
  match {
    kind = "alerting"
    label "severity" {
      value = "critical"
    }
  }
  annotation "runbook_url" {
    severity = "bug"
    required = true
  }
}

rule {
  # Recording rules must follow naming convention
  match {
    kind = "recording"
  }
  label "team" {
    severity = "warning"
    required = true
  }
}

GitHub Actions Pipeline

# .github/workflows/prometheus-ci.yaml
name: Prometheus Config CI

on:
  pull_request:
    paths:
      - 'prometheus/**'
      - 'rules/**'
      - 'alertmanager/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install promtool
        run: |
          VERSION="2.53.0"
          wget -q "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/prometheus-${VERSION}.linux-amd64.tar.gz"
          tar xzf "prometheus-${VERSION}.linux-amd64.tar.gz"
          sudo mv "prometheus-${VERSION}.linux-amd64/promtool" /usr/local/bin/

      - name: Validate Prometheus config
        run: promtool check config prometheus/prometheus.yml

      - name: Validate alert rules
        run: promtool check rules rules/*.yaml

      - name: Run unit tests
        run: promtool test rules tests/*_test.yaml

      - name: Validate Alertmanager config
        run: |
          wget -q "https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz"
          tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
          ./alertmanager-0.27.0.linux-amd64/amtool check-config alertmanager/alertmanager.yml

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install pint
        run: |
          go install github.com/cloudflare/pint/cmd/pint@latest

      - name: Lint rules
        run: pint lint rules/

  jsonnet-build:
    runs-on: ubuntu-latest
    if: contains(github.event.pull_request.changed_files, 'jsonnet')
    steps:
      - uses: actions/checkout@v4

      - name: Install Jsonnet + jb
        run: |
          go install github.com/google/go-jsonnet/cmd/jsonnet@latest
          go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest

      - name: Install dependencies
        run: jb install

      - name: Build rules from Jsonnet
        run: |
          jsonnet -J vendor -m generated/ main.jsonnet
          promtool check rules generated/*.yaml

GitOps Deployment

GitOps Prometheus Deployment Flow
flowchart LR
    subgraph PR["Pull Request"]
        V[Validate]
        T[Unit Test]
        L[Lint]
    end

    subgraph Merge["On Merge to Main"]
        B[Build Jsonnet]
        G[Generate YAML]
    end

    subgraph Deploy["GitOps"]
        AR[ArgoCD / Flux]
        K8S[Kubernetes]
    end

    PR --> Merge --> Deploy
    V & T & L -->|"all pass"| B --> G --> AR --> K8S
                            
# ArgoCD Application for Prometheus rules
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prometheus-rules
  namespace: argocd
spec:
  project: monitoring
  source:
    repoURL: https://github.com/myorg/monitoring-config.git
    targetRevision: main
    path: generated/rules
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Advanced Testing Patterns

# Test recording rules produce expected values
rule_files:
  - ../rules/recording-rules.yaml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{job="api",status="200"}'
        values: '0+100x10'
      - series: 'http_requests_total{job="api",status="500"}'
        values: '0+5x10'

    # Test that recording rule produces correct value
    promql_expr_test:
      - expr: 'job:http_error_ratio:rate5m{job="api"}'
        eval_time: 10m
        exp_samples:
          - labels: 'job:http_error_ratio:rate5m{job="api"}'
            value: 0.0476    # 5/105 ≈ 0.0476

Conclusion

Key Takeaways:
  • Always validatepromtool check config/rules catches syntax errors instantly
  • Unit test alert rules — synthetic time series prove alerts fire (and don’t fire) correctly
  • Lint beyond syntax — pint catches logical errors and missing best practices
  • CI blocks broken configs — no alert rule reaches production without passing tests
  • GitOps for deployment — ArgoCD/Flux reconciles generated YAML from main branch
  • Test both firing and non-firing — verify alerts stay silent when metrics are healthy