Prometheus Deep Dive Part 13: CI/CD Pipelines for Prometheus

Config & Rule Validation

promtool is Prometheus’s built-in CLI for validating configuration and rules. It catches syntax errors, invalid PromQL, label conflicts, and structural issues before deployment:

# Validate prometheus.yml configuration
promtool check config prometheus.yml
# Output: Checking prometheus.yml
#   SUCCESS: 4 rule files found
#   SUCCESS: prometheus.yml is valid prometheus config file

# Validate individual rule files
promtool check rules rules/*.yaml
# Output: Checking rules/node-alerts.yaml
#   SUCCESS: 12 rules found

# Check all rules in a directory recursively
find rules/ -name '*.yaml' -exec promtool check rules {} \;

# Common validation errors caught:
# - Invalid PromQL expressions
# - Duplicate rule names within a group
# - Missing required fields (alert name, expr)
# - Invalid label names (must match [a-zA-Z_][a-zA-Z0-9_]*)
# - Invalid template syntax in annotations

Unit Testing Alert Rules

promtool’s test rules command evaluates rules against synthetic time series data and verifies expected alert states at specific timestamps:

# tests/cpu_alerts_test.yaml
rule_files:
  - ../rules/node-alerts.yaml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      # Simulate CPU at 90% utilization (only 10% idle)
      - series: 'node_cpu_seconds_total{instance="node-1:9100",cpu="0",mode="idle"}'
        values: '0+0.1x30'    # 0.1 seconds idle per minute = 10% idle
      - series: 'node_cpu_seconds_total{instance="node-1:9100",cpu="0",mode="user"}'
        values: '0+0.9x30'    # 0.9 seconds user per minute

    alert_rule_test:
      # At 15m, the alert should be firing (15m 'for' duration met)
      - eval_time: 20m
        alertname: HighCpuUsage
        exp_alerts:
          - exp_labels:
              instance: node-1:9100
              severity: warning
            exp_annotations:
              summary: "High CPU on node-1:9100"

      # At 5m, alert should NOT be firing yet (for: 15m not met)
      - eval_time: 5m
        alertname: HighCpuUsage
        exp_alerts: []    # No alerts expected

  - interval: 1m
    input_series:
      # Simulate healthy CPU (70% idle)
      - series: 'node_cpu_seconds_total{instance="node-2:9100",cpu="0",mode="idle"}'
        values: '0+0.7x30'
      - series: 'node_cpu_seconds_total{instance="node-2:9100",cpu="0",mode="user"}'
        values: '0+0.3x30'

    alert_rule_test:
      # Should never fire for healthy node
      - eval_time: 30m
        alertname: HighCpuUsage
        exp_alerts: []

# Run unit tests
promtool test rules tests/cpu_alerts_test.yaml
# Output: Unit Testing:  tests/cpu_alerts_test.yaml
#   SUCCESS

# Run all test files
promtool test rules tests/*_test.yaml

# Verbose output for debugging
promtool test rules --debug tests/cpu_alerts_test.yaml

Linting & Best Practices

# pint — Prometheus rule linter (beyond syntax checking)
# Catches logical errors and best practice violations

# Install
go install github.com/cloudflare/pint/cmd/pint@latest

# Lint rules
pint lint rules/

# Checks performed by pint:
# - Alerts without 'for' duration (likely to be noisy)
# - Alerts without runbook_url annotation
# - Recording rules with incorrect naming convention
# - PromQL using absent() without proper labels
# - Rate() on a gauge metric (logical error)
# - Aggregations that lose important labels
# - Template syntax issues in annotations
# - Comparison to bool without clear semantic meaning

# .pint.hcl — pint configuration
prometheus "prod" {
  uri = "http://prometheus:9090"
  timeout = "30s"
}

rule {
  # Require runbook annotation on all critical alerts
  match {
    kind = "alerting"
    label "severity" {
      value = "critical"
    }
  }
  annotation "runbook_url" {
    severity = "bug"
    required = true
  }
}

rule {
  # Recording rules must follow naming convention
  match {
    kind = "recording"
  }
  label "team" {
    severity = "warning"
    required = true
  }
}

GitHub Actions Pipeline

# .github/workflows/prometheus-ci.yaml
name: Prometheus Config CI

on:
  pull_request:
    paths:
      - 'prometheus/**'
      - 'rules/**'
      - 'alertmanager/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install promtool
        run: |
          VERSION="2.53.0"
          wget -q "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/prometheus-${VERSION}.linux-amd64.tar.gz"
          tar xzf "prometheus-${VERSION}.linux-amd64.tar.gz"
          sudo mv "prometheus-${VERSION}.linux-amd64/promtool" /usr/local/bin/

      - name: Validate Prometheus config
        run: promtool check config prometheus/prometheus.yml

      - name: Validate alert rules
        run: promtool check rules rules/*.yaml

      - name: Run unit tests
        run: promtool test rules tests/*_test.yaml

      - name: Validate Alertmanager config
        run: |
          wget -q "https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz"
          tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
          ./alertmanager-0.27.0.linux-amd64/amtool check-config alertmanager/alertmanager.yml

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install pint
        run: |
          go install github.com/cloudflare/pint/cmd/pint@latest

      - name: Lint rules
        run: pint lint rules/

  jsonnet-build:
    runs-on: ubuntu-latest
    if: contains(github.event.pull_request.changed_files, 'jsonnet')
    steps:
      - uses: actions/checkout@v4

      - name: Install Jsonnet + jb
        run: |
          go install github.com/google/go-jsonnet/cmd/jsonnet@latest
          go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest

      - name: Install dependencies
        run: jb install

      - name: Build rules from Jsonnet
        run: |
          jsonnet -J vendor -m generated/ main.jsonnet
          promtool check rules generated/*.yaml

GitOps Deployment

GitOps Prometheus Deployment Flow

flowchart LR
    subgraph PR["Pull Request"]
        V[Validate]
        T[Unit Test]
        L[Lint]
    end

    subgraph Merge["On Merge to Main"]
        B[Build Jsonnet]
        G[Generate YAML]
    end

    subgraph Deploy["GitOps"]
        AR[ArgoCD / Flux]
        K8S[Kubernetes]
    end

    PR --> Merge --> Deploy
    V & T & L -->|"all pass"| B --> G --> AR --> K8S

# ArgoCD Application for Prometheus rules
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prometheus-rules
  namespace: argocd
spec:
  project: monitoring
  source:
    repoURL: https://github.com/myorg/monitoring-config.git
    targetRevision: main
    path: generated/rules
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Advanced Testing Patterns

# Test recording rules produce expected values
rule_files:
  - ../rules/recording-rules.yaml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{job="api",status="200"}'
        values: '0+100x10'
      - series: 'http_requests_total{job="api",status="500"}'
        values: '0+5x10'

    # Test that recording rule produces correct value
    promql_expr_test:
      - expr: 'job:http_error_ratio:rate5m{job="api"}'
        eval_time: 10m
        exp_samples:
          - labels: 'job:http_error_ratio:rate5m{job="api"}'
            value: 0.0476    # 5/105 ≈ 0.0476

Conclusion

                            
                            Key Takeaways:
                            Always validate — promtool check config/rules catches syntax errors instantly
Unit test alert rules — synthetic time series prove alerts fire (and don’t fire) correctly
Lint beyond syntax — pint catches logical errors and missing best practices
CI blocks broken configs — no alert rule reaches production without passing tests
GitOps for deployment — ArgoCD/Flux reconciles generated YAML from main branch
Test both firing and non-firing — verify alerts stay silent when metrics are healthy

                        

Previous Part 12: Jsonnet & Mixins Next Part 14: SLOs & Error Budgets

Prometheus Deep Dive Part 13: CI/CD Pipelines for Prometheus

Table of Contents

Config & Rule Validation

Unit Testing Alert Rules

Linting & Best Practices

GitHub Actions Pipeline

GitOps Deployment

Advanced Testing Patterns

Conclusion

Related Articles in This Series

Part 12: Jsonnet & Monitoring Mixins

Part 14: SLOs & Error Budgets

Part 6: Effective Alerting & Alertmanager