Config & Rule Validation
promtool is Prometheus’s built-in CLI for validating configuration and rules. It catches syntax errors, invalid PromQL, label conflicts, and structural issues before deployment:
# Validate prometheus.yml configuration
promtool check config prometheus.yml
# Output: Checking prometheus.yml
# SUCCESS: 4 rule files found
# SUCCESS: prometheus.yml is valid prometheus config file
# Validate individual rule files
promtool check rules rules/*.yaml
# Output: Checking rules/node-alerts.yaml
# SUCCESS: 12 rules found
# Check all rules in a directory recursively
find rules/ -name '*.yaml' -exec promtool check rules {} \;
# Common validation errors caught:
# - Invalid PromQL expressions
# - Duplicate rule names within a group
# - Missing required fields (alert name, expr)
# - Invalid label names (must match [a-zA-Z_][a-zA-Z0-9_]*)
# - Invalid template syntax in annotations
Unit Testing Alert Rules
promtool’s test rules command evaluates rules against synthetic time series data and verifies expected alert states at specific timestamps:
# tests/cpu_alerts_test.yaml
rule_files:
- ../rules/node-alerts.yaml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
# Simulate CPU at 90% utilization (only 10% idle)
- series: 'node_cpu_seconds_total{instance="node-1:9100",cpu="0",mode="idle"}'
values: '0+0.1x30' # 0.1 seconds idle per minute = 10% idle
- series: 'node_cpu_seconds_total{instance="node-1:9100",cpu="0",mode="user"}'
values: '0+0.9x30' # 0.9 seconds user per minute
alert_rule_test:
# At 15m, the alert should be firing (15m 'for' duration met)
- eval_time: 20m
alertname: HighCpuUsage
exp_alerts:
- exp_labels:
instance: node-1:9100
severity: warning
exp_annotations:
summary: "High CPU on node-1:9100"
# At 5m, alert should NOT be firing yet (for: 15m not met)
- eval_time: 5m
alertname: HighCpuUsage
exp_alerts: [] # No alerts expected
- interval: 1m
input_series:
# Simulate healthy CPU (70% idle)
- series: 'node_cpu_seconds_total{instance="node-2:9100",cpu="0",mode="idle"}'
values: '0+0.7x30'
- series: 'node_cpu_seconds_total{instance="node-2:9100",cpu="0",mode="user"}'
values: '0+0.3x30'
alert_rule_test:
# Should never fire for healthy node
- eval_time: 30m
alertname: HighCpuUsage
exp_alerts: []
# Run unit tests
promtool test rules tests/cpu_alerts_test.yaml
# Output: Unit Testing: tests/cpu_alerts_test.yaml
# SUCCESS
# Run all test files
promtool test rules tests/*_test.yaml
# Verbose output for debugging
promtool test rules --debug tests/cpu_alerts_test.yaml
Linting & Best Practices
# pint — Prometheus rule linter (beyond syntax checking)
# Catches logical errors and best practice violations
# Install
go install github.com/cloudflare/pint/cmd/pint@latest
# Lint rules
pint lint rules/
# Checks performed by pint:
# - Alerts without 'for' duration (likely to be noisy)
# - Alerts without runbook_url annotation
# - Recording rules with incorrect naming convention
# - PromQL using absent() without proper labels
# - Rate() on a gauge metric (logical error)
# - Aggregations that lose important labels
# - Template syntax issues in annotations
# - Comparison to bool without clear semantic meaning
# .pint.hcl — pint configuration
prometheus "prod" {
uri = "http://prometheus:9090"
timeout = "30s"
}
rule {
# Require runbook annotation on all critical alerts
match {
kind = "alerting"
label "severity" {
value = "critical"
}
}
annotation "runbook_url" {
severity = "bug"
required = true
}
}
rule {
# Recording rules must follow naming convention
match {
kind = "recording"
}
label "team" {
severity = "warning"
required = true
}
}
GitHub Actions Pipeline
# .github/workflows/prometheus-ci.yaml
name: Prometheus Config CI
on:
pull_request:
paths:
- 'prometheus/**'
- 'rules/**'
- 'alertmanager/**'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install promtool
run: |
VERSION="2.53.0"
wget -q "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/prometheus-${VERSION}.linux-amd64.tar.gz"
tar xzf "prometheus-${VERSION}.linux-amd64.tar.gz"
sudo mv "prometheus-${VERSION}.linux-amd64/promtool" /usr/local/bin/
- name: Validate Prometheus config
run: promtool check config prometheus/prometheus.yml
- name: Validate alert rules
run: promtool check rules rules/*.yaml
- name: Run unit tests
run: promtool test rules tests/*_test.yaml
- name: Validate Alertmanager config
run: |
wget -q "https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz"
tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
./alertmanager-0.27.0.linux-amd64/amtool check-config alertmanager/alertmanager.yml
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install pint
run: |
go install github.com/cloudflare/pint/cmd/pint@latest
- name: Lint rules
run: pint lint rules/
jsonnet-build:
runs-on: ubuntu-latest
if: contains(github.event.pull_request.changed_files, 'jsonnet')
steps:
- uses: actions/checkout@v4
- name: Install Jsonnet + jb
run: |
go install github.com/google/go-jsonnet/cmd/jsonnet@latest
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest
- name: Install dependencies
run: jb install
- name: Build rules from Jsonnet
run: |
jsonnet -J vendor -m generated/ main.jsonnet
promtool check rules generated/*.yaml
GitOps Deployment
GitOps Prometheus Deployment Flow
flowchart LR
subgraph PR["Pull Request"]
V[Validate]
T[Unit Test]
L[Lint]
end
subgraph Merge["On Merge to Main"]
B[Build Jsonnet]
G[Generate YAML]
end
subgraph Deploy["GitOps"]
AR[ArgoCD / Flux]
K8S[Kubernetes]
end
PR --> Merge --> Deploy
V & T & L -->|"all pass"| B --> G --> AR --> K8S
# ArgoCD Application for Prometheus rules
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prometheus-rules
namespace: argocd
spec:
project: monitoring
source:
repoURL: https://github.com/myorg/monitoring-config.git
targetRevision: main
path: generated/rules
destination:
server: https://kubernetes.default.svc
namespace: monitoring
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Advanced Testing Patterns
# Test recording rules produce expected values
rule_files:
- ../rules/recording-rules.yaml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{job="api",status="200"}'
values: '0+100x10'
- series: 'http_requests_total{job="api",status="500"}'
values: '0+5x10'
# Test that recording rule produces correct value
promql_expr_test:
- expr: 'job:http_error_ratio:rate5m{job="api"}'
eval_time: 10m
exp_samples:
- labels: 'job:http_error_ratio:rate5m{job="api"}'
value: 0.0476 # 5/105 ≈ 0.0476
Conclusion
Key Takeaways:
- Always validate —
promtool check config/rulescatches syntax errors instantly - Unit test alert rules — synthetic time series prove alerts fire (and don’t fire) correctly
- Lint beyond syntax — pint catches logical errors and missing best practices
- CI blocks broken configs — no alert rule reaches production without passing tests
- GitOps for deployment — ArgoCD/Flux reconciles generated YAML from main branch
- Test both firing and non-firing — verify alerts stay silent when metrics are healthy