Alert Rule Fundamentals
Prometheus alerting separates two concerns: alert evaluation (done by Prometheus itself) and alert notification (handled by Alertmanager). Prometheus periodically evaluates alert rules, fires alerts when conditions are met, and pushes them to Alertmanager for routing, deduplication, and delivery.
inactive → pending → firing.
Rule Syntax & Structure
Alert rules live in rule files referenced by rule_files in prometheus.yml. Each rule group has a name and evaluation interval:
# /etc/prometheus/rules/application.yml
groups:
- name: application.rules
interval: 30s # Override global evaluation_interval for this group
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Service {{ $labels.service }} has {{ $value | humanizePercentage }} error rate (threshold: 5%)."
runbook_url: "https://runbooks.internal/alerts/high-error-rate"
dashboard_url: "https://grafana.internal/d/svc-overview?var-service={{ $labels.service }}"
The for Duration
The for clause prevents transient spikes from triggering alerts. An alert stays in pending state until the condition holds continuously for the specified duration:
stateDiagram-v2
[*] --> Inactive
Inactive --> Pending: Expression returns results
Pending --> Firing: Condition holds for 'for' duration
Pending --> Inactive: Condition no longer true
Firing --> Inactive: Condition no longer true
Firing --> Firing: Condition still true
for timer resets. For critical alerts, keep for relatively short (2-5m) or use recording rules to pre-compute the condition.
Annotations & Templating
Annotations support Go templating with access to $labels and $value. Common formatting functions:
# Annotation templating examples
annotations:
# Access label values
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is CrashLooping"
# Format the expression value
description: |
Current value: {{ $value | humanize }}
Percentage: {{ $value | humanizePercentage }}
Duration: {{ $value | humanizeDuration }}
Timestamp: {{ $value | humanizeTimestamp }}
# Conditional text
impact: >-
{{ if gt $value 0.1 }}MAJOR impact on user traffic{{ else }}Minor degradation{{ end }}
# Printf formatting
detail: "Error rate is {{ printf \"%.2f\" $value }}%"
Labels on alert rules are used for routing decisions. Keep them consistent across your organization:
# Standard label taxonomy for alerts
labels:
severity: critical | warning | info
team: platform | backend | frontend | data
environment: production | staging
service: "{{ $labels.service }}" # Propagate from metric
tier: "1" # Business criticality (1=highest)
Alertmanager Architecture
Alertmanager receives alerts from one or more Prometheus servers and handles deduplication, grouping, routing, silencing, inhibition, and notification dispatch.
flowchart LR
P1[Prometheus 1] --> AM[Alertmanager]
P2[Prometheus 2] --> AM
AM --> D[Deduplication]
D --> G[Grouping]
G --> I[Inhibition]
I --> S[Silence Check]
S --> R[Routing]
R --> PD[PagerDuty]
R --> SL[Slack]
R --> WH[Webhook]
Routing Tree
The routing configuration forms a tree. Each node can match on labels and either handle the alert or delegate to child routes. The first matching child route wins.
# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.internal:587'
smtp_from: 'alerts@company.com'
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # Wait before sending first notification for a new group
group_interval: 5m # Wait before sending updates to an existing group
repeat_interval: 4h # Resend if alert still firing after this
routes:
# Critical alerts go to PagerDuty immediately
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 10s
repeat_interval: 1h
# Warning alerts go to Slack
- match:
severity: warning
receiver: 'slack-warnings'
group_wait: 1m
repeat_interval: 12h
# Team-specific routing
- match_re:
team: 'backend|platform'
receiver: 'backend-slack'
routes:
- match:
severity: critical
receiver: 'backend-pagerduty'
# Catch-all for info alerts
- match:
severity: info
receiver: 'slack-info'
group_wait: 5m
repeat_interval: 24h
Grouping
Grouping batches related alerts into a single notification. When many targets go down simultaneously, you get one notification listing all affected instances instead of N separate pages.
['alertname', 'cluster'] for infrastructure alerts and ['alertname', 'service'] for application alerts. Never group by high-cardinality labels like pod or instance—this defeats the purpose.
# Grouping examples
route:
# Group all alerts of the same name in the same cluster
group_by: ['alertname', 'cluster']
routes:
# For network alerts, group by datacenter
- match:
category: network
group_by: ['alertname', 'datacenter']
# For capacity alerts, group by namespace
- match:
category: capacity
group_by: ['alertname', 'namespace']
# Special: group_by: ['...'] means group ALL alerts together
# Useful for sending a daily digest
- match:
severity: info
group_by: ['...']
group_wait: 30m
Inhibition Rules
Inhibition suppresses notifications for alerts when a more severe related alert is already firing. If the entire cluster is down, you don't need separate alerts for each service in that cluster.
# Inhibition rules
inhibit_rules:
# If a critical alert fires, suppress matching warnings
- source_matchers:
- severity = critical
target_matchers:
- severity = warning
equal: ['alertname', 'cluster', 'service']
# If node is down, suppress all pod alerts on that node
- source_matchers:
- alertname = NodeDown
target_matchers:
- alertname =~ "Pod.*"
equal: ['node']
# If cluster is unreachable, suppress all alerts from that cluster
- source_matchers:
- alertname = ClusterUnreachable
target_matchers:
- severity =~ "warning|critical"
equal: ['cluster']
Silences
Silences temporarily mute alerts matching specific label matchers—useful during maintenance windows or known issues. Create them via the Alertmanager UI or amtool:
# Create a silence for 2 hours during maintenance
amtool silence add \
--alertmanager.url=http://alertmanager:9093 \
--author="wasil.zafar" \
--comment="Planned maintenance window for redis cluster upgrade" \
--duration=2h \
alertname="RedisDown" cluster="production"
# Create silence with specific start/end times
amtool silence add \
--alertmanager.url=http://alertmanager:9093 \
--author="wasil.zafar" \
--comment="Deploy window" \
--start="2026-06-15T22:00:00Z" \
--end="2026-06-15T23:00:00Z" \
service=~"checkout|payment"
# List active silences
amtool silence query --alertmanager.url=http://alertmanager:9093
# Expire (remove) a silence by ID
amtool silence expire --alertmanager.url=http://alertmanager:9093 "silence-id-here"
Notification Channels
PagerDuty
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: ''
# Or use routing_key for Events API v2
routing_key: ''
severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
description: '{{ .CommonAnnotations.summary }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
cluster: '{{ .CommonLabels.cluster }}'
# Custom links shown in PagerDuty incident
links:
- href: '{{ (index .Alerts 0).Annotations.dashboard_url }}'
text: 'Grafana Dashboard'
- href: '{{ (index .Alerts 0).Annotations.runbook_url }}'
text: 'Runbook'
Slack
receivers:
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00/B00/XXXX'
channel: '#alerts-warning'
username: 'Alertmanager'
icon_emoji: ':warning:'
send_resolved: true
title: '{{ .Status | toUpper }}{{ if eq .Status "firing" }} ({{ .Alerts.Firing | len }}){{ end }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Severity:* `{{ .Labels.severity }}`
*Service:* {{ .Labels.service }}
*Description:* {{ .Annotations.description }}
*Dashboard:* {{ .Annotations.dashboard_url }}
{{ end }}
# Color coding: red for firing, green for resolved
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
actions:
- type: button
text: 'Runbook :book:'
url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
- type: button
text: 'Silence :mute:'
url: '{{ .ExternalURL }}/#/silences/new?filter=%7Balertname%3D%22{{ .CommonLabels.alertname }}%22%7D'
OpsGenie
receivers:
- name: 'opsgenie-oncall'
opsgenie_configs:
- api_key: ''
message: '{{ .CommonAnnotations.summary }}'
description: |
{{ .CommonAnnotations.description }}
Firing alerts: {{ .Alerts.Firing | len }}
{{ range .Alerts.Firing }}
- {{ .Labels.alertname }}: {{ .Annotations.summary }}
{{ end }}
priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else if eq .CommonLabels.severity "warning" }}P3{{ else }}P5{{ end }}'
tags: 'prometheus,{{ .CommonLabels.team }},{{ .CommonLabels.environment }}'
responders:
- name: '{{ .CommonLabels.team }}-oncall'
type: 'schedule'
Webhooks
receivers:
- name: 'custom-webhook'
webhook_configs:
- url: 'https://alert-handler.internal/api/v1/alerts'
send_resolved: true
http_config:
bearer_token_file: '/etc/alertmanager/secrets/webhook-token'
tls_config:
cert_file: '/etc/alertmanager/tls/client.crt'
key_file: '/etc/alertmanager/tls/client.key'
# Max alerts per webhook call (default: 0 = unlimited)
max_alerts: 100
The webhook payload format (sent as POST with JSON body):
{
"version": "4",
"groupKey": "{}:{alertname=\"HighErrorRate\"}",
"status": "firing",
"receiver": "custom-webhook",
"groupLabels": {
"alertname": "HighErrorRate"
},
"commonLabels": {
"alertname": "HighErrorRate",
"severity": "critical",
"service": "checkout"
},
"commonAnnotations": {
"summary": "High error rate on checkout"
},
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "HighErrorRate",
"service": "checkout",
"instance": "checkout-7d8f9:8080"
},
"annotations": {
"summary": "High error rate on checkout",
"description": "checkout has 8.2% error rate"
},
"startsAt": "2026-06-15T10:00:00.000Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://prometheus:9090/graph?g0.expr=...",
"fingerprint": "abc123def456"
}
]
}
Alert Routing Examples
Team-Based Routing
# Route alerts to team-specific channels based on labels
route:
receiver: 'default'
group_by: ['alertname', 'service']
routes:
- match:
team: platform
receiver: 'platform-slack'
routes:
- match:
severity: critical
receiver: 'platform-pagerduty'
- match:
team: backend
receiver: 'backend-slack'
routes:
- match:
severity: critical
receiver: 'backend-pagerduty'
- match:
team: data
receiver: 'data-slack'
routes:
- match:
severity: critical
receiver: 'data-opsgenie'
Severity Escalation
# Escalation: if critical alert isn't acked in 30 min, page manager
route:
receiver: 'team-oncall'
group_by: ['alertname']
routes:
- match:
severity: critical
receiver: 'team-pagerduty'
repeat_interval: 30m # Re-notify every 30 min until resolved
routes:
# After 3 repeats (90 min), escalate to management
- match:
escalation_level: manager
receiver: 'manager-pagerduty'
repeat_interval simply re-sends the same alert—it doesn't inherently escalate.
Time-Based Routing
# Time-based routing using time_intervals (Alertmanager 0.24+)
time_intervals:
- name: business-hours
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '09:00'
end_time: '17:00'
- name: out-of-hours
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '17:00'
end_time: '09:00'
- weekdays: ['saturday', 'sunday']
route:
receiver: 'default'
routes:
# During business hours, send warnings to Slack
- match:
severity: warning
receiver: 'slack-warnings'
active_time_intervals:
- business-hours
# Out of hours, only page for critical
- match:
severity: critical
receiver: 'pagerduty-oncall'
# Mute non-critical during weekends
- match:
severity: warning
receiver: 'slack-warnings'
mute_time_intervals:
- out-of-hours
Testing Alerts
promtool test rules
You can unit-test alert rules with promtool test rules. Define synthetic input series and expected alert outputs:
# tests/alert_test.yml
rule_files:
- ../rules/application.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
# Simulate 10% error rate for 10 minutes
- series: 'http_requests_total{service="checkout", status="500"}'
values: '0+10x10' # Starts at 0, increments by 10 each minute
- series: 'http_requests_total{service="checkout", status="200"}'
values: '0+90x10' # 90 successful per minute
alert_rule_test:
- eval_time: 6m # Check at 6 minutes
alertname: HighErrorRate
exp_alerts:
- exp_labels:
severity: critical
team: platform
service: checkout
exp_annotations:
summary: "High error rate on checkout"
- eval_time: 3m # At 3 minutes, should be pending (for: 5m)
alertname: HighErrorRate
exp_alerts: [] # No firing alerts yet
# Run alert rule tests
promtool test rules tests/alert_test.yml
# Validate rule file syntax
promtool check rules rules/application.yml
# Example output
# Unit Testing: tests/alert_test.yml
# SUCCESS
Unit Testing Definitions
# Test that recording rules produce expected values
tests:
- interval: 1m
input_series:
- series: 'http_request_duration_seconds_bucket{le="0.1", service="api"}'
values: '0+100x5'
- series: 'http_request_duration_seconds_bucket{le="0.5", service="api"}'
values: '0+900x5'
- series: 'http_request_duration_seconds_bucket{le="+Inf", service="api"}'
values: '0+1000x5'
# Test recording rule output
promql_expr_test:
- expr: 'job:http_request_duration_seconds:p99'
eval_time: 5m
exp_samples:
- labels: 'job:http_request_duration_seconds:p99{service="api"}'
value: 0.5
Alert Fatigue Prevention
Symptom vs Cause-Based Alerts
| Approach | Example | When to Use |
|---|---|---|
| Symptom-based | "Error rate > 5% for users" | Page-worthy: directly impacts users |
| Cause-based | "Disk 90% full" | Warning only: may never cause symptoms |
| Multi-window | "Burning error budget too fast" | SLO-based alerting (preferred for SRE) |
Actionable Alerts Principles
1. What is broken? (summary annotation)
2. What is the impact? (description with user-facing effect)
3. What should I do? (runbook_url annotation)
4. Where do I look? (dashboard_url annotation)
# Anti-pattern: Alert that cannot be acted upon
- alert: CPUHigh
expr: node_cpu_seconds_total{mode="idle"} < 0.1
for: 5m
# Missing: what service? what impact? what to do?
# Better: Actionable alert tied to user impact
- alert: CheckoutLatencyBudgetBurn
expr: |
(
sum(rate(http_request_duration_seconds_count{service="checkout",code!~"5.."}[1h]))
-
sum(rate(http_request_duration_seconds_bucket{service="checkout",le="0.5",code!~"5.."}[1h]))
)
/
sum(rate(http_request_duration_seconds_count{service="checkout",code!~"5.."}[1h]))
> 0.01
for: 5m
labels:
severity: critical
team: backend
slo: checkout-latency
annotations:
summary: "Checkout latency SLO burn rate too high"
description: |
More than 1% of checkout requests are exceeding the 500ms latency SLO
in the last hour. Current violation rate: {{ $value | humanizePercentage }}.
At this burn rate, we will exhaust the monthly error budget in ~{{ printf "%.0f" (divf 1.0 $value) }} hours.
runbook_url: "https://runbooks.internal/slos/checkout-latency"
dashboard_url: "https://grafana.internal/d/checkout-slo"
High Availability
Alertmanager Clustering
Run multiple Alertmanager instances to avoid single points of failure. They form a cluster using a gossip protocol (based on Hashicorp's memberlist) to deduplicate notifications:
# docker-compose.yml - Alertmanager HA cluster (3 nodes)
services:
alertmanager-1:
image: prom/alertmanager:v0.27.0
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-2:9094'
- '--cluster.peer=alertmanager-3:9094'
- '--cluster.settle-timeout=60s'
ports:
- "9093:9093"
- "9094:9094"
alertmanager-2:
image: prom/alertmanager:v0.27.0
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-1:9094'
- '--cluster.peer=alertmanager-3:9094'
alertmanager-3:
image: prom/alertmanager:v0.27.0
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-1:9094'
- '--cluster.peer=alertmanager-2:9094'
Gossip Protocol
The gossip protocol ensures that notification deduplication works across all cluster members. Each Alertmanager instance knows about silences and notification states from peers:
# Check cluster status
curl -s http://alertmanager:9093/api/v2/status | jq '.cluster'
# Expected output for healthy 3-node cluster:
# {
# "name": "alertmanager-1",
# "status": "ready",
# "peers": [
# {"name": "alertmanager-2", "address": "10.0.0.2:9094"},
# {"name": "alertmanager-3", "address": "10.0.0.3:9094"}
# ]
# }
# Kubernetes: Use a headless service for peer discovery
# alertmanager-cluster.monitoring.svc.cluster.local resolves to all pod IPs
alertmanagers config. Prometheus will send alerts to all of them, and the cluster handles deduplication internally.
# prometheus.yml - Send alerts to all Alertmanager instances
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager-1:9093'
- 'alertmanager-2:9093'
- 'alertmanager-3:9093'
# Or use DNS service discovery in Kubernetes
- dns_sd_configs:
- names:
- 'alertmanager-cluster.monitoring.svc.cluster.local'
type: A
port: 9093
amtool CLI
amtool is the official CLI for interacting with Alertmanager. Essential commands for daily operations:
# Set default Alertmanager URL
export ALERTMANAGER_URL=http://alertmanager:9093
# Query current alerts
amtool alert query
amtool alert query alertname="HighErrorRate"
amtool alert query severity="critical" service=~"checkout|payment"
# Check which receiver an alert would route to
amtool config routes test --tree \
severity=critical team=backend alertname=HighErrorRate
# Show the full routing tree
amtool config routes show
# Manage silences
amtool silence add alertname="DeployInProgress" --duration=30m \
--author="deploy-bot" --comment="Rolling deploy in progress"
amtool silence query
amtool silence expire
amtool silence expire --all # Remove ALL silences (careful!)
# Check Alertmanager configuration
amtool check-config /etc/alertmanager/alertmanager.yml
# Verify template rendering
amtool template render --template.glob='/etc/alertmanager/templates/*.tmpl'
Conclusion
Effective alerting is the difference between a team that responds to real incidents and one drowning in noise. Focus on symptom-based alerts with clear runbooks, use Alertmanager's grouping and inhibition to reduce noise, and always test your rules before deploying them to production.