Alertmanager Architecture
Alertmanager sits between Prometheus (which evaluates alert rules) and your notification channels (PagerDuty, Slack, email, webhooks). It solves the critical problems of alert fatigue: deduplication prevents the same alert firing multiple times, grouping batches related alerts into a single notification, and inhibition suppresses lower-priority alerts when a root cause is already firing.
flowchart TD
A[Prometheus
Alert Rules] -->|HTTP POST /api/v1/alerts| B[Alertmanager API
Receives Alerts]
B --> C[Dispatcher
Route Matching]
C --> D[Routing Tree
Label Matchers]
D --> E[Grouping
group_by labels]
E --> F[Notification Pipeline
Dedup + Throttle]
F --> G[Receivers]
G --> H[PagerDuty]
G --> I[Slack]
G --> J[Email]
G --> K[Webhook]
L[Inhibition Rules] -->|Suppress| F
M[Silences] -->|Mute| F
Alert Lifecycle
Firing — Prometheus evaluates an alert rule expression and it returns results. After the optional for duration (pending period), the alert transitions to firing and is sent to Alertmanager via HTTP POST.
Grouping — Alertmanager groups the incoming alert with other alerts sharing the same group_by labels. The group waits for group_wait before sending the first notification.
Routing — The alert traverses the routing tree. Each route has label matchers; the first matching route determines the receiver. Child routes inherit parent settings unless explicitly overridden.
Notification — The notification pipeline checks inhibitions and silences. If not suppressed, the alert is formatted using the configured template and delivered to the receiver.
Resolved — When Prometheus stops sending the alert (expression no longer matches), Alertmanager marks it resolved and sends a resolution notification (configurable per receiver with send_resolved: true).
Routing Tree
The routing tree is Alertmanager's core configuration. Routes form a tree structure where alerts are matched against label matchers starting from the root. The first matching leaf route determines the receiver. If no child matches, the parent route's receiver handles the alert.
Complete Routing Configuration
# alertmanager.yml - Complete routing configuration
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alertmanager@company.com'
slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxxxx'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
# Default receiver if no child routes match
receiver: 'email-ops-team'
# Group alerts by these labels
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
# Child routes (evaluated in order, first match wins)
routes:
# Critical alerts → PagerDuty (immediate escalation)
- match:
severity: critical
receiver: 'pagerduty-oncall'
group_wait: 10s
repeat_interval: 1h
continue: false
# High severity → Slack urgent channel
- match:
severity: high
receiver: 'slack-alerts-urgent'
group_wait: 30s
repeat_interval: 2h
# Platform team routing (nested)
- match:
team: platform
receiver: 'slack-platform'
routes:
# Platform critical → Platform PagerDuty
- match:
severity: critical
receiver: 'pagerduty-platform'
group_wait: 10s
# Watchdog / DeadMansSwitch (should always be firing)
- match:
alertname: Watchdog
receiver: 'null'
repeat_interval: 24h
receivers:
- name: 'email-ops-team'
email_configs:
- to: 'ops-team@company.com'
send_resolved: true
- name: 'pagerduty-oncall'
pagerduty_configs:
- service_key_file: '/etc/alertmanager/secrets/pagerduty-key'
severity: '{{ .CommonLabels.severity }}'
description: '{{ .CommonAnnotations.summary }}'
- name: 'pagerduty-platform'
pagerduty_configs:
- service_key_file: '/etc/alertmanager/secrets/pagerduty-platform-key'
severity: '{{ .CommonLabels.severity }}'
- name: 'slack-alerts-urgent'
slack_configs:
- channel: '#alerts-urgent'
send_resolved: true
title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'slack-platform'
slack_configs:
- channel: '#platform-alerts'
send_resolved: true
- name: 'null'
team=platform + severity=critical) before broader matches. Use continue: true on a route to allow the alert to also match subsequent sibling routes — useful for sending to both PagerDuty and Slack simultaneously.
Grouping & Deduplication
Grouping is Alertmanager's primary mechanism for reducing notification volume. When multiple alerts share the same group_by label values, they are batched into a single notification. This prevents receiving 100 individual pod alerts when a node goes down — you get one grouped notification instead.
| Parameter | Purpose | Default | Recommended |
|---|---|---|---|
group_by |
Labels used to group alerts into batches | ['alertname'] |
['alertname', 'cluster', 'service'] |
group_wait |
How long to buffer alerts before sending the first notification for a new group | 30s |
30s (critical: 10s) |
group_interval |
How long to wait before sending notifications about new alerts added to an existing group | 5m |
5m |
repeat_interval |
How long to wait before re-sending a notification for an alert that is still firing (and unchanged) | 4h |
Critical: 1h, Warning: 4h, Info: 12h |
Grouping Behavior Explained
group_wait (30s) is the initial delay for a brand new group. When the first alert arrives for a group that doesn't exist yet, Alertmanager waits this duration to collect other alerts that might belong to the same group before sending the first notification.
group_interval (5m) applies to existing groups. If new alerts are added to an already-notified group, Alertmanager waits this duration before sending an updated notification with the new alerts included. This prevents notification spam when alerts trickle in one by one.
Practical example: A node failure triggers 50 pod alerts over 2 minutes. With
group_wait: 30s, the first notification goes out after 30s with ~15 alerts. With group_interval: 5m, a second notification (with the remaining 35 alerts) goes out 5 minutes later.
Deduplication is automatic — if the same alert (identical labels) is received multiple times within the same group period, only one notification is sent. Alertmanager uses the alert's fingerprint (hash of all labels) for deduplication.
Inhibition Rules
Inhibition rules suppress notifications for alerts when other "source" alerts are already firing. The classic use case: if a cluster is completely down (critical alert), suppress all individual service alerts for that cluster — the engineer only needs to see the root cause.
# alertmanager.yml - Inhibition rules
inhibit_rules:
# Critical alerts inhibit warnings for the same service
- source_matchers:
- severity = critical
target_matchers:
- severity = warning
# Only inhibit if these labels match between source and target
equal: ['alertname', 'cluster', 'service']
# Cluster-level alerts inhibit pod-level alerts
- source_matchers:
- alertname = KubeNodeNotReady
target_matchers:
- severity =~ "warning|high"
equal: ['cluster', 'node']
# Infrastructure down inhibits application alerts
- source_matchers:
- alertname = ClusterDown
target_matchers:
- severity =~ ".*"
- alertname != ClusterDown
equal: ['cluster']
# Database down inhibits query timeout alerts
- source_matchers:
- alertname = DatabaseDown
target_matchers:
- alertname =~ "QueryTimeout|ConnectionPoolExhausted"
equal: ['cluster', 'database']
Common Pitfalls
Circular inhibition: If alert A inhibits B, and B inhibits A, both can end up suppressed. Always ensure inhibition flows in one direction (higher severity → lower severity).
Missing
equal labels: Without equal constraints, a critical alert in cluster-A would suppress warnings in cluster-B. Always scope inhibitions with the appropriate equal labels.Over-broad target_matchers: Using
severity =~ ".*" as a target matcher will suppress ALL alerts including other critical alerts. Be specific about what should be inhibited.Testing tip: Use
amtool config routes test and amtool alert query --inhibited to verify inhibition behavior before deploying to production.
Silences
Silences mute notifications for alerts matching specific label matchers during a defined time window. Unlike inhibition (which is automatic and rule-based), silences are manually created for known situations like planned maintenance or acknowledged issues being actively worked.
Creating Silences via API
# Create a silence via Alertmanager API
# Mute all alerts for payment-service during maintenance window
curl -X POST http://alertmanager:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{
"name": "service",
"value": "payment-service",
"isRegex": false
},
{
"name": "cluster",
"value": "production",
"isRegex": false
}
],
"startsAt": "2026-05-14T22:00:00Z",
"endsAt": "2026-05-15T02:00:00Z",
"createdBy": "wasil.zafar",
"comment": "Planned maintenance: payment-service DB migration"
}'
# List active silences
curl -s http://alertmanager:9093/api/v2/silences | jq '.[] | select(.status.state=="active")'
# Delete (expire) a silence by ID
curl -X DELETE http://alertmanager:9093/api/v2/silence/silence-uuid-here
| Scenario | Use Silence? | Rationale |
|---|---|---|
| Planned maintenance window | Yes | Expected downtime — silence with clear comment and expiry time |
| Known issue, fix in progress | Yes | Prevents alert fatigue while actively working the incident (set short expiry) |
| Noisy alert needing tuning | Temporary | Silence briefly, but create a ticket to fix the alert threshold — don't leave silences indefinitely |
| Hiding persistent production issues | No | Silences mask real problems — fix the root cause or adjust the alert rule instead |
| Broad regex matching many services | No | Over-broad silences hide real issues — be as specific as possible with label matchers |
| Indefinite / no-expiry silence | No | Every silence must have an expiry. "Permanent" silences indicate broken alert rules that should be deleted or fixed |
Notification Templates
Alertmanager uses Go's text/template language for formatting notifications. Custom templates transform raw alert data into actionable messages with context — severity badges, runbook links, dashboard URLs, and relevant labels — so on-call engineers can triage without opening Alertmanager UI.
Slack Notification Template
# /etc/alertmanager/templates/slack.tmpl
{{ define "slack.custom.title" -}}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{- end }}
{{ define "slack.custom.text" -}}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ if eq .Labels.severity "critical" }}🔴{{ else if eq .Labels.severity "high" }}🟠{{ else }}🟡{{ end }} {{ .Labels.severity | toUpper }}
*Cluster:* {{ .Labels.cluster }}
*Service:* {{ .Labels.service }}
*Description:* {{ .Annotations.description }}
{{ if .Annotations.runbook_url }}*Runbook:* <{{ .Annotations.runbook_url }}|View Runbook>{{ end }}
{{ if .Annotations.dashboard_url }}*Dashboard:* <{{ .Annotations.dashboard_url }}|View Dashboard>{{ end }}
*Source:* <{{ .GeneratorURL }}|Prometheus Query>
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
---
{{ end }}
{{- end }}
{{ define "slack.custom.color" -}}
{{ if eq .Status "firing" -}}
{{ if eq .CommonLabels.severity "critical" -}}danger{{ else -}}warning{{ end -}}
{{ else -}}good{{ end -}}
{{- end }}
# Reference the template in alertmanager.yml
templates:
- '/etc/alertmanager/templates/*.tmpl'
receivers:
- name: 'slack-alerts'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ template "slack.custom.title" . }}'
text: '{{ template "slack.custom.text" . }}'
color: '{{ template "slack.custom.color" . }}'
actions:
- type: button
text: 'Silence 🔇'
url: '{{ template "__alertmanagerURL" . }}/#/silences/new?filter=%7B{{ range .CommonLabels.SortedPairs }}{{ .Name }}%3D%22{{ .Value }}%22%2C{{ end }}%7D'
- type: button
text: 'Dashboard 📊'
url: '{{ (index .Alerts 0).Annotations.dashboard_url }}'
amtool template render to test templates locally without sending real notifications. Feed it a sample alert JSON and verify the rendered output before deploying.
High Availability
A single Alertmanager instance is a single point of failure — if it goes down, no notifications are sent. Alertmanager supports native clustering using a gossip protocol (based on Hashicorp's Memberlist) to replicate notification state across multiple instances. All instances receive alerts from Prometheus, but they coordinate to ensure only one instance sends each notification.
Cluster Configuration
# docker-compose.yml - 3-node Alertmanager cluster
services:
alertmanager-1:
image: prom/alertmanager:v0.27.0
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-2:9094'
- '--cluster.peer=alertmanager-3:9094'
- '--cluster.settle-timeout=60s'
ports:
- "9093:9093"
- "9094:9094"
alertmanager-2:
image: prom/alertmanager:v0.27.0
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-1:9094'
- '--cluster.peer=alertmanager-3:9094'
alertmanager-3:
image: prom/alertmanager:v0.27.0
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--cluster.listen-address=0.0.0.0:9094'
- '--cluster.peer=alertmanager-1:9094'
- '--cluster.peer=alertmanager-2:9094'
# Prometheus configuration (send to all instances)
# prometheus.yml:
# alerting:
# alertmanagers:
# - static_configs:
# - targets:
# - alertmanager-1:9093
# - alertmanager-2:9093
# - alertmanager-3:9093
| Aspect | Single Instance | 3-Node Cluster |
|---|---|---|
| Availability | SPOF — no notifications during downtime or restarts | Tolerates 1 node failure without notification loss |
| Deduplication | Local only | Cluster-wide via gossip (prevents duplicate notifications) |
| Silences | Local state, lost on restart without persistence | Replicated across all peers via gossip |
| Notification state | Single state machine | Shared via CRDT-based state merge |
| Operational complexity | Minimal — single process | Moderate — config sync, network partitions, settle timeouts |
| Resource usage | ~50MB RAM | ~50MB RAM × 3 + gossip traffic (~1KB/s per peer) |
| Recommended for | Dev/staging, non-critical workloads | Production, SLO-bound services, regulated environments |
alerting.alertmanagers. The cluster gossip handles deduplication — each instance receives the alert, but only one sends the notification. This ensures no alert is lost if one instance is unreachable.
Production Checklist
Alertmanager Production Deployment Checklist
- Deploy in HA cluster — run 3 Alertmanager instances with
--cluster.peerflags, behind a load balancer, with Prometheus configured to send to all instances - Configure
resolve_timeout— set globalresolve_timeout: 5mto auto-resolve alerts if Prometheus stops sending them (handles Prometheus restarts gracefully) - Validate routing with amtool — run
amtool config routes test --config.file=alertmanager.ymlin CI to verify alert routing before deployment - Set up a DeadMansSwitch — configure a Watchdog alert that always fires and route it to a heartbeat receiver (PagerDuty, Healthchecks.io) to detect Alertmanager/Prometheus failures
- Monitor Alertmanager metrics — alert on
alertmanager_notifications_failed_total,alertmanager_cluster_membersdropping, andalertmanager_alerts_invalid_totalincreasing - Store secrets securely — use
service_key_fileandapi_key_filefor receiver credentials rather than inline plaintext; mount from Kubernetes secrets or Vault - Implement notification templates — customize Slack/PagerDuty templates to include runbook links, dashboard URLs, and severity badges for faster triage
- Audit silences regularly — schedule weekly reviews of active silences; set a maximum silence duration policy (e.g., 7 days) and require JIRA ticket references in silence comments