Tool Deep Dive: Alertmanager Complete Guide

Alertmanager Architecture

Alertmanager sits between Prometheus (which evaluates alert rules) and your notification channels (PagerDuty, Slack, email, webhooks). It solves the critical problems of alert fatigue: deduplication prevents the same alert firing multiple times, grouping batches related alerts into a single notification, and inhibition suppresses lower-priority alerts when a root cause is already firing.

Alertmanager Architecture — Alert Processing Pipeline

flowchart TD
    A[Prometheus
Alert Rules] -->|HTTP POST /api/v1/alerts| B[Alertmanager API
Receives Alerts]
    B --> C[Dispatcher
Route Matching]
    C --> D[Routing Tree
Label Matchers]
    D --> E[Grouping
group_by labels]
    E --> F[Notification Pipeline
Dedup + Throttle]
    F --> G[Receivers]
    G --> H[PagerDuty]
    G --> I[Slack]
    G --> J[Email]
    G --> K[Webhook]

    L[Inhibition Rules] -->|Suppress| F
    M[Silences] -->|Mute| F

Alert Lifecycle

Firing — Prometheus evaluates an alert rule expression and it returns results. After the optional for duration (pending period), the alert transitions to firing and is sent to Alertmanager via HTTP POST.

Grouping — Alertmanager groups the incoming alert with other alerts sharing the same group_by labels. The group waits for group_wait before sending the first notification.

Routing — The alert traverses the routing tree. Each route has label matchers; the first matching route determines the receiver. Child routes inherit parent settings unless explicitly overridden.

Notification — The notification pipeline checks inhibitions and silences. If not suppressed, the alert is formatted using the configured template and delivered to the receiver.

Resolved — When Prometheus stops sending the alert (expression no longer matches), Alertmanager marks it resolved and sends a resolution notification (configurable per receiver with send_resolved: true).

Routing Tree

The routing tree is Alertmanager's core configuration. Routes form a tree structure where alerts are matched against label matchers starting from the root. The first matching leaf route determines the receiver. If no child matches, the parent route's receiver handles the alert.

Complete Routing Configuration

# alertmanager.yml - Complete routing configuration
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alertmanager@company.com'
  slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxxxx'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

route:
  # Default receiver if no child routes match
  receiver: 'email-ops-team'
  # Group alerts by these labels
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  # Child routes (evaluated in order, first match wins)
  routes:
    # Critical alerts → PagerDuty (immediate escalation)
    - match:
        severity: critical
      receiver: 'pagerduty-oncall'
      group_wait: 10s
      repeat_interval: 1h
      continue: false

    # High severity → Slack urgent channel
    - match:
        severity: high
      receiver: 'slack-alerts-urgent'
      group_wait: 30s
      repeat_interval: 2h

    # Platform team routing (nested)
    - match:
        team: platform
      receiver: 'slack-platform'
      routes:
        # Platform critical → Platform PagerDuty
        - match:
            severity: critical
          receiver: 'pagerduty-platform'
          group_wait: 10s

    # Watchdog / DeadMansSwitch (should always be firing)
    - match:
        alertname: Watchdog
      receiver: 'null'
      repeat_interval: 24h

receivers:
  - name: 'email-ops-team'
    email_configs:
      - to: 'ops-team@company.com'
        send_resolved: true

  - name: 'pagerduty-oncall'
    pagerduty_configs:
      - service_key_file: '/etc/alertmanager/secrets/pagerduty-key'
        severity: '{{ .CommonLabels.severity }}'
        description: '{{ .CommonAnnotations.summary }}'

  - name: 'pagerduty-platform'
    pagerduty_configs:
      - service_key_file: '/etc/alertmanager/secrets/pagerduty-platform-key'
        severity: '{{ .CommonLabels.severity }}'

  - name: 'slack-alerts-urgent'
    slack_configs:
      - channel: '#alerts-urgent'
        send_resolved: true
        title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'slack-platform'
    slack_configs:
      - channel: '#platform-alerts'
        send_resolved: true

  - name: 'null'

                            
                            Route matching is top-down, first-match-wins. Place more specific routes (e.g., team=platform + severity=critical) before broader matches. Use continue: true on a route to allow the alert to also match subsequent sibling routes — useful for sending to both PagerDuty and Slack simultaneously.
                        

Grouping & Deduplication

Grouping is Alertmanager's primary mechanism for reducing notification volume. When multiple alerts share the same group_by label values, they are batched into a single notification. This prevents receiving 100 individual pod alerts when a node goes down — you get one grouped notification instead.

Parameter	Purpose	Default	Recommended
`group_by`	Labels used to group alerts into batches	`['alertname']`	`['alertname', 'cluster', 'service']`
`group_wait`	How long to buffer alerts before sending the first notification for a new group	`30s`	`30s` (critical: `10s`)
`group_interval`	How long to wait before sending notifications about new alerts added to an existing group	`5m`	`5m`
`repeat_interval`	How long to wait before re-sending a notification for an alert that is still firing (and unchanged)	`4h`	Critical: `1h`, Warning: `4h`, Info: `12h`

Grouping Behavior Explained

                            
                            group_wait vs group_interval — the key difference:

                            group_wait (30s) is the initial delay for a brand new group. When the first alert arrives for a group that doesn't exist yet, Alertmanager waits this duration to collect other alerts that might belong to the same group before sending the first notification.

                            group_interval (5m) applies to existing groups. If new alerts are added to an already-notified group, Alertmanager waits this duration before sending an updated notification with the new alerts included. This prevents notification spam when alerts trickle in one by one.

                            Practical example: A node failure triggers 50 pod alerts over 2 minutes. With group_wait: 30s, the first notification goes out after 30s with ~15 alerts. With group_interval: 5m, a second notification (with the remaining 35 alerts) goes out 5 minutes later.

Deduplication is automatic — if the same alert (identical labels) is received multiple times within the same group period, only one notification is sent. Alertmanager uses the alert's fingerprint (hash of all labels) for deduplication.

Inhibition Rules

Inhibition rules suppress notifications for alerts when other "source" alerts are already firing. The classic use case: if a cluster is completely down (critical alert), suppress all individual service alerts for that cluster — the engineer only needs to see the root cause.

# alertmanager.yml - Inhibition rules
inhibit_rules:
  # Critical alerts inhibit warnings for the same service
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    # Only inhibit if these labels match between source and target
    equal: ['alertname', 'cluster', 'service']

  # Cluster-level alerts inhibit pod-level alerts
  - source_matchers:
      - alertname = KubeNodeNotReady
    target_matchers:
      - severity =~ "warning|high"
    equal: ['cluster', 'node']

  # Infrastructure down inhibits application alerts
  - source_matchers:
      - alertname = ClusterDown
    target_matchers:
      - severity =~ ".*"
      - alertname != ClusterDown
    equal: ['cluster']

  # Database down inhibits query timeout alerts
  - source_matchers:
      - alertname = DatabaseDown
    target_matchers:
      - alertname =~ "QueryTimeout|ConnectionPoolExhausted"
    equal: ['cluster', 'database']

Common Pitfalls

                            
                            Warning: Inhibition loops and over-suppression

                            Circular inhibition: If alert A inhibits B, and B inhibits A, both can end up suppressed. Always ensure inhibition flows in one direction (higher severity → lower severity).

                            Missing equal labels: Without equal constraints, a critical alert in cluster-A would suppress warnings in cluster-B. Always scope inhibitions with the appropriate equal labels.

                            Over-broad target_matchers: Using severity =~ ".*" as a target matcher will suppress ALL alerts including other critical alerts. Be specific about what should be inhibited.

                            Testing tip: Use amtool config routes test and amtool alert query --inhibited to verify inhibition behavior before deploying to production.

Silences

Silences mute notifications for alerts matching specific label matchers during a defined time window. Unlike inhibition (which is automatic and rule-based), silences are manually created for known situations like planned maintenance or acknowledged issues being actively worked.

Creating Silences via API

# Create a silence via Alertmanager API
# Mute all alerts for payment-service during maintenance window
curl -X POST http://alertmanager:9093/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "service",
        "value": "payment-service",
        "isRegex": false
      },
      {
        "name": "cluster",
        "value": "production",
        "isRegex": false
      }
    ],
    "startsAt": "2026-05-14T22:00:00Z",
    "endsAt": "2026-05-15T02:00:00Z",
    "createdBy": "wasil.zafar",
    "comment": "Planned maintenance: payment-service DB migration"
  }'

# List active silences
curl -s http://alertmanager:9093/api/v2/silences | jq '.[] | select(.status.state=="active")'

# Delete (expire) a silence by ID
curl -X DELETE http://alertmanager:9093/api/v2/silence/silence-uuid-here

Scenario	Use Silence?	Rationale
Planned maintenance window	Yes	Expected downtime — silence with clear comment and expiry time
Known issue, fix in progress	Yes	Prevents alert fatigue while actively working the incident (set short expiry)
Noisy alert needing tuning	Temporary	Silence briefly, but create a ticket to fix the alert threshold — don't leave silences indefinitely
Hiding persistent production issues	No	Silences mask real problems — fix the root cause or adjust the alert rule instead
Broad regex matching many services	No	Over-broad silences hide real issues — be as specific as possible with label matchers
Indefinite / no-expiry silence	No	Every silence must have an expiry. "Permanent" silences indicate broken alert rules that should be deleted or fixed

Notification Templates

Alertmanager uses Go's text/template language for formatting notifications. Custom templates transform raw alert data into actionable messages with context — severity badges, runbook links, dashboard URLs, and relevant labels — so on-call engineers can triage without opening Alertmanager UI.

Slack Notification Template

# /etc/alertmanager/templates/slack.tmpl
{{ define "slack.custom.title" -}}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{- end }}

{{ define "slack.custom.text" -}}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ if eq .Labels.severity "critical" }}🔴{{ else if eq .Labels.severity "high" }}🟠{{ else }}🟡{{ end }} {{ .Labels.severity | toUpper }}
*Cluster:* {{ .Labels.cluster }}
*Service:* {{ .Labels.service }}
*Description:* {{ .Annotations.description }}
{{ if .Annotations.runbook_url }}*Runbook:* <{{ .Annotations.runbook_url }}|View Runbook>{{ end }}
{{ if .Annotations.dashboard_url }}*Dashboard:* <{{ .Annotations.dashboard_url }}|View Dashboard>{{ end }}
*Source:* <{{ .GeneratorURL }}|Prometheus Query>
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
---
{{ end }}
{{- end }}

{{ define "slack.custom.color" -}}
{{ if eq .Status "firing" -}}
  {{ if eq .CommonLabels.severity "critical" -}}danger{{ else -}}warning{{ end -}}
{{ else -}}good{{ end -}}
{{- end }}

# Reference the template in alertmanager.yml
templates:
  - '/etc/alertmanager/templates/*.tmpl'

receivers:
  - name: 'slack-alerts'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '{{ template "slack.custom.title" . }}'
        text: '{{ template "slack.custom.text" . }}'
        color: '{{ template "slack.custom.color" . }}'
        actions:
          - type: button
            text: 'Silence 🔇'
            url: '{{ template "__alertmanagerURL" . }}/#/silences/new?filter=%7B{{ range .CommonLabels.SortedPairs }}{{ .Name }}%3D%22{{ .Value }}%22%2C{{ end }}%7D'
          - type: button
            text: 'Dashboard 📊'
            url: '{{ (index .Alerts 0).Annotations.dashboard_url }}'

                            
                            Template debugging tip: Use amtool template render to test templates locally without sending real notifications. Feed it a sample alert JSON and verify the rendered output before deploying.
                        

High Availability

A single Alertmanager instance is a single point of failure — if it goes down, no notifications are sent. Alertmanager supports native clustering using a gossip protocol (based on Hashicorp's Memberlist) to replicate notification state across multiple instances. All instances receive alerts from Prometheus, but they coordinate to ensure only one instance sends each notification.

Cluster Configuration

# docker-compose.yml - 3-node Alertmanager cluster
services:
  alertmanager-1:
    image: prom/alertmanager:v0.27.0
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-2:9094'
      - '--cluster.peer=alertmanager-3:9094'
      - '--cluster.settle-timeout=60s'
    ports:
      - "9093:9093"
      - "9094:9094"

  alertmanager-2:
    image: prom/alertmanager:v0.27.0
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-1:9094'
      - '--cluster.peer=alertmanager-3:9094'

  alertmanager-3:
    image: prom/alertmanager:v0.27.0
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-1:9094'
      - '--cluster.peer=alertmanager-2:9094'

# Prometheus configuration (send to all instances)
# prometheus.yml:
#   alerting:
#     alertmanagers:
#       - static_configs:
#           - targets:
#               - alertmanager-1:9093
#               - alertmanager-2:9093
#               - alertmanager-3:9093

Aspect	Single Instance	3-Node Cluster
Availability	SPOF — no notifications during downtime or restarts	Tolerates 1 node failure without notification loss
Deduplication	Local only	Cluster-wide via gossip (prevents duplicate notifications)
Silences	Local state, lost on restart without persistence	Replicated across all peers via gossip
Notification state	Single state machine	Shared via CRDT-based state merge
Operational complexity	Minimal — single process	Moderate — config sync, network partitions, settle timeouts
Resource usage	~50MB RAM	~50MB RAM × 3 + gossip traffic (~1KB/s per peer)
Recommended for	Dev/staging, non-critical workloads	Production, SLO-bound services, regulated environments

                            
                            Prometheus must send alerts to ALL Alertmanager instances. Configure multiple targets in alerting.alertmanagers. The cluster gossip handles deduplication — each instance receives the alert, but only one sends the notification. This ensures no alert is lost if one instance is unreachable.
                        

Production Checklist

Production Readiness

Alertmanager Production Deployment Checklist

Deploy in HA cluster — run 3 Alertmanager instances with --cluster.peer flags, behind a load balancer, with Prometheus configured to send to all instances
Configure resolve_timeout — set global resolve_timeout: 5m to auto-resolve alerts if Prometheus stops sending them (handles Prometheus restarts gracefully)
Validate routing with amtool — run amtool config routes test --config.file=alertmanager.yml in CI to verify alert routing before deployment
Set up a DeadMansSwitch — configure a Watchdog alert that always fires and route it to a heartbeat receiver (PagerDuty, Healthchecks.io) to detect Alertmanager/Prometheus failures
Monitor Alertmanager metrics — alert on alertmanager_notifications_failed_total, alertmanager_cluster_members dropping, and alertmanager_alerts_invalid_total increasing
Store secrets securely — use service_key_file and api_key_file for receiver credentials rather than inline plaintext; mount from Kubernetes secrets or Vault
Implement notification templates — customize Slack/PagerDuty templates to include runbook links, dashboard URLs, and severity badges for faster triage
Audit silences regularly — schedule weekly reviews of active silences; set a maximum silence duration policy (e.g., 7 days) and require JIRA ticket references in silence comments

alerting production ops reliability

Previous Tool Deep Dive: Jaeger Complete Guide Next Tool Deep Dive: OTel Collector Complete Guide

Cookie Consent

Tool Deep Dive: Alertmanager Complete Guide

Table of Contents

Alertmanager Architecture

Alert Lifecycle

Routing Tree

Complete Routing Configuration

Grouping & Deduplication

Grouping Behavior Explained

Inhibition Rules

Common Pitfalls

Silences

Creating Silences via API

Notification Templates

Slack Notification Template

High Availability

Cluster Configuration

Production Checklist

Alertmanager Production Deployment Checklist

Cookie Consent

Tool Deep Dive: Alertmanager Complete Guide

Table of Contents

Alertmanager Architecture

Alert Lifecycle

Routing Tree

Complete Routing Configuration

Grouping & Deduplication

Grouping Behavior Explained

Inhibition Rules

Common Pitfalls

Silences

Creating Silences via API

Notification Templates

Slack Notification Template

High Availability

Cluster Configuration

Production Checklist

Alertmanager Production Deployment Checklist

Related Posts

Related Articles in This Series

Tool Deep Dive: Prometheus Complete Guide

Part 7: Visualization & Alerting

Part 9: SLOs & Error Budgets