Prometheus Deep Dive Part 6: Effective Alerting & Alertmanager

Alert Rule Fundamentals

Prometheus alerting separates two concerns: alert evaluation (done by Prometheus itself) and alert notification (handled by Alertmanager). Prometheus periodically evaluates alert rules, fires alerts when conditions are met, and pushes them to Alertmanager for routing, deduplication, and delivery.

                            
                            Key Concept: An alert rule is just a PromQL expression paired with a duration. If the expression returns results for the specified duration, the alert transitions from inactive → pending → firing.
                        

Rule Syntax & Structure

Alert rules live in rule files referenced by rule_files in prometheus.yml. Each rule group has a name and evaluation interval:

# /etc/prometheus/rules/application.yml
groups:
  - name: application.rules
    interval: 30s  # Override global evaluation_interval for this group
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Service {{ $labels.service }} has {{ $value | humanizePercentage }} error rate (threshold: 5%)."
          runbook_url: "https://runbooks.internal/alerts/high-error-rate"
          dashboard_url: "https://grafana.internal/d/svc-overview?var-service={{ $labels.service }}"

The `for` Duration

The for clause prevents transient spikes from triggering alerts. An alert stays in pending state until the condition holds continuously for the specified duration:

Alert State Machine

stateDiagram-v2
    [*] --> Inactive
    Inactive --> Pending: Expression returns results
    Pending --> Firing: Condition holds for 'for' duration
    Pending --> Inactive: Condition no longer true
    Firing --> Inactive: Condition no longer true
    Firing --> Firing: Condition still true

                            
                            Warning: If Prometheus restarts, pending alerts lose their state and the for timer resets. For critical alerts, keep for relatively short (2-5m) or use recording rules to pre-compute the condition.
                        

Annotations & Templating

Annotations support Go templating with access to $labels and $value. Common formatting functions:

# Annotation templating examples
annotations:
  # Access label values
  summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is CrashLooping"

  # Format the expression value
  description: |
    Current value: {{ $value | humanize }}
    Percentage: {{ $value | humanizePercentage }}
    Duration: {{ $value | humanizeDuration }}
    Timestamp: {{ $value | humanizeTimestamp }}

  # Conditional text
  impact: >-
    {{ if gt $value 0.1 }}MAJOR impact on user traffic{{ else }}Minor degradation{{ end }}

  # Printf formatting
  detail: "Error rate is {{ printf \"%.2f\" $value }}%"

Labels on alert rules are used for routing decisions. Keep them consistent across your organization:

# Standard label taxonomy for alerts
labels:
  severity: critical | warning | info
  team: platform | backend | frontend | data
  environment: production | staging
  service: "{{ $labels.service }}"  # Propagate from metric
  tier: "1"  # Business criticality (1=highest)

Alertmanager Architecture

Alertmanager receives alerts from one or more Prometheus servers and handles deduplication, grouping, routing, silencing, inhibition, and notification dispatch.

Alertmanager Processing Pipeline

flowchart LR
    P1[Prometheus 1] --> AM[Alertmanager]
    P2[Prometheus 2] --> AM
    AM --> D[Deduplication]
    D --> G[Grouping]
    G --> I[Inhibition]
    I --> S[Silence Check]
    S --> R[Routing]
    R --> PD[PagerDuty]
    R --> SL[Slack]
    R --> WH[Webhook]

Routing Tree

The routing configuration forms a tree. Each node can match on labels and either handle the alert or delegate to child routes. The first matching child route wins.

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.internal:587'
  smtp_from: 'alerts@company.com'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s       # Wait before sending first notification for a new group
  group_interval: 5m    # Wait before sending updates to an existing group
  repeat_interval: 4h   # Resend if alert still firing after this

  routes:
    # Critical alerts go to PagerDuty immediately
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s
      repeat_interval: 1h

    # Warning alerts go to Slack
    - match:
        severity: warning
      receiver: 'slack-warnings'
      group_wait: 1m
      repeat_interval: 12h

    # Team-specific routing
    - match_re:
        team: 'backend|platform'
      receiver: 'backend-slack'
      routes:
        - match:
            severity: critical
          receiver: 'backend-pagerduty'

    # Catch-all for info alerts
    - match:
        severity: info
      receiver: 'slack-info'
      group_wait: 5m
      repeat_interval: 24h

Grouping

Grouping batches related alerts into a single notification. When many targets go down simultaneously, you get one notification listing all affected instances instead of N separate pages.

                            
                            Best Practice: Group by ['alertname', 'cluster'] for infrastructure alerts and ['alertname', 'service'] for application alerts. Never group by high-cardinality labels like pod or instance—this defeats the purpose.
                        

# Grouping examples
route:
  # Group all alerts of the same name in the same cluster
  group_by: ['alertname', 'cluster']

  routes:
    # For network alerts, group by datacenter
    - match:
        category: network
      group_by: ['alertname', 'datacenter']

    # For capacity alerts, group by namespace
    - match:
        category: capacity
      group_by: ['alertname', 'namespace']

    # Special: group_by: ['...'] means group ALL alerts together
    # Useful for sending a daily digest
    - match:
        severity: info
      group_by: ['...']
      group_wait: 30m

Inhibition Rules

Inhibition suppresses notifications for alerts when a more severe related alert is already firing. If the entire cluster is down, you don't need separate alerts for each service in that cluster.

# Inhibition rules
inhibit_rules:
  # If a critical alert fires, suppress matching warnings
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: ['alertname', 'cluster', 'service']

  # If node is down, suppress all pod alerts on that node
  - source_matchers:
      - alertname = NodeDown
    target_matchers:
      - alertname =~ "Pod.*"
    equal: ['node']

  # If cluster is unreachable, suppress all alerts from that cluster
  - source_matchers:
      - alertname = ClusterUnreachable
    target_matchers:
      - severity =~ "warning|critical"
    equal: ['cluster']

Silences

Silences temporarily mute alerts matching specific label matchers—useful during maintenance windows or known issues. Create them via the Alertmanager UI or amtool:

# Create a silence for 2 hours during maintenance
amtool silence add \
  --alertmanager.url=http://alertmanager:9093 \
  --author="wasil.zafar" \
  --comment="Planned maintenance window for redis cluster upgrade" \
  --duration=2h \
  alertname="RedisDown" cluster="production"

# Create silence with specific start/end times
amtool silence add \
  --alertmanager.url=http://alertmanager:9093 \
  --author="wasil.zafar" \
  --comment="Deploy window" \
  --start="2026-06-15T22:00:00Z" \
  --end="2026-06-15T23:00:00Z" \
  service=~"checkout|payment"

# List active silences
amtool silence query --alertmanager.url=http://alertmanager:9093

# Expire (remove) a silence by ID
amtool silence expire --alertmanager.url=http://alertmanager:9093 "silence-id-here"

Notification Channels

PagerDuty

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: ''
        # Or use routing_key for Events API v2
        routing_key: ''
        severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          resolved: '{{ .Alerts.Resolved | len }}'
          cluster: '{{ .CommonLabels.cluster }}'
        # Custom links shown in PagerDuty incident
        links:
          - href: '{{ (index .Alerts 0).Annotations.dashboard_url }}'
            text: 'Grafana Dashboard'
          - href: '{{ (index .Alerts 0).Annotations.runbook_url }}'
            text: 'Runbook'

Slack

receivers:
  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/T00/B00/XXXX'
        channel: '#alerts-warning'
        username: 'Alertmanager'
        icon_emoji: ':warning:'
        send_resolved: true
        title: '{{ .Status | toUpper }}{{ if eq .Status "firing" }} ({{ .Alerts.Firing | len }}){{ end }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Severity:* `{{ .Labels.severity }}`
          *Service:* {{ .Labels.service }}
          *Description:* {{ .Annotations.description }}
          *Dashboard:* {{ .Annotations.dashboard_url }}
          {{ end }}
        # Color coding: red for firing, green for resolved
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        actions:
          - type: button
            text: 'Runbook :book:'
            url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
          - type: button
            text: 'Silence :mute:'
            url: '{{ .ExternalURL }}/#/silences/new?filter=%7Balertname%3D%22{{ .CommonLabels.alertname }}%22%7D'

OpsGenie

receivers:
  - name: 'opsgenie-oncall'
    opsgenie_configs:
      - api_key: ''
        message: '{{ .CommonAnnotations.summary }}'
        description: |
          {{ .CommonAnnotations.description }}

          Firing alerts: {{ .Alerts.Firing | len }}
          {{ range .Alerts.Firing }}
          - {{ .Labels.alertname }}: {{ .Annotations.summary }}
          {{ end }}
        priority: '{{ if eq .CommonLabels.severity "critical" }}P1{{ else if eq .CommonLabels.severity "warning" }}P3{{ else }}P5{{ end }}'
        tags: 'prometheus,{{ .CommonLabels.team }},{{ .CommonLabels.environment }}'
        responders:
          - name: '{{ .CommonLabels.team }}-oncall'
            type: 'schedule'

Webhooks

receivers:
  - name: 'custom-webhook'
    webhook_configs:
      - url: 'https://alert-handler.internal/api/v1/alerts'
        send_resolved: true
        http_config:
          bearer_token_file: '/etc/alertmanager/secrets/webhook-token'
          tls_config:
            cert_file: '/etc/alertmanager/tls/client.crt'
            key_file: '/etc/alertmanager/tls/client.key'
        # Max alerts per webhook call (default: 0 = unlimited)
        max_alerts: 100

The webhook payload format (sent as POST with JSON body):

{
  "version": "4",
  "groupKey": "{}:{alertname=\"HighErrorRate\"}",
  "status": "firing",
  "receiver": "custom-webhook",
  "groupLabels": {
    "alertname": "HighErrorRate"
  },
  "commonLabels": {
    "alertname": "HighErrorRate",
    "severity": "critical",
    "service": "checkout"
  },
  "commonAnnotations": {
    "summary": "High error rate on checkout"
  },
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "HighErrorRate",
        "service": "checkout",
        "instance": "checkout-7d8f9:8080"
      },
      "annotations": {
        "summary": "High error rate on checkout",
        "description": "checkout has 8.2% error rate"
      },
      "startsAt": "2026-06-15T10:00:00.000Z",
      "endsAt": "0001-01-01T00:00:00Z",
      "generatorURL": "http://prometheus:9090/graph?g0.expr=...",
      "fingerprint": "abc123def456"
    }
  ]
}

Alert Routing Examples

Team-Based Routing

# Route alerts to team-specific channels based on labels
route:
  receiver: 'default'
  group_by: ['alertname', 'service']
  routes:
    - match:
        team: platform
      receiver: 'platform-slack'
      routes:
        - match:
            severity: critical
          receiver: 'platform-pagerduty'

    - match:
        team: backend
      receiver: 'backend-slack'
      routes:
        - match:
            severity: critical
          receiver: 'backend-pagerduty'

    - match:
        team: data
      receiver: 'data-slack'
      routes:
        - match:
            severity: critical
          receiver: 'data-opsgenie'

Severity Escalation

# Escalation: if critical alert isn't acked in 30 min, page manager
route:
  receiver: 'team-oncall'
  group_by: ['alertname']
  routes:
    - match:
        severity: critical
      receiver: 'team-pagerduty'
      repeat_interval: 30m  # Re-notify every 30 min until resolved
      routes:
        # After 3 repeats (90 min), escalate to management
        - match:
            escalation_level: manager
          receiver: 'manager-pagerduty'

                            
                            Tip: For true escalation policies with timeout-based level changes, use PagerDuty or OpsGenie's native escalation features. Alertmanager's repeat_interval simply re-sends the same alert—it doesn't inherently escalate.
                        

Time-Based Routing

# Time-based routing using time_intervals (Alertmanager 0.24+)
time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '17:00'

  - name: out-of-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '17:00'
            end_time: '09:00'
      - weekdays: ['saturday', 'sunday']

route:
  receiver: 'default'
  routes:
    # During business hours, send warnings to Slack
    - match:
        severity: warning
      receiver: 'slack-warnings'
      active_time_intervals:
        - business-hours

    # Out of hours, only page for critical
    - match:
        severity: critical
      receiver: 'pagerduty-oncall'

    # Mute non-critical during weekends
    - match:
        severity: warning
      receiver: 'slack-warnings'
      mute_time_intervals:
        - out-of-hours

Testing Alerts

promtool test rules

You can unit-test alert rules with promtool test rules. Define synthetic input series and expected alert outputs:

# tests/alert_test.yml
rule_files:
  - ../rules/application.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      # Simulate 10% error rate for 10 minutes
      - series: 'http_requests_total{service="checkout", status="500"}'
        values: '0+10x10'  # Starts at 0, increments by 10 each minute
      - series: 'http_requests_total{service="checkout", status="200"}'
        values: '0+90x10'  # 90 successful per minute

    alert_rule_test:
      - eval_time: 6m  # Check at 6 minutes
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: critical
              team: platform
              service: checkout
            exp_annotations:
              summary: "High error rate on checkout"

      - eval_time: 3m  # At 3 minutes, should be pending (for: 5m)
        alertname: HighErrorRate
        exp_alerts: []  # No firing alerts yet

# Run alert rule tests
promtool test rules tests/alert_test.yml

# Validate rule file syntax
promtool check rules rules/application.yml

# Example output
# Unit Testing: tests/alert_test.yml
#   SUCCESS

Unit Testing Definitions

# Test that recording rules produce expected values
tests:
  - interval: 1m
    input_series:
      - series: 'http_request_duration_seconds_bucket{le="0.1", service="api"}'
        values: '0+100x5'
      - series: 'http_request_duration_seconds_bucket{le="0.5", service="api"}'
        values: '0+900x5'
      - series: 'http_request_duration_seconds_bucket{le="+Inf", service="api"}'
        values: '0+1000x5'

    # Test recording rule output
    promql_expr_test:
      - expr: 'job:http_request_duration_seconds:p99'
        eval_time: 5m
        exp_samples:
          - labels: 'job:http_request_duration_seconds:p99{service="api"}'
            value: 0.5

Alert Fatigue Prevention

Symptom vs Cause-Based Alerts

Alerting Philosophy

Approach	Example	When to Use
Symptom-based	"Error rate > 5% for users"	Page-worthy: directly impacts users
Cause-based	"Disk 90% full"	Warning only: may never cause symptoms
Multi-window	"Burning error budget too fast"	SLO-based alerting (preferred for SRE)

SRE Alerting Philosophy

Actionable Alerts Principles

                            
                            Every alert must answer these questions:

                            1. What is broken? (summary annotation)

                            2. What is the impact? (description with user-facing effect)

                            3. What should I do? (runbook_url annotation)

                            4. Where do I look? (dashboard_url annotation)

# Anti-pattern: Alert that cannot be acted upon
- alert: CPUHigh
  expr: node_cpu_seconds_total{mode="idle"} < 0.1
  for: 5m
  # Missing: what service? what impact? what to do?

# Better: Actionable alert tied to user impact
- alert: CheckoutLatencyBudgetBurn
  expr: |
    (
      sum(rate(http_request_duration_seconds_count{service="checkout",code!~"5.."}[1h]))
      -
      sum(rate(http_request_duration_seconds_bucket{service="checkout",le="0.5",code!~"5.."}[1h]))
    )
    /
    sum(rate(http_request_duration_seconds_count{service="checkout",code!~"5.."}[1h]))
    > 0.01
  for: 5m
  labels:
    severity: critical
    team: backend
    slo: checkout-latency
  annotations:
    summary: "Checkout latency SLO burn rate too high"
    description: |
      More than 1% of checkout requests are exceeding the 500ms latency SLO
      in the last hour. Current violation rate: {{ $value | humanizePercentage }}.
      At this burn rate, we will exhaust the monthly error budget in ~{{ printf "%.0f" (divf 1.0 $value) }} hours.
    runbook_url: "https://runbooks.internal/slos/checkout-latency"
    dashboard_url: "https://grafana.internal/d/checkout-slo"

High Availability

Alertmanager Clustering

Run multiple Alertmanager instances to avoid single points of failure. They form a cluster using a gossip protocol (based on Hashicorp's memberlist) to deduplicate notifications:

# docker-compose.yml - Alertmanager HA cluster (3 nodes)
services:
  alertmanager-1:
    image: prom/alertmanager:v0.27.0
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-2:9094'
      - '--cluster.peer=alertmanager-3:9094'
      - '--cluster.settle-timeout=60s'
    ports:
      - "9093:9093"
      - "9094:9094"

  alertmanager-2:
    image: prom/alertmanager:v0.27.0
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-1:9094'
      - '--cluster.peer=alertmanager-3:9094'

  alertmanager-3:
    image: prom/alertmanager:v0.27.0
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--cluster.listen-address=0.0.0.0:9094'
      - '--cluster.peer=alertmanager-1:9094'
      - '--cluster.peer=alertmanager-2:9094'

Gossip Protocol

The gossip protocol ensures that notification deduplication works across all cluster members. Each Alertmanager instance knows about silences and notification states from peers:

# Check cluster status
curl -s http://alertmanager:9093/api/v2/status | jq '.cluster'

# Expected output for healthy 3-node cluster:
# {
#   "name": "alertmanager-1",
#   "status": "ready",
#   "peers": [
#     {"name": "alertmanager-2", "address": "10.0.0.2:9094"},
#     {"name": "alertmanager-3", "address": "10.0.0.3:9094"}
#   ]
# }

# Kubernetes: Use a headless service for peer discovery
# alertmanager-cluster.monitoring.svc.cluster.local resolves to all pod IPs

                            
                            Production Tip: Point all Prometheus instances at all Alertmanager instances using alertmanagers config. Prometheus will send alerts to all of them, and the cluster handles deduplication internally.
                        

# prometheus.yml - Send alerts to all Alertmanager instances
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager-1:9093'
            - 'alertmanager-2:9093'
            - 'alertmanager-3:9093'
      # Or use DNS service discovery in Kubernetes
    - dns_sd_configs:
        - names:
            - 'alertmanager-cluster.monitoring.svc.cluster.local'
          type: A
          port: 9093

amtool CLI

amtool is the official CLI for interacting with Alertmanager. Essential commands for daily operations:

# Set default Alertmanager URL
export ALERTMANAGER_URL=http://alertmanager:9093

# Query current alerts
amtool alert query
amtool alert query alertname="HighErrorRate"
amtool alert query severity="critical" service=~"checkout|payment"

# Check which receiver an alert would route to
amtool config routes test --tree \
  severity=critical team=backend alertname=HighErrorRate

# Show the full routing tree
amtool config routes show

# Manage silences
amtool silence add alertname="DeployInProgress" --duration=30m \
  --author="deploy-bot" --comment="Rolling deploy in progress"
amtool silence query
amtool silence expire 
amtool silence expire --all  # Remove ALL silences (careful!)

# Check Alertmanager configuration
amtool check-config /etc/alertmanager/alertmanager.yml

# Verify template rendering
amtool template render --template.glob='/etc/alertmanager/templates/*.tmpl'

Conclusion

Effective alerting is the difference between a team that responds to real incidents and one drowning in noise. Focus on symptom-based alerts with clear runbooks, use Alertmanager's grouping and inhibition to reduce noise, and always test your rules before deploying them to production.

                            
                            Next Up: In Part 7, we tackle scaling Prometheus beyond a single instance with sharding strategies, hierarchical federation, and HA pair configurations.
                        

Previous Part 5: Service Discovery Next Part 7: Sharding, Federation & HA