Grafana Deep Dive Part 9: Managing Incidents Using Alerts

Being Alerted vs Being Alarmed

The distinction between being alerted and being alarmed is fundamental to building sustainable on-call practices. An alert should inform you that something meaningful requires attention. An alarm — the visceral “wake me up at 3 AM” page — should be reserved for situations where immediate human intervention is the only path to preserving user-facing reliability. When teams conflate these two concepts, they create a culture of fatigue, desensitization, and ultimately, missed critical incidents hidden in a flood of noise.

Alert Fatigue

Alert fatigue is the gradual erosion of an engineer’s ability to respond to alerts because they receive too many, too often, with too little actionable signal. Research from PagerDuty’s State of Digital Operations reports consistently shows that teams receiving more than 40 alerts per on-call shift experience significantly degraded response times and higher incident duration.

                            
                            The Alert Fatigue Spiral: Too many alerts → engineers start ignoring alerts → critical alerts get missed → more alerts are added to “catch” missed issues → even more noise. The only escape is ruthlessly pruning non-actionable alerts and investing in SLO-based alerting that ties directly to user impact.
                        

Common causes of alert fatigue include:

Symptom-based alerts on causes — alerting on CPU usage rather than on user-visible latency degradation
Static thresholds — hardcoded values that don’t account for natural traffic patterns (weekday vs weekend, seasonal peaks)
Redundant alerts — multiple alerts firing for the same root cause (disk full triggers filesystem alert, database alert, and application error alert simultaneously)
Missing deduplication — the same alert firing once per instance rather than once per incident
Alerts without owners — alerts created “just in case” that no one knows how to remediate

Signal vs Noise

A high signal-to-noise ratio means that when an engineer receives a notification, they can trust that it represents a genuine problem requiring action. Google’s SRE book recommends targeting a 50% or higher actionable rate — meaning at least half of all pages should require immediate human intervention. The best teams achieve 80%+ actionable rates.

Framework Alert Classification Matrix

Classify every existing alert into one of four categories to identify what to keep, modify, or delete:

Category	Action Required?	Time-Sensitive?	Recommendation
Page	Yes	Immediate	Keep as paging alert (P1/P2)
Ticket	Yes	Hours/Days	Route to issue tracker, not pager
Log	Maybe	No	Record as metric, review in aggregate
Delete	No	No	Remove entirely — it’s pure noise

SRE Alert Hygiene On-Call

Actionability Criteria

Every alert that can page a human must pass the actionability test. If an engineer cannot take a meaningful action upon receiving the alert, it should not be a paging alert. The three pillars of an actionable alert are:

What is broken? — The alert title and annotations must clearly state the user-visible impact
Why does it matter? — Include the SLO context (e.g., “Error budget consumed 40% in last hour, on track to exhaust within 6 hours”)
What should I do? — Link to a runbook with specific remediation steps

                            
                            The “Wake Up” Test: Before adding any paging alert, ask: “If this fires at 3 AM, would I be upset if it turned out to not need immediate human action?” If the answer is yes, it belongs as a ticket or log, not a page.
                        

Before an Incident

Incident response is 80% preparation and 20% execution. The teams that resolve incidents fastest are not the ones with the best engineers — they’re the ones with the best systems: clear runbooks, practiced communication patterns, defined roles, and tested escalation paths. Preparation transforms chaotic firefighting into methodical problem-solving.

Runbooks & Playbooks

A runbook is a documented procedure for diagnosing and remediating a known failure mode. Every paging alert should link to a runbook. The runbook should be written for the least experienced on-call engineer — clear, step-by-step, with explicit decision trees.

# Example runbook structure (stored alongside alert definitions)
# runbooks/high-error-rate-api-gateway.md

title: "API Gateway Error Rate > 5%"
severity: P2
escalation: platform-team
last_reviewed: 2026-05-01

symptoms:
  - HTTP 5xx responses exceed 5% of total traffic
  - Downstream service health checks failing
  - User-visible errors on checkout flow

diagnosis_steps:
  1. Check which endpoints are affected:
     query: |
       sum by (route)(rate(http_requests_total{status=~"5.."}[5m]))
       / sum by (route)(rate(http_requests_total[5m])) > 0.05
  2. Check if a recent deployment correlates:
     action: "Review deployment timeline in Grafana annotations"
  3. Check downstream dependency health:
     dashboard: "https://grafana.internal/d/deps-health"

mitigation:
  - If single endpoint: Route traffic away with feature flag
  - If recent deploy: Rollback via `kubectl rollout undo`
  - If downstream: Activate circuit breaker configuration
  - If capacity: Scale horizontally via HPA override

communication:
  template: "incident-api-degradation"
  channels: ["#incidents", "#platform-oncall"]

Communication Templates

During an incident, communication must be fast and structured. Pre-defined templates eliminate the cognitive overhead of composing messages under pressure. Define templates for initial notification, status updates, and resolution announcements:

# communication-templates/incident-declaration.yaml
templates:
  initial_notification:
    title: "[{{ severity }}] {{ title }}"
    body: |
      🚨 Incident Declared: {{ title }}
      Severity: {{ severity }}
      Impact: {{ impact_description }}
      Incident Commander: {{ commander }}
      War Room: {{ war_room_link }}
      Status Page: {{ status_page_link }}

      Next update in 15 minutes.

  status_update:
    body: |
      📋 Update #{{ update_number }} - {{ title }}
      Status: {{ status }}
      Current Actions: {{ current_actions }}
      ETA to Resolution: {{ eta }}
      Next update in {{ next_update_minutes }} minutes.

  resolution:
    body: |
      ✅ Resolved: {{ title }}
      Duration: {{ duration }}
      Root Cause: {{ root_cause_summary }}
      User Impact: {{ impact_summary }}
      Postmortem scheduled: {{ postmortem_date }}

Escalation Paths

Escalation paths define who to contact when the primary on-call cannot resolve an issue within a defined timeframe. Clear escalation prevents the “hero culture” where one engineer struggles alone while users suffer. Define escalation based on time and severity:

Escalation Path Flow

flowchart TD
    A[Alert Fires] --> B{Acknowledged
within 5 min?}
    B -->|Yes| C[Primary On-Call
Investigates]
    B -->|No| D[Escalate to
Secondary On-Call]
    C --> E{Resolved within
30 min?}
    E -->|Yes| F[Close Alert]
    E -->|No| G{Severity >= P2?}
    G -->|Yes| H[Declare Incident
Engage Team Lead]
    G -->|No| I[Create Ticket
Continue Investigation]
    D --> J{Acknowledged
within 5 min?}
    J -->|Yes| C
    J -->|No| K[Escalate to
Engineering Manager]
    H --> L[Incident Commander
Assigned]

During an Incident

When an incident is declared, the priority shifts from individual problem-solving to coordinated team response. The goal is not to find the root cause immediately — it’s to mitigate user impact as quickly as possible. Root cause analysis comes later, in the postmortem. During the incident, speed of mitigation trumps elegance of solution.

Triage & Assessment

The first 5 minutes of an incident determine its trajectory. Effective triage requires answering four questions rapidly:

What is the user impact? — Which users, which features, what percentage of traffic?
What changed? — Deployments, config changes, traffic spikes, upstream/downstream changes?
Is it getting worse? — Is error rate climbing, stable, or recovering?
What is the blast radius? — Single service, single region, or global?

Incident Roles

Clearly defined roles prevent chaos. The minimum viable incident team has three roles:

Best Practice Incident Response Roles

Incident Commander (IC) — Coordinates response, makes decisions about escalation and communication. Does NOT debug. Keeps the team focused and prevents tunnel vision.
Technical Lead — Drives diagnosis and mitigation. The most senior engineer available for the affected system. Delegates investigation tasks.
Communications Lead — Posts status updates to stakeholders, manages the status page, handles customer communication. Frees the IC and Tech Lead from interruptions.

For major incidents (P1/SEV1), add:

Scribe — Records all actions, decisions, and timestamps in the incident timeline
Subject Matter Experts (SMEs) — Pulled in as needed based on affected systems

Incident Response Team Coordination

Communication & Coordination

Effective incident communication follows the OODA loop: Observe, Orient, Decide, Act. The Incident Commander drives this loop with regular check-ins every 10–15 minutes, asking: “What do we know now? What are we trying? What should we try next?”

Key communication principles during incidents:

Use a dedicated channel — Never mix incident coordination with general team chat
Timestamp everything — “14:32 UTC: Rolled back deployment v2.4.1 to v2.4.0”
State hypotheses explicitly — “Hypothesis: The new Redis connection pool config is causing timeouts”
Announce actions before taking them — Prevents duplicate or conflicting remediation attempts
Regular external updates — Even “We’re still investigating” is better than silence

Mitigation Strategies

Mitigation is about restoring service, not fixing the underlying bug. Common mitigation patterns ordered by speed:

Rollback — Revert the most recent change (fastest if a deployment caused the issue)
Feature flag — Disable the affected feature path
Traffic shifting — Route traffic away from the affected region/instance
Scale out — Add capacity if the issue is load-related
Circuit breaker — Isolate the failing dependency with fallback behavior
Restart — Clear corrupted state (last resort, masks underlying issues)

After an Incident

The postmortem is where incidents become organizational improvements. Without effective post-incident review, teams are doomed to repeat failures. The goal is not to assign blame — it’s to understand the systemic conditions that allowed the failure and to build defenses against recurrence.

Blameless Postmortems

A blameless postmortem recognizes that in complex systems, failures are emergent properties of the system design, not individual negligence. The engineer who deployed the breaking change was acting rationally given the information available to them. The question is: why did the system allow a harmful change to reach production?

                            
                            Postmortem Document Structure:
                            Summary — One paragraph: what happened, duration, impact
Timeline — Minute-by-minute chronology from first signal to resolution
Root Cause — Technical explanation using the “5 Whys” technique
Contributing Factors — What made detection/response slower than ideal
Impact — Quantified user impact (requests failed, revenue lost, SLO budget consumed)
What Went Well — What worked in the response (celebrate good practice)
Action Items — Specific, assigned, time-bound improvements
Lessons Learned — Insights applicable beyond this specific incident

                        

Action Items & Follow-Up

Action items from postmortems must be specific, assigned, prioritized, and tracked. Vague action items like “improve monitoring” never get completed. Effective action items look like:

“Add circuit breaker to payment service Redis connection (assigned: @alice, P2, due: 2026-07-01)”
“Create alert for connection pool exhaustion at 80% capacity (assigned: @bob, P2, due: 2026-06-25)”
“Add deployment canary analysis gate requiring 5-minute error rate check (assigned: @charlie, P1, due: 2026-06-20)”

Track postmortem action item completion rate as a team metric. If fewer than 80% of action items are completed within their deadline, the postmortem process needs strengthening (dedicated time, management backing, or fewer but more impactful items).

Organizational Learning

Individual postmortems create local learning. To achieve organizational learning, share postmortems broadly, run periodic “failure review” meetings where teams present their most interesting incidents, and look for systemic patterns across incidents. Common patterns include: inadequate testing for failure modes, missing observability in new services, and deployment pipelines that lack safety gates.

Writing Great Alerts Using SLIs & SLOs

Traditional threshold-based alerting (“alert if CPU > 80%”) is fundamentally flawed because it measures causes rather than symptoms. Users don’t care about your CPU utilization — they care about latency, availability, and correctness. SLI/SLO-based alerting inverts this model: alert when the user experience degrades, regardless of the underlying cause.

SLI-Based Alerts

A Service Level Indicator (SLI) is a quantitative measure of user-perceived service quality. The most common SLIs are:

Availability — Proportion of requests that succeed: successful_requests / total_requests
Latency — Proportion of requests faster than a threshold: requests_under_300ms / total_requests
Correctness — Proportion of requests returning correct results

A Service Level Objective (SLO) sets a target for the SLI over a rolling window (typically 30 days). For example: “99.9% of requests will succeed over a 30-day rolling window.” This gives you an error budget of 0.1% — approximately 43 minutes of complete downtime per month, or a sustained 1% error rate for ~7 hours.

# SLO definition for API availability
slo:
  name: api-availability
  description: "API requests returning non-5xx responses"
  sli:
    events:
      good: sum(rate(http_requests_total{status!~"5.."}[5m]))
      total: sum(rate(http_requests_total[5m]))
  objective: 0.999  # 99.9%
  window: 30d
  alerting:
    burn_rate:
      - severity: critical
        short_window: 5m
        long_window: 1h
        factor: 14.4
      - severity: warning
        short_window: 30m
        long_window: 6h
        factor: 6

Burn-Rate Alerting

Burn rate measures how quickly you’re consuming your error budget relative to a sustainable rate. A burn rate of 1 means you’ll exactly exhaust your budget by the end of the window. A burn rate of 14.4 means you’ll exhaust your entire 30-day budget in just 2 hours (30 days × 24 hours / 14.4 = ~50 hours... corrected: 30 / 14.4 ≈ 2.08 days of budget consumed per day).

                            
                            Burn Rate Formula: burn_rate = (1 - SLI_current) / (1 - SLO_target). For a 99.9% SLO with current availability of 99.0%: burn_rate = (1 - 0.990) / (1 - 0.999) = 0.010 / 0.001 = 10. This means you’re consuming error budget 10× faster than sustainable.
                        

# Prometheus recording rules for burn-rate calculation
groups:
  - name: slo-burn-rates
    interval: 30s
    rules:
      # Error ratio over different windows
      - record: slo:error_ratio:rate5m
        expr: |
          1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          )

      - record: slo:error_ratio:rate1h
        expr: |
          1 - (
            sum(rate(http_requests_total{status!~"5.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          )

      - record: slo:error_ratio:rate6h
        expr: |
          1 - (
            sum(rate(http_requests_total{status!~"5.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          )

      # Burn rates (target SLO: 99.9% = 0.001 error budget)
      - record: slo:burn_rate:5m
        expr: slo:error_ratio:rate5m / 0.001

      - record: slo:burn_rate:1h
        expr: slo:error_ratio:rate1h / 0.001

      - record: slo:burn_rate:6h
        expr: slo:error_ratio:rate6h / 0.001

Multi-Window Multi-Burn-Rate

Single-window burn-rate alerts suffer from either being too slow (long window) or too noisy (short window). The multi-window multi-burn-rate approach combines a short window for freshness with a long window for significance. An alert fires only when both windows exceed their respective thresholds, dramatically reducing false positives while maintaining fast detection.

Multi-Window Multi-Burn-Rate Alert Logic

flowchart TD
    A[Evaluate Short Window
e.g., 5 min burn rate] --> B{Short Window
> Threshold?}
    B -->|No| C[No Alert
Brief spike, self-healing]
    B -->|Yes| D[Evaluate Long Window
e.g., 1 hour burn rate]
    D --> E{Long Window
> Threshold?}
    E -->|No| F[No Alert
Already recovering]
    E -->|Yes| G[FIRE ALERT
Sustained budget burn]

# Multi-window multi-burn-rate alert rules
groups:
  - name: slo-alerts
    rules:
      # Critical: 14.4x burn rate over 1h AND 5m (exhausts budget in ~2 days)
      - alert: SLOBurnRateCritical
        expr: |
          slo:burn_rate:1h > 14.4
          and
          slo:burn_rate:5m > 14.4
        for: 2m
        labels:
          severity: critical
          slo: api-availability
        annotations:
          summary: "API availability SLO burn rate critical"
          description: |
            Error budget burn rate is {{ $value | printf "%.1f" }}x.
            At this rate, the 30-day error budget will be exhausted in
            {{ printf "%.1f" (divf 720 $value) }} hours.
          runbook_url: "https://wiki.internal/runbooks/api-availability"
          dashboard_url: "https://grafana.internal/d/slo-overview"

      # Warning: 6x burn rate over 6h AND 30m (exhausts budget in ~5 days)
      - alert: SLOBurnRateWarning
        expr: |
          slo:burn_rate:6h > 6
          and
          slo:burn_rate:30m > 6
        for: 5m
        labels:
          severity: warning
          slo: api-availability
        annotations:
          summary: "API availability SLO burn rate elevated"
          description: |
            Error budget burn rate is {{ $value | printf "%.1f" }}x.
            Budget will be exhausted in {{ printf "%.1f" (divf 720 $value) }} hours
            if this rate continues.

Reference Recommended Multi-Window Parameters (Google SRE Workbook)

Severity	Long Window	Short Window	Burn Rate	Budget Exhaustion
Critical (Page)	1 hour	5 minutes	14.4×	~2 days
Warning (Page)	6 hours	30 minutes	6×	~5 days
Info (Ticket)	3 days	6 hours	1×	30 days

SLO Burn Rate Multi-Window

Grafana Alerting

Grafana’s unified alerting system provides a single pane of glass for managing alert rules across all data sources. Whether your rules are evaluated by Grafana itself, by Mimir Ruler, or by Loki Ruler, they all appear in the same UI with consistent notification routing, silencing, and grouping.

Alert Rules

Grafana supports three types of alert rules, each evaluated differently but managed through the same interface:

Grafana-managed rules — Evaluated by the Grafana server itself. Support any configured data source and multi-dimensional alerting with reduce/math expressions. Best for multi-source correlation alerts.
Mimir/Cortex-managed rules — PromQL rules pushed to and evaluated by Mimir Ruler. Benefit from Mimir’s HA evaluation and scale. Best for high-volume metric alerting.
Loki-managed rules — LogQL rules evaluated by Loki Ruler. Alert on log patterns, error rates from logs, and log-derived metrics.

# Grafana-managed alert rule (exported as YAML)
apiVersion: 1
groups:
  - orgId: 1
    name: slo-alerts
    folder: SLO Monitoring
    interval: 1m
    rules:
      - uid: slo-api-availability
        title: "API Availability SLO Burn Rate Critical"
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 3600  # 1 hour
              to: 0
            datasourceUid: mimir-prod
            model:
              expr: slo:burn_rate:1h{service="api-gateway"}
              instant: true
          - refId: B
            relativeTimeRange:
              from: 300   # 5 minutes
              to: 0
            datasourceUid: mimir-prod
            model:
              expr: slo:burn_rate:5m{service="api-gateway"}
              instant: true
          - refId: C
            datasourceUid: __expr__
            model:
              type: math
              expression: "$A > 14.4 && $B > 14.4"
        noDataState: OK
        execErrState: Alerting
        for: 2m
        labels:
          severity: critical
          team: platform
          slo: api-availability
        annotations:
          summary: "API availability burning error budget at {{ $values.A }}x rate"
          runbook_url: "https://wiki.internal/runbooks/slo-api-availability"

                            
                            Recording Rules for Performance: For frequently evaluated expressions, use recording rules to pre-compute results. This reduces query load on your data sources and ensures consistent evaluation. Define recording rules in Mimir/Cortex ruler and reference the recorded metric in your alert rules.
                        

Contact Points

Contact points define where notifications are delivered. Each contact point wraps one or more notification integrations. Grafana supports dozens of integrations natively:

Email — SMTP-based delivery with HTML templates
Slack — Channel messages with rich formatting, buttons, and images
PagerDuty — Events API v2 integration with severity mapping
Microsoft Teams — Adaptive cards via incoming webhooks
Webhooks — Generic HTTP POST for custom integrations
Grafana OnCall — Native integration for escalation workflows
OpsGenie, VictorOps, Telegram, Discord — Additional options

# Contact points provisioning
apiVersion: 1
contactPoints:
  - orgId: 1
    name: platform-team-critical
    receivers:
      - uid: pagerduty-platform
        type: pagerduty
        settings:
          integrationKey: "$PAGERDUTY_ROUTING_KEY"
          severity: critical
          class: "slo-violation"
          component: "{{ .CommonLabels.service }}"
          group: "{{ .CommonLabels.slo }}"
      - uid: slack-incidents
        type: slack
        settings:
          recipient: "#platform-incidents"
          token: "$SLACK_BOT_TOKEN"
          title: |
            {{ if eq .Status "firing" }}🔴{{ else }}✅{{ end }}
            [{{ .CommonLabels.severity | toUpper }}] {{ .CommonAnnotations.summary }}
          text: |
            {{ range .Alerts }}
            *Alert:* {{ .Annotations.summary }}
            *Description:* {{ .Annotations.description }}
            *Runbook:* {{ .Annotations.runbook_url }}
            {{ end }}

  - orgId: 1
    name: platform-team-warning
    receivers:
      - uid: slack-warnings
        type: slack
        settings:
          recipient: "#platform-alerts"
          token: "$SLACK_BOT_TOKEN"

Notification Policies

Notification policies form a routing tree that determines which contact point receives which alerts. Alerts enter at the root policy and traverse the tree until they match a child policy’s label matchers. The tree supports grouping (combining related alerts into single notifications), timing controls, and muting.

Notification Policy Routing Tree

flowchart TD
    A[Root Policy
Default: email-admin
group_by: alertname, cluster] --> B{severity=critical?}
    A --> C{team=platform?}
    A --> D{team=backend?}
    B -->|Match| E[Contact: PagerDuty
group_wait: 30s
repeat_interval: 4h]
    C -->|Match| F{severity=warning?}
    C -->|Match| G[Contact: platform-slack
group_wait: 5m
repeat_interval: 12h]
    F -->|Match| H[Contact: platform-tickets
group_wait: 10m
repeat_interval: 24h]
    D -->|Match| I[Contact: backend-oncall
group_wait: 1m
repeat_interval: 4h]

# Notification policies provisioning
apiVersion: 1
policies:
  - orgId: 1
    receiver: email-admin          # Default fallback
    group_by: ['alertname', 'cluster']
    group_wait: 30s                # Wait before first notification
    group_interval: 5m             # Wait between grouped notifications
    repeat_interval: 4h            # Resend if still firing
    routes:
      # Critical alerts → PagerDuty immediately
      - receiver: platform-team-critical
        matchers:
          - severity = critical
        group_wait: 30s
        group_interval: 5m
        repeat_interval: 4h
        continue: false

      # Platform team warnings → Slack channel
      - receiver: platform-team-warning
        matchers:
          - team = platform
          - severity = warning
        group_by: ['alertname', 'service']
        group_wait: 5m
        group_interval: 10m
        repeat_interval: 12h

      # Backend team → OnCall integration
      - receiver: backend-oncall
        matchers:
          - team = backend
        group_by: ['alertname', 'namespace']
        group_wait: 1m
        group_interval: 5m
        repeat_interval: 4h
        mute_time_intervals:
          - maintenance-window

Key timing parameters explained:

group_wait — How long to buffer alerts before sending the first notification for a new group. Allows related alerts to be batched together (e.g., 30s allows a cascading failure to be reported as one notification rather than 10).
group_interval — Minimum time between notifications for an existing group when new alerts are added. Prevents notification flooding during developing incidents.
repeat_interval — How often to resend a notification for an alert that remains firing. Set this long enough to avoid fatigue but short enough to remind that action is still needed.

Silences

Silences suppress notifications for alerts matching specific label criteria during a defined time window. Use silences for planned maintenance, known issues with accepted risk, or noisy alerts pending a fix. Silences do NOT prevent alert evaluation — alerts still fire and appear in the Grafana UI, but notifications are suppressed.

# Create a silence via the Grafana API
curl -X POST "https://grafana.internal/api/alertmanager/grafana/api/v2/silences" \
  -H "Authorization: Bearer $GRAFANA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "alertname", "value": "HighCPU", "isRegex": false},
      {"name": "cluster", "value": "staging", "isRegex": false}
    ],
    "startsAt": "2026-06-15T22:00:00Z",
    "endsAt": "2026-06-16T06:00:00Z",
    "createdBy": "wasil",
    "comment": "Planned staging cluster maintenance - capacity reduction expected"
  }'

                            
                            Silence Hygiene: Never create indefinite silences. Always set an expiration. Review active silences weekly. A common anti-pattern is silencing a noisy alert permanently rather than fixing it — this hides real problems. If an alert needs to be silenced for more than 48 hours, the alert itself should be fixed or deleted.
                        

Groups & Administration

Alert groups display the current state of all alert rule instances grouped by their notification policy labels. The alert state timeline shows historical transitions between Normal, Pending, Firing, and Resolved states, making it easy to identify flapping alerts (rapidly alternating between firing and resolved) and understand alert patterns over time.

Administrative best practices:

Organize rules into folders matching team ownership (e.g., “Platform / SLOs”, “Backend / Service Health”)
Use consistent labeling — team, severity, service, slo on all rules
Set evaluation intervals based on alert urgency (critical: 1m, warning: 5m, info: 15m)
Monitor alerting health — Track grafana_alerting_rule_evaluations_total and grafana_alerting_rule_evaluation_failures_total
Export as code — Use the provisioning API or Terraform to manage rules in Git

Grafana OnCall

Grafana OnCall extends Grafana Alerting with full incident response automation: on-call scheduling, escalation chains, alert grouping, and multi-channel notification. While Grafana Alerting decides when to notify and where, OnCall decides who gets notified, how urgently, and what happens if they don’t respond.

Alert Groups & Routing

OnCall groups related alerts into Alert Groups to prevent notification storms. When multiple alerts fire from the same source, they’re grouped into a single incident-like object that can be acknowledged, silenced, or resolved as a unit. Routing determines which escalation chain handles each incoming alert based on integration and labels.

# OnCall routing rules (configured via UI or Terraform)
# Route based on labels from incoming alerts
routes:
  - integration: grafana-alerting
    routing_rules:
      - condition: "{{ payload.labels.severity == 'critical' }}"
        escalation_chain: critical-page
      - condition: "{{ payload.labels.severity == 'warning' }}"
        escalation_chain: warning-notify
      - condition: "{{ payload.labels.team == 'platform' }}"
        escalation_chain: platform-oncall
    default_escalation_chain: default-notify

    # Grouping: combine alerts with same alertname + service
    grouping:
      type: label
      labels:
        - alertname
        - service

Inbound Integrations

Inbound integrations define how alerts enter OnCall. Each integration type understands a specific payload format and extracts relevant metadata (title, message, severity, labels) for routing and templating:

Grafana Alerting — Native integration, zero configuration. Alerts flow automatically from Grafana Alerting when OnCall is set as a contact point.
Alertmanager — Compatible with Prometheus Alertmanager webhook format. Use for alerts from external Prometheus/Mimir instances.
Webhook (Generic) — Accept any JSON payload. Define custom templates to extract title, message, and grouping keys.
Email — Monitor a dedicated email address. Useful for legacy systems that can only send email alerts.
Inbound Email — Parse email subject/body into alert fields

Notification Templating

OnCall uses Jinja2 templates to format notifications sent through escalation chains. Templates have access to the full alert payload, allowing rich, context-specific messages for each delivery channel (Slack, SMS, phone call, email):

// OnCall Jinja2 template for Slack notifications
// Title template
{% if payload.status == "firing" %}
🔴 {{ payload.commonLabels.alertname }}
{% else %}
✅ [RESOLVED] {{ payload.commonLabels.alertname }}
{% endif %}

// Message template
*Severity:* {{ payload.commonLabels.severity | upper }}
*Service:* {{ payload.commonLabels.service | default("unknown") }}
*Cluster:* {{ payload.commonLabels.cluster | default("N/A") }}

{% if payload.commonAnnotations.summary %}
*Summary:* {{ payload.commonAnnotations.summary }}
{% endif %}

{% if payload.commonAnnotations.runbook_url %}
📖 *Runbook:* {{ payload.commonAnnotations.runbook_url }}
{% endif %}

{% if payload.commonAnnotations.dashboard_url %}
📊 *Dashboard:* {{ payload.commonAnnotations.dashboard_url }}
{% endif %}

*Firing Alerts:* {{ payload.alerts | length }}
{% for alert in payload.alerts[:3] %}
  • {{ alert.labels.instance }}: {{ alert.annotations.description }}
{% endfor %}
{% if payload.alerts | length > 3 %}
  ... and {{ payload.alerts | length - 3 }} more
{% endif %}

Escalation Chains

Escalation chains define the sequence of notification steps, wait times, and conditions for escalating unacknowledged alerts. Each step can notify specific users, schedules, or user groups through configured channels (SMS, phone, Slack, push notification).

Escalation Chain Example: Critical Page

flowchart TD
    A[Alert Received] --> B[Step 1: Notify current on-call
via Slack + Push + SMS]
    B --> C{Acknowledged
within 5 min?}
    C -->|Yes| D[On-call investigates]
    C -->|No| E[Step 2: Notify on-call
via Phone Call]
    E --> F{Acknowledged
within 5 min?}
    F -->|Yes| D
    F -->|No| G[Step 3: Notify secondary
on-call + Team Lead]
    G --> H{Acknowledged
within 10 min?}
    H -->|Yes| D
    H -->|No| I[Step 4: Notify
Engineering Manager]
    I --> J[Declare Incident
Auto-create in Grafana Incident]

# Escalation chain configuration
escalation_chains:
  - name: critical-page
    steps:
      - type: notify_on_call_from_schedule
        schedule: primary-oncall
        notify_via:
          - slack
          - push
          - sms
        important: true    # Bypasses user DND settings

      - type: wait
        duration: 5m

      - type: notify_on_call_from_schedule
        schedule: primary-oncall
        notify_via:
          - phone_call
        important: true

      - type: wait
        duration: 5m

      - type: notify_on_call_from_schedule
        schedule: secondary-oncall
        notify_via:
          - slack
          - push
          - phone_call
      - type: notify_user_group
        group: team-leads
        notify_via:
          - slack

      - type: wait
        duration: 10m

      - type: notify_user_group
        group: engineering-managers
        notify_via:
          - phone_call
      - type: declare_incident
        severity: critical

Outbound Integrations

Outbound integrations define the channels through which OnCall delivers notifications to responders. Each user configures their personal notification preferences, and the escalation chain’s notify_via field determines which channels are used at each step:

Slack — Direct messages and channel posts with action buttons (Acknowledge, Resolve, Silence)
Microsoft Teams — Adaptive cards with interactive buttons via Bot Framework or incoming webhooks
Telegram — Bot messages with inline keyboard buttons for acknowledgment
SMS — Text messages via Twilio or built-in provider (Grafana Cloud)
Phone Call — Automated voice calls with text-to-speech alert summary and keypad acknowledgment
Push Notifications — Mobile app notifications via Grafana OnCall mobile app
Email — Rich HTML email with full alert context

Schedules & Rotations

OnCall schedules define who is on-call at any given time. Schedules support multiple layers (primary, secondary, shadow), rotations with configurable handoff times, overrides for holidays or swaps, and timezone-aware shifts:

# OnCall schedule configuration
schedules:
  - name: primary-oncall
    type: web
    timezone: America/New_York
    shifts:
      - rotation:
          name: weekly-rotation
          type: rolling_users
          start: "2026-06-01T09:00:00"
          duration: 604800   # 7 days in seconds
          users:
            - alice
            - bob
            - charlie
            - diana
          direction: forward
          frequency: weekly
          handoff_time: "09:00"

      - rotation:
          name: weekend-override
          type: rolling_users
          start: "2026-06-06T18:00:00"  # Friday 6 PM
          duration: 237600   # 66 hours (Fri 6PM → Mon 8AM)
          users:
            - alice
            - bob
          frequency: bi-weekly

    overrides:
      - start: "2026-07-04T00:00:00"
        end: "2026-07-05T00:00:00"
        user: bob   # Covering for alice on holiday

  - name: secondary-oncall
    type: web
    timezone: America/New_York
    shifts:
      - rotation:
          name: secondary-weekly
          type: rolling_users
          start: "2026-06-01T09:00:00"
          duration: 604800
          users:
            - bob       # Secondary is next week's primary
            - charlie
            - diana
            - alice
          direction: forward
          frequency: weekly

                            
                            Schedule Best Practices: Limit on-call shifts to 7 days maximum. Provide at least 12 hours between shifts. Use a “shadow” schedule to onboard new team members (they observe but don’t get paged). Allow shift swaps through the UI without manager approval. Track on-call burden metrics (pages per shift, sleep interruptions) and redistribute if uneven.
                        

Grafana Incident

Grafana Incident provides structured incident management integrated with Grafana’s observability stack. It bridges the gap between alert notification (OnCall) and post-incident learning (postmortems) by providing a collaborative workspace for incident resolution with automated timeline tracking, role assignment, and artifact collection.

Declaring Incidents

Incidents can be declared manually by engineers, automatically from OnCall escalation chains, or via API integration from external tools. Each incident has a severity level, title, and initial status:

# Declare an incident via Grafana Incident API
curl -X POST "https://grafana.internal/api/plugins/grafana-incident-app/resources/api/v1/IncidentsService.CreateIncident" \
  -H "Authorization: Bearer $GRAFANA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "API Gateway returning 503 errors for EU region",
    "severity": "critical",
    "status": "active",
    "labels": [
      {"key": "service", "value": "api-gateway"},
      {"key": "region", "value": "eu-west-1"}
    ],
    "attachCaption": "Initial alert dashboard",
    "attachURL": "https://grafana.internal/d/api-overview?orgId=1&var-region=eu-west-1"
  }'

Incident severity levels typically follow this classification:

Reference Incident Severity Definitions

Severity	Definition	Response	Communication
SEV1 / Critical	Complete service outage or data loss affecting all users	All-hands, IC assigned immediately	Every 15 minutes, status page updated
SEV2 / Major	Significant degradation affecting majority of users	On-call team, IC assigned within 15 min	Every 30 minutes
SEV3 / Minor	Limited impact, workaround available	On-call investigates during business hours	Hourly updates
SEV4 / Low	Cosmetic or minimal impact	Tracked as ticket, no immediate response	As needed

Severity Incident Classification

Workflow & Timelines

Grafana Incident automatically tracks the incident timeline, recording every action, role assignment, status change, and communication. The workflow progresses through defined states:

Incident Lifecycle States

stateDiagram-v2
    [*] --> Declared: Alert triggers / Manual declaration
    Declared --> Investigating: IC assigned, triage begins
    Investigating --> Mitigating: Root cause identified
    Mitigating --> Resolved: User impact eliminated
    Resolved --> Closed: Postmortem complete, actions tracked
    Investigating --> Resolved: False alarm / auto-recovery
    Closed --> [*]

Key workflow features in Grafana Incident:

Role assignment — IC, Technical Lead, Communications Lead assigned from incident UI
Task management — Create and assign tasks within the incident (e.g., “Check EU region load balancer logs”)
Activity feed — Automated timeline of all events, status changes, and manual notes
Artifact attachment — Link dashboards, runbooks, Slack threads, and external URLs
Severity changes — Escalate or de-escalate as understanding evolves
Auto-linking — Connects to the triggering alert group in OnCall
Stakeholder updates — Publish status updates to configured channels

Postmortem Generation

When an incident is resolved, Grafana Incident can auto-generate a postmortem document from the incident timeline. This document includes the chronological sequence of events, roles involved, duration, and severity — pre-populated so the team can focus on analysis rather than reconstruction:

# Auto-generated postmortem structure from Grafana Incident
postmortem:
  incident_id: INC-2026-0142
  title: "API Gateway 503 errors in EU region"
  severity: critical
  duration: 47m
  detected_at: "2026-06-15T14:23:00Z"
  resolved_at: "2026-06-15T15:10:00Z"

  impact:
    users_affected: ~12000
    error_rate_peak: "23%"
    slo_budget_consumed: "8.2%"

  timeline:
    - time: "14:20:00"
      event: "SLO burn rate alert fires (14.4x)"
      actor: system
    - time: "14:23:00"
      event: "On-call alice acknowledges"
      actor: alice
    - time: "14:25:00"
      event: "Incident declared as SEV1"
      actor: alice
    - time: "14:28:00"
      event: "IC role assigned to bob"
      actor: alice
    - time: "14:35:00"
      event: "Root cause identified: bad config push to EU LB"
      actor: alice
    - time: "14:42:00"
      event: "Config rollback initiated"
      actor: alice
    - time: "15:10:00"
      event: "Error rate returned to baseline, incident resolved"
      actor: bob

  roles:
    incident_commander: bob
    technical_lead: alice
    communications: charlie

  # These sections are filled in during the postmortem meeting
  root_cause: "[To be completed]"
  contributing_factors: "[To be completed]"
  what_went_well: "[To be completed]"
  action_items: "[To be completed]"

                            
                            Integrated Incident Workflow: The full Grafana incident management pipeline flows: Alert Rule (detects problem) → Notification Policy (routes to correct team) → Contact Point: OnCall (enters escalation) → Escalation Chain (ensures human response) → Grafana Incident (structured response) → Postmortem (organizational learning). Each component is optional — teams can adopt incrementally.
                        

Summary & Next Steps

Effective incident management is a discipline that spans the entire lifecycle from alert design through post-incident learning. In this guide, we covered:

Alert Philosophy — The critical distinction between being alerted and alarmed, combating alert fatigue through actionability criteria and signal-to-noise optimization
Incident Lifecycle — Preparation (runbooks, templates, escalation paths), execution (triage, roles, communication, mitigation), and learning (blameless postmortems, action items)
SLI/SLO-Based Alerting — Moving beyond threshold alerts to burn-rate alerting with multi-window multi-burn-rate patterns that alert on user impact rather than infrastructure metrics
Grafana Alerting — Alert rules (Grafana-managed, Mimir, Loki), contact points, notification policies with routing trees, silences, and alert administration
Grafana OnCall — Alert groups, inbound integrations, Jinja2 notification templating, escalation chains with multi-step notification, outbound integrations, and schedule rotations
Grafana Incident — Declaring incidents, structured workflow with timelines and roles, auto-generated postmortems, and severity classification

The key principle: alerts exist to protect users, not to report metrics. Every page should represent a genuine threat to user experience that requires immediate human intervention. Everything else belongs in dashboards, tickets, or logs.

Next in the Series

In Part 10: Infrastructure as Code for Observability, we’ll explore managing your entire Grafana stack as code — Terraform providers, Jsonnet/Grafonnet for dashboards, Kubernetes operators, GitOps workflows, and CI/CD pipelines for observability configuration.

Previous Part 8: Displaying Data with Dashboards Next Part 10: Infrastructure as Code for Observability