Being Alerted vs Being Alarmed
The distinction between being alerted and being alarmed is fundamental to building sustainable on-call practices. An alert should inform you that something meaningful requires attention. An alarm — the visceral “wake me up at 3 AM” page — should be reserved for situations where immediate human intervention is the only path to preserving user-facing reliability. When teams conflate these two concepts, they create a culture of fatigue, desensitization, and ultimately, missed critical incidents hidden in a flood of noise.
Alert Fatigue
Alert fatigue is the gradual erosion of an engineer’s ability to respond to alerts because they receive too many, too often, with too little actionable signal. Research from PagerDuty’s State of Digital Operations reports consistently shows that teams receiving more than 40 alerts per on-call shift experience significantly degraded response times and higher incident duration.
Common causes of alert fatigue include:
- Symptom-based alerts on causes — alerting on CPU usage rather than on user-visible latency degradation
- Static thresholds — hardcoded values that don’t account for natural traffic patterns (weekday vs weekend, seasonal peaks)
- Redundant alerts — multiple alerts firing for the same root cause (disk full triggers filesystem alert, database alert, and application error alert simultaneously)
- Missing deduplication — the same alert firing once per instance rather than once per incident
- Alerts without owners — alerts created “just in case” that no one knows how to remediate
Signal vs Noise
A high signal-to-noise ratio means that when an engineer receives a notification, they can trust that it represents a genuine problem requiring action. Google’s SRE book recommends targeting a 50% or higher actionable rate — meaning at least half of all pages should require immediate human intervention. The best teams achieve 80%+ actionable rates.
Classify every existing alert into one of four categories to identify what to keep, modify, or delete:
| Category | Action Required? | Time-Sensitive? | Recommendation |
|---|---|---|---|
| Page | Yes | Immediate | Keep as paging alert (P1/P2) |
| Ticket | Yes | Hours/Days | Route to issue tracker, not pager |
| Log | Maybe | No | Record as metric, review in aggregate |
| Delete | No | No | Remove entirely — it’s pure noise |
Actionability Criteria
Every alert that can page a human must pass the actionability test. If an engineer cannot take a meaningful action upon receiving the alert, it should not be a paging alert. The three pillars of an actionable alert are:
- What is broken? — The alert title and annotations must clearly state the user-visible impact
- Why does it matter? — Include the SLO context (e.g., “Error budget consumed 40% in last hour, on track to exhaust within 6 hours”)
- What should I do? — Link to a runbook with specific remediation steps
Before an Incident
Incident response is 80% preparation and 20% execution. The teams that resolve incidents fastest are not the ones with the best engineers — they’re the ones with the best systems: clear runbooks, practiced communication patterns, defined roles, and tested escalation paths. Preparation transforms chaotic firefighting into methodical problem-solving.
Runbooks & Playbooks
A runbook is a documented procedure for diagnosing and remediating a known failure mode. Every paging alert should link to a runbook. The runbook should be written for the least experienced on-call engineer — clear, step-by-step, with explicit decision trees.
# Example runbook structure (stored alongside alert definitions)
# runbooks/high-error-rate-api-gateway.md
title: "API Gateway Error Rate > 5%"
severity: P2
escalation: platform-team
last_reviewed: 2026-05-01
symptoms:
- HTTP 5xx responses exceed 5% of total traffic
- Downstream service health checks failing
- User-visible errors on checkout flow
diagnosis_steps:
1. Check which endpoints are affected:
query: |
sum by (route)(rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (route)(rate(http_requests_total[5m])) > 0.05
2. Check if a recent deployment correlates:
action: "Review deployment timeline in Grafana annotations"
3. Check downstream dependency health:
dashboard: "https://grafana.internal/d/deps-health"
mitigation:
- If single endpoint: Route traffic away with feature flag
- If recent deploy: Rollback via `kubectl rollout undo`
- If downstream: Activate circuit breaker configuration
- If capacity: Scale horizontally via HPA override
communication:
template: "incident-api-degradation"
channels: ["#incidents", "#platform-oncall"]
Communication Templates
During an incident, communication must be fast and structured. Pre-defined templates eliminate the cognitive overhead of composing messages under pressure. Define templates for initial notification, status updates, and resolution announcements:
# communication-templates/incident-declaration.yaml
templates:
initial_notification:
title: "[{{ severity }}] {{ title }}"
body: |
🚨 Incident Declared: {{ title }}
Severity: {{ severity }}
Impact: {{ impact_description }}
Incident Commander: {{ commander }}
War Room: {{ war_room_link }}
Status Page: {{ status_page_link }}
Next update in 15 minutes.
status_update:
body: |
📋 Update #{{ update_number }} - {{ title }}
Status: {{ status }}
Current Actions: {{ current_actions }}
ETA to Resolution: {{ eta }}
Next update in {{ next_update_minutes }} minutes.
resolution:
body: |
✅ Resolved: {{ title }}
Duration: {{ duration }}
Root Cause: {{ root_cause_summary }}
User Impact: {{ impact_summary }}
Postmortem scheduled: {{ postmortem_date }}
Escalation Paths
Escalation paths define who to contact when the primary on-call cannot resolve an issue within a defined timeframe. Clear escalation prevents the “hero culture” where one engineer struggles alone while users suffer. Define escalation based on time and severity:
flowchart TD
A[Alert Fires] --> B{Acknowledged
within 5 min?}
B -->|Yes| C[Primary On-Call
Investigates]
B -->|No| D[Escalate to
Secondary On-Call]
C --> E{Resolved within
30 min?}
E -->|Yes| F[Close Alert]
E -->|No| G{Severity >= P2?}
G -->|Yes| H[Declare Incident
Engage Team Lead]
G -->|No| I[Create Ticket
Continue Investigation]
D --> J{Acknowledged
within 5 min?}
J -->|Yes| C
J -->|No| K[Escalate to
Engineering Manager]
H --> L[Incident Commander
Assigned]
During an Incident
When an incident is declared, the priority shifts from individual problem-solving to coordinated team response. The goal is not to find the root cause immediately — it’s to mitigate user impact as quickly as possible. Root cause analysis comes later, in the postmortem. During the incident, speed of mitigation trumps elegance of solution.
Triage & Assessment
The first 5 minutes of an incident determine its trajectory. Effective triage requires answering four questions rapidly:
- What is the user impact? — Which users, which features, what percentage of traffic?
- What changed? — Deployments, config changes, traffic spikes, upstream/downstream changes?
- Is it getting worse? — Is error rate climbing, stable, or recovering?
- What is the blast radius? — Single service, single region, or global?
Incident Roles
Clearly defined roles prevent chaos. The minimum viable incident team has three roles:
- Incident Commander (IC) — Coordinates response, makes decisions about escalation and communication. Does NOT debug. Keeps the team focused and prevents tunnel vision.
- Technical Lead — Drives diagnosis and mitigation. The most senior engineer available for the affected system. Delegates investigation tasks.
- Communications Lead — Posts status updates to stakeholders, manages the status page, handles customer communication. Frees the IC and Tech Lead from interruptions.
For major incidents (P1/SEV1), add:
- Scribe — Records all actions, decisions, and timestamps in the incident timeline
- Subject Matter Experts (SMEs) — Pulled in as needed based on affected systems
Communication & Coordination
Effective incident communication follows the OODA loop: Observe, Orient, Decide, Act. The Incident Commander drives this loop with regular check-ins every 10–15 minutes, asking: “What do we know now? What are we trying? What should we try next?”
Key communication principles during incidents:
- Use a dedicated channel — Never mix incident coordination with general team chat
- Timestamp everything — “14:32 UTC: Rolled back deployment v2.4.1 to v2.4.0”
- State hypotheses explicitly — “Hypothesis: The new Redis connection pool config is causing timeouts”
- Announce actions before taking them — Prevents duplicate or conflicting remediation attempts
- Regular external updates — Even “We’re still investigating” is better than silence
Mitigation Strategies
Mitigation is about restoring service, not fixing the underlying bug. Common mitigation patterns ordered by speed:
- Rollback — Revert the most recent change (fastest if a deployment caused the issue)
- Feature flag — Disable the affected feature path
- Traffic shifting — Route traffic away from the affected region/instance
- Scale out — Add capacity if the issue is load-related
- Circuit breaker — Isolate the failing dependency with fallback behavior
- Restart — Clear corrupted state (last resort, masks underlying issues)
After an Incident
The postmortem is where incidents become organizational improvements. Without effective post-incident review, teams are doomed to repeat failures. The goal is not to assign blame — it’s to understand the systemic conditions that allowed the failure and to build defenses against recurrence.
Blameless Postmortems
A blameless postmortem recognizes that in complex systems, failures are emergent properties of the system design, not individual negligence. The engineer who deployed the breaking change was acting rationally given the information available to them. The question is: why did the system allow a harmful change to reach production?
- Summary — One paragraph: what happened, duration, impact
- Timeline — Minute-by-minute chronology from first signal to resolution
- Root Cause — Technical explanation using the “5 Whys” technique
- Contributing Factors — What made detection/response slower than ideal
- Impact — Quantified user impact (requests failed, revenue lost, SLO budget consumed)
- What Went Well — What worked in the response (celebrate good practice)
- Action Items — Specific, assigned, time-bound improvements
- Lessons Learned — Insights applicable beyond this specific incident
Action Items & Follow-Up
Action items from postmortems must be specific, assigned, prioritized, and tracked. Vague action items like “improve monitoring” never get completed. Effective action items look like:
- “Add circuit breaker to payment service Redis connection (assigned: @alice, P2, due: 2026-07-01)”
- “Create alert for connection pool exhaustion at 80% capacity (assigned: @bob, P2, due: 2026-06-25)”
- “Add deployment canary analysis gate requiring 5-minute error rate check (assigned: @charlie, P1, due: 2026-06-20)”
Track postmortem action item completion rate as a team metric. If fewer than 80% of action items are completed within their deadline, the postmortem process needs strengthening (dedicated time, management backing, or fewer but more impactful items).
Organizational Learning
Individual postmortems create local learning. To achieve organizational learning, share postmortems broadly, run periodic “failure review” meetings where teams present their most interesting incidents, and look for systemic patterns across incidents. Common patterns include: inadequate testing for failure modes, missing observability in new services, and deployment pipelines that lack safety gates.
Writing Great Alerts Using SLIs & SLOs
Traditional threshold-based alerting (“alert if CPU > 80%”) is fundamentally flawed because it measures causes rather than symptoms. Users don’t care about your CPU utilization — they care about latency, availability, and correctness. SLI/SLO-based alerting inverts this model: alert when the user experience degrades, regardless of the underlying cause.
SLI-Based Alerts
A Service Level Indicator (SLI) is a quantitative measure of user-perceived service quality. The most common SLIs are:
- Availability — Proportion of requests that succeed:
successful_requests / total_requests - Latency — Proportion of requests faster than a threshold:
requests_under_300ms / total_requests - Correctness — Proportion of requests returning correct results
A Service Level Objective (SLO) sets a target for the SLI over a rolling window (typically 30 days). For example: “99.9% of requests will succeed over a 30-day rolling window.” This gives you an error budget of 0.1% — approximately 43 minutes of complete downtime per month, or a sustained 1% error rate for ~7 hours.
# SLO definition for API availability
slo:
name: api-availability
description: "API requests returning non-5xx responses"
sli:
events:
good: sum(rate(http_requests_total{status!~"5.."}[5m]))
total: sum(rate(http_requests_total[5m]))
objective: 0.999 # 99.9%
window: 30d
alerting:
burn_rate:
- severity: critical
short_window: 5m
long_window: 1h
factor: 14.4
- severity: warning
short_window: 30m
long_window: 6h
factor: 6
Burn-Rate Alerting
Burn rate measures how quickly you’re consuming your error budget relative to a sustainable rate. A burn rate of 1 means you’ll exactly exhaust your budget by the end of the window. A burn rate of 14.4 means you’ll exhaust your entire 30-day budget in just 2 hours (30 days × 24 hours / 14.4 = ~50 hours... corrected: 30 / 14.4 ≈ 2.08 days of budget consumed per day).
burn_rate = (1 - SLI_current) / (1 - SLO_target). For a 99.9% SLO with current availability of 99.0%: burn_rate = (1 - 0.990) / (1 - 0.999) = 0.010 / 0.001 = 10. This means you’re consuming error budget 10× faster than sustainable.
# Prometheus recording rules for burn-rate calculation
groups:
- name: slo-burn-rates
interval: 30s
rules:
# Error ratio over different windows
- record: slo:error_ratio:rate5m
expr: |
1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
)
- record: slo:error_ratio:rate1h
expr: |
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
)
- record: slo:error_ratio:rate6h
expr: |
1 - (
sum(rate(http_requests_total{status!~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
)
# Burn rates (target SLO: 99.9% = 0.001 error budget)
- record: slo:burn_rate:5m
expr: slo:error_ratio:rate5m / 0.001
- record: slo:burn_rate:1h
expr: slo:error_ratio:rate1h / 0.001
- record: slo:burn_rate:6h
expr: slo:error_ratio:rate6h / 0.001
Multi-Window Multi-Burn-Rate
Single-window burn-rate alerts suffer from either being too slow (long window) or too noisy (short window). The multi-window multi-burn-rate approach combines a short window for freshness with a long window for significance. An alert fires only when both windows exceed their respective thresholds, dramatically reducing false positives while maintaining fast detection.
flowchart TD
A[Evaluate Short Window
e.g., 5 min burn rate] --> B{Short Window
> Threshold?}
B -->|No| C[No Alert
Brief spike, self-healing]
B -->|Yes| D[Evaluate Long Window
e.g., 1 hour burn rate]
D --> E{Long Window
> Threshold?}
E -->|No| F[No Alert
Already recovering]
E -->|Yes| G[FIRE ALERT
Sustained budget burn]
# Multi-window multi-burn-rate alert rules
groups:
- name: slo-alerts
rules:
# Critical: 14.4x burn rate over 1h AND 5m (exhausts budget in ~2 days)
- alert: SLOBurnRateCritical
expr: |
slo:burn_rate:1h > 14.4
and
slo:burn_rate:5m > 14.4
for: 2m
labels:
severity: critical
slo: api-availability
annotations:
summary: "API availability SLO burn rate critical"
description: |
Error budget burn rate is {{ $value | printf "%.1f" }}x.
At this rate, the 30-day error budget will be exhausted in
{{ printf "%.1f" (divf 720 $value) }} hours.
runbook_url: "https://wiki.internal/runbooks/api-availability"
dashboard_url: "https://grafana.internal/d/slo-overview"
# Warning: 6x burn rate over 6h AND 30m (exhausts budget in ~5 days)
- alert: SLOBurnRateWarning
expr: |
slo:burn_rate:6h > 6
and
slo:burn_rate:30m > 6
for: 5m
labels:
severity: warning
slo: api-availability
annotations:
summary: "API availability SLO burn rate elevated"
description: |
Error budget burn rate is {{ $value | printf "%.1f" }}x.
Budget will be exhausted in {{ printf "%.1f" (divf 720 $value) }} hours
if this rate continues.
| Severity | Long Window | Short Window | Burn Rate | Budget Exhaustion |
|---|---|---|---|---|
| Critical (Page) | 1 hour | 5 minutes | 14.4× | ~2 days |
| Warning (Page) | 6 hours | 30 minutes | 6× | ~5 days |
| Info (Ticket) | 3 days | 6 hours | 1× | 30 days |
Grafana Alerting
Grafana’s unified alerting system provides a single pane of glass for managing alert rules across all data sources. Whether your rules are evaluated by Grafana itself, by Mimir Ruler, or by Loki Ruler, they all appear in the same UI with consistent notification routing, silencing, and grouping.
Alert Rules
Grafana supports three types of alert rules, each evaluated differently but managed through the same interface:
- Grafana-managed rules — Evaluated by the Grafana server itself. Support any configured data source and multi-dimensional alerting with reduce/math expressions. Best for multi-source correlation alerts.
- Mimir/Cortex-managed rules — PromQL rules pushed to and evaluated by Mimir Ruler. Benefit from Mimir’s HA evaluation and scale. Best for high-volume metric alerting.
- Loki-managed rules — LogQL rules evaluated by Loki Ruler. Alert on log patterns, error rates from logs, and log-derived metrics.
# Grafana-managed alert rule (exported as YAML)
apiVersion: 1
groups:
- orgId: 1
name: slo-alerts
folder: SLO Monitoring
interval: 1m
rules:
- uid: slo-api-availability
title: "API Availability SLO Burn Rate Critical"
condition: C
data:
- refId: A
relativeTimeRange:
from: 3600 # 1 hour
to: 0
datasourceUid: mimir-prod
model:
expr: slo:burn_rate:1h{service="api-gateway"}
instant: true
- refId: B
relativeTimeRange:
from: 300 # 5 minutes
to: 0
datasourceUid: mimir-prod
model:
expr: slo:burn_rate:5m{service="api-gateway"}
instant: true
- refId: C
datasourceUid: __expr__
model:
type: math
expression: "$A > 14.4 && $B > 14.4"
noDataState: OK
execErrState: Alerting
for: 2m
labels:
severity: critical
team: platform
slo: api-availability
annotations:
summary: "API availability burning error budget at {{ $values.A }}x rate"
runbook_url: "https://wiki.internal/runbooks/slo-api-availability"
Contact Points
Contact points define where notifications are delivered. Each contact point wraps one or more notification integrations. Grafana supports dozens of integrations natively:
- Email — SMTP-based delivery with HTML templates
- Slack — Channel messages with rich formatting, buttons, and images
- PagerDuty — Events API v2 integration with severity mapping
- Microsoft Teams — Adaptive cards via incoming webhooks
- Webhooks — Generic HTTP POST for custom integrations
- Grafana OnCall — Native integration for escalation workflows
- OpsGenie, VictorOps, Telegram, Discord — Additional options
# Contact points provisioning
apiVersion: 1
contactPoints:
- orgId: 1
name: platform-team-critical
receivers:
- uid: pagerduty-platform
type: pagerduty
settings:
integrationKey: "$PAGERDUTY_ROUTING_KEY"
severity: critical
class: "slo-violation"
component: "{{ .CommonLabels.service }}"
group: "{{ .CommonLabels.slo }}"
- uid: slack-incidents
type: slack
settings:
recipient: "#platform-incidents"
token: "$SLACK_BOT_TOKEN"
title: |
{{ if eq .Status "firing" }}🔴{{ else }}✅{{ end }}
[{{ .CommonLabels.severity | toUpper }}] {{ .CommonAnnotations.summary }}
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
- orgId: 1
name: platform-team-warning
receivers:
- uid: slack-warnings
type: slack
settings:
recipient: "#platform-alerts"
token: "$SLACK_BOT_TOKEN"
Notification Policies
Notification policies form a routing tree that determines which contact point receives which alerts. Alerts enter at the root policy and traverse the tree until they match a child policy’s label matchers. The tree supports grouping (combining related alerts into single notifications), timing controls, and muting.
flowchart TD
A[Root Policy
Default: email-admin
group_by: alertname, cluster] --> B{severity=critical?}
A --> C{team=platform?}
A --> D{team=backend?}
B -->|Match| E[Contact: PagerDuty
group_wait: 30s
repeat_interval: 4h]
C -->|Match| F{severity=warning?}
C -->|Match| G[Contact: platform-slack
group_wait: 5m
repeat_interval: 12h]
F -->|Match| H[Contact: platform-tickets
group_wait: 10m
repeat_interval: 24h]
D -->|Match| I[Contact: backend-oncall
group_wait: 1m
repeat_interval: 4h]
# Notification policies provisioning
apiVersion: 1
policies:
- orgId: 1
receiver: email-admin # Default fallback
group_by: ['alertname', 'cluster']
group_wait: 30s # Wait before first notification
group_interval: 5m # Wait between grouped notifications
repeat_interval: 4h # Resend if still firing
routes:
# Critical alerts → PagerDuty immediately
- receiver: platform-team-critical
matchers:
- severity = critical
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
continue: false
# Platform team warnings → Slack channel
- receiver: platform-team-warning
matchers:
- team = platform
- severity = warning
group_by: ['alertname', 'service']
group_wait: 5m
group_interval: 10m
repeat_interval: 12h
# Backend team → OnCall integration
- receiver: backend-oncall
matchers:
- team = backend
group_by: ['alertname', 'namespace']
group_wait: 1m
group_interval: 5m
repeat_interval: 4h
mute_time_intervals:
- maintenance-window
Key timing parameters explained:
- group_wait — How long to buffer alerts before sending the first notification for a new group. Allows related alerts to be batched together (e.g., 30s allows a cascading failure to be reported as one notification rather than 10).
- group_interval — Minimum time between notifications for an existing group when new alerts are added. Prevents notification flooding during developing incidents.
- repeat_interval — How often to resend a notification for an alert that remains firing. Set this long enough to avoid fatigue but short enough to remind that action is still needed.
Silences
Silences suppress notifications for alerts matching specific label criteria during a defined time window. Use silences for planned maintenance, known issues with accepted risk, or noisy alerts pending a fix. Silences do NOT prevent alert evaluation — alerts still fire and appear in the Grafana UI, but notifications are suppressed.
# Create a silence via the Grafana API
curl -X POST "https://grafana.internal/api/alertmanager/grafana/api/v2/silences" \
-H "Authorization: Bearer $GRAFANA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "alertname", "value": "HighCPU", "isRegex": false},
{"name": "cluster", "value": "staging", "isRegex": false}
],
"startsAt": "2026-06-15T22:00:00Z",
"endsAt": "2026-06-16T06:00:00Z",
"createdBy": "wasil",
"comment": "Planned staging cluster maintenance - capacity reduction expected"
}'
Groups & Administration
Alert groups display the current state of all alert rule instances grouped by their notification policy labels. The alert state timeline shows historical transitions between Normal, Pending, Firing, and Resolved states, making it easy to identify flapping alerts (rapidly alternating between firing and resolved) and understand alert patterns over time.
Administrative best practices:
- Organize rules into folders matching team ownership (e.g., “Platform / SLOs”, “Backend / Service Health”)
- Use consistent labeling —
team,severity,service,sloon all rules - Set evaluation intervals based on alert urgency (critical: 1m, warning: 5m, info: 15m)
- Monitor alerting health — Track
grafana_alerting_rule_evaluations_totalandgrafana_alerting_rule_evaluation_failures_total - Export as code — Use the provisioning API or Terraform to manage rules in Git
Grafana OnCall
Grafana OnCall extends Grafana Alerting with full incident response automation: on-call scheduling, escalation chains, alert grouping, and multi-channel notification. While Grafana Alerting decides when to notify and where, OnCall decides who gets notified, how urgently, and what happens if they don’t respond.
Alert Groups & Routing
OnCall groups related alerts into Alert Groups to prevent notification storms. When multiple alerts fire from the same source, they’re grouped into a single incident-like object that can be acknowledged, silenced, or resolved as a unit. Routing determines which escalation chain handles each incoming alert based on integration and labels.
# OnCall routing rules (configured via UI or Terraform)
# Route based on labels from incoming alerts
routes:
- integration: grafana-alerting
routing_rules:
- condition: "{{ payload.labels.severity == 'critical' }}"
escalation_chain: critical-page
- condition: "{{ payload.labels.severity == 'warning' }}"
escalation_chain: warning-notify
- condition: "{{ payload.labels.team == 'platform' }}"
escalation_chain: platform-oncall
default_escalation_chain: default-notify
# Grouping: combine alerts with same alertname + service
grouping:
type: label
labels:
- alertname
- service
Inbound Integrations
Inbound integrations define how alerts enter OnCall. Each integration type understands a specific payload format and extracts relevant metadata (title, message, severity, labels) for routing and templating:
- Grafana Alerting — Native integration, zero configuration. Alerts flow automatically from Grafana Alerting when OnCall is set as a contact point.
- Alertmanager — Compatible with Prometheus Alertmanager webhook format. Use for alerts from external Prometheus/Mimir instances.
- Webhook (Generic) — Accept any JSON payload. Define custom templates to extract title, message, and grouping keys.
- Email — Monitor a dedicated email address. Useful for legacy systems that can only send email alerts.
- Inbound Email — Parse email subject/body into alert fields
Notification Templating
OnCall uses Jinja2 templates to format notifications sent through escalation chains. Templates have access to the full alert payload, allowing rich, context-specific messages for each delivery channel (Slack, SMS, phone call, email):
// OnCall Jinja2 template for Slack notifications
// Title template
{% if payload.status == "firing" %}
🔴 {{ payload.commonLabels.alertname }}
{% else %}
✅ [RESOLVED] {{ payload.commonLabels.alertname }}
{% endif %}
// Message template
*Severity:* {{ payload.commonLabels.severity | upper }}
*Service:* {{ payload.commonLabels.service | default("unknown") }}
*Cluster:* {{ payload.commonLabels.cluster | default("N/A") }}
{% if payload.commonAnnotations.summary %}
*Summary:* {{ payload.commonAnnotations.summary }}
{% endif %}
{% if payload.commonAnnotations.runbook_url %}
📖 *Runbook:* {{ payload.commonAnnotations.runbook_url }}
{% endif %}
{% if payload.commonAnnotations.dashboard_url %}
📊 *Dashboard:* {{ payload.commonAnnotations.dashboard_url }}
{% endif %}
*Firing Alerts:* {{ payload.alerts | length }}
{% for alert in payload.alerts[:3] %}
• {{ alert.labels.instance }}: {{ alert.annotations.description }}
{% endfor %}
{% if payload.alerts | length > 3 %}
... and {{ payload.alerts | length - 3 }} more
{% endif %}
Escalation Chains
Escalation chains define the sequence of notification steps, wait times, and conditions for escalating unacknowledged alerts. Each step can notify specific users, schedules, or user groups through configured channels (SMS, phone, Slack, push notification).
flowchart TD
A[Alert Received] --> B[Step 1: Notify current on-call
via Slack + Push + SMS]
B --> C{Acknowledged
within 5 min?}
C -->|Yes| D[On-call investigates]
C -->|No| E[Step 2: Notify on-call
via Phone Call]
E --> F{Acknowledged
within 5 min?}
F -->|Yes| D
F -->|No| G[Step 3: Notify secondary
on-call + Team Lead]
G --> H{Acknowledged
within 10 min?}
H -->|Yes| D
H -->|No| I[Step 4: Notify
Engineering Manager]
I --> J[Declare Incident
Auto-create in Grafana Incident]
# Escalation chain configuration
escalation_chains:
- name: critical-page
steps:
- type: notify_on_call_from_schedule
schedule: primary-oncall
notify_via:
- slack
- push
- sms
important: true # Bypasses user DND settings
- type: wait
duration: 5m
- type: notify_on_call_from_schedule
schedule: primary-oncall
notify_via:
- phone_call
important: true
- type: wait
duration: 5m
- type: notify_on_call_from_schedule
schedule: secondary-oncall
notify_via:
- slack
- push
- phone_call
- type: notify_user_group
group: team-leads
notify_via:
- slack
- type: wait
duration: 10m
- type: notify_user_group
group: engineering-managers
notify_via:
- phone_call
- type: declare_incident
severity: critical
Outbound Integrations
Outbound integrations define the channels through which OnCall delivers notifications to responders. Each user configures their personal notification preferences, and the escalation chain’s notify_via field determines which channels are used at each step:
- Slack — Direct messages and channel posts with action buttons (Acknowledge, Resolve, Silence)
- Microsoft Teams — Adaptive cards with interactive buttons via Bot Framework or incoming webhooks
- Telegram — Bot messages with inline keyboard buttons for acknowledgment
- SMS — Text messages via Twilio or built-in provider (Grafana Cloud)
- Phone Call — Automated voice calls with text-to-speech alert summary and keypad acknowledgment
- Push Notifications — Mobile app notifications via Grafana OnCall mobile app
- Email — Rich HTML email with full alert context
Schedules & Rotations
OnCall schedules define who is on-call at any given time. Schedules support multiple layers (primary, secondary, shadow), rotations with configurable handoff times, overrides for holidays or swaps, and timezone-aware shifts:
# OnCall schedule configuration
schedules:
- name: primary-oncall
type: web
timezone: America/New_York
shifts:
- rotation:
name: weekly-rotation
type: rolling_users
start: "2026-06-01T09:00:00"
duration: 604800 # 7 days in seconds
users:
- alice
- bob
- charlie
- diana
direction: forward
frequency: weekly
handoff_time: "09:00"
- rotation:
name: weekend-override
type: rolling_users
start: "2026-06-06T18:00:00" # Friday 6 PM
duration: 237600 # 66 hours (Fri 6PM → Mon 8AM)
users:
- alice
- bob
frequency: bi-weekly
overrides:
- start: "2026-07-04T00:00:00"
end: "2026-07-05T00:00:00"
user: bob # Covering for alice on holiday
- name: secondary-oncall
type: web
timezone: America/New_York
shifts:
- rotation:
name: secondary-weekly
type: rolling_users
start: "2026-06-01T09:00:00"
duration: 604800
users:
- bob # Secondary is next week's primary
- charlie
- diana
- alice
direction: forward
frequency: weekly
Grafana Incident
Grafana Incident provides structured incident management integrated with Grafana’s observability stack. It bridges the gap between alert notification (OnCall) and post-incident learning (postmortems) by providing a collaborative workspace for incident resolution with automated timeline tracking, role assignment, and artifact collection.
Declaring Incidents
Incidents can be declared manually by engineers, automatically from OnCall escalation chains, or via API integration from external tools. Each incident has a severity level, title, and initial status:
# Declare an incident via Grafana Incident API
curl -X POST "https://grafana.internal/api/plugins/grafana-incident-app/resources/api/v1/IncidentsService.CreateIncident" \
-H "Authorization: Bearer $GRAFANA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"title": "API Gateway returning 503 errors for EU region",
"severity": "critical",
"status": "active",
"labels": [
{"key": "service", "value": "api-gateway"},
{"key": "region", "value": "eu-west-1"}
],
"attachCaption": "Initial alert dashboard",
"attachURL": "https://grafana.internal/d/api-overview?orgId=1&var-region=eu-west-1"
}'
Incident severity levels typically follow this classification:
| Severity | Definition | Response | Communication |
|---|---|---|---|
| SEV1 / Critical | Complete service outage or data loss affecting all users | All-hands, IC assigned immediately | Every 15 minutes, status page updated |
| SEV2 / Major | Significant degradation affecting majority of users | On-call team, IC assigned within 15 min | Every 30 minutes |
| SEV3 / Minor | Limited impact, workaround available | On-call investigates during business hours | Hourly updates |
| SEV4 / Low | Cosmetic or minimal impact | Tracked as ticket, no immediate response | As needed |
Workflow & Timelines
Grafana Incident automatically tracks the incident timeline, recording every action, role assignment, status change, and communication. The workflow progresses through defined states:
stateDiagram-v2
[*] --> Declared: Alert triggers / Manual declaration
Declared --> Investigating: IC assigned, triage begins
Investigating --> Mitigating: Root cause identified
Mitigating --> Resolved: User impact eliminated
Resolved --> Closed: Postmortem complete, actions tracked
Investigating --> Resolved: False alarm / auto-recovery
Closed --> [*]
Key workflow features in Grafana Incident:
- Role assignment — IC, Technical Lead, Communications Lead assigned from incident UI
- Task management — Create and assign tasks within the incident (e.g., “Check EU region load balancer logs”)
- Activity feed — Automated timeline of all events, status changes, and manual notes
- Artifact attachment — Link dashboards, runbooks, Slack threads, and external URLs
- Severity changes — Escalate or de-escalate as understanding evolves
- Auto-linking — Connects to the triggering alert group in OnCall
- Stakeholder updates — Publish status updates to configured channels
Postmortem Generation
When an incident is resolved, Grafana Incident can auto-generate a postmortem document from the incident timeline. This document includes the chronological sequence of events, roles involved, duration, and severity — pre-populated so the team can focus on analysis rather than reconstruction:
# Auto-generated postmortem structure from Grafana Incident
postmortem:
incident_id: INC-2026-0142
title: "API Gateway 503 errors in EU region"
severity: critical
duration: 47m
detected_at: "2026-06-15T14:23:00Z"
resolved_at: "2026-06-15T15:10:00Z"
impact:
users_affected: ~12000
error_rate_peak: "23%"
slo_budget_consumed: "8.2%"
timeline:
- time: "14:20:00"
event: "SLO burn rate alert fires (14.4x)"
actor: system
- time: "14:23:00"
event: "On-call alice acknowledges"
actor: alice
- time: "14:25:00"
event: "Incident declared as SEV1"
actor: alice
- time: "14:28:00"
event: "IC role assigned to bob"
actor: alice
- time: "14:35:00"
event: "Root cause identified: bad config push to EU LB"
actor: alice
- time: "14:42:00"
event: "Config rollback initiated"
actor: alice
- time: "15:10:00"
event: "Error rate returned to baseline, incident resolved"
actor: bob
roles:
incident_commander: bob
technical_lead: alice
communications: charlie
# These sections are filled in during the postmortem meeting
root_cause: "[To be completed]"
contributing_factors: "[To be completed]"
what_went_well: "[To be completed]"
action_items: "[To be completed]"
Summary & Next Steps
Effective incident management is a discipline that spans the entire lifecycle from alert design through post-incident learning. In this guide, we covered:
- Alert Philosophy — The critical distinction between being alerted and alarmed, combating alert fatigue through actionability criteria and signal-to-noise optimization
- Incident Lifecycle — Preparation (runbooks, templates, escalation paths), execution (triage, roles, communication, mitigation), and learning (blameless postmortems, action items)
- SLI/SLO-Based Alerting — Moving beyond threshold alerts to burn-rate alerting with multi-window multi-burn-rate patterns that alert on user impact rather than infrastructure metrics
- Grafana Alerting — Alert rules (Grafana-managed, Mimir, Loki), contact points, notification policies with routing trees, silences, and alert administration
- Grafana OnCall — Alert groups, inbound integrations, Jinja2 notification templating, escalation chains with multi-step notification, outbound integrations, and schedule rotations
- Grafana Incident — Declaring incidents, structured workflow with timelines and roles, auto-generated postmortems, and severity classification
The key principle: alerts exist to protect users, not to report metrics. Every page should represent a genuine threat to user experience that requires immediate human intervention. Everything else belongs in dashboards, tickets, or logs.
Next in the Series
In Part 10: Infrastructure as Code for Observability, we’ll explore managing your entire Grafana stack as code — Terraform providers, Jsonnet/Grafonnet for dashboards, Kubernetes operators, GitOps workflows, and CI/CD pipelines for observability configuration.