The Incident Lifecycle
flowchart LR
A[Detection\nAlert fires or user reports] --> B[Triage\nAssess severity + impact]
B --> C[Response\nAssemble team, start investigating]
C --> D[Mitigation\nRestore service, stop bleeding]
D --> E[Resolution\nFix root cause]
E --> F[Post-Mortem\nAnalyse, learn, improve]
F --> G[Action Items\nPrevent recurrence]
Each phase has distinct goals and different people involved:
- Detection: Goal is speed — reduce MTTD (Mean Time to Detect) with SLO burn rate alerts and user-facing synthetic monitoring
- Triage: Classify severity, determine impact scope (how many users, which regions, which services)
- Response: Assemble the right people, establish communication channels, start structured investigation
- Mitigation: Restore service first — rollback, scale up, failover. Root cause comes later
- Resolution: Fix the underlying issue after mitigation stabilises the system
- Post-Mortem: Document what happened, why, and what will prevent recurrence
Severity Classification
| Severity | Impact | Response Time | Who Is Involved |
|---|---|---|---|
| SEV1 — Critical | Complete outage, data loss, security breach affecting all users | Immediate (15 min response) | Incident Commander, on-call engineers, management notified |
| SEV2 — Major | Significant degradation affecting many users, critical feature broken | 30 minutes | On-call engineer, service owners, IC if needed |
| SEV3 — Minor | Partial degradation, non-critical feature affected, workaround exists | 4 hours (business hours) | Service owner, handled during normal work |
| SEV4 — Low | Cosmetic issue, minor bug, no user impact | Next sprint | Ticket created, handled in backlog |
Incident Roles
Clear roles prevent chaos during incidents. The Google/PagerDuty model defines three primary roles:
| Role | Responsibilities | NOT Responsible For |
|---|---|---|
| Incident Commander (IC) | Coordinates response, makes decisions, manages communication, tracks timeline | Debugging code, writing fixes — the IC does not touch the keyboard |
| Technical Lead | Leads investigation, proposes mitigation, directs debugging efforts | Communication with stakeholders — that is the IC's job |
| Communications Lead | Updates status page, notifies affected customers, writes internal updates | Technical investigation |
Communication Protocols
During the Incident
- Dedicated Slack channel: Create
#incident-YYYY-MM-DD-descriptionimmediately. All discussion happens here, not in DMs - Regular updates: The IC posts a structured update every 30 minutes (for SEV1) or 60 minutes (for SEV2)
- Status page: Update the public status page within 15 minutes of SEV1 declaration
- Bridge call: For SEV1, open a video/voice bridge for real-time coordination
Update Template
Status: Investigating / Identified / Mitigating / Resolved
Impact: [Who is affected and how]
Current Theory: [What we think is happening]
Actions in Progress: [What we are doing right now]
Next Update: [When the next update will be posted]
IC: [@name]
Blameless Post-Mortems
A post-mortem is a structured review after an incident that identifies what happened, why, and what will prevent recurrence. The key word is blameless — the goal is to understand the system, not to punish individuals.
Post-Mortem Document Template
Post-Mortem: [Incident Title]
Date: YYYY-MM-DD | Severity: SEV-X | Duration: Xh Ym | Author: [Name]
1. Summary — 2-3 sentences describing the incident and its impact.
2. Impact — Number of affected users, error budget consumed, revenue impact, duration of user-facing impact.
3. Timeline — Chronological list of events from detection to resolution, with timestamps.
4. Root Cause — Technical explanation of what caused the incident. Focus on systemic causes, not individual actions.
5. Contributing Factors — What conditions allowed the root cause to have this impact? (Missing monitoring, no circuit breaker, insufficient testing, etc.)
6. What Went Well — Detection speed, team coordination, mitigation effectiveness.
7. What Went Poorly — Delayed detection, confusing communication, missing runbooks.
8. Action Items — Specific, assigned, time-bound improvements with owners and due dates.
Building Blameless Culture
Practices that build blameless culture:
- Language matters: Replace "Engineer X caused the outage" with "The deployment pipeline lacked a canary stage, allowing a bad config to reach production"
- Celebrate post-mortems: Share them widely, present at team meetings, treat them as learning opportunities rather than failure reports
- Follow up on action items: Track completion rates — if action items are never done, the post-mortem process loses credibility
- Leadership participation: Managers and directors should attend post-mortem reviews to show that learning is valued
- No post-mortem for blame: If post-mortems are used to assign blame or justify disciplinary action, people will stop reporting honestly
Incident Metrics & Tracking
Track incident metrics over time to measure whether your reliability processes are improving:
| Metric | Definition | Target Direction |
|---|---|---|
| MTTD (Mean Time to Detect) | Time from incident start to alert firing | ↓ Decrease |
| MTTA (Mean Time to Acknowledge) | Time from alert to human acknowledgement | ↓ Decrease |
| MTTR (Mean Time to Resolve) | Time from detection to full resolution | ↓ Decrease |
| MTTM (Mean Time to Mitigate) | Time from detection to user-impact mitigated | ↓ Decrease (most important) |
| Incident Count | Number of SEV1/SEV2 incidents per month | ↓ Decrease |
| Action Item Completion Rate | Post-mortem action items completed within deadline | ↑ Increase (target: > 80%) |
| Repeat Incidents | Incidents with the same root cause as a previous incident | ↓ Decrease (target: 0) |
Conclusion & Next Steps
Incident management is a skill that improves with practice and structure. Key takeaways from Part 10:
- The incident lifecycle has clear phases: detect → triage → respond → mitigate → resolve → post-mortem
- Always escalate up when severity is uncertain — downgrading is cheap, slow response is expensive
- Incident Commander coordinates but does not debug — separation of concerns prevents tunnel vision
- Regular structured updates every 30-60 minutes prevent information vacuums
- Blameless post-mortems focus on systemic factors, not individual mistakes — but still produce concrete action items
- Track MTTM (mitigation time) as the primary incident metric — users care about service restoration, not root cause