Part 10: Incident Management & Post-Mortems

The Incident Lifecycle

Incident Lifecycle — From Detection to Learning

                                flowchart LR
                                    A[Detection\nAlert fires or user reports] --> B[Triage\nAssess severity + impact]
                                    B --> C[Response\nAssemble team, start investigating]
                                    C --> D[Mitigation\nRestore service, stop bleeding]
                                    D --> E[Resolution\nFix root cause]
                                    E --> F[Post-Mortem\nAnalyse, learn, improve]
                                    F --> G[Action Items\nPrevent recurrence]

Each phase has distinct goals and different people involved:

Detection: Goal is speed — reduce MTTD (Mean Time to Detect) with SLO burn rate alerts and user-facing synthetic monitoring
Triage: Classify severity, determine impact scope (how many users, which regions, which services)
Response: Assemble the right people, establish communication channels, start structured investigation
Mitigation: Restore service first — rollback, scale up, failover. Root cause comes later
Resolution: Fix the underlying issue after mitigation stabilises the system
Post-Mortem: Document what happened, why, and what will prevent recurrence

Severity Classification

Severity	Impact	Response Time	Who Is Involved
SEV1 — Critical	Complete outage, data loss, security breach affecting all users	Immediate (15 min response)	Incident Commander, on-call engineers, management notified
SEV2 — Major	Significant degradation affecting many users, critical feature broken	30 minutes	On-call engineer, service owners, IC if needed
SEV3 — Minor	Partial degradation, non-critical feature affected, workaround exists	4 hours (business hours)	Service owner, handled during normal work
SEV4 — Low	Cosmetic issue, minor bug, no user impact	Next sprint	Ticket created, handled in backlog

                            
                            Escalation Rule: If the severity is uncertain, always escalate up. It is better to over-classify a SEV3 as SEV2 and stand down quickly than to under-classify a SEV1 as SEV3 and lose critical response time. Downgrading is cheap; the cost of a slow response to a real SEV1 is enormous.
                        

Incident Roles

Clear roles prevent chaos during incidents. The Google/PagerDuty model defines three primary roles:

Role	Responsibilities	NOT Responsible For
Incident Commander (IC)	Coordinates response, makes decisions, manages communication, tracks timeline	Debugging code, writing fixes — the IC does not touch the keyboard
Technical Lead	Leads investigation, proposes mitigation, directs debugging efforts	Communication with stakeholders — that is the IC's job
Communications Lead	Updates status page, notifies affected customers, writes internal updates	Technical investigation

                            
                            The IC Does Not Debug: The most common mistake in incident response is having the most senior engineer both coordinate the response AND debug the problem. This leads to tunnel vision, poor communication, and missed escalation windows. The IC's job is to orchestrate — deciding who investigates what, when to escalate, and keeping everyone informed.
                        

Communication Protocols

During the Incident

Dedicated Slack channel: Create #incident-YYYY-MM-DD-description immediately. All discussion happens here, not in DMs
Regular updates: The IC posts a structured update every 30 minutes (for SEV1) or 60 minutes (for SEV2)
Status page: Update the public status page within 15 minutes of SEV1 declaration
Bridge call: For SEV1, open a video/voice bridge for real-time coordination

Update Template

                            
                            Incident Update Template:

                            Status: Investigating / Identified / Mitigating / Resolved

                            Impact: [Who is affected and how]

                            Current Theory: [What we think is happening]

                            Actions in Progress: [What we are doing right now]

                            Next Update: [When the next update will be posted]

                            IC: [@name]

Blameless Post-Mortems

A post-mortem is a structured review after an incident that identifies what happened, why, and what will prevent recurrence. The key word is blameless — the goal is to understand the system, not to punish individuals.

Post-Mortem Document Template

Template

Post-Mortem: [Incident Title]

Date: YYYY-MM-DD | Severity: SEV-X | Duration: Xh Ym | Author: [Name]

1. Summary — 2-3 sentences describing the incident and its impact.

2. Impact — Number of affected users, error budget consumed, revenue impact, duration of user-facing impact.

3. Timeline — Chronological list of events from detection to resolution, with timestamps.

4. Root Cause — Technical explanation of what caused the incident. Focus on systemic causes, not individual actions.

5. Contributing Factors — What conditions allowed the root cause to have this impact? (Missing monitoring, no circuit breaker, insufficient testing, etc.)

6. What Went Well — Detection speed, team coordination, mitigation effectiveness.

7. What Went Poorly — Delayed detection, confusing communication, missing runbooks.

8. Action Items — Specific, assigned, time-bound improvements with owners and due dates.

Blameless Post-Mortem Continuous Improvement

Building Blameless Culture

                            
                            Blameless ≠ Accountability-free: Blameless means focusing on systemic factors rather than individual mistakes. A blameless post-mortem asks "Why did the system allow this to happen?" not "Who caused this?" But it still produces action items with owners and deadlines. The system must change — that is accountability.
                        

Practices that build blameless culture:

Language matters: Replace "Engineer X caused the outage" with "The deployment pipeline lacked a canary stage, allowing a bad config to reach production"
Celebrate post-mortems: Share them widely, present at team meetings, treat them as learning opportunities rather than failure reports
Follow up on action items: Track completion rates — if action items are never done, the post-mortem process loses credibility
Leadership participation: Managers and directors should attend post-mortem reviews to show that learning is valued
No post-mortem for blame: If post-mortems are used to assign blame or justify disciplinary action, people will stop reporting honestly

Incident Metrics & Tracking

Track incident metrics over time to measure whether your reliability processes are improving:

Metric	Definition	Target Direction
MTTD (Mean Time to Detect)	Time from incident start to alert firing	↓ Decrease
MTTA (Mean Time to Acknowledge)	Time from alert to human acknowledgement	↓ Decrease
MTTR (Mean Time to Resolve)	Time from detection to full resolution	↓ Decrease
MTTM (Mean Time to Mitigate)	Time from detection to user-impact mitigated	↓ Decrease (most important)
Incident Count	Number of SEV1/SEV2 incidents per month	↓ Decrease
Action Item Completion Rate	Post-mortem action items completed within deadline	↑ Increase (target: > 80%)
Repeat Incidents	Incidents with the same root cause as a previous incident	↓ Decrease (target: 0)

                            
                            MTTM over MTTR: Mean Time to Mitigate (MTTM) is more important than Mean Time to Resolve (MTTR). Mitigation restores service for users (rollback, failover); resolution fixes the root cause (which can happen later). Optimise for MTTM — users do not care about root cause while they are unable to use your service.
                        

Conclusion & Next Steps

Incident management is a skill that improves with practice and structure. Key takeaways from Part 10:

The incident lifecycle has clear phases: detect → triage → respond → mitigate → resolve → post-mortem
Always escalate up when severity is uncertain — downgrading is cheap, slow response is expensive
Incident Commander coordinates but does not debug — separation of concerns prevents tunnel vision
Regular structured updates every 30-60 minutes prevent information vacuums
Blameless post-mortems focus on systemic factors, not individual mistakes — but still produce concrete action items
Track MTTM (mitigation time) as the primary incident metric — users care about service restoration, not root cause

Previous Part 9: SLOs & Error Budgets Next Part 11: Chaos Engineering

Cookie Consent

Part 10: Incident Management & Post-Mortems

Table of Contents

The Incident Lifecycle

Severity Classification

Incident Roles

Communication Protocols

During the Incident

Update Template

Blameless Post-Mortems

Post-Mortem Document Template

Post-Mortem: [Incident Title]

Building Blameless Culture

Incident Metrics & Tracking

Conclusion & Next Steps

Cookie Consent

Part 10: Incident Management & Post-Mortems

Table of Contents

The Incident Lifecycle

Severity Classification

Incident Roles

Communication Protocols

During the Incident

Update Template

Blameless Post-Mortems

Post-Mortem Document Template

Post-Mortem: [Incident Title]

Building Blameless Culture

Incident Metrics & Tracking

Conclusion & Next Steps

Continue the Series

Part 11: Chaos Engineering & Reliability Testing

Part 9: SLOs, SLIs & Error Budgets