Back to Monitoring, Observability & Reliability Series

Part 10: Incident Management & Post-Mortems

May 14, 2026 Wasil Zafar 17 min read

Incidents are inevitable in distributed systems. What separates great engineering organisations from the rest is not whether incidents happen, but how they are managed during the crisis and what is learned afterwards. This part covers structured incident response, communication protocols, and blameless post-mortems.

Table of Contents

  1. The Incident Lifecycle
  2. Severity Classification
  3. Incident Roles
  4. Communication Protocols
  5. Blameless Post-Mortems
  6. Incident Metrics & Tracking
  7. Conclusion & Next Steps

The Incident Lifecycle

Incident Lifecycle — From Detection to Learning
                                flowchart LR
                                    A[Detection\nAlert fires or user reports] --> B[Triage\nAssess severity + impact]
                                    B --> C[Response\nAssemble team, start investigating]
                                    C --> D[Mitigation\nRestore service, stop bleeding]
                                    D --> E[Resolution\nFix root cause]
                                    E --> F[Post-Mortem\nAnalyse, learn, improve]
                                    F --> G[Action Items\nPrevent recurrence]
                            

Each phase has distinct goals and different people involved:

  • Detection: Goal is speed — reduce MTTD (Mean Time to Detect) with SLO burn rate alerts and user-facing synthetic monitoring
  • Triage: Classify severity, determine impact scope (how many users, which regions, which services)
  • Response: Assemble the right people, establish communication channels, start structured investigation
  • Mitigation: Restore service first — rollback, scale up, failover. Root cause comes later
  • Resolution: Fix the underlying issue after mitigation stabilises the system
  • Post-Mortem: Document what happened, why, and what will prevent recurrence

Severity Classification

SeverityImpactResponse TimeWho Is Involved
SEV1 — CriticalComplete outage, data loss, security breach affecting all usersImmediate (15 min response)Incident Commander, on-call engineers, management notified
SEV2 — MajorSignificant degradation affecting many users, critical feature broken30 minutesOn-call engineer, service owners, IC if needed
SEV3 — MinorPartial degradation, non-critical feature affected, workaround exists4 hours (business hours)Service owner, handled during normal work
SEV4 — LowCosmetic issue, minor bug, no user impactNext sprintTicket created, handled in backlog
Escalation Rule: If the severity is uncertain, always escalate up. It is better to over-classify a SEV3 as SEV2 and stand down quickly than to under-classify a SEV1 as SEV3 and lose critical response time. Downgrading is cheap; the cost of a slow response to a real SEV1 is enormous.

Incident Roles

Clear roles prevent chaos during incidents. The Google/PagerDuty model defines three primary roles:

RoleResponsibilitiesNOT Responsible For
Incident Commander (IC) Coordinates response, makes decisions, manages communication, tracks timeline Debugging code, writing fixes — the IC does not touch the keyboard
Technical Lead Leads investigation, proposes mitigation, directs debugging efforts Communication with stakeholders — that is the IC's job
Communications Lead Updates status page, notifies affected customers, writes internal updates Technical investigation
The IC Does Not Debug: The most common mistake in incident response is having the most senior engineer both coordinate the response AND debug the problem. This leads to tunnel vision, poor communication, and missed escalation windows. The IC's job is to orchestrate — deciding who investigates what, when to escalate, and keeping everyone informed.

Communication Protocols

During the Incident

  • Dedicated Slack channel: Create #incident-YYYY-MM-DD-description immediately. All discussion happens here, not in DMs
  • Regular updates: The IC posts a structured update every 30 minutes (for SEV1) or 60 minutes (for SEV2)
  • Status page: Update the public status page within 15 minutes of SEV1 declaration
  • Bridge call: For SEV1, open a video/voice bridge for real-time coordination

Update Template

Incident Update Template:
Status: Investigating / Identified / Mitigating / Resolved
Impact: [Who is affected and how]
Current Theory: [What we think is happening]
Actions in Progress: [What we are doing right now]
Next Update: [When the next update will be posted]
IC: [@name]

Blameless Post-Mortems

A post-mortem is a structured review after an incident that identifies what happened, why, and what will prevent recurrence. The key word is blameless — the goal is to understand the system, not to punish individuals.

Post-Mortem Document Template

Template

Post-Mortem: [Incident Title]

Date: YYYY-MM-DD  |  Severity: SEV-X  |  Duration: Xh Ym  |  Author: [Name]

1. Summary — 2-3 sentences describing the incident and its impact.

2. Impact — Number of affected users, error budget consumed, revenue impact, duration of user-facing impact.

3. Timeline — Chronological list of events from detection to resolution, with timestamps.

4. Root Cause — Technical explanation of what caused the incident. Focus on systemic causes, not individual actions.

5. Contributing Factors — What conditions allowed the root cause to have this impact? (Missing monitoring, no circuit breaker, insufficient testing, etc.)

6. What Went Well — Detection speed, team coordination, mitigation effectiveness.

7. What Went Poorly — Delayed detection, confusing communication, missing runbooks.

8. Action Items — Specific, assigned, time-bound improvements with owners and due dates.

Blameless Post-Mortem Continuous Improvement

Building Blameless Culture

Blameless ≠ Accountability-free: Blameless means focusing on systemic factors rather than individual mistakes. A blameless post-mortem asks "Why did the system allow this to happen?" not "Who caused this?" But it still produces action items with owners and deadlines. The system must change — that is accountability.

Practices that build blameless culture:

  • Language matters: Replace "Engineer X caused the outage" with "The deployment pipeline lacked a canary stage, allowing a bad config to reach production"
  • Celebrate post-mortems: Share them widely, present at team meetings, treat them as learning opportunities rather than failure reports
  • Follow up on action items: Track completion rates — if action items are never done, the post-mortem process loses credibility
  • Leadership participation: Managers and directors should attend post-mortem reviews to show that learning is valued
  • No post-mortem for blame: If post-mortems are used to assign blame or justify disciplinary action, people will stop reporting honestly

Incident Metrics & Tracking

Track incident metrics over time to measure whether your reliability processes are improving:

MetricDefinitionTarget Direction
MTTD (Mean Time to Detect)Time from incident start to alert firing↓ Decrease
MTTA (Mean Time to Acknowledge)Time from alert to human acknowledgement↓ Decrease
MTTR (Mean Time to Resolve)Time from detection to full resolution↓ Decrease
MTTM (Mean Time to Mitigate)Time from detection to user-impact mitigated↓ Decrease (most important)
Incident CountNumber of SEV1/SEV2 incidents per month↓ Decrease
Action Item Completion RatePost-mortem action items completed within deadline↑ Increase (target: > 80%)
Repeat IncidentsIncidents with the same root cause as a previous incident↓ Decrease (target: 0)
MTTM over MTTR: Mean Time to Mitigate (MTTM) is more important than Mean Time to Resolve (MTTR). Mitigation restores service for users (rollback, failover); resolution fixes the root cause (which can happen later). Optimise for MTTM — users do not care about root cause while they are unable to use your service.

Conclusion & Next Steps

Incident management is a skill that improves with practice and structure. Key takeaways from Part 10:

  • The incident lifecycle has clear phases: detect → triage → respond → mitigate → resolve → post-mortem
  • Always escalate up when severity is uncertain — downgrading is cheap, slow response is expensive
  • Incident Commander coordinates but does not debug — separation of concerns prevents tunnel vision
  • Regular structured updates every 30-60 minutes prevent information vacuums
  • Blameless post-mortems focus on systemic factors, not individual mistakes — but still produce concrete action items
  • Track MTTM (mitigation time) as the primary incident metric — users care about service restoration, not root cause