Design an incident management system with alert routing and escalation, on-call rotation management, pager fatigue reduction, runbook automation, and SLO-based alerting that scales to a 500-engineer organization.
## Problem
Your organization has grown to 500 engineers across 50 teams, each owning 2-5 services. The current on-call experience is painful: engineers receive 25+ pages per week, most of which are noise. Alert fatigue has led to missed critical alerts. There is no standardized escalation process, and runbooks are outdated Google Docs that nobody reads. Design an incident management system that reduces pager fatigue, automates common remediations, and scales with the organization.
Sign up to access the full problem
Design canvas, rubric, hints, and model solutions.
Write a Blameless Postmortem
Mid-Senior · Scenario