Design a self-healing platform that automatically detects and remediates infrastructure failures including node replacement, capacity adjustment, and anomaly response, with blast radius limiting and human escalation.
## Problem
Your infrastructure fleet of 5,000 nodes experiences approximately 50 node failures, 200 service-level incidents, and 30 capacity-related issues per week. Currently, on-call engineers handle all of these manually, leading to 40+ hours of toil per week and frequent overnight pages. Design a self-healing infrastructure platform that automatically detects and remediates common failure modes while ensuring safety through blast radius limiting and human escalation.
Sign up to access the full problem
Design canvas, rubric, hints, and model solutions.
Identify and Measure Toil
Junior · Scenario