Diagnose and halt a cascading failure spreading across a microservice architecture, identifying blast radius, breaking retry amplification loops, and coordinating war room response.
## Problem
You are the incident commander for a cascading failure. It started with elevated latency in the payment service, but within 10 minutes, the order service, inventory service, and API gateway are all returning errors. Error rates across the platform have gone from 0.1% to 45%. Describe your approach to diagnosing the cascade mechanism, containing the blast radius, and coordinating recovery.
Sign up to access the full problem
Design canvas, rubric, hints, and model solutions.