Design a metrics and alerting platform addressing cardinality management, alert fatigue reduction, signal correlation, business vs infrastructure metrics, and cost optimization for an organization running 300 services.
## Problem
Your organization runs 300 services generating 10 million active time series. The current metrics platform is struggling: queries are slow, cardinality explosions cause storage costs to spike, on-call engineers receive 200+ alerts per week (most are noise), and leadership cannot connect infrastructure issues to business impact. Design a metrics and alerting platform that scales efficiently, reduces alert fatigue, and bridges technical and business metrics.
Sign up to access the full problem
Design canvas, rubric, hints, and model solutions.
Design an SLO-Based Alerting System
Senior · Systems Design