Design a unified observability platform for 500 microservices integrating distributed tracing, log correlation, service dependency mapping, anomaly detection, and cost-effective retention strategies.
## Problem
Your organization has grown to 500 microservices across 5 regions. The existing observability setup — disparate Prometheus instances, ElasticSearch for logs, and no tracing — cannot support efficient incident investigation. The mean time to identify the root-cause service during an incident is 25 minutes, with engineers manually searching logs across services. Design a unified observability platform that integrates distributed tracing, log correlation, service dependency mapping, and anomaly detection, while keeping costs under 8% of infrastructure spend.
Sign up to access the full problem
Design canvas, rubric, hints, and model solutions.
Design an SLO-Based Alerting System
Senior · Systems Design