InfraPrep

## Problem

Your organization has grown to 500 microservices across 5 regions. The existing observability setup — disparate Prometheus instances, ElasticSearch for logs, and no tracing — cannot support efficient incident investigation. The mean time to identify the root-cause service during an incident is 25 minutes, with engineers manually searching logs across services. Design a unified observability platform that integrates distributed tracing, log correlation, service dependency mapping, and anomaly detection, while keeping costs under 8% of infrastructure spend.

Design an Observability Platform for Microservices

Constraints

Related Problems