Design an observability strategy covering the three pillars (metrics, logs, traces), choosing instrumentation points, alert hygiene practices, and dashboard design for a team operating 15 microservices.
## Problem
Your team operates 15 microservices across Go, Python, and Node.js. Observability is ad hoc — some services have Prometheus metrics, some have structured logs, and distributed tracing is not implemented. When an incident occurs, the on-call engineer spends the first 15 minutes just figuring out which service is involved. Design an observability strategy that gives the team clear visibility into system behavior.
Sign up to access the full problem
Design canvas, rubric, hints, and model solutions.