SLOs, error budgets, capacity planning, toil reduction, change management, and observability culture.
4 problems
Define meaningful SLOs for a user-facing web service handling 50K RPM, selecting appropriate SLIs for availability, latency, and correctness, and setting realistic targets with measurement windows.
Design an error budget policy for a platform serving 100M requests/day, including budget calculation, burn rate alerts, exhaustion responses, and a stakeholder buy-in framework.
Design a multi-burn-rate alerting system for 200+ services that detects SLO violations early using fast-burn and slow-burn windows while minimizing false positives and alert fatigue.
Analyze advanced SLO challenges including over-reliance on SLOs, gaming metrics, platform team SLOs, dependency chains, and composite SLO design for a 500-service organization.
3 problems
Design an automated capacity planning system with demand forecasting, reservation management, bin-packing optimization, failure domain headroom, and what-if modeling for a 10,000-server fleet.
Design a comprehensive load testing strategy covering load, stress, and soak testing, traffic replay, synthetic traffic generation, and safe production load testing for a microservices architecture.
Explain the fundamentals of capacity planning for a growing web application, covering demand forecasting, organic vs inorganic growth, headroom targets, lead times, and capacity models.
3 problems
Design a runbook automation platform that converts manual incident response procedures into automated workflows with human-in-the-loop safety checks, rollback capabilities, and audit trails.
Design a self-healing platform that automatically detects and remediates infrastructure failures including node replacement, capacity adjustment, and anomaly response, with blast radius limiting and human escalation.
Define toil in an SRE context, create a toil inventory for a team managing 30 production services, establish a measurement framework, and prioritize automation opportunities using the 50% toil budget cap.
3 problems
Design a change automation platform with change request workflows, automated risk scoring, intelligent approval routing, deployment dependency graphs, and change failure rate tracking for an organization making 500 changes per day.
Design a progressive rollout system with automated canary analysis, traffic shifting from 1% to 100%, bake time enforcement, automated rollback triggers, and deployment velocity optimization.
Explain production change management principles including change review, rollback plans, blast radius reduction, change windows, feature flags, and canary deployments for a team deploying 10 times per day.
3 problems
Design a metrics and alerting platform addressing cardinality management, alert fatigue reduction, signal correlation, business vs infrastructure metrics, and cost optimization for an organization running 300 services.
Design a unified observability platform for 500 microservices integrating distributed tracing, log correlation, service dependency mapping, anomaly detection, and cost-effective retention strategies.
Design an observability strategy covering the three pillars (metrics, logs, traces), choosing instrumentation points, alert hygiene practices, and dashboard design for a team operating 15 microservices.