SRE Practices

SLOs, error budgets, capacity planning, toil reduction, change management, and observability culture.

16 problems5 modules

Progress0%

4 problems

SLOs, SLIs & Error Budgets

Junior30m

Define SLOs for a Web Service

Define meaningful SLOs for a user-facing web service handling 50K RPM, selecting appropriate SLIs for availability, latency, and correctness, and setting realistic targets with measurement windows.

sloslierror budget

Start Problem

Mid-Senior35m

Design an Error Budget Policy

Design an error budget policy for a platform serving 100M requests/day, including budget calculation, burn rate alerts, exhaustion responses, and a stakeholder buy-in framework.

error budgetslosli

Start Problem

Senior45m

Design an SLO-Based Alerting System

Design a multi-burn-rate alerting system for 200+ services that detects SLO violations early using fast-burn and slow-burn windows while minimizing false positives and alert fatigue.

sloalertingerror budget

Start Problem

Staff45m

Explain SLO Trade-offs at Staff Level

Analyze advanced SLO challenges including over-reliance on SLOs, gaming metrics, platform team SLOs, dependency chains, and composite SLO design for a 500-service organization.

sloslierror budget

Start Problem

3 problems

Capacity Planning

Senior45m

Design a Capacity Planning System

Design an automated capacity planning system with demand forecasting, reservation management, bin-packing optimization, failure domain headroom, and what-if modeling for a 10,000-server fleet.

capacity planningautomationmonitoring

Start Problem

Mid-Senior35m

Design a Load Testing Strategy

Design a comprehensive load testing strategy covering load, stress, and soak testing, traffic replay, synthetic traffic generation, and safe production load testing for a microservices architecture.

load testingcapacity planningmonitoring

Start Problem

Junior30m

Explain Capacity Planning Fundamentals

Explain the fundamentals of capacity planning for a growing web application, covering demand forecasting, organic vs inorganic growth, headroom targets, lead times, and capacity models.

capacity planningload testingmonitoring

Start Problem

3 problems

Toil Reduction & Automation

Mid-Senior35m

Design a Runbook Automation System

Design a runbook automation platform that converts manual incident response procedures into automated workflows with human-in-the-loop safety checks, rollback capabilities, and audit trails.

automationtoilmonitoring

Start Problem

Senior45m

Design a Self-Healing Infrastructure Platform

Design a self-healing platform that automatically detects and remediates infrastructure failures including node replacement, capacity adjustment, and anomaly response, with blast radius limiting and human escalation.

self healingautomationtoil

Start Problem

Junior25m

Identify and Measure Toil

Define toil in an SRE context, create a toil inventory for a team managing 30 production services, establish a measurement framework, and prioritize automation opportunities using the 50% toil budget cap.

toilautomationslo

Start Problem

3 problems

Change Management

Senior45m

Design a Change Automation Platform

Design a change automation platform with change request workflows, automated risk scoring, intelligent approval routing, deployment dependency graphs, and change failure rate tracking for an organization making 500 changes per day.

change managementautomationcanary

Start Problem

Mid-Senior35m

Design a Progressive Rollout System

Design a progressive rollout system with automated canary analysis, traffic shifting from 1% to 100%, bake time enforcement, automated rollback triggers, and deployment velocity optimization.

canaryrolloutchange management

Start Problem

Junior25m

Explain Production Change Management

Explain production change management principles including change review, rollback plans, blast radius reduction, change windows, feature flags, and canary deployments for a team deploying 10 times per day.

change managementcanaryrollout

Start Problem

3 problems

Observability Culture

Mid-Senior35m

Design a Metrics and Alerting Platform

Design a metrics and alerting platform addressing cardinality management, alert fatigue reduction, signal correlation, business vs infrastructure metrics, and cost optimization for an organization running 300 services.

monitoringalertingobservability

Start Problem

Senior45m

Design an Observability Platform for Microservices

Design a unified observability platform for 500 microservices integrating distributed tracing, log correlation, service dependency mapping, anomaly detection, and cost-effective retention strategies.

observabilitytracinglogging

Start Problem

Junior30m

Design an Observability Strategy

Design an observability strategy covering the three pillars (metrics, logs, traces), choosing instrumentation points, alert hygiene practices, and dashboard design for a team operating 15 microservices.

observabilitymonitoringlogging

Start Problem