On-Call & Troubleshooting

Systematic debugging, performance investigation, infrastructure incidents, and incident management processes.

15 problems4 modules

Progress0%

3 problems

Systematic Debugging

Senior45m

Diagnose a Cascading Failure

Diagnose and halt a cascading failure spreading across a microservice architecture, identifying blast radius, breaking retry amplification loops, and coordinating war room response.

troubleshootingdebuggingincident response

Start Problem

Mid-Senior35m

Diagnose a Memory Leak in Production

Diagnose and resolve a memory leak in a production service where RSS grows linearly until OOM kills trigger, using heap profiling, cgroup memory accounting, and systematic elimination.

troubleshootingdebuggingmemory

Start Problem

Junior25m

Walk Through a 5xx Error Investigation

Systematically investigate a spike in 5xx errors across a microservice architecture, from initial alert to root cause identification, using structured debugging methodology.

troubleshootingdebuggingincident response

Start Problem

4 problems

Performance Incidents

Senior45m

Diagnose a Thundering Herd Problem

Diagnose and resolve thundering herd problems across multiple scenarios: cache stampede after key expiry, connection storm after service restart, and lock convoy on hot rows, using request coalescing, jittered expiry, and probabilistic early recomputation.

troubleshootingdebuggingperformance

Start Problem

Mid-Senior35m

Diagnose Database Performance Degradation

Diagnose a PostgreSQL database that has degraded from 2ms to 200ms average query latency, using slow query logs, EXPLAIN plans, index analysis, lock waits, and replication lag investigation.

troubleshootingdebuggingdatabase

Start Problem

Junior25m

Diagnose High CPU Usage

Systematically diagnose high CPU usage on a production Linux server, distinguishing between user, system, and steal time, using top, perf, and flame graphs to identify the offending process and code path.

troubleshootingdebuggingperformance

Start Problem

Mid-Senior35m

Diagnose Latency Spikes

Diagnose intermittent latency spikes where p99 diverges from p50 by 100x, using distributed tracing, GC analysis, lock contention profiling, and network diagnostics to isolate the cause.

troubleshootingdebuggingperformance

Start Problem

5 problems

Infrastructure Incidents

Mid-Senior30m

Diagnose a Certificate Expiry

Diagnose a production outage caused by an expired TLS certificate, covering chain validation, intermediate vs leaf expiry, SNI-based multi-cert edge cases, OCSP stapling, and safe rotation without dropping connections.

troubleshootingdebuggingtls

Start Problem

Mid-Senior35m

Diagnose a Disk Space Issue

Diagnose a production disk-full incident where df reports 100% usage but du disagrees, covering deleted-but-open files, inode exhaustion, filesystem reserved blocks, and safe recovery without restarting stateful services.

troubleshootingdebugginglinux

Start Problem

Mid-Senior35m

Diagnose a DNS Outage

Diagnose a DNS outage affecting service discovery, analyzing the resolver chain, TTL behavior during failures, negative caching effects, and designing DNS failover strategies.

troubleshootingdebuggingdns

Start Problem

Senior45m

Diagnose a Split-Brain Scenario

Diagnose and resolve a split-brain scenario in a distributed database cluster where a network partition has caused two nodes to independently accept writes, leading to data divergence and consistency violations.

troubleshootingdebuggingnetworking

Start Problem

Junior30m

Diagnose Network Connectivity Issues

Systematically diagnose network connectivity issues in a cloud environment, using ping, traceroute, mtr, DNS resolution checks, and firewall rule analysis to identify whether the problem is DNS, routing, firewall, or application-level.

troubleshootingdebuggingnetworking

Start Problem

3 problems

Incident Management

Senior45m

Design an Incident Management System

Design an incident management system with alert routing and escalation, on-call rotation management, pager fatigue reduction, runbook automation, and SLO-based alerting that scales to a 500-engineer organization.

incident responseoncalltroubleshooting

Start Problem

Junior30m

Run an Effective Incident Response

Design and execute an effective incident response process including incident commander responsibilities, severity classification, communication cadence, escalation paths, and status page updates.

incident responseoncalltroubleshooting

Start Problem

Mid-Senior35m

Write a Blameless Postmortem

Write a blameless postmortem for a production incident, including timeline reconstruction, root cause analysis using the 5 Whys method, contributing factors identification, and action items with owners and deadlines.

postmortemincident responseoncall

Start Problem