Systematic debugging, performance investigation, infrastructure incidents, and incident management processes.
3 problems
Diagnose and halt a cascading failure spreading across a microservice architecture, identifying blast radius, breaking retry amplification loops, and coordinating war room response.
Diagnose and resolve a memory leak in a production service where RSS grows linearly until OOM kills trigger, using heap profiling, cgroup memory accounting, and systematic elimination.
Systematically investigate a spike in 5xx errors across a microservice architecture, from initial alert to root cause identification, using structured debugging methodology.
4 problems
Diagnose and resolve thundering herd problems across multiple scenarios: cache stampede after key expiry, connection storm after service restart, and lock convoy on hot rows, using request coalescing, jittered expiry, and probabilistic early recomputation.
Diagnose a PostgreSQL database that has degraded from 2ms to 200ms average query latency, using slow query logs, EXPLAIN plans, index analysis, lock waits, and replication lag investigation.
Systematically diagnose high CPU usage on a production Linux server, distinguishing between user, system, and steal time, using top, perf, and flame graphs to identify the offending process and code path.
Diagnose intermittent latency spikes where p99 diverges from p50 by 100x, using distributed tracing, GC analysis, lock contention profiling, and network diagnostics to isolate the cause.
5 problems
Diagnose a production outage caused by an expired TLS certificate, covering chain validation, intermediate vs leaf expiry, SNI-based multi-cert edge cases, OCSP stapling, and safe rotation without dropping connections.
Diagnose a production disk-full incident where df reports 100% usage but du disagrees, covering deleted-but-open files, inode exhaustion, filesystem reserved blocks, and safe recovery without restarting stateful services.
Diagnose a DNS outage affecting service discovery, analyzing the resolver chain, TTL behavior during failures, negative caching effects, and designing DNS failover strategies.
Diagnose and resolve a split-brain scenario in a distributed database cluster where a network partition has caused two nodes to independently accept writes, leading to data divergence and consistency violations.
Systematically diagnose network connectivity issues in a cloud environment, using ping, traceroute, mtr, DNS resolution checks, and firewall rule analysis to identify whether the problem is DNS, routing, firewall, or application-level.
3 problems
Design an incident management system with alert routing and escalation, on-call rotation management, pager fatigue reduction, runbook automation, and SLO-based alerting that scales to a 500-engineer organization.
Design and execute an effective incident response process including incident commander responsibilities, severity classification, communication cadence, escalation paths, and status page updates.
Write a blameless postmortem for a production incident, including timeline reconstruction, root cause analysis using the 5 Whys method, contributing factors identification, and action items with owners and deadlines.