Mid-SeniorTroubleshooting35m·Google, Meta, Netflix, Datadog, Hashicorp

Diagnose and Resolve OOM Issues

Diagnose and resolve Linux OOM killer events across a production fleet: oom_score_adj tuning, cgroup memory limits, overcommit policies, and memory leak detection.

linuxmemoryoomcgroupsperformance

## Problem

Your team operates a fleet of 500 Linux servers running a mix of microservices in containers. Over the past week, you have seen 15 OOM kill events across the fleet, causing service disruptions. Some kills target critical services while leaving memory-hungry batch jobs untouched. Your manager asks you to design a comprehensive strategy for diagnosing, preventing, and handling OOM events.

Sign up to access the full problem

Design canvas, rubric, hints, and model solutions.

Get Started Free Sign In

Constraints

Detection TimeSign up to view

Blast RadiusSign up to view

Memory Limit AccuracySign up to view

Leak DetectionSign up to view

Diagnose and Resolve OOM Issues

Constraints

Related Problems