Diagnose and resolve Linux OOM killer events across a production fleet: oom_score_adj tuning, cgroup memory limits, overcommit policies, and memory leak detection.
## Problem
Your team operates a fleet of 500 Linux servers running a mix of microservices in containers. Over the past week, you have seen 15 OOM kill events across the fleet, causing service disruptions. Some kills target critical services while leaving memory-hungry batch jobs untouched. Your manager asks you to design a comprehensive strategy for diagnosing, preventing, and handling OOM events.
Sign up to access the full problem
Design canvas, rubric, hints, and model solutions.