Diagnose a production disk-full incident where df reports 100% usage but du disagrees, covering deleted-but-open files, inode exhaustion, filesystem reserved blocks, and safe recovery without restarting stateful services.
## Problem
A stateful production service on a 500 GB ext4 volume mounted at /var started returning 5xx errors fifteen minutes ago. Application logs show `write: no space left on device`. `df -h /var` reports 100% usage. But when you run `du -sh /var`, the total is only about 300 GB — well under the volume size. You cannot restart the service without dropping ~30,000 active sessions and triggering a 15-minute cold start. Diagnose the root cause, restore writes, and produce a plan that does not require a full restart.
Sign up to access the full problem
Design canvas, rubric, hints, and model solutions.