Diagnose a production outage caused by an expired TLS certificate, covering chain validation, intermediate vs leaf expiry, SNI-based multi-cert edge cases, OCSP stapling, and safe rotation without dropping connections.
## Problem
Your edge tier is fronted by six TLS terminators. At 03:47 UTC, PagerDuty fired on a 30% spike in 5xx errors scoped to one hostname, `checkout.example.com`. Browsers show a certificate error; internal service-to-service calls log `x509: certificate has expired or is not yet valid`. The other five listeners and most hostnames are serving normally. The checkout hostname runs long-lived WebSocket sessions that must not be dropped during rotation. Diagnose the failure, rotate the certificate without killing connections, and make this failure mode impossible to repeat quietly.
Sign up to access the full problem
Design canvas, rubric, hints, and model solutions.