A detailed account of building a watchdog service for an RPG game server that evolved through multiple incidents. Started with simple process monitoring, then progressed to HTTP health checks, websocket-based validation, population thresholds, and timeout detection. The watchdog now detects deadlocks and runtime starvation

17m read time From kittygiraudel.com
Post cover image
Table of contents
Ensuring the server always runsIncident #1: running doesn’t mean healthyIncident #2: healthy is all relativeIncident #3: health checks are not live trafficIncident #4: when minutes last foreverWhy so many incidents?Bonus: meaningful status pageBonus: operational ergonomicsLessons learned

Sort: