A detailed account of building a watchdog service for an RPG game server that evolved through multiple incidents. Started with simple process monitoring, then progressed to HTTP health checks, websocket-based validation, population thresholds, and timeout detection. The watchdog now detects deadlocks and runtime starvation
•17m read time• From kittygiraudel.com
Table of contents
Ensuring the server always runsIncident #1: running doesn’t mean healthyIncident #2: healthy is all relativeIncident #3: health checks are not live trafficIncident #4: when minutes last foreverWhy so many incidents?Bonus: meaningful status pageBonus: operational ergonomicsLessons learnedSort: