A detailed account of building a watchdog service for an RPG game server that evolved through multiple incidents. Started with simple process monitoring, then progressed to HTTP health checks, websocket-based validation, population thresholds, and timeout detection. The watchdog now detects deadlocks and runtime starvation

17m read timeFrom kittygiraudel.com
Post cover image
Table of contents
Ensuring the server always runsIncident #1: running doesn’t mean healthyIncident #2: healthy is all relativeIncident #3: health checks are not live trafficIncident #4: when minutes last foreverWhy so many incidents?Bonus: meaningful status pageBonus: operational ergonomicsLessons learned

Sort: