A conference talk covering chaos engineering and game days as a structured practice for improving incident response. The speaker explains the four principles of chaos engineering (steady state, hypothesize, experiment, verify) plus a fifth principle of improvement. Game days are presented as controlled chaos experiments that double as incident response drills, using a full incident command structure including commanders, scribes, and subject matter experts. The talk covers when to run game days (stable error budgets, calm periods, post-release), when not to (known issues, post-reorg, pre-peak season, no leadership buy-in), how PagerDuty structures their 'Failure Fridays', and how to get organizational buy-in by framing chaos engineering as a proactive alternative to unpredictable real incidents.

42m watch time

Sort: