GitHub CTO Vlad Fedorov published a post addressing three recent availability incidents (Feb 2, Feb 9, and Mar 5, 2026). This commentary analyzes the key failure patterns: a database saturation event caused by compounding factors including a cache TTL reduction and a new AI model release; a failover-triggered incident where a telemetry gap caused security policies to block VM metadata access; and a Redis cluster failover that left no writable primary due to a latent config issue. Key themes include brittle collapse (systems showing no warning signs until they tip), the reliability-security tradeoff, and the importance of giving incident responders manual controls and room to maneuver — not just automated mechanisms.
Table of contents
Saturation, again and again and againTaking it to the limit, and then over itThe thing about tipping points is that you don’t notice until you tipFailovers are a different mode of operationReliability vs security, the eternal struggleIt’s not just about automation, it’s about more options for responders1 Comment
Sort: