GitHub's CTO details three major availability incidents from February 2, February 9, and March 5, explaining root causes including a database cluster overload from compounding load factors (cache TTL change, increased API traffic from client apps, and new model releases), a telemetry gap causing security policies to block VM metadata access for Actions hosted runners, and a Redis failover leaving a cluster with no writable primary. Contributing factors included insufficient architectural isolation, inadequate load shedding, and monitoring gaps. Remediation includes redesigning the user cache system, isolating critical dependencies, accelerating Azure migration (currently at 12.5% of traffic, targeting 50% by July), and breaking apart the monolith into isolated services.

8m read timeFrom github.blog
Post cover image
Table of contents
What happenedWhat we are doing nowOur commitment to transparencyWritten by

Sort: