GitHub's CTO Vlad Fedorov addresses two recent incidents — a merge queue regression on April 23 that corrupted squash merges in 230 repositories, and an Elasticsearch overload on April 27 that disrupted search-backed UI features. He explains that GitHub is scaling from a 10X to 30X capacity target, driven by explosive growth in agentic development workflows since late 2025. Reliability improvements include isolating critical services, reducing single points of failure, migrating performance-sensitive code from Ruby to Go, moving to public cloud and pursuing multi-cloud, and optimizing merge queue and large monorepo handling. GitHub also committed to greater transparency via updated status pages and proactive incident communication.
Sort: