Uber redesigned its MySQL infrastructure by replacing external failover mechanisms with MySQL Group Replication (MGR), a Paxos-based consensus protocol. The new architecture embeds leader election and failure detection directly in the database layer using three-node MGR clusters, reducing failover time from minutes to under 10 seconds. Scalable read replicas fan out from secondaries to separate read scaling from write availability. Flow control within MGR prevents replication lag and errant GTIDs. The rollout was automated with a control plane handling onboarding, node replacement, topology rebalancing, and quorum protection. Trade-offs include a slight increase in write latency (hundreds of microseconds), but write unavailability during failures dropped dramatically. Single-primary mode was chosen over multi-primary for operational simplicity.

3m read timeFrom infoq.com
Post cover image

Sort: