Airbnb deployed distributed SQL databases across multiple Kubernetes clusters, each mapped to a different AWS Availability Zone, to achieve high availability and fault tolerance. They built custom Kubernetes operators to safely manage stateful workloads, coordinate node replacements, and maintain quorum during failures. Using AWS EBS for persistent storage, PVCs for volume management, and techniques like replica reads and stale reads, they mitigated latency issues while maintaining consistency. Their largest production cluster handles 3 million queries per second across 150 nodes with 300TB of data, achieving 99.95% availability through careful sequencing of upgrades, canary deployments, and overprovisioning for resilience.
Table of contents
Stop Agent Hallucinations with Project Rules (Warp University) (Sponsored)Help us Make ByteByteGo Newsletter BetterRunning Databases on KubernetesNode Replacement CoordinationKubernetes UpgradesMulti-Cluster Deployment for Fault ToleranceLeveraging AWS EBSConclusionSPONSOR US1 Comment
Sort: