Airbnb deployed distributed SQL databases across multiple Kubernetes clusters, each mapped to a different AWS Availability Zone, to achieve high availability and fault tolerance. They built custom Kubernetes operators to safely manage stateful workloads, coordinate node replacements, and maintain quorum during failures. Using AWS EBS for persistent storage, PVCs for volume management, and techniques like replica reads and stale reads, they mitigated latency issues while maintaining consistency. Their largest production cluster handles 3 million queries per second across 150 nodes with 300TB of data, achieving 99.95% availability through careful sequencing of upgrades, canary deployments, and overprovisioning for resilience.
Sort: