Airbnb's engineering team shares how they built an internally operated metrics storage system capable of ingesting 50 million samples per second and storing 2.5 petabytes of logical time series data across 1.3 billion active time series. Key challenges included multi-tenant isolation using shuffle sharding, stabilizing writes and reads with per-tenant guardrails, compaction sharding for large tenants, and transitioning from a single-cluster to a multi-cluster architecture to reduce blast radius. The team automated tenant onboarding via a consolidated control plane, used Promxy for cross-cluster federated querying, and leveraged Grafana Kubernetes OSS rollout operators for consistent stateful deployments. Key learnings include that federated queries are 5–10x more expensive than single-cluster queries, and that treating clusters as cattle rather than pets enabled scalable cluster management.
Table of contents
Navigating tenancyObservability at scaleGet Rishabh Kumar’s stories in your inboxThe outcomeConclusionAcknowledgmentsSort: