Building a fault-tolerant metrics storage system at Airbnb

Airbnb's engineering team shares how they built an internally operated metrics storage system capable of ingesting 50 million samples per second and storing 2.5 petabytes of logical time series data across 1.3 billion active time series. Key challenges included multi-tenant isolation using shuffle sharding, stabilizing writes and reads with per-tenant guardrails, compaction sharding for large tenants, and transitioning from a single-cluster to a multi-cluster architecture to reduce blast radius. The team automated tenant onboarding via a consolidated control plane, used Promxy for cross-cluster federated querying, and leveraged Grafana Kubernetes OSS rollout operators for consistent stateful deployments. Key learnings include that federated queries are 5–10x more expensive than single-cluster queries, and that treating clusters as cattle rather than pets enabled scalable cluster management.

#data-science

#observability

#distributed-systems

#prometheus

#multi-tenancy

Apr 21•11m read time•From medium.com

Table of contents

Navigating tenancy Observability at scale Get Rishabh Kumar’s stories in your inbox The outcome Conclusion Acknowledgments

Comment

Bookmark

Copy

Sort: