Databricks engineering team shares how they scaled their monitoring infrastructure to handle 10 trillion metric samples per day and 5 billion active timeseries. Three key solutions were developed: (1) Pantheon, a fork of open-source Thanos TSDB with a custom control plane, tiered storage, and automated lifecycle management that reduced downtime 5x and saved millions in cloud costs; (2) a metric aggregation pipeline built on Telegraf and Dicer to shield TSDBs from cardinality explosion caused by serverless workloads launching tens of millions of short-lived VMs daily; and (3) Hydra, a lakehouse-native platform storing 20 billion unaggregated timeseries in Delta Lake with PromQL-to-SQL translation for Grafana integration, enabling high-cardinality debugging at 50x lower storage cost than Thanos.

10m read timeFrom databricks.com
Post cover image
Table of contents
Thanos timeseries databasesCardinality and aggregationHigh-cardinality data on the lakehouseTakeaways

Sort: