10 trillion samples a day: Scaling beyond traditional monitoring infra at Databricks
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
Databricks engineering team shares how they scaled their monitoring infrastructure to handle 10 trillion metric samples per day and 5 billion active timeseries. Three key solutions were developed: (1) Pantheon, a fork of open-source Thanos TSDB with a custom control plane, tiered storage, and automated lifecycle management that reduced downtime 5x and saved millions in cloud costs; (2) a metric aggregation pipeline built on Telegraf and Dicer to shield TSDBs from cardinality explosion caused by serverless workloads launching tens of millions of short-lived VMs daily; and (3) Hydra, a lakehouse-native platform storing 20 billion unaggregated timeseries in Delta Lake with PromQL-to-SQL translation for Grafana integration, enabling high-cardinality debugging at 50x lower storage cost than Thanos.
Table of contents
Thanos timeseries databasesCardinality and aggregationHigh-cardinality data on the lakehouseTakeawaysSort: