This post shares how we’ve tackled 3 key problems to scale Databricks’ monitoring infrastructure: 1. Developing a reliable and efficient timeseries database (TSDB) architecture 2. Introducing metric aggregation to shield TSDBs from cardinality 3. Enabling highly dimensional troubleshooting with the Databricks lakehouse

databricks

Databricks engineering team shares how they scaled their monitoring infrastructure to handle 10 trillion metric samples per day and 5 billion active timeseries. Three key solutions were developed: (1) Pantheon, a fork of open-source Thanos TSDB with a custom control plane, tiered storage, and automated lifecycle management that reduced downtime 5x and saved millions in cloud costs; (2) a metric aggregation pipeline built on Telegraf and Dicer to shield TSDBs from cardinality explosion caused by serverless workloads launching tens of millions of short-lived VMs daily; and (3) Hydra, a lakehouse-native platform storing 20 billion unaggregated timeseries in Delta Lake with PromQL-to-SQL translation for Grafana integration, enabling high-cardinality debugging at 50x lower storage cost than Thanos.

10 trillion samples a day: Scaling beyond traditional monitoring infra at Databricks