10,000 GPUs, One TSDB: Cardinality at GPU Scale

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

GPU monitoring at scale creates a cardinality explosion in time-series databases. A 1,000-node cluster with 8 GPUs each and 60 metrics per GPU generates 1.4M unique time series from hardware metrics alone — before adding pod names, Slurm job IDs, or model versions. The core architectural solution is splitting high-cardinality fields (pod names, job IDs, model versions, ECC per-block counters) onto logs rather than metrics, since logs scale linearly with volume rather than with unique label combinations. Additional strategies include dimension pruning (only attach labels a dashboard actually needs to GROUP BY), interval processing to reduce 10-15s raw collection to 60s resolution, pre-aggregation at collection time, and Prometheus recording rules for commonly-queried aggregations. Specific failure modes at 1,000+ GPU deployments are covered: TSDB memory exhaustion from abandoned series due to pod churn, query timeouts on high-cardinality aggregations, AMD MI300X ECC block explosion (40 blocks × 8 GPUs per node), and NCCL rank cardinality in distributed training. A scaling table maps deployment size to recommended strategy, from no special handling under 100 GPUs to per-cluster sharding above 10,000.

#observability

#opentelemetry

#prometheus

Apr 21•8m read time•From last9.io

Table of contents

Why GPU Monitoring Explodes Cardinality The Logs vs Metrics Split Dimension Pruning Interval Processing and Aggregation What Breaks at Scale Design for Scale on Day One

Comment

Bookmark

Copy

Sort: