10,000 GPUs, One TSDB: Cardinality at GPU Scale
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
GPU monitoring at scale creates a cardinality explosion in time-series databases. A 1,000-node cluster with 8 GPUs each and 60 metrics per GPU generates 1.4M unique time series from hardware metrics alone — before adding pod names, Slurm job IDs, or model versions. The core architectural solution is splitting high-cardinality fields (pod names, job IDs, model versions, ECC per-block counters) onto logs rather than metrics, since logs scale linearly with volume rather than with unique label combinations. Additional strategies include dimension pruning (only attach labels a dashboard actually needs to GROUP BY), interval processing to reduce 10-15s raw collection to 60s resolution, pre-aggregation at collection time, and Prometheus recording rules for commonly-queried aggregations. Specific failure modes at 1,000+ GPU deployments are covered: TSDB memory exhaustion from abandoned series due to pod churn, query timeouts on high-cardinality aggregations, AMD MI300X ECC block explosion (40 blocks × 8 GPUs per node), and NCCL rank cardinality in distributed training. A scaling table maps deployment size to recommended strategy, from no special handling under 100 GPUs to per-cluster sharding above 10,000.
Table of contents
Why GPU Monitoring Explodes CardinalityThe Logs vs Metrics SplitDimension PruningInterval Processing and AggregationWhat Breaks at ScaleDesign for Scale on Day OneSort: