GPU Usage Monitor is an open-source Helm chart that bundles DCGM Exporter, kube-state-metrics, Prometheus, and Grafana into a single deployment for real-time GPU observability on Kubernetes. It addresses common pain points for platform teams: over-provisioned GPU allocations, silent idle workloads, and scheduling bottlenecks that only surface when users complain. A single helm install command delivers pre-built dashboards showing GPU allocation trends, per-workload memory consumption, compute utilization with configurable thresholds, and running vs. pending pod counts. The chart supports external Prometheus integration, custom resource limits, and GPU type filtering for heterogeneous fleets (Hopper, Blackwell, etc.). Available under Apache 2.0 on GitHub, it requires Kubernetes 1.19+, Helm 3.0+, and DCGM Exporter on GPU nodes.

5m read timeFrom developer.nvidia.com
Post cover image
Table of contents
The observability gap in GPU-Accelerated Kubernetes clustersWhat is the GPU Usage Monitor?What the dashboards surfaceConfigurationLearn more

Sort: