Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters

GPU Usage Monitor is an open-source Helm chart that bundles DCGM Exporter, kube-state-metrics, Prometheus, and Grafana into a single deployment for real-time GPU observability on Kubernetes. It addresses common pain points for platform teams: over-provisioned GPU allocations, silent idle workloads, and scheduling bottlenecks that only surface when users complain. A single helm install command delivers pre-built dashboards showing GPU allocation trends, per-workload memory consumption, compute utilization with configurable thresholds, and running vs. pending pod counts. The chart supports external Prometheus integration, custom resource limits, and GPU type filtering for heterogeneous fleets (Hopper, Blackwell, etc.). Available under Apache 2.0 on GitHub, it requires Kubernetes 1.19+, Helm 3.0+, and DCGM Exporter on GPU nodes.

#kubernetes

#grafana

#prometheus

May 21•5m read time•From developer.nvidia.com

Table of contents

The observability gap in GPU-Accelerated Kubernetes clusters What is the GPU Usage Monitor?What the dashboards surface Configuration Learn more

Comment

Bookmark

Copy

Sort: