Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling

NVIDIA GB200/GB300 NVL72 rack-scale supercomputers present a topology mismatch with traditional schedulers that treat GPUs as a flat pool. The post explains how cluster UUIDs and clique IDs encode NVLink domain membership, and how NVIDIA Mission Control bridges hardware topology to schedulers. For Slurm, the topology/block plugin maps NVLink partitions to scheduling blocks for locality-aware placement. For Kubernetes, the NVIDIA DRA GPU driver introduces ComputeDomains—first-class objects that group nodes sharing an NVLink/MNNVL fabric and manage IMEX lifecycle per workload. NVIDIA Run:ai automates topology detection, ComputeDomain creation, and topology-aware pod placement on top of Kubernetes. Topograph, an open-source tool, auto-discovers cluster topology and feeds it to schedulers, eliminating manual topology modeling.

#kubernetes

Apr 07•9m read time•From developer.nvidia.com

Table of contents

The core challenge: Rack-scale topology meets AI workload scheduling Scheduling Multi-Node NVLink workloads with Slurm IMEX management with Slurm: From rack-level service to per-job isolation Extending multi-node NVLink support to Kubernetes and NVIDIA Run:ai How NVIDIA Run:ai simplifies distributed workloads on NVLink domains Automatic topology detection with Topograph Learn more about advanced AI operations

Comment

Bookmark

Copy

Sort: