NVIDIA GB200/GB300 NVL72 rack-scale supercomputers present a topology mismatch with traditional schedulers that treat GPUs as a flat pool. The post explains how cluster UUIDs and clique IDs encode NVLink domain membership, and how NVIDIA Mission Control bridges hardware topology to schedulers. For Slurm, the topology/block
Table of contents
The core challenge: Rack-scale topology meets AI workload schedulingScheduling Multi-Node NVLink workloads with SlurmIMEX management with Slurm: From rack-level service to per-job isolationExtending multi-node NVLink support to Kubernetes and NVIDIA Run:aiHow NVIDIA Run:ai simplifies distributed workloads on NVLink domainsAutomatic topology detection with TopographLearn more about advanced AI operationsSort: