Slinky is an open source project by SchedMD (now NVIDIA) that runs full Slurm clusters on Kubernetes infrastructure using a slurm-operator. It maps Slurm daemons (slurmctld, slurmdbd, slurmd, slurmrestd) to Kubernetes CRDs and pods, enabling high availability, autoscaling via HPA, and bidirectional state synchronization between Kubernetes and Slurm. Key integrations include the NVIDIA GPU Operator for automated GPU management, DCGM Exporter for per-job GPU metrics, and ComputeDomains for multinode NVLink connectivity on GB200 hardware. NVIDIA runs this in production on clusters with 8,000+ GPUs for large-scale LLM training, achieving the same NCCL benchmark performance as bare-metal Slurm. The recently released v1.1.0 adds dynamic topology support, DaemonSet-style worker pod scaling, and automatic remediation of unregistered worker pods. The main constraint is the current 1:1 worker pod-per-node assumption, making it best suited for multinode job workloads.
Table of contents
How does Slinky slurm-operator work?How to deploy Slinky slurm-operatorWhat is the benefit of running Slurm on Kubernetes?Slinky slurm-operator at scaleSlinky slurm-operator v1.1.0 release highlightsGet started with the Slinky slurm-operatorSort: