If you’re an infrastructure or MLOps engineer at a large company, you know the drill. The ML team comes to you with requirements that change weekly. They need GPUs yesterday, but the budget was set six months ago. They want to use the latest framework, but it breaks your carefully crafted Kubernetes deployments. They need to comply with data locality requirements while also optimizing for cost.
Sound familiar? You’re not alone, and there’s a better way.

SkyPilot

AI neoclouds like CoreWeave, Lambda Labs, and Nebius have democratized GPU access with specialized infrastructure featuring InfiniBand networking and 30-50% lower costs than hyperscalers. However, Kubernetes remains poorly suited for ML workloads due to its stateless service design, creating friction through steep learning curves, gang scheduling problems, and single-cluster limitations. While solutions like Slurm-on-Kubernetes attempt to bridge this gap, they introduce additional complexity without addressing the fundamental mismatch between traditional orchestration tools and modern AI workflows.

The Evolution of AI Job Orchestration. Part 1: Running AI jobs on GPU Neoclouds

The GPU Gold Rush and the Rise of Neoclouds #

Why Kubernetes Still Fails Your ML Team #

Trying to Bridge the Gap: AI Schedulers and Slurm-on-K8s #

The Missing Piece: Making Neocloud Power Usable #