Disaggregated LLM inference splits the prefill and decode stages into independent services, each with different GPU resource profiles and scaling needs. This post explains how to deploy such architectures on Kubernetes using LeaderWorkerSet (LWS) and NVIDIA Grove, covering gang scheduling, hierarchical gang scheduling, and
Table of contents
How do aggregated and disaggregated inference differ?Why scheduling is the key to multi-pod inference performance on KubernetesDeploying disaggregated inferenceScaling disaggregated workloadsHow inference frameworks coordinate scalingGetting startedSort: