Disaggregated LLM inference splits the prefill and decode stages into independent services, each with different GPU resource profiles and scaling needs. This post explains how to deploy such architectures on Kubernetes using LeaderWorkerSet (LWS) and NVIDIA Grove, covering gang scheduling, hierarchical gang scheduling, and

14m read timeFrom developer.nvidia.com
Post cover image
Table of contents
How do aggregated and disaggregated inference differ?Why scheduling is the key to multi-pod inference performance on KubernetesDeploying disaggregated inferenceScaling disaggregated workloadsHow inference frameworks coordinate scalingGetting started

Sort: