Disaggregated LLM inference splits the prefill and decode stages into independent services, each with different GPU resource profiles and scaling needs. This post explains how to deploy such architectures on Kubernetes using LeaderWorkerSet (LWS) and NVIDIA Grove, covering gang scheduling, hierarchical gang scheduling, and topology-aware placement via KAI Scheduler. It walks through concrete YAML manifests for prefill workers, decode workers, and routers, then compares the coordination trade-offs between managing separate LWS resources versus using Grove's unified PodCliqueSet API. Scaling strategies are also covered, including per-role HPA, Tensor Parallel group scaling, and cross-role coordination using tools like NVIDIA Dynamo planner and llm-d's workload variant autoscaler.
Table of contents
How do aggregated and disaggregated inference differ?Why scheduling is the key to multi-pod inference performance on KubernetesDeploying disaggregated inferenceScaling disaggregated workloadsHow inference frameworks coordinate scalingGetting startedSort: