Deploying Disaggregated LLM Inference Workloads on Kubernetes

Disaggregated LLM inference splits the prefill and decode stages into independent services, each with different GPU resource profiles and scaling needs. This post explains how to deploy such architectures on Kubernetes using LeaderWorkerSet (LWS) and NVIDIA Grove, covering gang scheduling, hierarchical gang scheduling, and topology-aware placement via KAI Scheduler. It walks through concrete YAML manifests for prefill workers, decode workers, and routers, then compares the coordination trade-offs between managing separate LWS resources versus using Grove's unified PodCliqueSet API. Scaling strategies are also covered, including per-role HPA, Tensor Parallel group scaling, and cross-role coordination using tools like NVIDIA Dynamo planner and llm-d's workload variant autoscaler.

#kubernetes

Mar 23•14m read time•From developer.nvidia.com

Table of contents

How do aggregated and disaggregated inference differ?Why scheduling is the key to multi-pod inference performance on Kubernetes Deploying disaggregated inference Scaling disaggregated workloads How inference frameworks coordinate scaling Getting started

Comment

Bookmark

Copy

Sort: