Disaggregated Serving for Hybrid SSM Models in vLLM

vLLM's NIXL-based disaggregated prefill/decode (P/D) has been extended to support hybrid SSM-FA models like NVIDIA Nemotron-H. The core challenge is that FA and Mamba layers store fundamentally different state (KV cache vs. conv/SSM state) with different sizes and layouts, making the existing uniform descriptor scheme insufficient. Three key techniques were introduced: dual descriptor views that register two separate descriptor lists over the same physical memory; physical/logical block bridging to handle the mismatch between logical block abstractions and kernel-required physical block sizes; and a 3-descriptor conv transfer using the DS layout (dim, state_len) that enables heterogeneous tensor-parallel transfers without data reshuffling or staging buffers. Benchmarks on 8x H200 GPUs with Nemotron Super 120B show disaggregated P/D Pareto-dominates co-located serving at high concurrency. Available in vllm>=v0.20.0.

#ai-inference

#vllm

May 10•14m read time•From vllm.ai

Table of contents

Introduction Background: The NIXL KV Transfer Workflow The Challenge: FA and SSM State Are Fundamentally Different Dual Descriptor Views Physical vs. Logical Block Sizes The 3-Descriptors Conv Transfer Putting It Together: Nemotron-H Example Performance Getting Started Limitations and Future Work Acknowledgments

Comment

Bookmark

Copy

Sort: