vLLM's NIXL-based disaggregated prefill/decode (P/D) has been extended to support hybrid SSM-FA models like NVIDIA Nemotron-H. The core challenge is that FA and Mamba layers store fundamentally different state (KV cache vs. conv/SSM state) with different sizes and layouts, making the existing uniform descriptor scheme insufficient. Three key techniques were introduced: dual descriptor views that register two separate descriptor lists over the same physical memory; physical/logical block bridging to handle the mismatch between logical block abstractions and kernel-required physical block sizes; and a 3-descriptor conv transfer using the DS layout (dim, state_len) that enables heterogeneous tensor-parallel transfers without data reshuffling or staging buffers. Benchmarks on 8x H200 GPUs with Nemotron Super 120B show disaggregated P/D Pareto-dominates co-located serving at high concurrency. Available in vllm>=v0.20.0.
Table of contents
IntroductionBackground: The NIXL KV Transfer WorkflowThe Challenge: FA and SSM State Are Fundamentally DifferentDual Descriptor ViewsPhysical vs. Logical Block SizesThe 3-Descriptors Conv TransferPutting It Together: Nemotron-H ExamplePerformanceGetting StartedLimitations and Future WorkAcknowledgmentsSort: