vLLM

vLLM achieves 26.2K prefill tokens per GPU second and 10.1K decode TPGS on NVIDIA GB200 for DeepSeek-style MoE models, representing a 3-5x improvement over H200. Key optimizations include lower-precision operations (NVFP4 and FP8 GEMM), kernel fusion techniques (RoPE+Quant+Q write, Concat K), weight offloading v2 with asynchronous prefetching, and minimized chunking overheads. The improvements leverage GB200's increased memory bandwidth (8 TB/s), enhanced FP4/FP8 tensor core capabilities, and NVLink-C2C interconnect. The deployment uses 4 prefill instances (2 GB200 each) and 1 decode instance (8 GB200) with data-parallelism and expert-parallelism for workloads of 2K input and 2K output tokens.

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)