vLLM achieves 26.2K prefill tokens per GPU second and 10.1K decode TPGS on NVIDIA GB200 for DeepSeek-style MoE models, representing a 3-5x improvement over H200. Key optimizations include lower-precision operations (NVFP4 and FP8 GEMM), kernel fusion techniques (RoPE+Quant+Q write, Concat K), weight offloading v2 with
•9m read time• From blog.vllm.ai
Table of contents
Lower-Precision OperationsKernel FusionScaling Down PrefillMinimize Chunking OverheadsSort: