vLLM achieves 26.2K prefill tokens per GPU second and 10.1K decode TPGS on NVIDIA GB200 for DeepSeek-style MoE models, representing a 3-5x improvement over H200. Key optimizations include lower-precision operations (NVFP4 and FP8 GEMM), kernel fusion techniques (RoPE+Quant+Q write, Concat K), weight offloading v2 with

9m read time From blog.vllm.ai
Post cover image
Table of contents
Lower-Precision OperationsKernel FusionScaling Down PrefillMinimize Chunking Overheads

Sort: