vLLM

DeepSeek-V3.2 and DeepSeek-R1 achieve significant performance gains on NVIDIA's GB300 (Blackwell Ultra) GPUs using FP4 quantization. DeepSeek-V3.2 reaches 7360 tokens/GPU/second in prefill-only scenarios with TP2 parallelization, while DeepSeek-R1 achieves 22476 TGS. Compared to Hopper H200, Blackwell shows 8x improvement in prefill and 10-20x in mixed-context scenarios. The article provides detailed benchmarking across different parallelization strategies (TP2 vs EP2), quantization formats (FP4 vs FP8), and deployment patterns including disaggregated prefill/decode architectures. DeepSeek-V3.2's Sparse MLA introduces overhead that limits prefill performance compared to R1, indicating room for optimization.

DeepSeek-V3.2 on GB300: Performance Breakthrough

Basic Recipe with FP4 Weight Quantization

Performance Boost by Blackwell Architecture

Disaggregated Prefill (for DeepSeek-V3.2)