DeepSeek-V3.2 on GB300: Performance Breakthrough

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

DeepSeek-V3.2 and DeepSeek-R1 achieve significant performance gains on NVIDIA's GB300 (Blackwell Ultra) GPUs using FP4 quantization. DeepSeek-V3.2 reaches 7360 tokens/GPU/second in prefill-only scenarios with TP2 parallelization, while DeepSeek-R1 achieves 22476 TGS. Compared to Hopper H200, Blackwell shows 8x improvement in prefill and 10-20x in mixed-context scenarios. The article provides detailed benchmarking across different parallelization strategies (TP2 vs EP2), quantization formats (FP4 vs FP8), and deployment patterns including disaggregated prefill/decode architectures. DeepSeek-V3.2's Sparse MLA introduces overhead that limits prefill performance compared to R1, indicating room for optimization.

11m read timeFrom blog.vllm.ai
Post cover image
Table of contents
SummaryBenchmark SetupBasic Recipe with FP4 Weight QuantizationPerformance Boost by Blackwell ArchitectureDeployment TuningDeepSeek V3.2 - Still Way To GoDisaggregated Prefill (for DeepSeek-V3.2)Acknowledgements

Sort: