NVIDIA's MLPerf Inference v6.0 results show Blackwell Ultra GPUs achieving record throughput across the broadest range of models and scenarios. Key highlights include a 2.77x performance improvement on DeepSeek-R1 server scenario for GB300 NVL72 compared to six months ago, driven by TensorRT-LLM software optimizations including disaggregated serving, Wide Expert Parallel, and Multi-Token Prediction. New benchmarks added this round include DeepSeek-R1 Interactive, Qwen3-VL-235B (first multimodal model in MLPerf), GPT-OSS-120B, WAN-2.2 text-to-video, and DLRMv3. NVIDIA was the only platform to submit results on all newly added models. At scale, four GB300 NVL72 systems with 288 Blackwell Ultra GPUs interconnected via Quantum-X800 InfiniBand achieved over 2.4 million tokens/sec on DeepSeek-R1 offline. NVIDIA's cumulative MLPerf wins since 2018 now stand at 291, 9x all other submitters combined.
Table of contents
New benchmarks, new performance recordsNVIDIA TensorRT-LLM software updates unlock up to 2.7X performance gains on the same Blackwell Ultra GPUsScale-out inference with NVIDIA Quantum-X800 InfiniBand platform enables millions of tokens per secondLooking ahead to MLPerf EndpointsSort: