NVIDIA Extreme Co-Design Delivers New MLPerf Inference Records

NVIDIA's MLPerf Inference v6.0 results show Blackwell Ultra GPUs achieving record throughput across the broadest range of models and scenarios. Key highlights include a 2.77x performance improvement on DeepSeek-R1 server scenario for GB300 NVL72 compared to six months ago, driven by TensorRT-LLM software optimizations including disaggregated serving, Wide Expert Parallel, and Multi-Token Prediction. New benchmarks added this round include DeepSeek-R1 Interactive, Qwen3-VL-235B (first multimodal model in MLPerf), GPT-OSS-120B, WAN-2.2 text-to-video, and DLRMv3. NVIDIA was the only platform to submit results on all newly added models. At scale, four GB300 NVL72 systems with 288 Blackwell Ultra GPUs interconnected via Quantum-X800 InfiniBand achieved over 2.4 million tokens/sec on DeepSeek-R1 offline. NVIDIA's cumulative MLPerf wins since 2018 now stand at 291, 9x all other submitters combined.

#llm

#nvidia

#ai-inference

Apr 01•10m read time•From developer.nvidia.com

Table of contents

New benchmarks, new performance records NVIDIA TensorRT-LLM software updates unlock up to 2.7X performance gains on the same Blackwell Ultra GPUs Scale-out inference with NVIDIA Quantum-X800 InfiniBand platform enables millions of tokens per second Looking ahead to MLPerf Endpoints

Comment

Bookmark

Copy

Sort: