TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA’s Blackwell GPUs. Through deep integration with FlashInfer, novel kernel fusions via torch.compile, and various inference runtime features, we have set a new record for the model’s performance Pareto frontier —simultaneously optimizing for maximum throughput (+38%) and best interactivity (+13%).

vLLM

vLLM and NVIDIA achieved significant performance improvements for the gpt-oss-120b model on Blackwell GPUs through FlashInfer integration, torch.compile-based kernel fusion, and runtime optimizations. The optimizations pushed the Pareto frontier with 38% higher maximum throughput and 13% better interactivity. Key techniques include fusing AllReduce with RMSNorm, async scheduling to hide CPU overhead, stream interval buffering to reduce network I/O, and FP8 KV-cache support. The improvements apply across the entire performance curve, benefiting various deployment scenarios from high-throughput to low-latency use cases.

GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier

FlashInfer Integration and torch.compile based fusion