vLLM and NVIDIA achieved significant performance improvements for the gpt-oss-120b model on Blackwell GPUs through FlashInfer integration, torch.compile-based kernel fusion, and runtime optimizations. The optimizations pushed the Pareto frontier with 38% higher maximum throughput and 13% better interactivity. Key techniques include fusing AllReduce with RMSNorm, async scheduling to hide CPU overhead, stream interval buffering to reduce network I/O, and FP8 KV-cache support. The improvements apply across the entire performance curve, benefiting various deployment scenarios from high-throughput to low-latency use cases.

8m read timeFrom blog.vllm.ai
Post cover image
Table of contents
Table of ContentsIntroductionFlashInfer Integration and torch.compile based fusionRuntime ImprovementsDeployment RecipesResultsNext stepsAcknowledgements

Sort: