vLLM and NVIDIA achieved significant performance improvements for the gpt-oss-120b model on Blackwell GPUs through FlashInfer integration, torch.compile-based kernel fusion, and runtime optimizations. The optimizations pushed the Pareto frontier with 38% higher maximum throughput and 13% better interactivity. Key techniques
•8m read time• From blog.vllm.ai
Table of contents
Table of ContentsIntroductionFlashInfer Integration and torch.compile based fusionRuntime ImprovementsDeployment RecipesResultsNext stepsAcknowledgementsSort: