vLLM and NVIDIA achieved significant performance improvements for the gpt-oss-120b model on Blackwell GPUs through FlashInfer integration, torch.compile-based kernel fusion, and runtime optimizations. The optimizations pushed the Pareto frontier with 38% higher maximum throughput and 13% better interactivity. Key techniques

8m read time From blog.vllm.ai
Post cover image
Table of contents
Table of ContentsIntroductionFlashInfer Integration and torch.compile based fusionRuntime ImprovementsDeployment RecipesResultsNext stepsAcknowledgements

Sort: