torch.compile is PyTorch's just-in-time compiler that automatically generates optimized kernels for faster model execution without manual optimization. vLLM integrates torch.compile by default, using compilation caching, dynamic batch size support, and piecewise CUDA Graphs to improve LLM inference performance. The integration includes custom compiler passes for operations like SiLU+quantization fusion and sequence parallelism, achieving performance improvements of 8-15% in various scenarios. Future work focuses on improving stability, reducing startup times, and enhancing custom pass mechanisms.

13m read timeFrom blog.vllm.ai
Post cover image
Table of contents
IntroductionWhat Is torch.compile?Why Use torch.compile?How torch.compile WorksvLLM IntegrationCustom Compiler Passes in vLLMFuture WorkConclusion

Sort: