[!NOTE] This blog originated from our biweekly vLLM office hours, a community forum hosted by Red Hat with vLLM project committers and the UC Berkeley team. Each session covers recent updates, a deep dive with a guest speaker, and open Q&A. Join us every other Thursday at 2:00 PM ET / 11:00 AM PT on Google Meet, and get the recording and slides afterward on our YouTube playlist.

vLLM

torch.compile is PyTorch's just-in-time compiler that automatically generates optimized kernels for faster model execution without manual optimization. vLLM integrates torch.compile by default, using compilation caching, dynamic batch size support, and piecewise CUDA Graphs to improve LLM inference performance. The integration includes custom compiler passes for operations like SiLU+quantization fusion and sequence parallelism, achieving performance improvements of 8-15% in various scenarios. Future work focuses on improving stability, reducing startup times, and enhancing custom pass mechanisms.

Introduction to torch.compile and How It Works with vLLM