vLLM V1 is a significant upgrade to its core architecture, designed to enhance flexibility, scalability, and performance with a zero-configuration model. Key improvements include a simple and modular codebase, optimized execution loop, flexible scheduler, efficient input processing, and advanced support for multimodal and vision-language models. The new version re-architects core components like the scheduler and API server, and introduces innovative features such as piecewise CUDA graphs, FlashAttention 3, and near-zero overhead prefix caching. Performance benchmarks show up to 1.7x higher throughput compared to V0.

11m read timeFrom blog.vllm.ai
Post cover image
Table of contents
Why vLLM V1?What’s New in vLLM V1?PerformanceLimitations & Future WorkHow to Get StartedAcknowledgment

Sort: