vLLM V1 is a significant upgrade to its core architecture, designed to enhance flexibility, scalability, and performance with a zero-configuration model. Key improvements include a simple and modular codebase, optimized execution loop, flexible scheduler, efficient input processing, and advanced support for multimodal and vision-language models. The new version re-architects core components like the scheduler and API server, and introduces innovative features such as piecewise CUDA graphs, FlashAttention 3, and near-zero overhead prefix caching. Performance benchmarks show up to 1.7x higher throughput compared to V0.
Table of contents
Why vLLM V1?What’s New in vLLM V1?PerformanceLimitations & Future WorkHow to Get StartedAcknowledgmentSort: