vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM

vLLM V1 is a significant upgrade to its core architecture, designed to enhance flexibility, scalability, and performance with a zero-configuration model. Key improvements include a simple and modular codebase, optimized execution loop, flexible scheduler, efficient input processing, and advanced support for multimodal and vision-language models. The new version re-architects core components like the scheduler and API server, and introduces innovative features such as piecewise CUDA graphs, FlashAttention 3, and near-zero overhead prefix caching. Performance benchmarks show up to 1.7x higher throughput compared to V0.

vLLM V1: A Major Upgrade to vLLM's Core Architecture