vLLM is a high-throughput and memory-efficient inference and serving engine for large language models. Developed at UC Berkeley, it offers state-of-the-art serving throughput, efficient memory management with PagedAttention, continuous request batching, and optimized CUDA kernels. vLLM supports various quantization methods and popular open-source models on HuggingFace. It integrates with NVIDIA, AMD, Intel hardware, and cloud platforms like AWS and Google Cloud. The project is community-driven, with contributions from academia and industry, and is supported by various organizations and contributors.

5m read timeFrom github.com
Post cover image
Table of contents
AboutGetting StartedContributingSponsorsCitationContact UsMedia Kit

Sort: