Nano-vLLM is a lightweight implementation of vLLM built from scratch in approximately 1,200 lines of Python code. It offers comparable inference speeds to the original vLLM while providing optimization features like prefix caching, tensor parallelism, and CUDA graph support. Benchmark results show it achieving 1434 tokens/s

1m read timeFrom github.com
Post cover image
Table of contents
Key FeaturesInstallationQuick StartBenchmarkStar History

Sort: