vLLM announces Model Runner V2 (MRV2), a ground-up rewrite of its model runner built around three principles: modularity, GPU-native execution, and async-first design. Key changes include decoupling persistent batch state from per-step inputs, moving input preparation to the GPU via Triton kernels, a new Triton-native sampler with Gumbel-Max sampling and more efficient top-k logprobs, and a new ModelState abstraction that isolates model-specific logic. Performance benchmarks show a 56% throughput increase on small models (Qwen3-0.6B on GB200) and 6.3% lower TPOT for speculative decoding. The largest source file shrank from 6,700 lines to under 1,300. MRV2 is experimental in v0.18.0 with some features not yet supported (LoRA, logits processors, linear attention models). No API changes are required — enable it with VLLM_USE_V2_MODEL_RUNNER=1.

7m read timeFrom vllm.ai
Post cover image
Table of contents
Why Model Runner V2?What's New in Model Runner V2?PerformanceLimitations and Current StatusGetting StartedAcknowledgments
1 Comment

Sort: