We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more effic

vLLM

vLLM announces Model Runner V2 (MRV2), a ground-up rewrite of its model runner built around three principles: modularity, GPU-native execution, and async-first design. Key changes include decoupling persistent batch state from per-step inputs, moving input preparation to the GPU via Triton kernels, a new Triton-native sampler with Gumbel-Max sampling and more efficient top-k logprobs, and a new ModelState abstraction that isolates model-specific logic. Performance benchmarks show a 56% throughput increase on small models (Qwen3-0.6B on GB200) and 6.3% lower TPOT for speculative decoding. The largest source file shrank from 6,700 lines to under 1,300. MRV2 is experimental in v0.18.0 with some features not yet supported (LoRA, logits processors, linear attention models). No API changes are required — enable it with VLLM_USE_V2_MODEL_RUNNER=1.

Model Runner V2: A Modular and Faster Core for vLLM

<p>Really like the direction here. Moving input prep and sampling closer to the GPU is the kind of change that actually shows up in real throughput, not just benchmarks — and cutting a 6,700-line runner down that much should make future debugging way less painful too.</p>