Performance improvements with speculative decoding in vLLM for gpt-oss

Benchmarking results show that Eagle3 speculative decoding in vLLM delivers consistent throughput and latency improvements for the gpt-oss-120B mixture-of-experts model, even at up to 200 concurrent requests — contradicting the common assumption that speculative decoding only helps at low concurrency. Across three real-world datasets (ShareGPT, MLPerf, SWE-bench) and two tensor-parallelism settings, output throughput improves by 10–21% and end-to-end latency drops by 12–20%. On code-heavy SWE-bench workloads, the gains translate to a 19.4% reduction in cost per 1M output tokens on H200 GPUs. The analysis also covers optimal draft token count, finding 2–3 draft tokens to be the sweet spot, with 4 tokens adding overhead without proportional benefit.

#ai-inference

#vllm

Apr 16•13m read time•From developers.redhat.com

Table of contents

Why it matters: 19% cost savings at scale Experimental setup Metrics Benchmarking on ShareGPT: Speculative decoding versus baseline Tuning the draft token count: How many speculative tokens?Validating on MLPerf: Cross-dataset and cross-TP confirmation Coding workloads: SWE-bench validation Cost implications for production deployments Conclusion Get started

Comment

Bookmark

Copy

Sort: