Benchmarking results show that Eagle3 speculative decoding in vLLM delivers consistent throughput and latency improvements for the gpt-oss-120B mixture-of-experts model, even at up to 200 concurrent requests — contradicting the common assumption that speculative decoding only helps at low concurrency. Across three real-world datasets (ShareGPT, MLPerf, SWE-bench) and two tensor-parallelism settings, output throughput improves by 10–21% and end-to-end latency drops by 12–20%. On code-heavy SWE-bench workloads, the gains translate to a 19.4% reduction in cost per 1M output tokens on H200 GPUs. The analysis also covers optimal draft token count, finding 2–3 draft tokens to be the sweet spot, with 4 tokens adding overhead without proportional benefit.

13m read timeFrom developers.redhat.com
Post cover image
Table of contents
Why it matters: 19% cost savings at scaleExperimental setupMetricsBenchmarking on ShareGPT: Speculative decoding versus baselineTuning the draft token count: How many speculative tokens?Validating on MLPerf: Cross-dataset and cross-TP confirmationCoding workloads: SWE-bench validationCost implications for production deploymentsConclusionGet started

Sort: