Benchmarking results show that Eagle3 speculative decoding in vLLM delivers consistent throughput and latency improvements for the gpt-oss-120B mixture-of-experts model, even at up to 200 concurrent requests — contradicting the common assumption that speculative decoding only helps at low concurrency. Across three real-world

Table of contents
Why it matters: 19% cost savings at scaleExperimental setupMetricsBenchmarking on ShareGPT: Speculative decoding versus baselineTuning the draft token count: How many speculative tokens?Validating on MLPerf: Cross-dataset and cross-TP confirmationCoding workloads: SWE-bench validationCost implications for production deploymentsConclusionGet startedSort: