EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

EAGLE 3.1 is a new release of the speculative decoding algorithm developed collaboratively by the EAGLE team, vLLM, and TorchSpec. It addresses a fragility issue called 'attention drift' — where the drafter shifts attention away from sink tokens at deeper speculation depths — through two architectural fixes: FC normalization after each target hidden state and feeding post-norm hidden states into the next decoding step. These changes yield up to 2× longer acceptance length in long-context workloads compared to EAGLE 3, better robustness to chat templates and system prompts, and more stable acceptance lengths across serving environments. EAGLE 3.1 is integrated into vLLM as a config-driven extension with full backward compatibility for EAGLE 3 checkpoints, and will ship in vLLM v0.22.0. TorchSpec now supports EAGLE 3.1 training. A draft model for Kimi K2.6 has been open-sourced, achieving 2.03× per-user output throughput at concurrency 1 on coding benchmarks.

#ai-inference

#vllm

May 26•4m read time•From vllm.ai

Table of contents

EAGLE 3.1 Innovations EAGLE 3.1 Training with TorchSpec EAGLE 3.1 Integration with vLLM Open-Source Collaboration Across the Ecosystem

Comment

Bookmark

Copy

Sort: