EAGLE 3.1 is a new release of the speculative decoding algorithm developed collaboratively by the EAGLE team, vLLM, and TorchSpec. It addresses a fragility issue called 'attention drift' — where the drafter shifts attention away from sink tokens at deeper speculation depths — through two architectural fixes: FC normalization after each target hidden state and feeding post-norm hidden states into the next decoding step. These changes yield up to 2× longer acceptance length in long-context workloads compared to EAGLE 3, better robustness to chat templates and system prompts, and more stable acceptance lengths across serving environments. EAGLE 3.1 is integrated into vLLM as a config-driven extension with full backward compatibility for EAGLE 3 checkpoints, and will ship in vLLM v0.22.0. TorchSpec now supports EAGLE 3.1 training. A draft model for Kimi K2.6 has been open-sourced, achieving 2.03× per-user output throughput at concurrency 1 on coding benchmarks.

4m read timeFrom vllm.ai
Post cover image
Table of contents
EAGLE 3.1 InnovationsEAGLE 3.1 Training with TorchSpecEAGLE 3.1 Integration with vLLMOpen-Source Collaboration Across the Ecosystem

Sort: