Google has released Multi-Token Prediction (MTP) drafters for the Gemma 4 model family, enabling up to 3x faster inference through speculative decoding without any quality degradation. The technique pairs a lightweight drafter model with the heavier target model: the drafter predicts multiple future tokens in parallel, and the target model verifies them in a single forward pass. This approach addresses the memory-bandwidth bottleneck inherent in standard autoregressive LLM inference. MTP drafters share the target model's KV cache and activations, avoiding redundant computation. Additional optimizations include an efficient embedding clustering technique for edge models (E2B/E4B) and hardware-specific tuning for Apple Silicon and NVIDIA GPUs. The drafters are available under Apache 2.0 on Hugging Face, Kaggle, and supported by vLLM, MLX, SGLang, and Ollama.

5m read timeFrom blog.google
Post cover image
Table of contents
Why speculative decoding?How speculative decoding worksUnlocking faster AI from the edge to the workstation

Sort: