An overview of how Multi-Token Prediction (MTP) drafters are making Gemma 4 models up to 3x faster at inference.

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Google has released Multi-Token Prediction (MTP) drafters for the Gemma 4 model family, enabling up to 3x faster inference through speculative decoding without any quality degradation. The technique pairs a lightweight drafter model with the heavier target model: the drafter predicts multiple future tokens in parallel, and the target model verifies them in a single forward pass. This approach addresses the memory-bandwidth bottleneck inherent in standard autoregressive LLM inference. MTP drafters share the target model's KV cache and activations, avoiding redundant computation. Additional optimizations include an efficient embedding clustering technique for edge models (E2B/E4B) and hardware-specific tuning for Apple Silicon and NVIDIA GPUs. The drafters are available under Apache 2.0 on Hugging Face, Kaggle, and supported by vLLM, MLX, SGLang, and Ollama.

Multi-token-prediction in Gemma 4

Unlocking faster AI from the edge to the workstation