Gemma 4 can be paired with multi-token prediction (MTP) drafters that use speculative decoding to generate multiple tokens in parallel, allowing the model to verify them in a single pass and achieve u

InfoQ is a leading online platform for software developers, architects, and technical leaders, providing news, articles, presentations, and interviews on a wide range of topics, including agile practices, DevOps, microservices, and emerging technologies. With a focus on quality content and expert insights, InfoQ helps professionals stay informed about the latest trends, best practices, and industry developments. Developers can learn from real-world experiences, gain  knowledge, and connect with peers in the global software community through InfoQ's diverse and engaging content.

InfoQ

Google has released multi-token prediction (MTP) drafter models for Gemma 4 that use speculative decoding to achieve up to ~3x faster token generation without quality loss. Lightweight auxiliary models work alongside Gemma 4 by predicting several future tokens in parallel while the main model verifies them in a single pass, addressing the memory-bandwidth bottleneck common in LLM inference. The technique benefits personal computers, consumer GPUs, and mobile devices running various Gemma 4 variants. A key architectural improvement is that the MTP drafters share the target model's KV cache, reducing memory overhead. Community feedback notes the approach is most effective in low-concurrency scenarios like mobile and edge deployments, with limited gains for large-scale API providers.

Gemma 4 Multi-Token Prediction Delivers Up to ~3x Faster Token Generation