Google has released multi-token prediction (MTP) drafter models for Gemma 4 that use speculative decoding to achieve up to ~3x faster token generation without quality loss. Lightweight auxiliary models work alongside Gemma 4 by predicting several future tokens in parallel while the main model verifies them in a single pass, addressing the memory-bandwidth bottleneck common in LLM inference. The technique benefits personal computers, consumer GPUs, and mobile devices running various Gemma 4 variants. A key architectural improvement is that the MTP drafters share the target model's KV cache, reducing memory overhead. Community feedback notes the approach is most effective in low-concurrency scenarios like mobile and edge deployments, with limited gains for large-scale API providers.
Sort: