Google Releases MTP to Speed Up Gemma 4 Inference
- •Google releases 'Multi-Token Prediction' (MTP) technology, accelerating Gemma 4 inference by up to 3x.
- •The method utilizes speculative decoding where a lightweight draft model predicts tokens for parallel verification by the target model.
- •Model weights are available on platforms like Hugging Face, enabling low-latency use cases for chat and agent workflows.
Google announced on May 5, 2026, the release of 'Multi-Token Prediction (MTP) drafters' designed to improve inference speeds for the open-model Gemma 4 family. This architecture employs a lightweight draft model that pre-calculates future tokens, which the larger target model then verifies in parallel. According to Google, this approach achieves up to a 3x increase in inference speed without compromising output quality or logical consistency.
MTP supports the Gemma 4 series announced in March 2026, which includes four sizes: E2B, E4B, 31B, and 26B A4B; MTP support for these models was released on April 16. Standard LLM inference typically faces memory bandwidth bottlenecks due to moving parameters for every single token generated. By using speculative decoding, the system generates multiple candidate tokens simultaneously, allowing the target model to validate them in a single processing step.
The draft model shares input embeddings and utilizes final layer activations from the target model to maintain high predictive accuracy. Google confirms this maintains quality equivalent to standard autoregressive generation. Users are encouraged to implement this by designating the lightweight 4-layer MTP drafter as an assistant model within frameworks like Hugging Face Transformers.
Google expects this technology to enhance responsiveness for low-latency chat, voice applications, agent workflows, and on-device mobile experiences. The integration enables faster execution of 26B MoE and 31B Dense models on personal computers and consumer GPUs. Performance gains depend on the environment; for instance, the Gemma 4 26B A4B (an MoE model) showed up to approximately 2.2x speedup on Apple Silicon with batch sizes of 4 to 8. The model weights are licensed under Apache 2.0 and are accessible via Hugging Face and Kaggle, with support for platforms including vLLM, SGLang, and Ollama.