What are the key points?

Google DeepMind released DiffusionGemma, an experimental model using diffusion techniques to accelerate text generation up to 4x. The 26B MoE model features 3.8B active inference parameters, enabling local execution on hardware with 18GB of VRAM. A new parallel block-generation and bidirectional refinement method improves efficiency for code completion and inline editing tasks.

Google Boosts Text Generation Speed 4x via Diffusion Models

•Google DeepMind released DiffusionGemma, an experimental model using diffusion techniques to accelerate text generation up to 4x.
•The 26B MoE model features 3.8B active inference parameters, enabling local execution on hardware with 18GB of VRAM.
•A new parallel block-generation and bidirectional refinement method improves efficiency for code completion and inline editing tasks.

Google DeepMind released DiffusionGemma, an experimental open model designed to accelerate text generation, on June 10, 2026. By applying principles from image-focused diffusion models to text, the system achieves up to 4x faster speeds compared to traditional autoregressive models on GPU hardware.

Built upon research from Gemma 4 and Gemini Diffusion, DiffusionGemma is a 26B MoE model released under the Apache 2.0 license. Unlike standard autoregressive models that generate tokens sequentially, this architecture uses a 256-token canvas to generate and refine entire text blocks in parallel. This approach mitigates memory bandwidth bottlenecks and maximizes GPU compute utilization.

In terms of performance, the model records over 1000 tokens per second on NVIDIA H100 hardware and over 700 tokens per second on NVIDIA GeForce RTX 5090 hardware. Because the model allows for bidirectional reference across entire blocks during generation, it is well-suited for code completion and inline editing while maintaining structural consistency. Although the total parameter count is 25.2B, only 3.8B parameters are active during inference, allowing the model to fit within 18GB of VRAM for local operations.

Google positions this model for local conversational applications where speed is a priority, or for tasks with specific constraints. As an experimental release, Google recommends the standard Gemma 4 for high-quality requirements. The model is currently available on Hugging Face and supports inference and fine-tuning through frameworks such as vLLM and MLX.

Google DeepMind released DiffusionGemma, an experimental open model designed to accelerate text generation, on June 10, 2026. By applying principles from image-focused diffusion models to text, the system achieves up to 4x faster speeds compared to traditional autoregressive models on GPU hardware.

Built upon research from Gemma 4 and Gemini Diffusion, DiffusionGemma is a 26B MoE model released under the Apache 2.0 license. Unlike standard autoregressive models that generate tokens sequentially, this architecture uses a 256-token canvas to generate and refine entire text blocks in parallel. This approach mitigates memory bandwidth bottlenecks and maximizes GPU compute utilization.

In terms of performance, the model records over 1000 tokens per second on NVIDIA H100 hardware and over 700 tokens per second on NVIDIA GeForce RTX 5090 hardware. Because the model allows for bidirectional reference across entire blocks during generation, it is well-suited for code completion and inline editing while maintaining structural consistency. Although the total parameter count is 25.2B, only 3.8B parameters are active during inference, allowing the model to fit within 18GB of VRAM for local operations.

Google positions this model for local conversational applications where speed is a priority, or for tasks with specific constraints. As an experimental release, Google recommends the standard Gemma 4 for high-quality requirements. The model is currently available on Hugging Face and supports inference and fine-tuning through frameworks such as vLLM and MLX.