What are the key points?

Google released the experimental open-source DiffusionGemma model, utilizing diffusion techniques for simultaneous text block generation. The 26B parameter model achieves 4x faster inference, hitting 1000+ tokens per second on NVIDIA H100 GPUs. Designed for speed-critical local workflows, the model uses bi-directional attention and iterative refinement to optimize interactive generation tasks.

Google Releases DiffusionGemma for Rapid Text Generation

•Google released the experimental open-source DiffusionGemma model, utilizing diffusion techniques for simultaneous text block generation.
•The 26B parameter model achieves 4x faster inference, hitting 1000+ tokens per second on NVIDIA H100 GPUs.
•Designed for speed-critical local workflows, the model uses bi-directional attention and iterative refinement to optimize interactive generation tasks.

•Google released the experimental open-source DiffusionGemma model, utilizing diffusion techniques for simultaneous text block generation.
•The 26B parameter model achieves 4x faster inference, hitting 1000+ tokens per second on NVIDIA H100 GPUs.
•Designed for speed-critical local workflows, the model uses bi-directional attention and iterative refinement to optimize interactive generation tasks.

Google researchers Brendan O'Donoghue and Sebastian Flennerhag released DiffusionGemma on June 10, 2026, an experimental open-source model designed for high-speed text generation. Unlike standard autoregressive models that process text sequentially, this 26B Mixture of Experts (MoE) model utilizes a diffusion-based approach to generate blocks of text simultaneously. The model is available under an Apache 2.0 license and is compatible with frameworks like MLX, vLLM, and Hugging Face Transformers.

Performance benchmarks show that DiffusionGemma achieves up to 4x faster text generation on dedicated hardware. It reaches speeds exceeding 1000 tokens per second on a single NVIDIA H100 and over 700 tokens per second on an NVIDIA GeForce RTX 5090. Despite its 26B parameter total, the model activates only 3.8B parameters during inference, allowing it to fit into 18GB of VRAM when quantized. It supports bi-directional attention, which enables the generation of 256 tokens per forward pass.

This architecture is optimized for speed-critical workflows such as in-line editing, rapid code iteration, and mathematical graph generation. The system employs an iterative refinement process that functions similarly to image diffusion, starting from random placeholders and polishing text blocks in real-time. While it excels at low-to-medium batch sizes on local accelerators, Google notes that DiffusionGemma produces lower output quality compared to standard Gemma 4 models and is therefore not intended for production environments requiring maximum precision.

Google researchers Brendan O'Donoghue and Sebastian Flennerhag released DiffusionGemma on June 10, 2026, an experimental open-source model designed for high-speed text generation. Unlike standard autoregressive models that process text sequentially, this 26B Mixture of Experts (MoE) model utilizes a diffusion-based approach to generate blocks of text simultaneously. The model is available under an Apache 2.0 license and is compatible with frameworks like MLX, vLLM, and Hugging Face Transformers.

Performance benchmarks show that DiffusionGemma achieves up to 4x faster text generation on dedicated hardware. It reaches speeds exceeding 1000 tokens per second on a single NVIDIA H100 and over 700 tokens per second on an NVIDIA GeForce RTX 5090. Despite its 26B parameter total, the model activates only 3.8B parameters during inference, allowing it to fit into 18GB of VRAM when quantized. It supports bi-directional attention, which enables the generation of 256 tokens per forward pass.

This architecture is optimized for speed-critical workflows such as in-line editing, rapid code iteration, and mathematical graph generation. The system employs an iterative refinement process that functions similarly to image diffusion, starting from random placeholders and polishing text blocks in real-time. While it excels at low-to-medium batch sizes on local accelerators, Google notes that DiffusionGemma produces lower output quality compared to standard Gemma 4 models and is therefore not intended for production environments requiring maximum precision.