AI 비교하기AI 사용하기AI 최신정보AI 커뮤니티
Our VisionTermsPrivacyContact

Google Boosts Text Generation Speed 4x via Diffusion Models

Google Boosts Text Generation Speed 4x via Diffusion Models

Ledge AI
Monday, June 15, 2026
  • •Google DeepMind released DiffusionGemma, an experimental model using diffusion techniques to accelerate text generation up to 4x.
  • •The 26B MoE model features 3.8B active inference parameters, enabling local execution on hardware with 18GB of VRAM.
  • •A new parallel block-generation and bidirectional refinement method improves efficiency for code completion and inline editing tasks.
  • •Google DeepMind released DiffusionGemma, an experimental model using diffusion techniques to accelerate text generation up to 4x.
  • •The 26B MoE model features 3.8B active inference parameters, enabling local execution on hardware with 18GB of VRAM.
  • •A new parallel block-generation and bidirectional refinement method improves efficiency for code completion and inline editing tasks.

Google DeepMind released DiffusionGemma, an experimental open model designed to accelerate text generation, on June 10, 2026. By applying principles from image-focused diffusion models to text, the system achieves up to 4x faster speeds compared to traditional autoregressive models on GPU hardware.

Built upon research from Gemma 4 and Gemini Diffusion, DiffusionGemma is a 26B MoE model released under the Apache 2.0 license. Unlike standard autoregressive models that generate tokens sequentially, this architecture uses a 256-token canvas to generate and refine entire text blocks in parallel. This approach mitigates memory bandwidth bottlenecks and maximizes GPU compute utilization.

In terms of performance, the model records over 1000 tokens per second on NVIDIA H100 hardware and over 700 tokens per second on NVIDIA GeForce RTX 5090 hardware. Because the model allows for bidirectional reference across entire blocks during generation, it is well-suited for code completion and inline editing while maintaining structural consistency. Although the total parameter count is 25.2B, only 3.8B parameters are active during inference, allowing the model to fit within 18GB of VRAM for local operations.

Google positions this model for local conversational applications where speed is a priority, or for tasks with specific constraints. As an experimental release, Google recommends the standard Gemma 4 for high-quality requirements. The model is currently available on Hugging Face and supports inference and fine-tuning through frameworks such as vLLM and MLX.

Google DeepMind released DiffusionGemma, an experimental open model designed to accelerate text generation, on June 10, 2026. By applying principles from image-focused diffusion models to text, the system achieves up to 4x faster speeds compared to traditional autoregressive models on GPU hardware.

Built upon research from Gemma 4 and Gemini Diffusion, DiffusionGemma is a 26B MoE model released under the Apache 2.0 license. Unlike standard autoregressive models that generate tokens sequentially, this architecture uses a 256-token canvas to generate and refine entire text blocks in parallel. This approach mitigates memory bandwidth bottlenecks and maximizes GPU compute utilization.

In terms of performance, the model records over 1000 tokens per second on NVIDIA H100 hardware and over 700 tokens per second on NVIDIA GeForce RTX 5090 hardware. Because the model allows for bidirectional reference across entire blocks during generation, it is well-suited for code completion and inline editing while maintaining structural consistency. Although the total parameter count is 25.2B, only 3.8B parameters are active during inference, allowing the model to fit within 18GB of VRAM for local operations.

Google positions this model for local conversational applications where speed is a priority, or for tasks with specific constraints. As an experimental release, Google recommends the standard Gemma 4 for high-quality requirements. The model is currently available on Hugging Face and supports inference and fine-tuning through frameworks such as vLLM and MLX.

Read original (Japanese)·Jun 14, 2026
#diffusiongemma#google deepmind#moe#text generation#gpu optimization