AI 비교하기AI 사용하기AI 최신정보AI 커뮤니티
Our VisionTermsPrivacyContact

NVIDIA Optimizes Google DeepMind’s DiffusionGemma for Local AI

NVIDIA Optimizes Google DeepMind’s DiffusionGemma for Local AI

NVIDIA
Thursday, June 11, 2026
  • •Google DeepMind released DiffusionGemma, a parallel-processing open model for rapid text generation.
  • •NVIDIA optimized the model to achieve 4x faster performance than standard autoregressive LLMs.
  • •DiffusionGemma runs locally on NVIDIA hardware, reaching up to 2,000 tokens/sec on DGX Station.
  • •Google DeepMind released DiffusionGemma, a parallel-processing open model for rapid text generation.
  • •NVIDIA optimized the model to achieve 4x faster performance than standard autoregressive LLMs.
  • •DiffusionGemma runs locally on NVIDIA hardware, reaching up to 2,000 tokens/sec on DGX Station.

Google DeepMind released DiffusionGemma, an experimental open model designed for high-speed text generation, on June 10, 2026. NVIDIA has optimized the model for its hardware ecosystem, including GeForce RTX GPUs, the NVIDIA RTX PRO platform, and DGX Spark systems, enabling performance up to 4x faster than traditional autoregressive models. Unlike standard LLMs that generate one word at a time, DiffusionGemma utilizes a parallel approach to denoise up to 256 tokens per step, significantly reducing latency for single-user workloads like interactive chat and agentic loops.

Built upon the Gemma 4 architecture, the model features a 26-billion-parameter mixture-of-experts design that activates 3.8 billion parameters per step. This architecture combines a diffusion head with the underlying Gemma 4 framework to process blocks of text in parallel rather than sequentially. This shift from memory-bound to compute-bound processing allows NVIDIA Tensor Cores to accelerate mathematical operations efficiently. The model is available under an Apache 2.0 license and supports day-zero integration with frameworks like Hugging Face Transformers, vLLM, and Unsloth.

Performance benchmarks on NVIDIA hardware demonstrate significant speed advantages: the model reaches 1,000 tokens/sec on a single NVIDIA H100 GPU, 150 tokens/sec on NVIDIA DGX Spark, and up to 2,000 tokens/sec on NVIDIA DGX Station. The NVIDIA DGX Spark, a deskside supercomputer powered by the GB10 Grace Blackwell Superchip with 128GB of unified memory, provides a local environment for prototyping and fine-tuning. Developers can also utilize NVIDIA NeMo frameworks for domain-specific adaptation, or test the model for free via NVIDIA-hosted APIs at build.nvidia.com.

Google DeepMind released DiffusionGemma, an experimental open model designed for high-speed text generation, on June 10, 2026. NVIDIA has optimized the model for its hardware ecosystem, including GeForce RTX GPUs, the NVIDIA RTX PRO platform, and DGX Spark systems, enabling performance up to 4x faster than traditional autoregressive models. Unlike standard LLMs that generate one word at a time, DiffusionGemma utilizes a parallel approach to denoise up to 256 tokens per step, significantly reducing latency for single-user workloads like interactive chat and agentic loops.

Built upon the Gemma 4 architecture, the model features a 26-billion-parameter mixture-of-experts design that activates 3.8 billion parameters per step. This architecture combines a diffusion head with the underlying Gemma 4 framework to process blocks of text in parallel rather than sequentially. This shift from memory-bound to compute-bound processing allows NVIDIA Tensor Cores to accelerate mathematical operations efficiently. The model is available under an Apache 2.0 license and supports day-zero integration with frameworks like Hugging Face Transformers, vLLM, and Unsloth.

Performance benchmarks on NVIDIA hardware demonstrate significant speed advantages: the model reaches 1,000 tokens/sec on a single NVIDIA H100 GPU, 150 tokens/sec on NVIDIA DGX Spark, and up to 2,000 tokens/sec on NVIDIA DGX Station. The NVIDIA DGX Spark, a deskside supercomputer powered by the GB10 Grace Blackwell Superchip with 128GB of unified memory, provides a local environment for prototyping and fine-tuning. Developers can also utilize NVIDIA NeMo frameworks for domain-specific adaptation, or test the model for free via NVIDIA-hosted APIs at build.nvidia.com.

Read original (English)·Jun 10, 2026
#diffusiongemma#gemma 4#nvidia dgx#geforce rtx#parallel generation#inference