Qwen-Image-2.0 Foundation Model Released
- •Qwen-Image-2.0 unifies high-fidelity generation and precise image editing within a single, advanced foundation model framework.
- •The model uses Qwen3-VL as a condition encoder and a Multimodal Diffusion Transformer for joint target modeling.
- •Qwen-Image-2.0 supports 1K-token prompts, improving text-rich content, multilingual typography, and photorealism in image generation tasks.
Qwen-Image-2.0 is an omni-capable image generation foundation model that unifies high-fidelity synthesis and precise editing within a single framework. Introduced by a large team of researchers on May 11, 2026, the model utilizes Qwen3-VL as a condition encoder (a component that processes input prompts for guidance) paired with a Multimodal Diffusion Transformer for joint condition-target modeling. This architectural integration, combined with large-scale data curation and a multi-stage training pipeline, allows the system to address challenges in ultra-long text rendering, multilingual typography, and complex visual composition.
The model is engineered to support prompts of up to 1K tokens, facilitating the creation of text-rich assets such as posters, infographics, and comics. It demonstrates improved multilingual text fidelity and more reliable instruction following across diverse styles. Furthermore, Qwen-Image-2.0 enhances photorealistic generation through richer details, realistic textures, and coherent lighting. According to human evaluations, the model significantly outperforms previous versions of Qwen-Image in both generation and editing tasks, marking progress toward more reliable and versatile image generation foundation models.