AI 비교하기AI 사용하기AI 최신정보AI 커뮤니티
Our VisionTermsPrivacyContact

InterleaveThinker Enables Sequential Text-Image Generation

InterleaveThinker Enables Sequential Text-Image Generation

HuggingFace
Saturday, June 13, 2026
  • •InterleaveThinker enables interleaved text-image generation via a new multi-agent planning and critic pipeline.
  • •The system utilizes specialized datasets including Interleave-Planner-SFT-80k and Interleave-Critic-RL-13k for reinforcement training.
  • •InterleaveThinker achieves performance comparable to GPT-5 and Nano Banana on established visual and reasoning benchmarks.
  • •InterleaveThinker enables interleaved text-image generation via a new multi-agent planning and critic pipeline.
  • •The system utilizes specialized datasets including Interleave-Planner-SFT-80k and Interleave-Critic-RL-13k for reinforcement training.
  • •InterleaveThinker achieves performance comparable to GPT-5 and Nano Banana on established visual and reasoning benchmarks.

Dian Zheng and colleagues released InterleaveThinker on June 11, 2026, a multi-agent framework enabling interleaved text-image sequence generation for existing image generators. While current models struggle with sequential visual narratives, this pipeline utilizes a planner agent to organize input sequences and a critic agent to refine outputs based on instruction adherence.

To build the framework, the team created the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k datasets for initial training. They subsequently developed the Interleave-Critic-RL-13k dataset to reinforce instruction correction using GRPO (a reinforcement learning algorithm that optimizes generation policies). Because trajectories can exceed 25 generator calls, the authors implemented accuracy and step-wise rewards to enable efficient single-step reinforcement learning.

InterleaveThinker demonstrates performance comparable to Nano Banana and GPT-5 on visual benchmarks. Additionally, the system improves base model reasoning capabilities, evidenced by substantial gains on the WISE and RISE benchmarks using the 4-step FLUX.2-klein model architecture. The research is hosted on GitHub, with project documentation available on the InterleaveThinker page.

Dian Zheng and colleagues released InterleaveThinker on June 11, 2026, a multi-agent framework enabling interleaved text-image sequence generation for existing image generators. While current models struggle with sequential visual narratives, this pipeline utilizes a planner agent to organize input sequences and a critic agent to refine outputs based on instruction adherence.

To build the framework, the team created the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k datasets for initial training. They subsequently developed the Interleave-Critic-RL-13k dataset to reinforce instruction correction using GRPO (a reinforcement learning algorithm that optimizes generation policies). Because trajectories can exceed 25 generator calls, the authors implemented accuracy and step-wise rewards to enable efficient single-step reinforcement learning.

InterleaveThinker demonstrates performance comparable to Nano Banana and GPT-5 on visual benchmarks. Additionally, the system improves base model reasoning capabilities, evidenced by substantial gains on the WISE and RISE benchmarks using the 4-step FLUX.2-klein model architecture. The research is hosted on GitHub, with project documentation available on the InterleaveThinker page.

Read original (English)·Jun 13, 2026
#interleavethinker#multimodal#grpo#image generation#reinforcement learning