InterleaveThinker Enables Sequential Text-Image Generation
- •InterleaveThinker enables interleaved text-image generation via a new multi-agent planning and critic pipeline.
- •The system utilizes specialized datasets including Interleave-Planner-SFT-80k and Interleave-Critic-RL-13k for reinforcement training.
- •InterleaveThinker achieves performance comparable to GPT-5 and Nano Banana on established visual and reasoning benchmarks.
Dian Zheng and colleagues released InterleaveThinker on June 11, 2026, a multi-agent framework enabling interleaved text-image sequence generation for existing image generators. While current models struggle with sequential visual narratives, this pipeline utilizes a planner agent to organize input sequences and a critic agent to refine outputs based on instruction adherence.
To build the framework, the team created the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k datasets for initial training. They subsequently developed the Interleave-Critic-RL-13k dataset to reinforce instruction correction using GRPO (a reinforcement learning algorithm that optimizes generation policies). Because trajectories can exceed 25 generator calls, the authors implemented accuracy and step-wise rewards to enable efficient single-step reinforcement learning.
InterleaveThinker demonstrates performance comparable to Nano Banana and GPT-5 on visual benchmarks. Additionally, the system improves base model reasoning capabilities, evidenced by substantial gains on the WISE and RISE benchmarks using the 4-step FLUX.2-klein model architecture. The research is hosted on GitHub, with project documentation available on the InterleaveThinker page.