What are the key points?

Researchers launched Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B for agentic environment simulation. The models use 10M interaction trajectories and a three-stage training pipeline involving CPT, SFT, and RL. The new AgentWorldBench benchmark shows Qwen-AgentWorld outperforms existing frontier models across 9 established tasks.

Qwen-AgentWorld Language Models for Agent Simulation

•Researchers launched Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B for agentic environment simulation.
•The models use 10M interaction trajectories and a three-stage training pipeline involving CPT, SFT, and RL.
•The new AgentWorldBench benchmark shows Qwen-AgentWorld outperforms existing frontier models across 9 established tasks.

Researchers introduced Qwen-AgentWorld, a framework of language-based world models designed to enhance agentic environment simulation and downstream reasoning capabilities. The team unveiled two primary versions: Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B. These models simulate agentic environments across 7 distinct domains using long chain-of-thought (reasoning steps generated before answering) techniques. The development pipeline utilizes over 10M environment interaction trajectories through a three-stage process: CPT (continual pre-training) for general capabilities, SFT (supervised fine-tuning) for next-state prediction, and RL (reinforcement learning) to refine simulation fidelity via hybrid rubric-and-rule rewards.

To measure progress, the team released AgentWorldBench, a benchmark derived from interactions of 5 frontier models across 9 established tasks. Qwen-AgentWorld significantly outperformed existing models on this benchmark. Beyond general simulation, the model serves as a decoupled simulator for scalable agentic reinforcement learning and provides an effective warm-up phase that improves downstream performance across 7 agentic benchmarks.

Researchers introduced Qwen-AgentWorld, a framework of language-based world models designed to enhance agentic environment simulation and downstream reasoning capabilities. The team unveiled two primary versions: Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B. These models simulate agentic environments across 7 distinct domains using long chain-of-thought (reasoning steps generated before answering) techniques. The development pipeline utilizes over 10M environment interaction trajectories through a three-stage process: CPT (continual pre-training) for general capabilities, SFT (supervised fine-tuning) for next-state prediction, and RL (reinforcement learning) to refine simulation fidelity via hybrid rubric-and-rule rewards.

To measure progress, the team released AgentWorldBench, a benchmark derived from interactions of 5 frontier models across 9 established tasks. Qwen-AgentWorld significantly outperformed existing models on this benchmark. Beyond general simulation, the model serves as a decoupled simulator for scalable agentic reinforcement learning and provides an effective warm-up phase that improves downstream performance across 7 agentic benchmarks.