What are the key points?

Sakana AI introduces 'Conductor,' a model that delegates complex tasks to specialized AI teams System outperforms individual models on LiveCodeBench and GPQA-Diamond with superior cost-efficiency Recursive self-correction allows agents to review their own work and fix errors autonomously

New AI Orchestrator Dynamically Builds Teams of Experts

•Sakana AI introduces 'Conductor,' a model that delegates complex tasks to specialized AI teams
•System outperforms individual models on LiveCodeBench and GPQA-Diamond with superior cost-efficiency
•Recursive self-correction allows agents to review their own work and fix errors autonomously

For years, the gold standard for getting the best results out of Large Language Models (LLMs) was prompt engineering—the meticulous art of crafting instructions to guide AI behavior. However, this manual process is hitting a ceiling as tasks become more complex and multi-layered. Researchers at Sakana AI have proposed a new paradigm: rather than training an AI to solve a specific problem, why not train it to act as a manager? Their new work, accepted at ICLR 2026, introduces the Conductor, a model designed to orchestrate entire workflows by delegating subtasks to a diverse pool of expert AI agents.

The Conductor functions much like a human project manager in a corporate setting. When presented with a prompt, it does not immediately attempt to generate a direct answer. Instead, it analyzes the complexity of the request and decides which models are best suited for the job. For a simple factual query, it might rely on a single model. For a demanding coding problem, however, it autonomously spins up a sophisticated pipeline involving planners, coders, and verifiers. This dynamic adaptation is the core breakthrough here—the system builds a custom, specialized team on the fly based on the specific needs of the user.

Perhaps the most fascinating capability introduced is what the team calls recursive test-time scaling. By allowing the Conductor to include itself in its own workflows, the system can review its previous output, identify logical gaps or failures, and launch a corrective process immediately. This ability to 'self-reflect' and improve during execution introduces a new dimension to computational efficiency. Instead of simply scaling up the size of a model, this approach focuses on scaling the intelligence of the workflow itself, allowing the AI to achieve high-performance results at a fraction of the cost of traditional, rigid multi-agent systems.

The numbers support this shift in strategy, with the 7B Conductor model setting new records on industry-standard benchmarks like LiveCodeBench and GPQA-Diamond. By acting as a meta-prompt engineer that harnesses the collective intelligence of frontier models, this research suggests a shift in the future of AI development. We are moving away from monolithic models that try to do everything alone toward intelligent, flexible, and decentralized systems that work collaboratively to solve the world's most difficult problems.

For years, the gold standard for getting the best results out of Large Language Models (LLMs) was prompt engineering—the meticulous art of crafting instructions to guide AI behavior. However, this manual process is hitting a ceiling as tasks become more complex and multi-layered. Researchers at Sakana AI have proposed a new paradigm: rather than training an AI to solve a specific problem, why not train it to act as a manager? Their new work, accepted at ICLR 2026, introduces the Conductor, a model designed to orchestrate entire workflows by delegating subtasks to a diverse pool of expert AI agents.

The Conductor functions much like a human project manager in a corporate setting. When presented with a prompt, it does not immediately attempt to generate a direct answer. Instead, it analyzes the complexity of the request and decides which models are best suited for the job. For a simple factual query, it might rely on a single model. For a demanding coding problem, however, it autonomously spins up a sophisticated pipeline involving planners, coders, and verifiers. This dynamic adaptation is the core breakthrough here—the system builds a custom, specialized team on the fly based on the specific needs of the user.

Perhaps the most fascinating capability introduced is what the team calls recursive test-time scaling. By allowing the Conductor to include itself in its own workflows, the system can review its previous output, identify logical gaps or failures, and launch a corrective process immediately. This ability to 'self-reflect' and improve during execution introduces a new dimension to computational efficiency. Instead of simply scaling up the size of a model, this approach focuses on scaling the intelligence of the workflow itself, allowing the AI to achieve high-performance results at a fraction of the cost of traditional, rigid multi-agent systems.

The numbers support this shift in strategy, with the 7B Conductor model setting new records on industry-standard benchmarks like LiveCodeBench and GPQA-Diamond. By acting as a meta-prompt engineer that harnesses the collective intelligence of frontier models, this research suggests a shift in the future of AI development. We are moving away from monolithic models that try to do everything alone toward intelligent, flexible, and decentralized systems that work collaboratively to solve the world's most difficult problems.