What are the key points?

Yale researchers introduce ResearchGym to test AI agents on real-world scientific research tasks. GPT-5 and other frontier models show a major reliability gap, beating human baselines only 6.7% of the time. Long-horizon failures like poor resource management and context limits hinder autonomous research capabilities.

AI Agents Struggle with End-to-End Scientific Research Benchmarks

•Yale researchers introduce ResearchGym to test AI agents on real-world scientific research tasks.
•GPT-5 and other frontier models show a major reliability gap, beating human baselines only 6.7% of the time.
•Long-horizon failures like poor resource management and context limits hinder autonomous research capabilities.

While AI agents are increasingly capable of writing code and answering queries, their ability to conduct autonomous scientific research remains inconsistent. Researchers from Yale University have introduced ResearchGym, a new benchmark designed to evaluate these agents on the complex, multi-step process of AI research. By repurposing papers from top conferences like ICML and ICLR, the environment challenges agents to propose hypotheses, run experiments, and attempt to beat human-established baselines. This provides a closed-loop system where the AI must handle everything from initial ideation to final implementation.

The results reveal a stark capability-reliability gap in today's most advanced models. Even when powered by frontier engines like GPT-5 and Claude Code, these agents struggled to maintain performance over long periods. In testing, GPT-5 only managed to surpass the original paper's baseline in 6.7% of evaluations. While it did achieve one standout success by outperforming a 2025 spotlight task, such instances were the exception rather than the rule, highlighting the unpredictable nature of current agentic systems when faced with high-level academic rigor.

The study identifies several long-horizon failure modes that prevent AI from being truly autonomous researchers. These include overconfidence in weak ideas, difficulty managing parallel experiments, and hard limits imposed by the model's memory capacity or context length. By providing the necessary infrastructure to track how models handle these hurdles, ResearchGym aims to help developers bridge the gap between occasional brilliance and the steady reliability required to accelerate scientific discovery.

While AI agents are increasingly capable of writing code and answering queries, their ability to conduct autonomous scientific research remains inconsistent. Researchers from Yale University have introduced ResearchGym, a new benchmark designed to evaluate these agents on the complex, multi-step process of AI research. By repurposing papers from top conferences like ICML and ICLR, the environment challenges agents to propose hypotheses, run experiments, and attempt to beat human-established baselines. This provides a closed-loop system where the AI must handle everything from initial ideation to final implementation.

The results reveal a stark capability-reliability gap in today's most advanced models. Even when powered by frontier engines like GPT-5 and Claude Code, these agents struggled to maintain performance over long periods. In testing, GPT-5 only managed to surpass the original paper's baseline in 6.7% of evaluations. While it did achieve one standout success by outperforming a 2025 spotlight task, such instances were the exception rather than the rule, highlighting the unpredictable nature of current agentic systems when faced with high-level academic rigor.

The study identifies several long-horizon failure modes that prevent AI from being truly autonomous researchers. These include overconfidence in weak ideas, difficulty managing parallel experiments, and hard limits imposed by the model's memory capacity or context length. By providing the necessary infrastructure to track how models handle these hurdles, ResearchGym aims to help developers bridge the gap between occasional brilliance and the steady reliability required to accelerate scientific discovery.