What are the key points?

NatureBench introduces 90 scientific tasks from Nature-family journals to evaluate AI coding agents. The strongest tested model outperformed established state-of-the-art benchmarks on only 17.8% of tasks. Agents primarily succeed through methodological translation rather than original scientific discovery or innovation.

NatureBench Evaluates AI Coding Agents on Scientific Discovery

•NatureBench introduces 90 scientific tasks from Nature-family journals to evaluate AI coding agents.
•The strongest tested model outperformed established state-of-the-art benchmarks on only 17.8% of tasks.
•Agents primarily succeed through methodological translation rather than original scientific discovery or innovation.

NatureBench provides a new cross-disciplinary benchmark featuring 90 distinct scientific tasks derived from peer-reviewed Nature-family publications. Researchers developed this suite to evaluate whether AI coding agents (systems capable of writing and executing software) can perform scientific discovery instead of merely reproducing known results. The benchmark utilizes NatureGym, an automated pipeline that creates a standardized, containerized environment (a self-contained system for running software) from source papers to solve issues with environment fragmentation.

Evaluation of ten frontier agent configurations under a strict web-search-disabled protocol shows that the most capable model outperforms reported state-of-the-art results on only 17.8% of tasks using the g>0.1 criterion. Further analysis indicates that agents succeed largely through methodological translation, which involves reformatting scientific tasks into familiar supervised prediction problems rather than generating genuine scientific innovation. Most failures stem from incorrect method choices or insufficient compute budgets, rather than a failure to understand the task itself. The researchers released the benchmark, the NatureGym pipeline, and a public leaderboard for verification.

NatureBench provides a new cross-disciplinary benchmark featuring 90 distinct scientific tasks derived from peer-reviewed Nature-family publications. Researchers developed this suite to evaluate whether AI coding agents (systems capable of writing and executing software) can perform scientific discovery instead of merely reproducing known results. The benchmark utilizes NatureGym, an automated pipeline that creates a standardized, containerized environment (a self-contained system for running software) from source papers to solve issues with environment fragmentation.

Evaluation of ten frontier agent configurations under a strict web-search-disabled protocol shows that the most capable model outperforms reported state-of-the-art results on only 17.8% of tasks using the g>0.1 criterion. Further analysis indicates that agents succeed largely through methodological translation, which involves reformatting scientific tasks into familiar supervised prediction problems rather than generating genuine scientific innovation. Most failures stem from incorrect method choices or insufficient compute budgets, rather than a failure to understand the task itself. The researchers released the benchmark, the NatureGym pipeline, and a public leaderboard for verification.