What are the key points?

AI agents now demonstrate the ability to independently complete complex research tasks lasting up to 12 hours. Coding benchmarks like SWE-Bench are effectively saturated, signaling AI's competency in fully automated software engineering. Experts estimate a 60% probability of no-human-involved AI R&D emerging by the end of 2028.

AI Research Is Automating Its Own Development

•AI agents now demonstrate the ability to independently complete complex research tasks lasting up to 12 hours.
•Coding benchmarks like SWE-Bench are effectively saturated, signaling AI's competency in fully automated software engineering.
•Experts estimate a 60% probability of no-human-involved AI R&D emerging by the end of 2028.

We are approaching a historical pivot point where artificial intelligence research may begin to iterate and improve upon itself without direct human intervention. This phenomenon—often described as recursive self-improvement—is no longer the exclusive domain of science fiction. Instead, it is emerging as a tangible outcome of recent breakthroughs in system reliability and coding autonomy. By analyzing aggregate trends across scientific benchmarks, we can see that the foundational components for fully automated research are already being assembled.

The most striking evidence of this shift lies in coding competency. Consider SWE-Bench, a rigorous standard used to evaluate how well models solve real-world GitHub issues. Where previous systems struggled to reach even single-digit success rates, modern models have achieved near-total saturation of the benchmark. This implies that AI is no longer just assisting developers; it is effectively functioning as a software engineer capable of writing, testing, and debugging code independently.

Beyond pure coding, we are witnessing a dramatic increase in the time horizons over which these models operate. Measurements from the METR (Model Evaluation and Threat Research) initiative indicate that the duration a model can remain reliable while working autonomously has grown from minutes to over 12 hours in just a few years. This capacity for sustained, independent work is essential for the unglamorous aspects of research: cleaning datasets, launching experiments, and verifying results.

Critically, this automation is extending into the very infrastructure of science. Recent experiments demonstrate models optimizing their own machine learning kernels—the foundational code that dictates hardware efficiency—and even performing automated alignment research. This involves AI agents autonomously identifying and solving safety problems, a task once thought to be exclusively human-centric. As these systems learn to manage other sub-agents, we see the early architecture of a self-sustaining research loop.

While AI models cannot yet generate the paradigm-shifting creative leaps that define true breakthroughs, they are exceptionally adept at the relentless, iterative experimentation that drives scientific progress. If scaling trends persist, the possibility of fully automated, no-human-involved R&D by 2028 appears increasingly plausible. We are effectively crossing a technological Rubicon, moving toward a future where the pace of discovery may be dictated by the speed of computation rather than the capacity of human researchers.

We are approaching a historical pivot point where artificial intelligence research may begin to iterate and improve upon itself without direct human intervention. This phenomenon—often described as recursive self-improvement—is no longer the exclusive domain of science fiction. Instead, it is emerging as a tangible outcome of recent breakthroughs in system reliability and coding autonomy. By analyzing aggregate trends across scientific benchmarks, we can see that the foundational components for fully automated research are already being assembled.

The most striking evidence of this shift lies in coding competency. Consider SWE-Bench, a rigorous standard used to evaluate how well models solve real-world GitHub issues. Where previous systems struggled to reach even single-digit success rates, modern models have achieved near-total saturation of the benchmark. This implies that AI is no longer just assisting developers; it is effectively functioning as a software engineer capable of writing, testing, and debugging code independently.

Beyond pure coding, we are witnessing a dramatic increase in the time horizons over which these models operate. Measurements from the METR (Model Evaluation and Threat Research) initiative indicate that the duration a model can remain reliable while working autonomously has grown from minutes to over 12 hours in just a few years. This capacity for sustained, independent work is essential for the unglamorous aspects of research: cleaning datasets, launching experiments, and verifying results.

Critically, this automation is extending into the very infrastructure of science. Recent experiments demonstrate models optimizing their own machine learning kernels—the foundational code that dictates hardware efficiency—and even performing automated alignment research. This involves AI agents autonomously identifying and solving safety problems, a task once thought to be exclusively human-centric. As these systems learn to manage other sub-agents, we see the early architecture of a self-sustaining research loop.

While AI models cannot yet generate the paradigm-shifting creative leaps that define true breakthroughs, they are exceptionally adept at the relentless, iterative experimentation that drives scientific progress. If scaling trends persist, the possibility of fully automated, no-human-involved R&D by 2028 appears increasingly plausible. We are effectively crossing a technological Rubicon, moving toward a future where the pace of discovery may be dictated by the speed of computation rather than the capacity of human researchers.