What are the key points?

Researchers developed a span-level error localization framework to audit the reasoning trajectories of deep-research agents. The new TELBench benchmark provides 1,000 instances for evaluating errors within agent operational logs. The DRIFT auditing framework improved first-error accuracy by 30 percentage points in experimental testing.

New Framework Improves Error Detection in Research Agents

•Researchers developed a span-level error localization framework to audit the reasoning trajectories of deep-research agents.
•The new TELBench benchmark provides 1,000 instances for evaluating errors within agent operational logs.
•The DRIFT auditing framework improved first-error accuracy by 30 percentage points in experimental testing.

Researchers at NJU-LINK Lab have developed a new framework to identify specific error points within the reasoning processes of deep-research agents, which are systems designed to execute long-running tasks involving search, tool usage, and evidence synthesis. While traditional evaluation methods rely solely on the final answer to determine agent success, this study shifts the focus to span-level error localization, pinpointing the exact segments within an agent's operational trajectory that lead to unreliable outcomes. The team compiled a dataset of 2,790 real agent trajectories derived from two distinct agent frameworks, three backbone models, and three benchmarks. By converting raw logs into semantic segments and utilizing LLM-assisted expert review, they annotated harmful error spans, ultimately creating TELBench, a 1,000-instance benchmark for testing error identification.

To address these reliability challenges, the researchers proposed the DRIFT framework, a claim-centric auditing system. DRIFT monitors agent claims and verifies them against the evidence spans gathered during the trajectory, highlighting where unsupported or conflicting claims deviate from the answer path. Experimental results indicate that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points across various model families. This approach provides a process-level view of how agents operate, distinguishing between harmless noise, failed searches, and genuine logic errors. By isolating which claims depend on specific evidence, the framework creates a structured ledger that helps developers understand where and why agents fail before they produce a final, potentially incorrect, conclusion.

Researchers at NJU-LINK Lab have developed a new framework to identify specific error points within the reasoning processes of deep-research agents, which are systems designed to execute long-running tasks involving search, tool usage, and evidence synthesis. While traditional evaluation methods rely solely on the final answer to determine agent success, this study shifts the focus to span-level error localization, pinpointing the exact segments within an agent's operational trajectory that lead to unreliable outcomes. The team compiled a dataset of 2,790 real agent trajectories derived from two distinct agent frameworks, three backbone models, and three benchmarks. By converting raw logs into semantic segments and utilizing LLM-assisted expert review, they annotated harmful error spans, ultimately creating TELBench, a 1,000-instance benchmark for testing error identification.

To address these reliability challenges, the researchers proposed the DRIFT framework, a claim-centric auditing system. DRIFT monitors agent claims and verifies them against the evidence spans gathered during the trajectory, highlighting where unsupported or conflicting claims deviate from the answer path. Experimental results indicate that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points across various model families. This approach provides a process-level view of how agents operate, distinguishing between harmless noise, failed searches, and genuine logic errors. By isolating which claims depend on specific evidence, the framework creates a structured ledger that helps developers understand where and why agents fail before they produce a final, potentially incorrect, conclusion.