What are the key points?

Researchers launched EnterpriseClawBench, a benchmark for enterprise agents using 852 tasks from real workplace sessions. The evaluation protocol prioritizes metrics like artifact delivery, cost, and skill-transfer instead of single performance scores. The top-performing configuration of Codex and GPT-5.5 achieved a score of 0.663 on the benchmark.

EnterpriseClawBench Benchmarks Agents Using Real Workplace Sessions

•Researchers launched EnterpriseClawBench, a benchmark for enterprise agents using 852 tasks from real workplace sessions.
•The evaluation protocol prioritizes metrics like artifact delivery, cost, and skill-transfer instead of single performance scores.
•The top-performing configuration of Codex and GPT-5.5 achieved a score of 0.663 on the benchmark.

A research team led by Jincheng Zhong and Kaiyan Zhang introduced EnterpriseClawBench, a benchmark designed to evaluate enterprise AI agents based on 852 reproducible tasks derived from real-world workplace sessions. Unlike existing benchmarks that rely on synthetic environments, this framework utilizes proprietary logs of agents reading heterogeneous files, invoking tools, and generating business artifacts. The researchers developed the protocol to include specific metrics such as harness-model combinations, artifact delivery success, visual quality, operational cost, runtime, and skill-transfer behavior.

Due to the sensitive nature of the original workplace data, the team opted not to release the raw dataset. Instead, they published the construction and evaluation methodology, enabling organizations to apply the protocol to their own private sessions. In performance testing, the top-performing configuration—a combination of Codex and GPT-5.5—achieved a score of only 0.663. The authors emphasize that enterprise performance cannot be simplified into a single metric, arguing that these multi-faceted evaluations are essential for understanding how agents function in complex business environments.

A research team led by Jincheng Zhong and Kaiyan Zhang introduced EnterpriseClawBench, a benchmark designed to evaluate enterprise AI agents based on 852 reproducible tasks derived from real-world workplace sessions. Unlike existing benchmarks that rely on synthetic environments, this framework utilizes proprietary logs of agents reading heterogeneous files, invoking tools, and generating business artifacts. The researchers developed the protocol to include specific metrics such as harness-model combinations, artifact delivery success, visual quality, operational cost, runtime, and skill-transfer behavior.

Due to the sensitive nature of the original workplace data, the team opted not to release the raw dataset. Instead, they published the construction and evaluation methodology, enabling organizations to apply the protocol to their own private sessions. In performance testing, the top-performing configuration—a combination of Codex and GPT-5.5—achieved a score of only 0.663. The authors emphasize that enterprise performance cannot be simplified into a single metric, arguing that these multi-faceted evaluations are essential for understanding how agents function in complex business environments.