AI 비교하기AI 교차검증AI 최신정보AI 커뮤니티
Our VisionTermsPrivacyFAQContact

EnterpriseClawBench Benchmarks Agents Using Real Workplace Sessions

EnterpriseClawBench Benchmarks Agents Using Real Workplace Sessions

HuggingFace
Wednesday, June 24, 2026
  • •Researchers launched EnterpriseClawBench, a benchmark for enterprise agents using 852 tasks from real workplace sessions.
  • •The evaluation protocol prioritizes metrics like artifact delivery, cost, and skill-transfer instead of single performance scores.
  • •The top-performing configuration of Codex and GPT-5.5 achieved a score of 0.663 on the benchmark.
  • •Researchers launched EnterpriseClawBench, a benchmark for enterprise agents using 852 tasks from real workplace sessions.
  • •The evaluation protocol prioritizes metrics like artifact delivery, cost, and skill-transfer instead of single performance scores.
  • •The top-performing configuration of Codex and GPT-5.5 achieved a score of 0.663 on the benchmark.

A research team led by Jincheng Zhong and Kaiyan Zhang introduced EnterpriseClawBench, a benchmark designed to evaluate enterprise AI agents based on 852 reproducible tasks derived from real-world workplace sessions. Unlike existing benchmarks that rely on synthetic environments, this framework utilizes proprietary logs of agents reading heterogeneous files, invoking tools, and generating business artifacts. The researchers developed the protocol to include specific metrics such as harness-model combinations, artifact delivery success, visual quality, operational cost, runtime, and skill-transfer behavior.

Due to the sensitive nature of the original workplace data, the team opted not to release the raw dataset. Instead, they published the construction and evaluation methodology, enabling organizations to apply the protocol to their own private sessions. In performance testing, the top-performing configuration—a combination of Codex and GPT-5.5—achieved a score of only 0.663. The authors emphasize that enterprise performance cannot be simplified into a single metric, arguing that these multi-faceted evaluations are essential for understanding how agents function in complex business environments.

A research team led by Jincheng Zhong and Kaiyan Zhang introduced EnterpriseClawBench, a benchmark designed to evaluate enterprise AI agents based on 852 reproducible tasks derived from real-world workplace sessions. Unlike existing benchmarks that rely on synthetic environments, this framework utilizes proprietary logs of agents reading heterogeneous files, invoking tools, and generating business artifacts. The researchers developed the protocol to include specific metrics such as harness-model combinations, artifact delivery success, visual quality, operational cost, runtime, and skill-transfer behavior.

Due to the sensitive nature of the original workplace data, the team opted not to release the raw dataset. Instead, they published the construction and evaluation methodology, enabling organizations to apply the protocol to their own private sessions. In performance testing, the top-performing configuration—a combination of Codex and GPT-5.5—achieved a score of only 0.663. The authors emphasize that enterprise performance cannot be simplified into a single metric, arguing that these multi-faceted evaluations are essential for understanding how agents function in complex business environments.

Read original (English)·Jun 24, 2026
#enterpriseclawbench#agentic ai#benchmark#gpt 5 5#codex#workplace sessions