What are the key points?

AA-Briefcase benchmark measures model efficiency on long-horizon, realistic multi-week knowledge work projects. GPT-5.5 (xhigh) completes tasks in 11 minutes, significantly faster than Claude Opus 4.8’s 23 minutes. GLM-5.2 is the top-performing open weights model with an Elo score of 1261.

AA-Briefcase Benchmark Analyzes AI Task Efficiency

•AA-Briefcase benchmark measures model efficiency on long-horizon, realistic multi-week knowledge work projects.
•GPT-5.5 (xhigh) completes tasks in 11 minutes, significantly faster than Claude Opus 4.8’s 23 minutes.
•GLM-5.2 is the top-performing open weights model with an Elo score of 1261.

Artificial Analysis released its new AA-Briefcase benchmark on June 24, 2026, designed to evaluate AI models on long-horizon, realistic knowledge work projects such as financial modeling and presentation creation. A central metric introduced is average time per task, determined by combining evaluation token usage, model output speeds, and tool execution duration. Across evaluations, Claude Opus 4.8 emerged as a top-scoring model but required approximately 23 minutes per task. In comparison, GPT-5.5 (xhigh) demonstrated higher efficiency, completing tasks in 11 minutes while maintaining a top-five position in AA-Briefcase Elo rankings.

The analysis further highlights GLM-5.2 on the Pareto frontier with an Elo score of 1261, completing tasks in 16.3 minutes; it currently stands as the leading open weights model ahead of MiniMax-M3, which scored 1113. Historical data for the discontinued Claude Fable 5 suggests it would have required 28.5 minutes per task, based on a measured output speed of approximately 91 tokens per second and 139,000 output tokens per task. The study notes that tool execution represents a relatively minor component of the total duration, accounting for only about 12% of the time, while the bulk of the duration is driven by output verbosity, turn usage, and raw inference speed.

Artificial Analysis released its new AA-Briefcase benchmark on June 24, 2026, designed to evaluate AI models on long-horizon, realistic knowledge work projects such as financial modeling and presentation creation. A central metric introduced is average time per task, determined by combining evaluation token usage, model output speeds, and tool execution duration. Across evaluations, Claude Opus 4.8 emerged as a top-scoring model but required approximately 23 minutes per task. In comparison, GPT-5.5 (xhigh) demonstrated higher efficiency, completing tasks in 11 minutes while maintaining a top-five position in AA-Briefcase Elo rankings.

The analysis further highlights GLM-5.2 on the Pareto frontier with an Elo score of 1261, completing tasks in 16.3 minutes; it currently stands as the leading open weights model ahead of MiniMax-M3, which scored 1113. Historical data for the discontinued Claude Fable 5 suggests it would have required 28.5 minutes per task, based on a measured output speed of approximately 91 tokens per second and 139,000 output tokens per task. The study notes that tool execution represents a relatively minor component of the total duration, accounting for only about 12% of the time, while the bulk of the duration is driven by output verbosity, turn usage, and raw inference speed.