TerminalBench Hard

About This Benchmark

A hard-level benchmark evaluating AI agents' ability to execute complex shell commands, file operations, and system tasks in a real terminal environment. Score is success rate (%).

Source: Artificial Analysis