TAU2

About This Benchmark

An agentic benchmark measuring tool-use and planning ability in multi-step tasks simulating real user workflows. Score is task success rate (%).

Source: Artificial Analysis