An agentic benchmark measuring tool-use and planning ability in multi-step tasks simulating real user workflows. Score is task success rate (%).
Z.ai
GLM-5
GLM-5.1
Alibaba
Qwen3.6 Plus
DeepSeek
DeepSeek V4 Pro
Moonshot AI
Kimi K2.5
Google
Gemini 3.1 Pro
MiniMax
MiniMax M2.5
DeepSeek V4 Flash
Xiaomi
MiMo-V2-Pro
Grok
Grok 4.1 Fast (Reasoning)
Grok 4.20 (Reasoning)
Anthropic
Claude Opus 4.6
Meta
Muse Spark
DeepSeek V3.2
Arcee AI
Trinity Large Thinking
Claude Opus 4.5
Claude Opus 4.7
OpenAI
GPT-5.4
MiniMax M2.7
Baidu
ERNIE 5.0 Thinking
Qwen3.5 397B A17B
Gemini 3 Flash
Meituan
Longcat Flash Chat
Claude Sonnet 4.6
Claude Sonnet 4.5
Claude Opus 4
LG AI Research
K-EXAONE
Claude Opus 4.1
NVIDIA
Nemotron 3 Super
Claude Sonnet 4
Grok 4.1 Fast
Amazon
Nova 2 Lite
Gemma 4 31B
Grok 4.20
Claude Haiku 4.5
Gemini 2.5 Pro
GPT-5.4 Nano
GPT-4.1
GPT OSS 120B
Mistral AI
Mistral Small 4
GPT-5.4 Mini
GPT-5 Mini
Gemini 2.5 Flash
Gemini 3.1 Flash Lite
Gemini 2.5 Flash Lite
GPT-5 Nano
Llama 4 Maverick
Llama 4 Scout
ERNIE 4.5 300B A47B
GPT-5