A benchmark evaluating the ability to accurately recall specific information from long documents. Measures long-context retrieval and processing capability. Score is accuracy (%).
Anthropic
Claude Opus 4.5
OpenAI
GPT-5.4
Google
Gemini 3.1 Pro
Claude Opus 4.6
Claude Haiku 4.5
Claude Opus 4.7
Meta
Muse Spark
Alibaba
Qwen3.6 Plus
MiniMax
MiniMax M2.7
Grok
Grok 4.1 Fast (Reasoning)
Claude Opus 4.1
DeepSeek
DeepSeek V4 Pro
Gemini 3 Flash
Gemini 2.5 Pro
MiniMax M2.5
Claude Sonnet 4.5
Gemini 3.1 Flash Lite
Moonshot AI
Kimi K2.5
DeepSeek V3.2
Claude Sonnet 4
GPT-5
Z.ai
GLM-5
DeepSeek V4 Flash
GLM-5.1
Gemma 4 31B
Gemini 2.5 Flash
GPT-5.4 Mini
GPT-4.1
Xiaomi
MiMo-V2-Pro
NVIDIA
Nemotron 3 Super
Claude Sonnet 4.6
Grok 4.20 (Reasoning)
Qwen3.5 397B A17B
GPT-5.4 Nano
Gemini 2.5 Flash Lite
Llama 4 Maverick
Mistral AI
Mistral Small 4
GPT OSS 120B
GPT-5 Nano
GPT-5 Mini
Claude Opus 4
Arcee AI
Trinity Large Thinking
Llama 4 Scout
Meituan
Longcat Flash Chat
Grok 4.1 Fast
Amazon
Nova 2 Lite
Grok 4.20
Baidu
ERNIE 5.0 Thinking
ERNIE 4.5 300B A47B