What are the key points?

PlanBench-XL evaluates LLM agent planning capabilities across 327 retail tasks and 1,665 tools. GPT-5.4's accuracy fell from 51.90% to 11.36% when challenged with simulated environmental disruptions. Benchmark analysis confirms agents struggle to recover when tool failures lack clear error signals.

PlanBench-XL Evaluates LLM Agent Planning in Complex Environments

•PlanBench-XL evaluates LLM agent planning capabilities across 327 retail tasks and 1,665 tools.
•GPT-5.4's accuracy fell from 51.90% to 11.36% when challenged with simulated environmental disruptions.
•Benchmark analysis confirms agents struggle to recover when tool failures lack clear error signals.

Researchers from the University of Illinois at Urbana-Champaign introduced PlanBench-XL on June 21, a new interactive benchmark designed to evaluate how well LLM agents navigate large-scale, complex tool environments. The benchmark comprises 327 retail tasks involving a total of 1,665 tools, challenging models to iteratively retrieve, invoke, and chain tool functions to achieve final objectives. PlanBench-XL includes a blocking mechanism that simulates real-world unpredictability by introducing failing, missing, or distracting tools, forcing agents to detect path disruptions and adapt their strategies at runtime.

Experimental results across ten leading models reveal significant vulnerabilities in current agentic planning. GPT-5.4 achieved a 51.90% accuracy rate in block-free scenarios, but its performance dropped sharply to 11.36% when subjected to severe blocking conditions. Analysis indicates that agents struggle most when failures provide no explicit error signals or when they must identify alternative, longer tool-use paths to recover from an obstacle. These findings highlight a critical gap in the ability of current models to manage long-horizon planning tasks in imperfect, large-scale ecosystems where tool visibility is limited.

Researchers from the University of Illinois at Urbana-Champaign introduced PlanBench-XL on June 21, a new interactive benchmark designed to evaluate how well LLM agents navigate large-scale, complex tool environments. The benchmark comprises 327 retail tasks involving a total of 1,665 tools, challenging models to iteratively retrieve, invoke, and chain tool functions to achieve final objectives. PlanBench-XL includes a blocking mechanism that simulates real-world unpredictability by introducing failing, missing, or distracting tools, forcing agents to detect path disruptions and adapt their strategies at runtime.

Experimental results across ten leading models reveal significant vulnerabilities in current agentic planning. GPT-5.4 achieved a 51.90% accuracy rate in block-free scenarios, but its performance dropped sharply to 11.36% when subjected to severe blocking conditions. Analysis indicates that agents struggle most when failures provide no explicit error signals or when they must identify alternative, longer tool-use paths to recover from an obstacle. These findings highlight a critical gap in the ability of current models to manage long-horizon planning tasks in imperfect, large-scale ecosystems where tool visibility is limited.