PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

PlanBench-XL is an interactive benchmark comprising 327 retail tasks and 1,665 tools, designed to assess whether large language model agents can iteratively retrieve, select, and invoke tools under retrieval‑limited visibility to achieve long‑horizon planning objectives.

Motivation and Research Gap

Large language model (LLM) agents are increasingly deployed in complex tool ecosystems where real‑world tasks demand the discovery of relevant tools, inference of implicit sub‑goals, and adaptation to dynamic environments over extended planning horizons. Existing evaluation benchmarks predominantly focus on short‑horizon or fully observable tool access, thereby failing to capture the challenges posed by retrieval‑limited visibility and the need for sustained, multi‑step planning.

PlanBench-XL Design

The benchmark includes 327 distinct retail tasks spanning a variety of consumer‑facing scenarios, and a tool collection of 1,665 individual tools that cover diverse functional capabilities such as product lookup, price comparison, order placement, and inventory checking. Each task is presented in an interactive setting where agents must:

  • Identify and retrieve usable tools from a partially observable pool.
  • Determine appropriate sub‑goals that enable progressive tool invocation.
  • Execute tool calls iteratively, updating their internal state based on tool feedback.
  • Adapt to dynamic changes in the tool ecosystem or task environment.

Evaluation Protocol

PlanBench-XL adopts a retrieval‑limited protocol: agents initially have access to a restricted subset of tools and must progressively expand their toolset through systematic retrieval actions. Success is measured by the ability to complete the full task sequence within a bounded number of tool invocations, thereby testing both planning depth and tool‑use efficiency.