CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

Researchers introduce CoffeeBench, a novel benchmarking framework designed to evaluate the performance of Large Language Model (LLM) agents operating within complex, heterogeneous multi-agent economic systems over extended time horizons.

The Challenge of Economic Evaluation in LLM Agents

As Large Language Models (LLMs) evolve into autonomous agents capable of executing long-horizon tasks, there is a growing need for rigorous evaluation frameworks that mirror real-world complexity. Most existing benchmarks focus on single-agent interactions within passive environments, which fails to capture the dynamic nature of economic systems. In contrast, economic environments are inherently multi-agent, requiring autonomous entities to engage in continuous communication, strategic negotiation, and transactional exchanges to achieve specific objectives over prolonged periods.

Introducing CoffeeBench

CoffeeBench addresses these gaps by providing a specialized environment to test how LLM agents navigate heterogeneous economies. Unlike static tests, CoffeeBench emphasizes the ability of agents to pursue self-defined goals while interacting with other autonomous agents. This requires the models to exhibit high-level reasoning, long-term planning, and the ability to manage resources and relationships within a competitive yet collaborative ecosystem.

Key Evaluation Dimensions

The benchmark focuses on several critical capabilities essential for deployment in economic simulations:

  • Long-Horizon Planning: The ability to maintain goal consistency and execute multi-step strategies over extended operational windows.
  • Multi-Agent Coordination: The capacity to communicate and negotiate effectively with other agents to facilitate transactions.
  • Heterogeneity: Evaluating how agents with different objectives and constraints interact within a shared economic framework.

Note: Due to the limited nature of the provided source text, specific detailed metrics and quantitative results of the CoffeeBench experiments are not available.

Original Source
LLM Agents Multi-Agent Systems (MAS) Economic Simulation Long-Horizon Planning Benchmarking