CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
Researchers introduce CoffeeBench, a novel benchmarking framework designed to evaluate the performance of Large Language Model (LLM) agents operating within complex, heterogeneous multi-agent economic systems over extended time horizons.
The Challenge of Economic Evaluation in LLM Agents
As Large Language Models (LLMs) evolve into autonomous agents capable of executing long-horizon tasks, there is a growing need for rigorous evaluation frameworks that mirror real-world complexity. Most existing benchmarks focus on single-agent interactions within passive environments, which fails to capture the dynamic nature of economic systems. In contrast, economic environments are inherently multi-agent, requiring autonomous entities to engage in continuous communication, strategic negotiation, and transactional exchanges to achieve specific objectives over prolonged periods.
Introducing CoffeeBench
CoffeeBench addresses these gaps by providing a specialized environment to test how LLM agents navigate heterogeneous economies. Unlike static tests, CoffeeBench emphasizes the ability of agents to pursue self-defined goals while interacting with other autonomous agents. This requires the models to exhibit high-level reasoning, long-term planning, and the ability to manage resources and relationships within a competitive yet collaborative ecosystem.
Key Evaluation Dimensions
The benchmark focuses on several critical capabilities essential for deployment in economic simulations:
- Long-Horizon Planning: The ability to maintain goal consistency and execute multi-step strategies over extended operational windows.
- Multi-Agent Coordination: The capacity to communicate and negotiate effectively with other agents to facilitate transactions.
- Heterogeneity: Evaluating how agents with different objectives and constraints interact within a shared economic framework.
Note: Due to the limited nature of the provided source text, specific detailed metrics and quantitative results of the CoffeeBench experiments are not available.
Original Source