Research
CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
CoffeeBench is a newly introduced benchmark designed to evaluate long-horizon LLM agents within heterogeneous multi-agent economic systems, featuring a simulation with two farmers, two roasters, and two retailers over a 90-day period. The benchmark assesses LLMs' performance based on their ability to communicate, negotiate, and manage resources effectively, revealing that while most models outperform a passive baseline, there are notable differences in agent behavior, such as Claude Haiku 4.5's tendency towards inaction. This tool is significant for practitioners as it provides a structured way to analyze LLMs' capabilities in complex, dynamic environments, facilitating improvements in their design and application in economic contexts.
benchmarkingllmmulti-agent