ai-digest.dev
last updated 2 h ago
ResearcharXiv cs.AI 8 d ago

CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

CoffeeBench is a newly introduced benchmark designed to evaluate long-horizon LLM agents within heterogeneous multi-agent economic systems, featuring a simulation with two farmers, two roasters, and two retailers over a 90-day period. The benchmark assesses LLMs' performance based on their ability to communicate, negotiate, and manage resources effectively, revealing that while most models outperform a passive baseline, there are notable differences in agent behavior, such as Claude Haiku 4.5's tendency towards inaction. This tool is significant for practitioners as it provides a structured way to analyze LLMs' capabilities in complex, dynamic environments, facilitating improvements in their design and application in economic contexts.

benchmarkingllmmulti-agentrelevance 0.00 · engagement 0.00
Read at source ↗← all news