AgentsarXiv cs.CL — 14 d ago

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench introduces a benchmark for evaluating the long-term decision-making capabilities of language model agents in a simulated startup environment over 500 days. The benchmark assesses agents on their ability to navigate uncertainty, adapt to changes, and manage multiple tasks, with only Claude Opus 4.8 and GPT-5.5 achieving a positive financial outcome, highlighting the challenges state-of-the-art models face in complex, real-world scenarios. This evaluation framework is significant for practitioners as it emphasizes the need for enhanced model capabilities in sustained adaptive reasoning and strategic decision-making.

long-term planningmulti-agentllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news