Agents
ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?
The article introduces ORAgentBench, a new benchmark designed to evaluate autonomous agents on complex operations research (OR) tasks in executable environments. It features 107 human-reviewed tasks that require agents to write and execute solution code, with performance assessed based on schema validity and objective quality. Results indicate that current agent models struggle with reliability, achieving only 35.51% success on all tasks and highlighting significant weaknesses in strategic decision-making and solution quality, underscoring the need for advancements in operational decision-making capabilities for AI practitioners.
operations-researchbenchmark