AgentsarXiv cs.AI — 8 d ago

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

RetailBench is a new simulation benchmark introduced for evaluating long-horizon reasoning and coherent decision-making in LLM agents within a supermarket operation context. It models retail management as a partially observable decision process over a thousand-day scale, assessing seven contemporary LLMs against a privileged oracle policy over a 180-day evaluation. The results indicate significant performance disparities, with most models struggling to maintain effective decision-making over the long term, highlighting the need for improved evidence acquisition and policy consistency in LLM applications for complex, dynamic environments.

long horizon reasoningretailllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news