Agents
RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments
RetailBench is a new simulation benchmark introduced for evaluating long-horizon reasoning and coherent decision-making in LLM agents within a supermarket operation context. It models retail management as a partially observable decision process over a thousand-day scale, assessing seven contemporary LLMs against a privileged oracle policy over a 180-day evaluation. The results indicate significant performance disparities, with most models struggling to maintain effective decision-making over the long term, highlighting the need for improved evidence acquisition and policy consistency in LLM applications for complex, dynamic environments.
long horizon reasoningretailllm