Agents
ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents
ShoppingBench is a new end-to-end benchmark for evaluating LLM-based agents in e-commerce, addressing complex user intents like voucher application and multi-product searches. It features a scalable framework with a shopping sandbox containing over 2.5 million real-world products, revealing that even advanced models like GPT-4.1 struggle with success rates below 50% on benchmark tasks. The research also introduces a trajectory distillation strategy that allows a smaller agent to achieve competitive performance through supervised fine-tuning and reinforcement learning, making it significant for practitioners focused on improving LLM capabilities in real-world applications.
shopping-benchmarkintent-groundedllme-commerce