Agents
EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent
EComAgentBench is a new benchmark comprising 662 tasks designed to evaluate LLM-based shopping agents on long-horizon tasks by revealing hidden user intents through various data sources, including queries and profiles. The benchmark incorporates automated, source-tagged rubrics for grading, allowing for precise attribution of failures to specific requirements. Initial evaluations of seven models show that even the best-performing model achieves only 57.1% accuracy, highlighting the challenges agents face in understanding complex user requirements, which is critical for advancing shopping assistants from simple queries to more reliable, context-aware interactions.
llmshoppingbenchmark