ProductsarXiv cs.CL — 7 d ago

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

The article introduces the Shopping Reasoning Bench, a benchmark designed to evaluate multi-turn conversational shopping assistants with a focus on open-ended reasoning and domain expertise. It comprises 525 missions and 10,863 importance-weighted binary rubrics developed by retail experts, organized into five reasoning categories. Evaluation of nine models, including GPT, Claude, and Gemini, reveals that current models achieve only 57-77% pass rates, particularly struggling with optional criteria in multi-turn dialogues, indicating a significant gap in performance that this benchmark aims to address for advancing shopping assistant capabilities.

conversational assistantsbenchmarkshoppingrelevance 0.00 · engagement 0.00

Read at source ↗← all news