Agents
$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems
The article introduces $\tau$-Rec, a verifiable benchmark designed for agentic recommender systems, addressing the limitations of current evaluation methods that rely on subjective assessments. It employs a reveal-tagged elicitation (RTE) mechanism and a pass^k reliability metric to evaluate nine configurations across five model families, including GPT-5.4 and Claude Sonnet 4.6, revealing that even top models only achieve ~57% reliability at pass^1 and ~38% at pass^4. This benchmark is crucial for practitioners as it provides a systematic approach to evaluate conversational agents, highlighting the need for improved reliability in real-world applications.
recommender systemsbenchmarkllm