AgentsarXiv cs.AI — 4 d ago

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

The article introduces $\tau$-Rec, a verifiable benchmark designed for agentic recommender systems, addressing the limitations of current evaluation methods that rely on subjective assessments. It employs a reveal-tagged elicitation (RTE) mechanism and a pass^k reliability metric to evaluate nine configurations across five model families, including GPT-5.4 and Claude Sonnet 4.6, revealing that even top models only achieve ~57% reliability at pass^1 and ~38% at pass^4. This benchmark is crucial for practitioners as it provides a systematic approach to evaluate conversational agents, highlighting the need for improved reliability in real-world applications.

recommender systemsbenchmarkllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news