RAG
DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks
The article introduces DailyReport, an open-ended benchmark designed to evaluate search agents (SAs) on daily search tasks, consisting of 150 tasks and 3,546 associated rubrics. It employs a cascade rubric system for detailed performance attribution and user-centric scoring, highlighting the limitations of current agentic systems in meeting user expectations. This benchmark provides a more realistic assessment framework for practitioners developing SAs, moving beyond traditional task-specific evaluations.
searchagentsllm