AgentsarXiv cs.CL — 7 d ago

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

LoHoSearch introduces a new benchmark for long-horizon search agents, consisting of 544 human-verified questions across 11 domains, constructed using an automated pipeline based on a knowledge graph of over 7 million Wikipedia entities. This benchmark aims to overcome the limitations of existing human-authored benchmarks, which have reached a difficulty ceiling, by offering structurally complex questions that challenge current models, as evidenced by the top model achieving only 34.74% accuracy. This resource is significant for practitioners as it sets a higher standard for evaluating reasoning and context management capabilities in AI search agents.

searchbenchmarklong-horizonrelevance 0.00 · engagement 0.00

Read at source ↗← all news