Agents
LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling
LoHoSearch introduces a new benchmark for long-horizon search agents, consisting of 544 human-verified questions across 11 domains, constructed using an automated pipeline based on a knowledge graph of over 7 million Wikipedia entities. This benchmark aims to overcome the limitations of existing human-authored benchmarks, which have reached a difficulty ceiling, by offering structurally complex questions that challenge current models, as evidenced by the top model achieving only 34.74% accuracy. This resource is significant for practitioners as it sets a higher standard for evaluating reasoning and context management capabilities in AI search agents.
searchbenchmarklong-horizon