ProductsarXiv cs.AI — 7 d ago▲ 2 · 0 cmts

Agents' Last Exam

The paper introduces Agents' Last Exam (ALE), a new benchmark aimed at evaluating AI agents on economically valuable, long-horizon tasks with verifiable outcomes, developed with input from over 250 industry experts. ALE encompasses a task taxonomy of 1,000+ tasks across 55 subfields in 13 industry clusters, revealing that current AI systems achieve an average full pass rate of less than 1% on the hardest tier, indicating significant room for improvement. This benchmark is intended to bridge the gap between AI performance on traditional benchmarks and its practical impact on economic workflows, positioning it as a dynamic tool for ongoing evaluation and development in AI applications.

benchmarkevaluationagentsrelevance 0.00 · engagement 0.06

Read at source ↗HN discussion ← all news