Agents
Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
The paper introduces a comprehensive evaluation framework for large language model (LLM) agents, emphasizing the limitations of existing aggregate-score leaderboards that fail to account for deployment complexities. It presents a twelve-tier measurement apparatus that focuses on predictive validity, correlating in-sample and out-of-sample rankings, and proposes new criteria for assessing agent performance in diverse, real-world scenarios. This approach aims to enhance the reliability of benchmarks for practitioners by providing a more nuanced understanding of agent capabilities beyond static rankings.
benchmarksllmevaluation