Research
A Framework for Evaluating Agentic Skills at Scale
The paper introduces a novel evaluation framework for assessing agent skills in large language models (LLMs), enabling skill authors to create realistic tasks for rigorous evaluation. It applies this framework to 500 real-world skills, generating 1,000 tasks and scoring rubrics, and evaluates 19 proprietary and open-source agent-model configurations, revealing significant performance variability based on skill integration. This work is crucial for practitioners as it provides a structured methodology to quantify skill utility and model behavior, enhancing the development of more effective LLM agents.
agentic skillsevaluation framework