Agents
SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval
SkillResolve-Bench 1.0 is introduced as a benchmark for evaluating agent skill retrieval, addressing the issue of same-capability execution-risk retrieval by providing 661 helpful/risky skill pairs and a candidate pool of 7,982. The benchmark measures helpful ranking alongside harmful sibling rate (HSR@K) and demonstrates that the SkillResolve method achieves Recall@3 of 0.766 and NDCG@3 of 0.699, significantly enhancing performance over the previous SkillRouter by improving recall and reducing harmful exposure. This work is crucial for practitioners as it offers a structured approach to mitigate risks in skill retrieval systems, ensuring safer and more reliable agent execution.
agentsskill retrievalbenchmark