Agents
JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks
The article introduces JADE, a two-layer evaluation framework designed to assess agentic AI performance on open-ended professional tasks. The first layer encodes expert knowledge into stable evaluation criteria, while the second layer allows for flexible, claim-level evaluation, enhancing assessment stability and revealing agent failure modes overlooked by traditional LLM-based evaluators. JADE has shown effective alignment with expert rubrics across multiple benchmarks, including BizBench, HealthBench, and DR.BENCH, making it a valuable tool for practitioners aiming for rigorous and adaptable evaluation in AI applications.
evaluationAItasks