Agents
Offline Preference-Based Trajectory Evaluation
The paper introduces a preference-based trajectory evaluation method that enhances the offline evaluation of agentic systems by comparing trajectories based on temporal preferences rather than solely on terminal success. This approach significantly reduces tied comparisons from approximately 75% to 35%, thereby improving discriminative power and data efficiency across various benchmarks. This method suggests that traditional success-based metrics may contribute to benchmark saturation, highlighting the importance of evaluation measures in assessing AI systems' performance.
offline evaluationagentic systems