Agents
Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking
The paper introduces WebStep, a new benchmark for evaluating web agents through process-level analysis, featuring 1,800 task instances with automatic semantic state tracking. This framework enables detailed evaluation of agent performance by capturing high-level states and transitions, revealing critical insights into agent behavior that traditional outcome-based metrics overlook. By identifying specific skill discrepancies and error sources, the study highlights areas for improvement, particularly as task complexity increases, thereby offering practitioners a more nuanced understanding of agent capabilities and weaknesses in real-world applications.
web agentssemantic state trackingbenchmark