AgentsarXiv cs.AI — 8 d ago

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

The paper introduces WebStep, a new benchmark for evaluating web agents through process-level analysis, featuring 1,800 task instances with automatic semantic state tracking. This framework enables detailed evaluation of agent performance by capturing high-level states and transitions, revealing critical insights into agent behavior that traditional outcome-based metrics overlook. By identifying specific skill discrepancies and error sources, the study highlights areas for improvement, particularly as task complexity increases, thereby offering practitioners a more nuanced understanding of agent capabilities and weaknesses in real-world applications.

web agentssemantic state trackingbenchmarkrelevance 0.00 · engagement 0.00

Read at source ↗← all news