AgentsarXiv cs.AI — 4 d ago

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Workflow-GYM has been introduced as a benchmark designed to evaluate AI agents on long-horizon, high-value tasks in professional domains using graphical user interfaces (GUIs). Initial experiments reveal that even state-of-the-art models only achieve success rates slightly above 30%, indicating significant challenges in maintaining workflow consistency, avoiding errors, and comprehending domain-specific software. This benchmark highlights critical limitations in current AI agent capabilities and outlines essential areas for future research in GUI-agent development.

ai agentsguiworkflowrelevance 0.00 · engagement 0.00

Read at source ↗← all news