Agents
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
Workflow-GYM has been introduced as a benchmark designed to evaluate AI agents on long-horizon, high-value tasks in professional domains using graphical user interfaces (GUIs). Initial experiments reveal that even state-of-the-art models only achieve success rates slightly above 30%, indicating significant challenges in maintaining workflow consistency, avoiding errors, and comprehending domain-specific software. This benchmark highlights critical limitations in current AI agent capabilities and outlines essential areas for future research in GUI-agent development.
ai agentsguiworkflow