ai-digest.dev
last updated 13 h ago
AgentsarXiv cs.AI 4 d ago

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Workflow-GYM has been introduced as a benchmark designed to evaluate AI agents on long-horizon, high-value tasks in professional domains using graphical user interfaces (GUIs). Initial experiments reveal that even state-of-the-art models only achieve success rates slightly above 30%, indicating significant challenges in maintaining workflow consistency, avoiding errors, and comprehending domain-specific software. This benchmark highlights critical limitations in current AI agent capabilities and outlines essential areas for future research in GUI-agent development.

ai agentsguiworkflowrelevance 0.00 · engagement 0.00
Read at source ↗← all news
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields — AI News Digest