Agents
STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios
The paper introduces STAGE-Claw, an automated framework designed for evaluating personal agents in realistic state-based computing environments. It generates benchmark tasks from a task hint, providing a comprehensive evaluation that measures agents' performance based on the correctness of the final system state rather than just textual responses. This framework includes 40 challenging tasks and evaluates 11 frontier models, addressing limitations of existing benchmarks and enhancing the scalability and reliability of personal-agent evaluation for practitioners.
llmbenchmarkingevaluation