AgentsarXiv cs.AI — 7 d ago

WorkBench Revisited: Workplace Agents Two Years On

The article announces an updated benchmark for workplace agents, highlighting significant performance improvements with the release of Claude Opus 4.8, which achieves an 89% task completion rate and reduces unintended harmful actions to 2.5%. Key findings indicate that capability and safety are positively correlated, with some basic errors still persisting, such as misdirected emails. Additionally, the emergence of open-weight models has made high-performance capabilities more accessible and cost-effective, prompting an update to the benchmark with enhanced data, code quality, and model evaluations.

workplaceagentsbenchmarkrelevance 0.00 · engagement 0.00

Read at source ↗← all news