CodingarXiv cs.CL — 8 d ago

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Claw-SWE-Bench is a newly introduced benchmark designed to evaluate OpenClaw-style agent harnesses on coding tasks, featuring 350 instances across 8 programming languages and 43 repositories. It includes a full benchmark and a faster validation subset, Claw-SWE-Bench Lite, with significant performance differences observed based on adapter design, as shown by OpenClaw's Pass@1 scores of 19.1% with a minimal adapter versus 73.4% with a full adapter. This benchmark emphasizes the importance of harness design and cost accounting in assessing coding agents, providing a standardized framework for practitioners in the AI coding domain.

benchmarkagentscoding tasksrelevance 0.00 · engagement 0.00

Read at source ↗← all news