Research
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
ResearchClawBench is a newly introduced benchmark designed to evaluate autonomous scientific research capabilities across 40 tasks from 10 scientific domains, utilizing expert-curated multimodal rubrics for assessment. The benchmark was tested on seven autonomous research agents and seventeen native LLMs, revealing that the top-performing systems, Claude Code and Claude-Opus-4.7, achieved average scores of 21.5 and 20.7, respectively, indicating significant gaps in reliable re-discovery of scientific knowledge. This benchmark is crucial for practitioners as it establishes a reproducible evaluation framework to track advancements in the development of AI coding agents for scientific research.
benchmarkscientific researchauto-research