Agents
Benchmarking Agentic Review Systems
A new study benchmarks agentic review systems, evaluating two open-source systems (OpenAIReview and coarse) and one proprietary system (Reviewer3) against six LLMs, including the frontier model GPT-5.5. The best-performing configuration, OpenAIReview + GPT-5.5, achieved an 83.0% pairwise accuracy in correlating AI reviews with paper quality and detected 71.6% of injected errors in a perturbation benchmark. These findings indicate that while AI-assisted review systems show promise in quality tracking and error detection, there remains significant potential for performance enhancement, which is critical for practitioners developing AI-driven peer review solutions.
peer-reviewevaluationllm