CodingarXiv cs.AI — 12 d ago

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

The paper critiques existing coding benchmarks for AI agents, arguing they fail to accurately assess the performance of agentic software engineering systems. It highlights that current benchmarks conflate model performance with the overall system, penalize alternative solutions by relying on single reference points, and lack granularity for component-level evaluation, which hinders iterative improvements. This misalignment suggests a need for new benchmarking methodologies that better reflect the complexities of coding agents, crucial for practitioners developing robust AI systems.

coding benchmarkssoftware engineeringagentic skillsrelevance 0.00 · engagement 0.00

Read at source ↗← all news