Coding
Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering
The paper critiques existing coding benchmarks for AI agents, arguing they fail to accurately assess the performance of agentic software engineering systems. It highlights that current benchmarks conflate model performance with the overall system, penalize alternative solutions by relying on single reference points, and lack granularity for component-level evaluation, which hinders iterative improvements. This misalignment suggests a need for new benchmarking methodologies that better reflect the complexities of coding agents, crucial for practitioners developing robust AI systems.
coding benchmarkssoftware engineeringagentic skills