Coding
Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents
The article introduces Dialogue SWE-Bench, a benchmark dataset designed to evaluate coding agents' performance in resolving software engineering problems through user dialogue, rather than as fully-autonomous systems. It features a persona-grounded user simulator and incorporates automatic evaluations of dialogue quality. A new schema-guided agent is proposed, demonstrating a performance improvement of 3-14% over strong baselines, highlighting the need for enhanced dialogue capabilities in coding agents, which are currently underexplored in existing benchmarks.
coding agentsbenchmarkdialogue