CodingarXiv cs.CL — 8 d ago

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

The article introduces Dialogue SWE-Bench, a benchmark dataset designed to evaluate coding agents' performance in resolving software engineering problems through user dialogue, rather than as fully-autonomous systems. It features a persona-grounded user simulator and incorporates automatic evaluations of dialogue quality. A new schema-guided agent is proposed, demonstrating a performance improvement of 3-14% over strong baselines, highlighting the need for enhanced dialogue capabilities in coding agents, which are currently underexplored in existing benchmarks.

coding agentsbenchmarkdialoguerelevance 0.00 · engagement 0.00

Read at source ↗← all news