Coding
DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents
DualGauge is introduced as the first automated framework for evaluating the correctness and security of code generated from natural language specifications, accompanied by the DualGauge-Bench benchmark consisting of 307 coding tasks with functional and security tests. Evaluation of 10 LLMs across Python, C++, and JavaScript reveals that even the top-performing model achieves less than 15% joint success in security and functionality, indicating that improvements in model size and tuning do not necessarily enhance secure code generation. This framework highlights critical vulnerabilities in code generation processes, emphasizing the need for integrated benchmarks to ensure both functional and security compliance in AI-generated code.
llmcode-generationsecuritybenchmark