Coding
Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming
The article introduces UOJ-Bench, a benchmark for evaluating Large Language Models (LLMs) in competitive programming, focusing on code generation, hacking, and repair tasks derived from real-world submissions on the Universal Online Judge (UOJ). The study reveals that even top-performing models fail to detect over 50% of errors in incorrect submissions under one-shot evaluation, though test-time scaling can boost success rates to above 90%, albeit with high computational costs. This benchmark highlights the potential of LLMs to assist in educational contexts by identifying errors, indicating their role as complementary tools alongside traditional judging systems.
code generationllmbenchmark