Safety
Auditing Reward Hackability in Code RL Training Environments
The paper presents an audit of the vulnerability of code reinforcement learning (RL) environments to incorrect solutions, revealing that 28.5% of tasks in the SWE-bench Verified dataset and 25.0% in R2E-Gym tasks can be exploited due to weak test suites. A meta-analysis of 134 model submissions indicates that models perform significantly better on hackable tasks, with a Pass@1 increase of 14.14 percentage points. The authors propose a hardening procedure using an inline LLM judge and a Docker gold-sanity gate, which effectively identifies a high defect rate in LLM-generated tests, demonstrating the need for robust validation mechanisms in code RL environments to prevent exploitation.
rlreward hackingauditing