Signed Compression Progress on a Sealed Audit is Goodhart-Resistant
The article presents a formal proof that "signed compression progress" as an intrinsic motivation mechanism in reinforcement learning is resistant to manipulation, specifically in the context of sealed audits. It introduces a method where the intrinsic reward is based on the signed decrease of a sealed-audit loss, ensuring that cumulative rewards correlate with actual performance improvements, while also identifying failure modes that could undermine this guarantee. The findings are supported by experiments using Lean 4 and ARC-TGI grid-transformation generators, demonstrating that the approach effectively mitigates issues like clip-farming and stream leakage, making it relevant for practitioners focused on robust reinforcement learning frameworks.