One Token to Fool LLM-as-a-Judge
The paper presents findings on the vulnerability of large language models (LLMs) to reward hacking, revealing that superficial inputs, termed "master keys," can elicit false positive rewards in generative reward models used for automated judgment. The research identifies this issue across various models, including GPT-01 and Claude-4, and proposes a data augmentation strategy that leverages truncated model outputs to create robust Master Reward Models (Master-RMs), which show improved resistance to these attacks while performing well in standard evaluations. This work is significant for practitioners as it highlights critical flaws in LLM evaluation methods and provides a framework for developing more reliable reward systems in AI applications.