SafetyarXiv cs.CL — 7 d ago

One Token to Fool LLM-as-a-Judge

The paper presents findings on the vulnerability of large language models (LLMs) to reward hacking, revealing that superficial inputs, termed "master keys," can elicit false positive rewards in generative reward models used for automated judgment. The research identifies this issue across various models, including GPT-01 and Claude-4, and proposes a data augmentation strategy that leverages truncated model outputs to create robust Master Reward Models (Master-RMs), which show improved resistance to these attacks while performing well in standard evaluations. This work is significant for practitioners as it highlights critical flaws in LLM evaluation methods and provides a framework for developing more reliable reward systems in AI applications.

llmreward-hackingevaluationrelevance 0.00 · engagement 0.00

Read at source ↗← all news