Training
Uncertainty-Aware Reward Modeling for Stable RLHF
The article presents Uncertainty-Aware Reward Modeling (UARM), a novel approach to enhance reinforcement learning from human feedback (RLHF) by addressing the limitations of deterministic reward models and their susceptibility to unreliable predictions. UARM integrates quantile-based conformal prediction for calibrated uncertainty and employs heteroscedastic variance decomposition to reweight advantages in group-based policy optimization, specifically improving upon the GRPO framework. Experimental results indicate that UARM significantly enhances reward model calibration and reduces reward hacking, which is critical for ensuring robust alignment of large language models in diverse response scenarios.
reinforcement learninghuman feedbackreward modeling