TrainingarXiv cs.AI — 7 d ago

A Unifying Lens on Reward Uncertainty in RLHF

The paper presents a novel approach to addressing reward uncertainty in Reinforcement Learning from Human Feedback (RLHF) by introducing a distributional reward model \( p(r \mid x, y) \). It derives a closed-form effective reward using KL-regularized RLHF, which unifies various existing heuristics for reward model aggregation, such as mean aggregation and worst-case optimization. This framework enhances the understanding of reward uncertainty and provides a more principled method to mitigate reward hacking, which is crucial for practitioners developing robust RLHF systems.

rlhfreward uncertaintyoptimizationrelevance 0.00 · engagement 0.00

Read at source ↗← all news