Training
Boosting Direct Preference Optimization with Penalization
The paper introduces Direct Preference Optimization with Penalization (DPOP), an enhancement of Direct Preference Optimization (DPO) that incorporates a gated penalty to improve the selection of preferred responses in offline preference optimization. DPOP outperforms DPO, SimPO, and AlphaDPO on the AlpacaEval 2.0 benchmark, achieving relative win rate increases of 5.3% and 4.4% for the Llama-3-8b-it and Gemma-2-9b-it models, respectively. This method leverages additional signals from reference model outputs, providing a more effective approach for practitioners focused on optimizing response selection in AI systems.
preference-optimizationreinforcement-learningllm