ResearcharXiv cs.AI — 21 h ago

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

The article presents a comprehensive survey on Direct Preference Optimization (DPO), highlighting its role as a reinforcement learning-free alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning policy models with human preferences. It categorizes recent research on DPO, discusses theoretical analyses, variants, and relevant datasets, and identifies future research directions to enhance model alignment. This survey is significant for practitioners as it consolidates existing knowledge and guides future work in preference-based model alignment, which is crucial for developing more effective LLMs.

llmpreferencealignmentdporelevance 0.00 · engagement 0.00

Read at source ↗← all news