Training
Autoregressive Direct Preference Optimization
The paper introduces Autoregressive Direct Preference Optimization (ADPO), a new formulation of direct preference optimization (DPO) that integrates autoregressive assumptions into the preference alignment process for large language models (LLMs). This approach reformulates the DPO objective by shifting the summation operation outside the log-sigmoid function and distinguishes between two critical length measures—token length and feedback length—impacting the design of DPO-based algorithms. This advancement is significant for practitioners as it enhances the alignment of LLMs with human preferences, potentially improving model performance in practical applications.
llmpreference-optimization