Research
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
The paper introduces Token-level Bregman Preference Optimization (TBPO), a method that enhances Direct Preference Optimization (DPO) by modeling token-level preferences through a Bregman-divergence density-ratio matching objective. TBPO consists of two variants, TBPO-Q, which incorporates a lightweight state baseline, and TBPO-A, which utilizes advantage normalization, both improving alignment quality, training stability, and output diversity across various benchmarks. This development is significant for practitioners as it provides a more granular approach to preference optimization in language models, potentially leading to better performance in tasks requiring nuanced understanding and generation.
preference-optimizationllmtoken-level