ResearcharXiv cs.AI — 9 d ago

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

The paper introduces Token-level Bregman Preference Optimization (TBPO), a method that enhances Direct Preference Optimization (DPO) by modeling token-level preferences through a Bregman-divergence density-ratio matching objective. TBPO consists of two variants, TBPO-Q, which incorporates a lightweight state baseline, and TBPO-A, which utilizes advantage normalization, both improving alignment quality, training stability, and output diversity across various benchmarks. This development is significant for practitioners as it provides a more granular approach to preference optimization in language models, potentially leading to better performance in tasks requiring nuanced understanding and generation.

preference-optimizationllmtoken-levelrelevance 0.00 · engagement 0.00

Read at source ↗← all news