TrainingarXiv cs.CL — 7 d ago

TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

The paper introduces Token-Adaptive Barrier Preference Optimization (TAB-PO), a novel post-SFT objective designed to enhance preference learning in token-critical structured generation tasks. By implementing a confidence-gated token-level barrier, TAB-PO effectively mitigates issues of gradient dilution and token erosion associated with traditional Direct Preference Optimization (DPO). Evaluated on the SciERC task using Llama/Qwen models ranging from 1.5B to 70B parameters, TAB-PO achieved an average improvement of 11.59% in ontology-critical metrics over standard SFT and outperformed existing DPO variants and leading models by 14.71%, highlighting its significance for practitioners focused on structured prediction in AI applications.

llmpreference-optimizationdporelevance 0.00 · engagement 0.00

Read at source ↗← all news