TrainingarXiv cs.AI — 14 d ago

Which Pairs to Compare for LLM Post-Training?

The paper presents a study on optimizing preference-based post-training for language models by focusing on the selection of comparison pairs for labeling. It formulates the problem as a sampling-design challenge, providing a framework for Direct Preference Optimization (DPO) that connects labeled pair selection to downstream policy performance through a design-dependent information matrix. The findings demonstrate that strategic comparison selection can enhance sample efficiency, offering a practical approach for practitioners to maximize the effectiveness of their labeling budgets in alignment tasks.

llmpost-trainingpreferencerelevance 0.00 · engagement 0.00

Read at source ↗← all news