Training
Which Pairs to Compare for LLM Post-Training?
The paper presents a study on optimizing preference-based post-training for language models by focusing on the selection of comparison pairs for labeling. It formulates the problem as a sampling-design challenge, providing a framework for Direct Preference Optimization (DPO) that connects labeled pair selection to downstream policy performance through a design-dependent information matrix. The findings demonstrate that strategic comparison selection can enhance sample efficiency, offering a practical approach for practitioners to maximize the effectiveness of their labeling budgets in alignment tasks.
llmpost-trainingpreference