TrainingarXiv cs.AI — 15 d ago

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

The study investigates the effects of Sequential Direct Preference Optimization (DPO) on language models, focusing on the Llama-3.1-8B-Instruct model with LoRA adapters. It reveals that sequential training does not lead to uniform forgetting of earlier learned preferences; instead, the impact varies based on the relationship between objectives, signal strength, and training order. This research highlights the importance of considering objective compatibility and signal strength in alignment pipelines, providing insights for practitioners on optimizing multi-objective training without assuming detrimental effects on previously acquired preferences.

preference optimizationlanguage modelssequential trainingrelevance 0.00 · engagement 0.00

Read at source ↗← all news