MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
The paper introduces MixSD, a novel method for knowledge injection in language models that avoids the degradation of pretrained capabilities associated with traditional supervised fine-tuning (SFT). By dynamically mixing tokens from two conditionals of the base model—an expert conditional that incorporates the injected knowledge and a naive conditional that reflects the model's original distribution—MixSD achieves superior memorization-retention trade-offs, maintaining up to 100% of the base model's capabilities while achieving near-perfect training accuracy. This approach mitigates catastrophic forgetting by aligning supervision with the model's native generation distribution, which is significant for practitioners aiming to enhance language model performance without sacrificing original reasoning abilities.