Training
PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning
The article presents Path-Aligned Decompression Distillation (PADD), a novel framework for distilling knowledge from dense teacher models to mixture-of-experts (MoE) students without explicit routing. PADD involves a four-stage process that includes teacher neuron clustering, online adaptive distillation, and reward-augmented load balancing, achieving significant performance improvements on mathematical reasoning benchmarks while maintaining the same inference costs. This approach is crucial for practitioners aiming to enhance model capacity and efficiency within fixed computational budgets, enabling MoE architectures to effectively leverage knowledge from larger dense models.
distillationmoellm