Research
A Survey of On-Policy Distillation for Large Language Models
The article presents a survey on On-Policy Distillation (OPD) for improving knowledge transfer from large language models (LLMs) to smaller student models. It critiques the traditional static imitation approach, highlighting issues like exposure bias that worsen with longer sequences, and proposes an iterative correction process where the teacher provides feedback on the student's generated outputs. This work formalizes OPD as f-divergence minimization and organizes the field around optimization strategies, signal sources, and training stabilization, addressing critical challenges and future research directions relevant for practitioners in AI model deployment and distillation.
knowledge-distillationllmon-policy