Training
Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation
The paper introduces Anchored Residual On-Policy Distillation (AR-OPD), a novel framework that enhances on-policy distillation for large language models by disentangling privileged information into locally reachable reasoning steps and future-conditioned signals. AR-OPD demonstrates significant improvements over traditional privileged on-policy distillation methods, achieving a 2.3-point increase in performance and reducing hindsight leakage by 21.7%, particularly on long-horizon tasks exceeding 768 tokens. This advancement is critical for practitioners as it addresses the challenges of reasoning and capacity gaps in student-teacher model training, enabling more effective learning from complex data.
llmon-policy distillationtraining