Training
PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation
The paper introduces PowerOPD, a novel approach to on-policy distillation (OPD) that utilizes bounded, sign-consistent rewards derived from the Box-Cox power transformation to address training inefficiencies and instabilities observed in standard OPD. PowerOPD demonstrates significant improvements across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, achieving gains of up to +6.37 in Avg@8/Pass@8 over traditional OPD while also reducing wall-clock time by 59.2% and peak GPU memory usage by 23.1%. This method is crucial for practitioners aiming to stabilize training dynamics and enhance model performance without incurring additional computational costs.
on-policydistillationllmtrainingframework