TrainingarXiv cs.AI — 12 d ago

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

The paper introduces PowerOPD, a novel approach to on-policy distillation (OPD) that utilizes bounded, sign-consistent rewards derived from the Box-Cox power transformation to address training inefficiencies and instabilities observed in standard OPD. PowerOPD demonstrates significant improvements across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, achieving gains of up to +6.37 in Avg@8/Pass@8 over traditional OPD while also reducing wall-clock time by 59.2% and peak GPU memory usage by 23.1%. This method is crucial for practitioners aiming to stabilize training dynamics and enhance model performance without incurring additional computational costs.

on-policydistillationllmtrainingframeworkrelevance 0.00 · engagement 0.00

Read at source ↗← all news