Training
When Context Returns: Toward Robust Internalization in On-Policy Distillation
The article presents a study on on-policy distillation that allows student models to internalize privileged context, eliminating the need for such context during inference. It introduces a lightweight consistency regularizer that uses stop-gradient techniques and forward KL divergence to prevent context-induced degradation, achieving improved context-conditioned accuracy across 12 configurations. This advancement is significant for practitioners as it enhances model robustness by ensuring stable performance when context is reintroduced, thereby optimizing the deployment of distilled models in varied applications.
distillationcontextinternalization