Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
The paper introduces Adaptive Teacher Exposure for Self-Distillation (ATESD), a novel approach that optimizes the exposure of a teacher model during on-policy self-distillation to enhance reasoning in large language models (LLMs). ATESD utilizes a learnable Beta-policy controller to dynamically adjust the teacher's exposure to reference reasoning, leading to improved performance on benchmarks AIME 24, AIME 25, and HMMT 25 with Qwen3 models (1.7B, 4B, and 8B parameters), achieving significant gains over existing self-distillation and reinforcement learning methods. This work highlights the importance of adaptive exposure strategies in training LLMs, providing practitioners with a new mechanism to fine-tune model training and improve reasoning capabilities.