Training
The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning
The article presents the concept of the "Quality-Utility Paradox" in the context of knowledge distillation for Small Language Models (SLMs) in mathematical reasoning tasks. It reveals that higher reward scores from powerful models, such as Qwen2.5, LLaMA-3, and DeepSeek, can lead to underperformance due to distributional drift from the SLM's native reasoning patterns. To address this, the authors propose "Style-Aligned Refinement," which balances logical improvements from an Oracle with the preservation of the SLM's original reasoning distribution, enhancing adaptation and overall utility in model training.
knowledge distillationmathematical reasoning