Training
Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning
The paper introduces Hindsight Self-Distillation (HSD), a method for improving token-level credit assignment in long reasoning traces of large language models (LLMs) by utilizing successful peer rollouts as a teacher signal. This approach enhances the model's ability to provide dense per-token feedback, particularly beneficial for terse-answer tasks, and demonstrates superior performance on math and code benchmarks using Qwen3-8B and Qwen3-32B models compared to existing methods like GRPO variants and standard on-policy distillation. This advancement is significant for practitioners as it enables more effective training of LLMs by refining the credit assignment process during the reasoning phase.
reinforcement learningcredit assignmentllm