TrainingarXiv cs.AI — 9 d ago

Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

The paper introduces Hindsight Self-Distillation (HSD), a method for improving token-level credit assignment in long reasoning traces of large language models (LLMs) by utilizing successful peer rollouts as a teacher signal. This approach enhances the model's ability to provide dense per-token feedback, particularly beneficial for terse-answer tasks, and demonstrates superior performance on math and code benchmarks using Qwen3-8B and Qwen3-32B models compared to existing methods like GRPO variants and standard on-policy distillation. This advancement is significant for practitioners as it enables more effective training of LLMs by refining the credit assignment process during the reasoning phase.

reinforcement learningcredit assignmentllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news