Training
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
The article presents Self-Distillation Zero (SD-Zero), a novel training method that enhances sample efficiency by transforming binary rewards into dense token-level supervision without the need for external teachers or high-quality demonstrations. SD-Zero employs a dual-role model architecture, where a Generator produces initial responses and a Reviser conditions on these responses and their binary rewards to refine them. The method shows at least a 10% performance improvement on math and code reasoning benchmarks using Qwen3-4B-Instruct and Olmo-3-7B-Instruct, outperforming existing techniques like Rejection Fine-Tuning and Self-Distillation Fine-Tuning.
llmself-distillationreinforcement-learning