TrainingarXiv cs.CL — 7 d ago

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

The article presents Self-Distillation Zero (SD-Zero), a novel training method that enhances sample efficiency by transforming binary rewards into dense token-level supervision without the need for external teachers or high-quality demonstrations. SD-Zero employs a dual-role model architecture, where a Generator produces initial responses and a Reviser conditions on these responses and their binary rewards to refine them. The method shows at least a 10% performance improvement on math and code reasoning benchmarks using Qwen3-4B-Instruct and Olmo-3-7B-Instruct, outperforming existing techniques like Rejection Fine-Tuning and Self-Distillation Fine-Tuning.

llmself-distillationreinforcement-learningrelevance 0.00 · engagement 0.00

Read at source ↗← all news