TrainingarXiv cs.CL — 15 d ago

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

The paper introduces a novel framework called Rubric-Conditioned Self-Distillation, which enhances post-training for reasoning language models by utilizing structured rubrics for fine-grained feedback rather than relying solely on expensive chain-of-thought annotations or scalar rewards. This method allows for a more detailed credit assignment during the reasoning process, resulting in improved performance on science reasoning benchmarks, where it outperforms existing methods like GRPO and OPSD by an average of 1.0 and 0.9 points, respectively. This approach is significant for practitioners as it streamlines the training process and improves model accuracy by leveraging task-specific rubrics for better guidance during self-distillation.

self-distillationrubricfeedbackrelevance 0.00 · engagement 0.00

Read at source ↗← all news