Research
Constitutional On-Policy Safe Distillation
The article presents Constitutional On-Policy Safe Distillation (COPSD), a novel approach to on-policy self-distillation that addresses the collapse issues observed in safety alignment tasks. By utilizing a Cross-SFT cold-start to calibrate the teacher and applying constitutional conditioning, COPSD improves the safety-helpfulness trade-off across 12 benchmarks, while mitigating the expressiveness reduction typically associated with safety pressures. This advancement is significant for practitioners as it enhances the robustness of AI models in reasoning tasks while maintaining safety compliance.
self-distillationsafety