Safety
SHARD: Safe and Helpful Alignment via Self-Reframing Distillation
The paper introduces SHARD, a self-reframing distillation method designed to enhance the safety and helpfulness of large language models (LLMs) when handling sensitive prompts. SHARD rewrites prompts to reveal benign intent, reframes responses to be safer and more helpful, and fine-tunes the model on these self-reframed outputs. Benchmark results on DNA and the English subset of LINGUASAFE indicate improved helpfulness across various model families while maintaining safety, highlighting a potential method for LLMs to internalize beneficial behaviors without relying on larger teacher models.
alignmentsafetyllm