Training
Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning
The paper introduces DualSelect, a framework for coupled task and reference selection aimed at enhancing the safety of fine-tuned large language models (LLMs) while maintaining task utility. Utilizing entropy-regularized scoring surrogates and gradient correction, DualSelect effectively preserves safety in models ranging from 1B to 8B parameters, achieving an improvement of at least 5.10 points in Safety Avg. over existing baselines. This approach is significant for practitioners as it addresses the challenge of maintaining safety in LLMs during adaptation, providing a method to ensure safety alignment without sacrificing performance on downstream tasks.
llmfine-tuningsafety