Safety
Patcher: Post-Hoc Patching of Backdoored Large Language Models
The paper introduces Patcher, a post-hoc defense framework designed to mitigate backdoor vulnerabilities in large language models by utilizing only a single reported failure case and the model parameters. Patcher operates in two phases: it first identifies backdoor triggers using gradient-based saliency scores and adaptive clustering, followed by a constrained fine-tuning process that severs the trigger-response link while maintaining the model's performance on benign tasks. This approach offers a practical solution for enhancing the security of deployed language models against backdoor attacks, addressing limitations of existing defenses that require extensive prior knowledge of the attack.
backdoorlarge language modelsdefense