SafetyarXiv cs.AI — 8 d ago

Patcher: Post-Hoc Patching of Backdoored Large Language Models

The paper introduces Patcher, a post-hoc defense framework designed to mitigate backdoor vulnerabilities in large language models by utilizing only a single reported failure case and the model parameters. Patcher operates in two phases: it first identifies backdoor triggers using gradient-based saliency scores and adaptive clustering, followed by a constrained fine-tuning process that severs the trigger-response link while maintaining the model's performance on benign tasks. This approach offers a practical solution for enhancing the security of deployed language models against backdoor attacks, addressing limitations of existing defenses that require extensive prior knowledge of the attack.

backdoorlarge language modelsdefenserelevance 0.00 · engagement 0.00

Read at source ↗← all news