Safety
InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset
The paper introduces InstantForget, a novel approach for backdoor unlearning that operates without updating model parameters during inference. It employs a clean-calibrated gated reset mechanism to identify and neutralize malicious features using a Mahalanobis score, achieving a significant reduction in average attack success rate (ASR) to 0.071 on CIFAR-10 with no dependency on triggered samples. This method offers a promising solution for practitioners needing effective backdoor mitigation while maintaining model integrity, as it demonstrates high detection performance (0.981 AUROC) across multiple model architectures.
backdoorunlearningfeatures