Research
Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP
Arditi et al. (2024) present a comparison between difference-in-means (DiM) interventions and Iterative Nullspace Projection (INLP) techniques for steering refusal in safety fine-tuned chat models, analyzing five open-weight models. They find that INLP's counterfactual flipping is competitive with DiM's directional ablation for refusal suppression, while nullspace projection is less effective. The study reveals that INLP interventions operate in distinct activation spaces, with implications for the nuanced understanding of model behavior in encoding harmful versus harmless concepts, which is critical for practitioners aiming to refine model safety mechanisms.
model collapsedata selection