Safety
From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks
The paper presents a mechanistic analysis of jailbreak vulnerabilities in safety-aligned LLMs, specifically using the Gemma-2-2B model. It introduces a token-driven pipeline that decomposes the model's residual stream into Sparse Autoencoder (SAE) features, identifying subgroups linked to unsafe behaviors through three feature-grouping strategies. The findings indicate that individual harmful prompt tokens can effectively localize these vulnerabilities in the model's architecture, highlighting the importance of fine-grained feature analysis for enhancing model safety and robustness against adversarial prompts.
jailbreaksllmmechanistic