SafetyarXiv cs.CL — 15 d ago

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

The paper presents a mechanistic analysis of jailbreak vulnerabilities in safety-aligned LLMs, specifically using the Gemma-2-2B model. It introduces a token-driven pipeline that decomposes the model's residual stream into Sparse Autoencoder (SAE) features, identifying subgroups linked to unsafe behaviors through three feature-grouping strategies. The findings indicate that individual harmful prompt tokens can effectively localize these vulnerabilities in the model's architecture, highlighting the importance of fine-grained feature analysis for enhancing model safety and robustness against adversarial prompts.

jailbreaksllmmechanisticrelevance 0.00 · engagement 0.00

Read at source ↗← all news