Safety
Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families
The study presents a novel approach to detect and mitigate emergent misalignment in instruction-tuned language models, specifically Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, and Ministral-3-3B. By identifying a causally actionable activation-space direction that achieves 99.6% separation of aligned and misaligned activations, the authors demonstrate that causal steering can reduce code spillover by 21-51 points. This research highlights the importance of within-model probing for auditing and defines the limitations of linear cross-architecture corrections, which may not maintain content specificity.
misalignmentactivationllm