Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models
The paper introduces instruction-based vector steering for Large Audio-Language Models (LALMs), which allows for the redirection of temporal attention in audio signals by contrasting activations from different prompts while keeping the audio constant. This method significantly improves the concentration of attention on acoustically relevant regions, achieving 60.87% and 68.72% overlap with ground-truth intervals in sound event localization tasks using Qwen2-Audio and Audio Flamingo 3, respectively, compared to lower overlaps from standard prompting techniques. This advancement provides a new training-free approach to probe the temporal structures encoded within LALMs, enhancing their interpretability and application in audio understanding tasks.