Multimodal
Gaze Heads: How VLMs Look at What They Describe
The paper introduces the concept of "gaze heads," a specific set of attention heads in vision-language models (VLMs) that track the image regions being described, enhancing the model's ability to focus on relevant visual information. By applying attention-mask interventions on these gaze heads, the authors achieve an 83.1% accuracy in redirecting descriptions to chosen comic panels, demonstrating that this mechanism can also be applied to natural COCO images across various model sizes (2B to 32B parameters). This finding is significant for practitioners as it offers a method for real-time steering of VLM behavior without retraining, enabling more precise control in multimodal applications.
vision-languageattention headsgaze heads