Research
Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models
The paper introduces the VLM Reliability Probe (VRP), which critiques the Attention-Confidence Assumption in Vision-Language Models (VLMs), revealing that spatial attention does not correlate with accuracy (R ≈ 0.001). Instead, it identifies Self-Consistency across reasoning paths as a more reliable predictor of truth (R = 0.429). The study highlights architectural differences among models, noting that while LLaVA's predictions are fragile, models like PaliGemma and Qwen2-VL maintain reliability even with significant layer destruction, emphasizing the need for practitioners to focus on generation dynamics over spatial attention for assessing model reliability.
vision-language modelsattentionreliability