RAG
When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs
The paper introduces MAD-RAG, a novel intervention designed to address Attention Distraction (AD) in Retrieval-Augmented Generation (RAG) for Large Vision-Language Models (LVLMs). MAD-RAG decouples visual grounding from context integration using a dual-question formulation and attention mixing, leading to significant performance improvements on knowledge-based visual question answering tasks, with gains of up to 9.20% over existing baselines on datasets like OK-VQA and E-VQA. This approach is crucial for practitioners as it enhances model reliability by mitigating attention-related failures without incurring substantial computational costs.
attentionretrieval-augmentedlvms