Agents
See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL
The paper introduces Visual Evidence Pre-Alignment (VEPA), a novel training stage for multimodal large language models (MLLMs) designed to improve the integration of visual evidence in responses. VEPA employs a sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize visual evidence descriptions conditioned on questions, resulting in enhanced performance on visually demanding benchmarks. This approach addresses the limitations of traditional caption-based pretraining by providing stronger visual grounding, which is crucial for practitioners aiming to develop more accurate and context-aware MLLMs.
multimodalvisual evidence