Multimodal
Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback
The paper introduces Iterative Visual Thinking (IVT), a framework designed to enhance the self-correction capabilities of vision-language models (VLMs) through visual feedback. The model employs a two-phase training approach that generates corrective reasoning from the model's own predictions and utilizes Group Relative Policy Optimization (GRPO) with an Intersection over Union (IoU) reward to improve multi-step refinement. Results show significant improvements in accuracy metrics on benchmarks like RefCOCOg, with Acc@0.5 increasing from 79.6% to 82.0%, demonstrating that effective spatial self-correction can be achieved with limited data and computational resources, which is crucial for practitioners developing robust VLMs.
vision-language modelsself-correction