Multimodal
Thinking with Visual Grounding
The paper introduces the concept of visually grounded thinking in vision-language models (VLMs), where natural-language reasoning is interleaved with explicit visual evidence through point or box groundings. A new training pipeline is developed, utilizing a SAM3-based agent for grounding and incorporating grounding-aware reinforcement learning, which enhances the model's performance on counting and spatial reasoning tasks. The results demonstrate that the 4B models with visually grounded thinking achieve performance comparable to or exceeding that of larger models, indicating that explicit grounding improves reasoning accuracy in VLMs.
visual groundingreasoning