Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning
The article introduces the PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent designed to enhance spatial reasoning tasks through active evidence acquisition and multi-step visual interaction. PERIA employs two lightweight tool families for vision perception and interaction, and is trained using a novel method combining supervised trajectory synthesis and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO). Experimental results indicate that the PERIA-8B model outperforms the Qwen3-8B backbone by 10% on in-distribution benchmarks and achieves competitive performance against larger models like Qwen3-VL-235B-A22B-Thinking and GPT-5, highlighting its potential for practitioners focused on improving spatial reasoning in AI applications.