MultimodalarXiv cs.AI — 7 d ago

VISTA: View-Consistent Self-Verified Training for GUI Grounding

VISTA (View-Consistent Self-Verified Training) is a new training framework for GUI grounding that enhances Group Relative Policy Optimization (GRPO) by utilizing multiple target-preserving views of the same GUI instance for model rollouts. This method stabilizes coordinate generation and improves grounding accuracy across five benchmarks, with notable performance increases for Qwen3-VL models: from 55.5 to 63.4 for the 4B model on ScreenSpot-Pro. The framework's design allows for a more robust evaluation of model performance, reducing prediction variability and improving worst-view accuracy, which is critical for practitioners working on reliable GUI interaction systems.

guitrainingrelevance 0.00 · engagement 0.00

Read at source ↗← all news