Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning
Ouroboros-Spatial introduces a self-evolving training framework for spatial reasoning in multimodal large language models (MLLMs), where the model alternates between generating and solving spatial question-answer pairs. This approach utilizes a frozen proposer to create QA pairs from 3D scene metadata and video frames while a learnable solver is fine-tuned based on its confidence in predictions, leading to a dynamic training distribution that adapts to the model's capabilities. The framework significantly enhances the performance of Qwen3-VL-4B and Qwen3-VL-8B models on six benchmarks, achieving notable improvements on VSI-Bench with fewer training examples compared to traditional large-scale datasets, highlighting its efficiency for practitioners in AI model development.