Agents
Planning with the Views via Scene Self-Exploration
The paper introduces a framework for view planning in visual language models (VLMs) that enhances their ability to predict and compose camera movements in 3D environments. Utilizing a self-exploration method combined with view graph distillation, the framework significantly improves the performance of the Qwen2.5-VL-7B model from 2.5% to 47.8% on interactive view planning tasks, outperforming competitors like GPT-5.4 Pro and Gemini 3.1 Pro. This advancement is crucial for practitioners as it addresses the limitations of existing VLMs in multi-turn planning and offers a structured approach to navigating complex 3D scenes.
view-planningvisual-language-models3d