Research
GeoWorld-VLM: Geometry from World Models for Vision-Language Models
GeoWorld-VLM is a new framework that enhances Vision-Language Models (VLMs) by integrating geometric structure from frozen camera-conditioned video world models, specifically fine-tuning the image encoder and multimodal projector while keeping the language model unchanged. This approach improves spatial reasoning capabilities, achieving approximately 4% better performance on the What'sUp and VSR benchmarks across different VLM architectures. The method highlights the importance of preserving linguistic capabilities while enhancing visual interpretations, which is crucial for practitioners aiming to improve spatial reasoning in AI systems.
vision-languagegeometryworld models