Multimodal
m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning
The article introduces m2sv, a scalable benchmark for map-to-street-view spatial reasoning, consisting of m2sv-20k and m2sv-sft-11k datasets aimed at improving vision-language models (VLMs) in aligning overhead maps with Street View images. The benchmark reveals that the best VLM achieves only 65.2% accuracy, significantly lower than human annotators' average of 72.0%, indicating persistent challenges in geometric alignment and reasoning consistency. This work emphasizes the need for advancements in grounded spatial reasoning, making it crucial for practitioners focusing on multimodal AI applications.
spatial reasoningvision-language