Research
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
The article presents a novel framework called Reason, then Re-reason (ReRe) for improving spatial reasoning in egocentric videos, addressing limitations of single-turn inference. The framework operates in two phases: the Reason Phase, where a multi-modal large language model (MLLM) generates a spatial hypothesis, and the Re-reason Phase, which revises this hypothesis using synthesized novel-view videos created through a Geometry-to-Video pipeline. Evaluations on VSI-Bench and STI-Bench show that ReRe significantly enhances the performance of open-source MLLMs, enabling them to compete with proprietary models, which is crucial for practitioners focusing on robust spatial reasoning capabilities in AI applications.
spatial reasoningmllminference