Multimodal
Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations
Gen-VCoT is a proposed framework that enhances multimodal large language models (MLLMs) by integrating RGB images as visual reasoning intermediates, improving interpretability over traditional text-based chain-of-thought methods. It employs a three-stage process involving visual grounding with SAM segmentation, geometric reasoning using Marigold depth maps, and semantic reasoning through Qwen2-VL integration, with an adaptive router for selecting reasoning depth. Evaluations demonstrate significant improvements in spatial and depth question accuracy, although text-based CoT remains superior for simple factual queries, highlighting the task-dependent nature of optimal representations in MLLMs.
visual reasoningdiffusionGen-VCoTRGB