MultimodalarXiv cs.AI — 11 d ago

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

The article introduces the MathVis-Fine framework, designed to improve multimodal mathematical reasoning by addressing the limitations of existing Chain-of-Thought (CoT) approaches in handling visual inputs. It presents the MathVis-Fine dataset, which includes fine-grained visual annotations and dependency ratings, and proposes a two-stage progressive visual enhancement training paradigm that adjusts rewards based on the visual dependency level of each sample. This framework is significant for practitioners as it enhances the accuracy of multimodal reasoning models by providing more precise supervision and mitigating reward bias, ultimately leading to improved performance in tasks that require integration of text and visual information.

visual reasoningtrainingdependenciesrelevance 0.00 · engagement 0.00

Read at source ↗← all news