Research
CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment
The paper introduces CORA, a novel approach for addressing thinking-answer inconsistency in reinforcement learning with verifiable rewards (RLVR) for large vision-language models (LVLMs). CORA employs a lightweight consistency reward model and Hybrid Reward Advantage Splitting (HRAS) to enhance semantic consistency during the reasoning process, demonstrating improved task performance and reduced inconsistencies across multimodal reasoning benchmarks. This development is significant for practitioners as it offers a method to achieve more reliable reasoning outputs in LVLM applications.
llmreasoningreinforcement-learning