Research
PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment
PaLMR is a new framework designed to enhance visual reasoning in Multimodal Large Language Models (MLLMs) by addressing process-level misalignment through a dual-component approach. It features a perception-aligned data layer that generates structured pseudo-ground-truths and a process-aligned optimization layer with a hierarchical reward fusion scheme, which collectively improve reasoning fidelity and reduce hallucinations. Experiments with Qwen2.5-VL-7B demonstrate that PaLMR achieves state-of-the-art results on HallusionBench while maintaining strong performance on other benchmarks, indicating its potential to improve the reliability and interpretability of multimodal reasoning systems.
visual reasoningmultimodalreinforcement learning