ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward
ProcessThinker introduces a novel post-training pipeline that enhances multi-modal large language models' reasoning capabilities by providing step-level process rewards without requiring a dedicated process reward model. It utilizes a method of rewriting reasoning traces into a step-tagged format for fine-tuning, followed by Group Relative Policy Optimization (GRPO) that incorporates rollout-based rewards derived from empirical success rates of intermediate reasoning steps. This approach demonstrates consistent improvements over the baseline model Qwen3-VL-8B-Instruct across four video benchmarks, addressing issues of inconsistent reasoning and enabling more reliable logical conclusions, which is crucial for practitioners developing multi-step reasoning applications in AI.