Agents
When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks
The paper discusses the limitations of verifier-driven self-DPO in visual-language models (VLMs), specifically highlighting that stronger verifiers can lead to regressions in model performance on new tasks. It presents empirical evidence showing that the same verifiers that improve a Qwen-3-VL-2B model on MathVista can significantly underperform on MMMU, causing accuracy drops of 3.4 to 10.9 percentage points. The findings emphasize the importance of evaluating verifier quality based on task-specific accuracy rather than model size, suggesting that practitioners should carefully assess verifier performance before deploying them to avoid detrimental effects on learning outcomes.
visual-language modelsself-improvementverifiers