Multimodal
Enhancing Pathological VLMs with Cross-scale Reasoning
The paper introduces a novel cross-scale reasoning paradigm for vision-language models (VLMs) in pathology, addressing the need for multi-magnification reasoning in image interpretation. It presents Scale-VQA, a benchmark comprising 4,685 questions based on 2,537 pathology images, and introduces ScaleReasoner-R1, a model trained via reinforcement learning that achieves state-of-the-art results on this new benchmark and established single-scale benchmarks. This advancement is significant for practitioners as it enhances the ability of VLMs to integrate multi-scale evidence, improving diagnostic accuracy in pathological assessments.
pathologyvision-language modelcross-scale reasoning