Multimodal
The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation
The study evaluates 12 open-weight vision-language models (VLMs) in binary classification tasks across two clinical neuroimaging datasets, \textsc{FOR2107} and \textsc{OASIS-3}. It finds that smaller models can achieve up to 58% F1 score improvements when neuroimaging context is introduced, largely due to prompt framing rather than actual data integration, indicating a phenomenon termed the "scaffold effect." These results highlight the potential pitfalls of relying on surface-level performance metrics in clinical AI applications, emphasizing the need for deeper evaluation of multimodal reasoning capabilities.
clinicalvlmneuroimaging