ResearcharXiv cs.AI — 12 d ago

Vision-language models for chest radiography do not always need the image

The study introduces a causal audit for evaluating vision-language models in chest radiography, revealing that a text-only model can achieve accuracy within 5.7 points of the best multimodal model, despite having no image access. A 119-billion-parameter multimodal model showed statistical equivalence to a 7-billion text-only model, indicating that reliance on images may not be as critical as previously thought. This finding emphasizes the need for grounding audits over accuracy metrics in clinical applications, as it suggests that some models may not leverage image data effectively, potentially impacting diagnostic reliability.

vision-language modelschest radiographycausal auditrelevance 0.00 · engagement 0.00

Read at source ↗← all news