MultimodalarXiv cs.CL — 2 d ago

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

A new benchmark for evaluating vision-language models (VLMs) has been introduced, consisting of 540 images and four question variants designed to isolate the reliance on textual priors over image content. The study benchmarks eleven VLMs, revealing that all models exhibit degradation in performance when faced with questions that minimize text leakage, with open-weight models showing the most significant drop in accuracy. This research highlights the necessity for practitioners to address textual-prior reliance in VLMs, suggesting that targeted training methods, such as GRPO post-training, can enhance model performance by improving image-dependence.

vision-languagebenchmarkllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news