Multimodal
Would you still call this Dax? Novel Visual References in VLMs and Humans
The article introduces the Novel Visual References Dataset (NVRD), comprising 19,176 images and 90 visual concepts designed to investigate how vision-language models (VLMs) and humans map novel visual references to language. The dataset includes progressively perturbed versions of objects to assess generalization capabilities, revealing that models struggle with in-context learning of novel concepts that contradict prior knowledge, and tend to overgeneralize compared to human judgments. This work provides a new benchmark for understanding visual concept learning, which is crucial for improving VLM performance in real-world applications.
vision-language-modelsnovel-concepts