ResearcharXiv cs.CL — 8 d ago

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

MET-Bench is a newly introduced multimodal entity tracking benchmark aimed at evaluating vision-language models' capabilities in maintaining coherent entity representations over time. The benchmark reveals a significant performance gap, primarily due to deficiencies in visual reasoning, and demonstrates that while text-based reasoning strategies can enhance performance, challenges persist in long-horizon multimodal tasks. This work underscores the necessity for advancements in multimodal representations and reasoning methods to improve the integration of textual and image-based state updates in AI systems.

entity trackingvision-languagebenchmarkrelevance 0.00 · engagement 0.00

Read at source ↗← all news