Research
MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models
MET-Bench is a newly introduced multimodal entity tracking benchmark aimed at evaluating vision-language models' capabilities in maintaining coherent entity representations over time. The benchmark reveals a significant performance gap, primarily due to deficiencies in visual reasoning, and demonstrates that while text-based reasoning strategies can enhance performance, challenges persist in long-horizon multimodal tasks. This work underscores the necessity for advancements in multimodal representations and reasoning methods to improve the integration of textual and image-based state updates in AI systems.
entity trackingvision-languagebenchmark