Agents
Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation
The article introduces the \textsc{\benchmarkname{}} benchmark for assessing visual social intelligence in multimodal agents, comprising 240 scenarios, 585 role instances, and 2,340 role-task instances that integrate textual and visual cues. Evaluation of seven recent multimodal language models (MLLMs) reveals performance saturation in role-specific tasks but significant challenges in interaction regulation and visually grounded outcomes. This benchmark provides a structured framework for improving AI's understanding of social dynamics, crucial for developing more effective multimodal agents.
multimodalsocial intelligencebenchmark