AgentsarXiv cs.CL — 11 d ago

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

The article introduces the \textsc{\benchmarkname{}} benchmark for assessing visual social intelligence in multimodal agents, comprising 240 scenarios, 585 role instances, and 2,340 role-task instances that integrate textual and visual cues. Evaluation of seven recent multimodal language models (MLLMs) reveals performance saturation in role-specific tasks but significant challenges in interaction regulation and visually grounded outcomes. This benchmark provides a structured framework for improving AI's understanding of social dynamics, crucial for developing more effective multimodal agents.

multimodalsocial intelligencebenchmarkrelevance 0.00 · engagement 0.00

Read at source ↗← all news