AgentsarXiv cs.AI — 15 d ago

ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

The article introduces ROSE (Reference-conditioned Oddity and Symbolic Execution), a benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to translate visual information into context-specific actions. It reveals that performance declines significantly—up to 44.5 percentage points—when transitioning from counting tasks to region-conditioned actions across nine recent MLLMs, highlighting a critical gap in models' ability to leverage visual evidence effectively in varying contexts. This finding underscores the need for improved architectures and training methods to bridge the perception-to-action gap in MLLMs, which is essential for applications requiring nuanced interaction with visual data.

multimodal modelsactionbenchmarkrelevance 0.00 · engagement 0.00

Read at source ↗← all news