Agents
ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models
The article introduces ROSE (Reference-conditioned Oddity and Symbolic Execution), a benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to translate visual information into context-specific actions. It reveals that performance declines significantly—up to 44.5 percentage points—when transitioning from counting tasks to region-conditioned actions across nine recent MLLMs, highlighting a critical gap in models' ability to leverage visual evidence effectively in varying contexts. This finding underscores the need for improved architectures and training methods to bridge the perception-to-action gap in MLLMs, which is essential for applications requiring nuanced interaction with visual data.
multimodal modelsactionbenchmark