Agents
Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use
The paper introduces PhysTool-Bench, a benchmark designed to assess Multimodal Large Language Models (MLLMs) in their ability to recognize and utilize physical tools in real-world scenarios, comprising 2,510 queries across 2,678 tools from various domains. Evaluation of 13 leading MLLMs, including Gemini-3.1-Pro, reveals that even the best model only identifies 58.7% of tools and successfully completes 21.0% of tasks, highlighting significant deficits in both perception and planning capabilities. This work underscores the challenges MLLMs face in functional commonsense reasoning, which are critical for advancing embodied AI applications.
MLLMphysical tool usebenchmark