Agents
MedCTA: A Benchmark for Clinical Tool Agents
MedCTA is a newly introduced benchmark designed for evaluating medical AI agents on complex, clinician-validated tasks that require tool retrieval and evidence integration from multimodal clinical inputs. It includes 107 real-world tasks and evaluates key aspects such as tool selection and execution stability across 18 multimodal models, revealing that even advanced systems struggle with multi-step clinical processes due to issues like protocol failures and incorrect tool recruitment. This benchmark is significant for practitioners as it offers a comprehensive framework for assessing the reliability and effectiveness of AI agents in clinical environments, highlighting the gap between perception capabilities and practical agentic behavior.
benchmarkclinical toolsml