AgentsarXiv cs.AI — 4 d ago

MedCTA: A Benchmark for Clinical Tool Agents

MedCTA is a newly introduced benchmark designed for evaluating medical AI agents on complex, clinician-validated tasks that require tool retrieval and evidence integration from multimodal clinical inputs. It includes 107 real-world tasks and evaluates key aspects such as tool selection and execution stability across 18 multimodal models, revealing that even advanced systems struggle with multi-step clinical processes due to issues like protocol failures and incorrect tool recruitment. This benchmark is significant for practitioners as it offers a comprehensive framework for assessing the reliability and effectiveness of AI agents in clinical environments, highlighting the gap between perception capabilities and practical agentic behavior.

benchmarkclinical toolsmlrelevance 0.00 · engagement 0.00

Read at source ↗← all news