Research
MedicalAgentsBench for Complex Medical Reasoning: Comparing Internalized Reasoning Models versus Externalized Agent-based Frameworks
The article introduces MedicalAgentsBench, a benchmark designed for evaluating complex medical reasoning, consisting of 862 clinical questions curated from eight datasets. It compares three internalized reasoning models (DeepSeek-R1, o1-mini, o3-mini) and nine externalized agent-based frameworks, revealing that combining an internalized model (o3-mini) with externalized agents (MDAgents) achieves the highest accuracy at 35.1%. This research highlights the complementary benefits of both approaches, suggesting that practitioners can enhance performance in resource-constrained environments through strategic model layering and optimization.
medical reasoningllmbenchmark