ResearcharXiv cs.AI — 8 d ago

Can LLMs Accurately Score Medical Diagnoses and Clinical Reasoning?

The study evaluates a Jury of large language models (LLMs) for scoring 3,334 medical diagnoses from low- and middle-income country hospital cases, comparing their performance to expert clinician panels. Key findings include that the uncalibrated LLM Jury maintains ordinal agreement with expert scores but is systematically lower, while showing a lower probability of severe-risk errors and excellent agreement with calibrated scores. This suggests that LLMs can serve as reliable proxies for expert evaluation in medical AI, potentially improving efficiency in clinical assessments and identifying high-risk diagnoses for expert review.

medical diagnosisllmevaluationrelevance 0.00 · engagement 0.00

Read at source ↗← all news