Research
Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports
The study presents a novel evaluation framework for radiology report generation using Large Language Models (LLMs), focusing on the clinical significance of generated content. Utilizing the ReEvalMed benchmark, the researchers identified a bias in LLMs where they effectively detect errors but over-penalize harmless variations. They synthesized 4,000 report pairs to train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B, achieving performance that surpasses larger 32B medical LLMs while suggesting that one-pass trained metrics are more suitable for cost-sensitive applications. The dataset and metric will be released for further use.
llmradiologyevaluation