ResearcharXiv cs.CL — 14 d ago

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

The study presents a novel evaluation framework for radiology report generation using Large Language Models (LLMs), focusing on the clinical significance of generated content. Utilizing the ReEvalMed benchmark, the researchers identified a bias in LLMs where they effectively detect errors but over-penalize harmless variations. They synthesized 4,000 report pairs to train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B, achieving performance that surpasses larger 32B medical LLMs while suggesting that one-pass trained metrics are more suitable for cost-sensitive applications. The dataset and metric will be released for further use.

llmradiologyevaluationrelevance 0.00 · engagement 0.00

Read at source ↗← all news