Research
Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text
This article presents a benchmark for evaluating how well large language models (LLMs) preserve diagnostic uncertainty in clinical text, consisting of 1,200 documents with 9,184 annotations across five levels of uncertainty. The evaluation of three LLMs revealed that they often fail to maintain the original uncertainty cues, achieving preservation rates of less than 50%, and struggle with nuanced distinctions between adjacent levels. This research highlights a critical failure mode not addressed by standard metrics, emphasizing the need for improved evaluation methods to ensure safe deployment of LLMs in clinical settings.
llmclinicaluncertainty