Training
I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.
A benchmark study evaluated eight large language models (LLMs) for medical scribing using 300 synthetic doctor-patient dialogues, focusing on their ability to generate SOAP notes. The results revealed 12 confirmed high-impact hallucinations and 520 instances of omitted clinically relevant safety facts, indicating that omissions are a more significant issue than hallucinations. Notable performers included GPT-5.4-mini for cost and speed, while DeepSeek showed promise in prose quality but had many omissions, suggesting that integrating a safety layer with lower-cost models could enhance their clinical utility.
benchmarkmedicalscribing