Safety
First, do NOHARM: towards clinically safe large language models
The article introduces NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark consisting of 1,100 tasks across 10 medical specialties designed to evaluate the safety of large language model (LLM) generated medical recommendations. Analysis of 28 LLMs revealed that up to 22.6% of generated recommendations posed a risk of severe harm, primarily due to errors of omission. This underscores the critical need for ongoing assessment of clinical safety in AI applications, as current models, despite high performance in general intelligence and medical knowledge, can still produce dangerous medical advice.
llmhealthcarerisk assessment