Training
Automatic identification of diagnosis from hospital discharge letters via weakly supervised Natural Language Processing
A weakly supervised Natural Language Processing (NLP) pipeline was developed to automatically identify diagnoses from Italian hospital discharge letters, eliminating the need for extensive manual annotation. The method utilizes a transformer model pre-trained on Italian medical documents to generate semantic embeddings, followed by a two-level clustering procedure for weak label generation, achieving an AUROC of 77.68% and an F1-score of 78.14% on a dataset of 33,176 discharge letters. This approach significantly reduces manual annotation efforts while demonstrating performance comparable to fully supervised models, highlighting its applicability for large-scale clinical data analysis.
natural language processingdiagnosisweak supervision