CodingarXiv cs.AI — 7 d ago

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

The EDEN dataset, comprising approximately 4 million anonymized clinical notes from Italian emergency departments, has been released to support the development of Large Language Models in medical applications. A subset of around 6,000 notes has been manually annotated with 132 items relevant to dyspnea and loss of consciousness, facilitating structured information extraction tasks. This dataset, along with its proposed CRF-filling benchmark and baseline results from models Gemma-27B and MedGemma-27B, addresses a significant data gap for practitioners working with clinical language processing in Italian.

code-authorshipllmbenchmarkdetectionrelevance 0.00 · engagement 0.00

Read at source ↗← all news