The DRAGON benchmark for clinical NLP
Abstract Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to...
Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-05-01
|
| Series: | npj Digital Medicine |
| Online Access: | https://doi.org/10.1038/s41746-025-01626-x |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850273364118077440 |
|---|---|
| author | Joeran S. Bosma Koen Dercksen Luc Builtjes Romain André Christian Roest Stefan J. Fransen Constant R. Noordman Mar Navarro-Padilla Judith Lefkes Natália Alves Max J. J. de Grauw Leander van Eekelen Joey M. A. Spronck Megan Schuurmans Bram de Wilde Ward Hendrix Witali Aswolinskiy Anindo Saha Jasper J. Twilt Daan Geijs Jeroen Veltman Derya Yakar Maarten de Rooij Francesco Ciompi Alessa Hering Jeroen Geerdink Henkjan Huisman On behalf of the DRAGON consortium |
| author_facet | Joeran S. Bosma Koen Dercksen Luc Builtjes Romain André Christian Roest Stefan J. Fransen Constant R. Noordman Mar Navarro-Padilla Judith Lefkes Natália Alves Max J. J. de Grauw Leander van Eekelen Joey M. A. Spronck Megan Schuurmans Bram de Wilde Ward Hendrix Witali Aswolinskiy Anindo Saha Jasper J. Twilt Daan Geijs Jeroen Veltman Derya Yakar Maarten de Rooij Francesco Ciompi Alessa Hering Jeroen Geerdink Henkjan Huisman On behalf of the DRAGON consortium |
| author_sort | Joeran S. Bosma |
| collection | DOAJ |
| description | Abstract Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, p < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available. |
| format | Article |
| id | doaj-art-30f59bf928b14aa1a2cf997e1473604a |
| institution | OA Journals |
| issn | 2398-6352 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | npj Digital Medicine |
| spelling | doaj-art-30f59bf928b14aa1a2cf997e1473604a2025-08-20T01:51:31ZengNature Portfolionpj Digital Medicine2398-63522025-05-018111010.1038/s41746-025-01626-xThe DRAGON benchmark for clinical NLPJoeran S. Bosma0Koen Dercksen1Luc Builtjes2Romain André3Christian Roest4Stefan J. Fransen5Constant R. Noordman6Mar Navarro-Padilla7Judith Lefkes8Natália Alves9Max J. J. de Grauw10Leander van Eekelen11Joey M. A. Spronck12Megan Schuurmans13Bram de Wilde14Ward Hendrix15Witali Aswolinskiy16Anindo Saha17Jasper J. Twilt18Daan Geijs19Jeroen Veltman20Derya Yakar21Maarten de Rooij22Francesco Ciompi23Alessa Hering24Jeroen Geerdink25Henkjan Huisman26On behalf of the DRAGON consortiumDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDepartment of Radiology, University Medical Center GroningenDepartment of Radiology, University Medical Center GroningenDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterMinimally Invasive Image-Guided Intervention Center, Department of Medical Imaging, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterDepartment of Radiology, Ziekenhuisgroep TwenteDepartment of Radiology, Netherlands Cancer InstituteDepartment of Medical Imaging, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDepartment of Health & Information Technology, Ziekenhuisgroep TwenteDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterAbstract Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, p < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available.https://doi.org/10.1038/s41746-025-01626-x |
| spellingShingle | Joeran S. Bosma Koen Dercksen Luc Builtjes Romain André Christian Roest Stefan J. Fransen Constant R. Noordman Mar Navarro-Padilla Judith Lefkes Natália Alves Max J. J. de Grauw Leander van Eekelen Joey M. A. Spronck Megan Schuurmans Bram de Wilde Ward Hendrix Witali Aswolinskiy Anindo Saha Jasper J. Twilt Daan Geijs Jeroen Veltman Derya Yakar Maarten de Rooij Francesco Ciompi Alessa Hering Jeroen Geerdink Henkjan Huisman On behalf of the DRAGON consortium The DRAGON benchmark for clinical NLP npj Digital Medicine |
| title | The DRAGON benchmark for clinical NLP |
| title_full | The DRAGON benchmark for clinical NLP |
| title_fullStr | The DRAGON benchmark for clinical NLP |
| title_full_unstemmed | The DRAGON benchmark for clinical NLP |
| title_short | The DRAGON benchmark for clinical NLP |
| title_sort | dragon benchmark for clinical nlp |
| url | https://doi.org/10.1038/s41746-025-01626-x |
| work_keys_str_mv | AT joeransbosma thedragonbenchmarkforclinicalnlp AT koendercksen thedragonbenchmarkforclinicalnlp AT lucbuiltjes thedragonbenchmarkforclinicalnlp AT romainandre thedragonbenchmarkforclinicalnlp AT christianroest thedragonbenchmarkforclinicalnlp AT stefanjfransen thedragonbenchmarkforclinicalnlp AT constantrnoordman thedragonbenchmarkforclinicalnlp AT marnavarropadilla thedragonbenchmarkforclinicalnlp AT judithlefkes thedragonbenchmarkforclinicalnlp AT nataliaalves thedragonbenchmarkforclinicalnlp AT maxjjdegrauw thedragonbenchmarkforclinicalnlp AT leandervaneekelen thedragonbenchmarkforclinicalnlp AT joeymaspronck thedragonbenchmarkforclinicalnlp AT meganschuurmans thedragonbenchmarkforclinicalnlp AT bramdewilde thedragonbenchmarkforclinicalnlp AT wardhendrix thedragonbenchmarkforclinicalnlp AT witaliaswolinskiy thedragonbenchmarkforclinicalnlp AT anindosaha thedragonbenchmarkforclinicalnlp AT jasperjtwilt thedragonbenchmarkforclinicalnlp AT daangeijs thedragonbenchmarkforclinicalnlp AT jeroenveltman thedragonbenchmarkforclinicalnlp AT deryayakar thedragonbenchmarkforclinicalnlp AT maartenderooij thedragonbenchmarkforclinicalnlp AT francescociompi thedragonbenchmarkforclinicalnlp AT alessahering thedragonbenchmarkforclinicalnlp AT jeroengeerdink thedragonbenchmarkforclinicalnlp AT henkjanhuisman thedragonbenchmarkforclinicalnlp AT onbehalfofthedragonconsortium thedragonbenchmarkforclinicalnlp AT joeransbosma dragonbenchmarkforclinicalnlp AT koendercksen dragonbenchmarkforclinicalnlp AT lucbuiltjes dragonbenchmarkforclinicalnlp AT romainandre dragonbenchmarkforclinicalnlp AT christianroest dragonbenchmarkforclinicalnlp AT stefanjfransen dragonbenchmarkforclinicalnlp AT constantrnoordman dragonbenchmarkforclinicalnlp AT marnavarropadilla dragonbenchmarkforclinicalnlp AT judithlefkes dragonbenchmarkforclinicalnlp AT nataliaalves dragonbenchmarkforclinicalnlp AT maxjjdegrauw dragonbenchmarkforclinicalnlp AT leandervaneekelen dragonbenchmarkforclinicalnlp AT joeymaspronck dragonbenchmarkforclinicalnlp AT meganschuurmans dragonbenchmarkforclinicalnlp AT bramdewilde dragonbenchmarkforclinicalnlp AT wardhendrix dragonbenchmarkforclinicalnlp AT witaliaswolinskiy dragonbenchmarkforclinicalnlp AT anindosaha dragonbenchmarkforclinicalnlp AT jasperjtwilt dragonbenchmarkforclinicalnlp AT daangeijs dragonbenchmarkforclinicalnlp AT jeroenveltman dragonbenchmarkforclinicalnlp AT deryayakar dragonbenchmarkforclinicalnlp AT maartenderooij dragonbenchmarkforclinicalnlp AT francescociompi dragonbenchmarkforclinicalnlp AT alessahering dragonbenchmarkforclinicalnlp AT jeroengeerdink dragonbenchmarkforclinicalnlp AT henkjanhuisman dragonbenchmarkforclinicalnlp AT onbehalfofthedragonconsortium dragonbenchmarkforclinicalnlp |