The DRAGON benchmark for clinical NLP

Abstract Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to...

Full description

Saved in:
Bibliographic Details
Main Authors: Joeran S. Bosma, Koen Dercksen, Luc Builtjes, Romain André, Christian Roest, Stefan J. Fransen, Constant R. Noordman, Mar Navarro-Padilla, Judith Lefkes, Natália Alves, Max J. J. de Grauw, Leander van Eekelen, Joey M. A. Spronck, Megan Schuurmans, Bram de Wilde, Ward Hendrix, Witali Aswolinskiy, Anindo Saha, Jasper J. Twilt, Daan Geijs, Jeroen Veltman, Derya Yakar, Maarten de Rooij, Francesco Ciompi, Alessa Hering, Jeroen Geerdink, Henkjan Huisman, On behalf of the DRAGON consortium
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-025-01626-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850273364118077440
author Joeran S. Bosma
Koen Dercksen
Luc Builtjes
Romain André
Christian Roest
Stefan J. Fransen
Constant R. Noordman
Mar Navarro-Padilla
Judith Lefkes
Natália Alves
Max J. J. de Grauw
Leander van Eekelen
Joey M. A. Spronck
Megan Schuurmans
Bram de Wilde
Ward Hendrix
Witali Aswolinskiy
Anindo Saha
Jasper J. Twilt
Daan Geijs
Jeroen Veltman
Derya Yakar
Maarten de Rooij
Francesco Ciompi
Alessa Hering
Jeroen Geerdink
Henkjan Huisman
On behalf of the DRAGON consortium
author_facet Joeran S. Bosma
Koen Dercksen
Luc Builtjes
Romain André
Christian Roest
Stefan J. Fransen
Constant R. Noordman
Mar Navarro-Padilla
Judith Lefkes
Natália Alves
Max J. J. de Grauw
Leander van Eekelen
Joey M. A. Spronck
Megan Schuurmans
Bram de Wilde
Ward Hendrix
Witali Aswolinskiy
Anindo Saha
Jasper J. Twilt
Daan Geijs
Jeroen Veltman
Derya Yakar
Maarten de Rooij
Francesco Ciompi
Alessa Hering
Jeroen Geerdink
Henkjan Huisman
On behalf of the DRAGON consortium
author_sort Joeran S. Bosma
collection DOAJ
description Abstract Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, p < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available.
format Article
id doaj-art-30f59bf928b14aa1a2cf997e1473604a
institution OA Journals
issn 2398-6352
language English
publishDate 2025-05-01
publisher Nature Portfolio
record_format Article
series npj Digital Medicine
spelling doaj-art-30f59bf928b14aa1a2cf997e1473604a2025-08-20T01:51:31ZengNature Portfolionpj Digital Medicine2398-63522025-05-018111010.1038/s41746-025-01626-xThe DRAGON benchmark for clinical NLPJoeran S. Bosma0Koen Dercksen1Luc Builtjes2Romain André3Christian Roest4Stefan J. Fransen5Constant R. Noordman6Mar Navarro-Padilla7Judith Lefkes8Natália Alves9Max J. J. de Grauw10Leander van Eekelen11Joey M. A. Spronck12Megan Schuurmans13Bram de Wilde14Ward Hendrix15Witali Aswolinskiy16Anindo Saha17Jasper J. Twilt18Daan Geijs19Jeroen Veltman20Derya Yakar21Maarten de Rooij22Francesco Ciompi23Alessa Hering24Jeroen Geerdink25Henkjan Huisman26On behalf of the DRAGON consortiumDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDepartment of Radiology, University Medical Center GroningenDepartment of Radiology, University Medical Center GroningenDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterMinimally Invasive Image-Guided Intervention Center, Department of Medical Imaging, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterDepartment of Radiology, Ziekenhuisgroep TwenteDepartment of Radiology, Netherlands Cancer InstituteDepartment of Medical Imaging, Radboud University Medical CenterComputational Pathology Group, Department of Pathology, Radboud University Medical CenterDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterDepartment of Health & Information Technology, Ziekenhuisgroep TwenteDiagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical CenterAbstract Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, p < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available.https://doi.org/10.1038/s41746-025-01626-x
spellingShingle Joeran S. Bosma
Koen Dercksen
Luc Builtjes
Romain André
Christian Roest
Stefan J. Fransen
Constant R. Noordman
Mar Navarro-Padilla
Judith Lefkes
Natália Alves
Max J. J. de Grauw
Leander van Eekelen
Joey M. A. Spronck
Megan Schuurmans
Bram de Wilde
Ward Hendrix
Witali Aswolinskiy
Anindo Saha
Jasper J. Twilt
Daan Geijs
Jeroen Veltman
Derya Yakar
Maarten de Rooij
Francesco Ciompi
Alessa Hering
Jeroen Geerdink
Henkjan Huisman
On behalf of the DRAGON consortium
The DRAGON benchmark for clinical NLP
npj Digital Medicine
title The DRAGON benchmark for clinical NLP
title_full The DRAGON benchmark for clinical NLP
title_fullStr The DRAGON benchmark for clinical NLP
title_full_unstemmed The DRAGON benchmark for clinical NLP
title_short The DRAGON benchmark for clinical NLP
title_sort dragon benchmark for clinical nlp
url https://doi.org/10.1038/s41746-025-01626-x
work_keys_str_mv AT joeransbosma thedragonbenchmarkforclinicalnlp
AT koendercksen thedragonbenchmarkforclinicalnlp
AT lucbuiltjes thedragonbenchmarkforclinicalnlp
AT romainandre thedragonbenchmarkforclinicalnlp
AT christianroest thedragonbenchmarkforclinicalnlp
AT stefanjfransen thedragonbenchmarkforclinicalnlp
AT constantrnoordman thedragonbenchmarkforclinicalnlp
AT marnavarropadilla thedragonbenchmarkforclinicalnlp
AT judithlefkes thedragonbenchmarkforclinicalnlp
AT nataliaalves thedragonbenchmarkforclinicalnlp
AT maxjjdegrauw thedragonbenchmarkforclinicalnlp
AT leandervaneekelen thedragonbenchmarkforclinicalnlp
AT joeymaspronck thedragonbenchmarkforclinicalnlp
AT meganschuurmans thedragonbenchmarkforclinicalnlp
AT bramdewilde thedragonbenchmarkforclinicalnlp
AT wardhendrix thedragonbenchmarkforclinicalnlp
AT witaliaswolinskiy thedragonbenchmarkforclinicalnlp
AT anindosaha thedragonbenchmarkforclinicalnlp
AT jasperjtwilt thedragonbenchmarkforclinicalnlp
AT daangeijs thedragonbenchmarkforclinicalnlp
AT jeroenveltman thedragonbenchmarkforclinicalnlp
AT deryayakar thedragonbenchmarkforclinicalnlp
AT maartenderooij thedragonbenchmarkforclinicalnlp
AT francescociompi thedragonbenchmarkforclinicalnlp
AT alessahering thedragonbenchmarkforclinicalnlp
AT jeroengeerdink thedragonbenchmarkforclinicalnlp
AT henkjanhuisman thedragonbenchmarkforclinicalnlp
AT onbehalfofthedragonconsortium thedragonbenchmarkforclinicalnlp
AT joeransbosma dragonbenchmarkforclinicalnlp
AT koendercksen dragonbenchmarkforclinicalnlp
AT lucbuiltjes dragonbenchmarkforclinicalnlp
AT romainandre dragonbenchmarkforclinicalnlp
AT christianroest dragonbenchmarkforclinicalnlp
AT stefanjfransen dragonbenchmarkforclinicalnlp
AT constantrnoordman dragonbenchmarkforclinicalnlp
AT marnavarropadilla dragonbenchmarkforclinicalnlp
AT judithlefkes dragonbenchmarkforclinicalnlp
AT nataliaalves dragonbenchmarkforclinicalnlp
AT maxjjdegrauw dragonbenchmarkforclinicalnlp
AT leandervaneekelen dragonbenchmarkforclinicalnlp
AT joeymaspronck dragonbenchmarkforclinicalnlp
AT meganschuurmans dragonbenchmarkforclinicalnlp
AT bramdewilde dragonbenchmarkforclinicalnlp
AT wardhendrix dragonbenchmarkforclinicalnlp
AT witaliaswolinskiy dragonbenchmarkforclinicalnlp
AT anindosaha dragonbenchmarkforclinicalnlp
AT jasperjtwilt dragonbenchmarkforclinicalnlp
AT daangeijs dragonbenchmarkforclinicalnlp
AT jeroenveltman dragonbenchmarkforclinicalnlp
AT deryayakar dragonbenchmarkforclinicalnlp
AT maartenderooij dragonbenchmarkforclinicalnlp
AT francescociompi dragonbenchmarkforclinicalnlp
AT alessahering dragonbenchmarkforclinicalnlp
AT jeroengeerdink dragonbenchmarkforclinicalnlp
AT henkjanhuisman dragonbenchmarkforclinicalnlp
AT onbehalfofthedragonconsortium dragonbenchmarkforclinicalnlp