The DRAGON benchmark for clinical NLP

Abstract Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to...

Full description

Saved in:
Bibliographic Details
Main Authors: Joeran S. Bosma, Koen Dercksen, Luc Builtjes, Romain André, Christian Roest, Stefan J. Fransen, Constant R. Noordman, Mar Navarro-Padilla, Judith Lefkes, Natália Alves, Max J. J. de Grauw, Leander van Eekelen, Joey M. A. Spronck, Megan Schuurmans, Bram de Wilde, Ward Hendrix, Witali Aswolinskiy, Anindo Saha, Jasper J. Twilt, Daan Geijs, Jeroen Veltman, Derya Yakar, Maarten de Rooij, Francesco Ciompi, Alessa Hering, Jeroen Geerdink, Henkjan Huisman, On behalf of the DRAGON consortium
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-025-01626-x
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, p < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available.
ISSN:2398-6352