Evaluation of SURUS: a named entity recognition NLP system to extract knowledge from interventional study records

Abstract Background Medical decision-making commonly is guided by evidence-based analyses from systematic literature reviews (SLRs). These require large amounts of time and subject matter expertise to perform. Automated extraction of key datapoints from clinical publications could speed up the proce...

Full description

Saved in:

Bibliographic Details
Main Authors:	Casper Peeters, Koen Vijverberg, Marianne Pouwer, Bart Westerman, Maikel Boot, Suzan Verberne
Format:	Article
Language:	English
Published:	BMC 2025-07-01
Series:	BMC Medical Research Methodology
Subjects:	Language model Evidence-based medicine PICO Systematic literature review Natural language rrocessing Bi-directional encoder representations from transformers
Online Access:	https://doi.org/10.1186/s12874-025-02624-z
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849331844293591040
author	Casper Peeters Koen Vijverberg Marianne Pouwer Bart Westerman Maikel Boot Suzan Verberne
author_facet	Casper Peeters Koen Vijverberg Marianne Pouwer Bart Westerman Maikel Boot Suzan Verberne
author_sort	Casper Peeters
collection	DOAJ
description	Abstract Background Medical decision-making commonly is guided by evidence-based analyses from systematic literature reviews (SLRs). These require large amounts of time and subject matter expertise to perform. Automated extraction of key datapoints from clinical publications could speed up the process of systematic literature review assembly. To this end, we built SURUS, a named entity recognition (NER) system comprised of a Bidirectional Encoder Representations from Transformers (BERT) model trained on a fine-grained dataset. The aim of this study was to assess the quality of SURUS classifications of PICO (patient, intervention, comparator and outcome) and study design elements of clinical study abstracts. Methods The PubMedBERT-based model was trained and evaluated using a dataset of 39,531 labels amongst 400 clinical abstracts, with an inter-annotator agreement of 0.81 (Cohen’s κ) and 0.88 (F1). The labels were manually annotated using a strict annotation guide. We evaluated quality of the dataset and tested the utility of the model in the practise of systematic literature screening, by comparing SURUS predictions to expert PICO and design classifications. Additionally, we tested out-of-domain quality of the model across 7 other therapeutic areas and another study design. Results The SURUS NER system achieved an overall F1 score of 0.95, with minor deviation between labels. In addition, SURUS achieved a NER F1 of 0.90 and 0.84 for out-of-domain therapeutic area and observational study abstracts, respectively. Finally, F1 of PICO and study design classifications was 0.89 with a recall of 0.96 compared to expert classifications. Conclusion The system reaches an F1 score of 0.95 across 25 contextually different medical named entities. This high-quality in-domain medical entity prediction of a fine-tuned BERT-based model was the result of a strict annotation guideline and high inter-annotator agreement. This prediction accuracy was largely preserved during extensive out-of-domain evaluation, indicating its utility across other indication areas and study types. Current approaches in the field lack in the fine-grained training data and versatility demonstrated here. We think that this approach sets a new standard in medical literature analysis and paves the way for creating fine-grained datasets of labelled entities that can be used for downstream analysis outside of traditional SLRs.
format	Article
id	doaj-art-cda599c551d34602b5b9fff71e454fc3
institution	Kabale University
issn	1471-2288
language	English
publishDate	2025-07-01
publisher	BMC
record_format	Article
series	BMC Medical Research Methodology
spelling	doaj-art-cda599c551d34602b5b9fff71e454fc32025-08-20T03:46:23ZengBMCBMC Medical Research Methodology1471-22882025-07-0125111210.1186/s12874-025-02624-zEvaluation of SURUS: a named entity recognition NLP system to extract knowledge from interventional study recordsCasper Peeters0Koen Vijverberg1Marianne Pouwer2Bart Westerman3Maikel Boot4Suzan Verberne5Medstone ScienceMedstone ScienceMedstone ScienceAmsterdam University Medical Center (UMC)Medstone ScienceLeiden Institute of Advanced Computer Science (LIACS), Leiden UniversityAbstract Background Medical decision-making commonly is guided by evidence-based analyses from systematic literature reviews (SLRs). These require large amounts of time and subject matter expertise to perform. Automated extraction of key datapoints from clinical publications could speed up the process of systematic literature review assembly. To this end, we built SURUS, a named entity recognition (NER) system comprised of a Bidirectional Encoder Representations from Transformers (BERT) model trained on a fine-grained dataset. The aim of this study was to assess the quality of SURUS classifications of PICO (patient, intervention, comparator and outcome) and study design elements of clinical study abstracts. Methods The PubMedBERT-based model was trained and evaluated using a dataset of 39,531 labels amongst 400 clinical abstracts, with an inter-annotator agreement of 0.81 (Cohen’s κ) and 0.88 (F1). The labels were manually annotated using a strict annotation guide. We evaluated quality of the dataset and tested the utility of the model in the practise of systematic literature screening, by comparing SURUS predictions to expert PICO and design classifications. Additionally, we tested out-of-domain quality of the model across 7 other therapeutic areas and another study design. Results The SURUS NER system achieved an overall F1 score of 0.95, with minor deviation between labels. In addition, SURUS achieved a NER F1 of 0.90 and 0.84 for out-of-domain therapeutic area and observational study abstracts, respectively. Finally, F1 of PICO and study design classifications was 0.89 with a recall of 0.96 compared to expert classifications. Conclusion The system reaches an F1 score of 0.95 across 25 contextually different medical named entities. This high-quality in-domain medical entity prediction of a fine-tuned BERT-based model was the result of a strict annotation guideline and high inter-annotator agreement. This prediction accuracy was largely preserved during extensive out-of-domain evaluation, indicating its utility across other indication areas and study types. Current approaches in the field lack in the fine-grained training data and versatility demonstrated here. We think that this approach sets a new standard in medical literature analysis and paves the way for creating fine-grained datasets of labelled entities that can be used for downstream analysis outside of traditional SLRs.https://doi.org/10.1186/s12874-025-02624-zLanguage modelEvidence-based medicinePICOSystematic literature reviewNatural language rrocessingBi-directional encoder representations from transformers
spellingShingle	Casper Peeters Koen Vijverberg Marianne Pouwer Bart Westerman Maikel Boot Suzan Verberne Evaluation of SURUS: a named entity recognition NLP system to extract knowledge from interventional study records BMC Medical Research Methodology Language model Evidence-based medicine PICO Systematic literature review Natural language rrocessing Bi-directional encoder representations from transformers
title	Evaluation of SURUS: a named entity recognition NLP system to extract knowledge from interventional study records
title_full	Evaluation of SURUS: a named entity recognition NLP system to extract knowledge from interventional study records
title_fullStr	Evaluation of SURUS: a named entity recognition NLP system to extract knowledge from interventional study records
title_full_unstemmed	Evaluation of SURUS: a named entity recognition NLP system to extract knowledge from interventional study records
title_short	Evaluation of SURUS: a named entity recognition NLP system to extract knowledge from interventional study records
title_sort	evaluation of surus a named entity recognition nlp system to extract knowledge from interventional study records
topic	Language model Evidence-based medicine PICO Systematic literature review Natural language rrocessing Bi-directional encoder representations from transformers
url	https://doi.org/10.1186/s12874-025-02624-z
work_keys_str_mv	AT casperpeeters evaluationofsurusanamedentityrecognitionnlpsystemtoextractknowledgefrominterventionalstudyrecords AT koenvijverberg evaluationofsurusanamedentityrecognitionnlpsystemtoextractknowledgefrominterventionalstudyrecords AT mariannepouwer evaluationofsurusanamedentityrecognitionnlpsystemtoextractknowledgefrominterventionalstudyrecords AT bartwesterman evaluationofsurusanamedentityrecognitionnlpsystemtoextractknowledgefrominterventionalstudyrecords AT maikelboot evaluationofsurusanamedentityrecognitionnlpsystemtoextractknowledgefrominterventionalstudyrecords AT suzanverberne evaluationofsurusanamedentityrecognitionnlpsystemtoextractknowledgefrominterventionalstudyrecords

Evaluation of SURUS: a named entity recognition NLP system to extract knowledge from interventional study records

Similar Items