Clinical applications of large language models in medicine and surgery: A scoping review

Objective To provide a comprehensive overview of the current use of large language models in clinical medicine and surgery, with emphasis on model characteristics, clinical applications, and readiness for adoption. Methods A scoping review of studies on the use of large language models in clinical m...

Full description

Saved in:
Bibliographic Details
Main Authors: Eric Nan Liang, Sophia Pei, Phillip Staibano, Benjamin van der Woerd
Format: Article
Language:English
Published: SAGE Publishing 2025-07-01
Series:Journal of International Medical Research
Online Access:https://doi.org/10.1177/03000605251347556
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850081089819770880
author Eric Nan Liang
Sophia Pei
Phillip Staibano
Benjamin van der Woerd
author_facet Eric Nan Liang
Sophia Pei
Phillip Staibano
Benjamin van der Woerd
author_sort Eric Nan Liang
collection DOAJ
description Objective To provide a comprehensive overview of the current use of large language models in clinical medicine and surgery, with emphasis on model characteristics, clinical applications, and readiness for adoption. Methods A scoping review of studies on the use of large language models in clinical medicine and surgery was conducted in accordance with the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-scoping review and JBI methodology (protocol registration: 10.37766/inplasy2025.3.0102). A comprehensive search of EMBASE, PubMed, CINAHL, and IEEE Xplore identified 3313 articles published between 2018 and 2023. After screening of articles and full-text review, 156 studies were included. Data were extracted for study type, sample size, clinical specialty, model architecture, training methods, application purpose, and performance metrics. Descriptive analyses were performed. Results Most studies were proof-of-concept studies (55.8%) or clinical trials (21.2%), with a steady rise in publications since 2022. Large language models were most frequently used for data extraction (69.9%), followed by clinical recommendations (11.5%), report generation (9.0%), and patient-facing chatbots (7.1%). Proprietary models were used in 57.7% of the studies, whereas 39.7% used open-source models. ChatGPT-3.5, ChatGPT-4, and Bidirectional Encoder Representations from Transformers (BERT) were the most commonly reported models. Only 25.0% of the studies reported models as ready for clinical use, whereas 67.9% stated that the models required further validation. F-score (30.8%) and area under the curve (15.4%) were the most common performance metrics; 10.9% of the studies used expert opinion for validation. Conclusions Large language models are increasingly being used in clinical medicine. Although most applications focus on data extraction and summarization, emerging studies are beginning to explore higher-level tasks such as clinical decision-making and multidisciplinary simulation. Significant heterogeneity continues to exist in model architecture, evaluation methods, and reporting standards. Further standardization is needed to develop transparent evaluation frameworks and ensure safe, reliable integration of large language models into complex clinical workflows.
format Article
id doaj-art-3489bc0df60448019469210e2b264662
institution DOAJ
issn 1473-2300
language English
publishDate 2025-07-01
publisher SAGE Publishing
record_format Article
series Journal of International Medical Research
spelling doaj-art-3489bc0df60448019469210e2b2646622025-08-20T02:44:49ZengSAGE PublishingJournal of International Medical Research1473-23002025-07-015310.1177/03000605251347556Clinical applications of large language models in medicine and surgery: A scoping reviewEric Nan LiangSophia PeiPhillip StaibanoBenjamin van der WoerdObjective To provide a comprehensive overview of the current use of large language models in clinical medicine and surgery, with emphasis on model characteristics, clinical applications, and readiness for adoption. Methods A scoping review of studies on the use of large language models in clinical medicine and surgery was conducted in accordance with the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-scoping review and JBI methodology (protocol registration: 10.37766/inplasy2025.3.0102). A comprehensive search of EMBASE, PubMed, CINAHL, and IEEE Xplore identified 3313 articles published between 2018 and 2023. After screening of articles and full-text review, 156 studies were included. Data were extracted for study type, sample size, clinical specialty, model architecture, training methods, application purpose, and performance metrics. Descriptive analyses were performed. Results Most studies were proof-of-concept studies (55.8%) or clinical trials (21.2%), with a steady rise in publications since 2022. Large language models were most frequently used for data extraction (69.9%), followed by clinical recommendations (11.5%), report generation (9.0%), and patient-facing chatbots (7.1%). Proprietary models were used in 57.7% of the studies, whereas 39.7% used open-source models. ChatGPT-3.5, ChatGPT-4, and Bidirectional Encoder Representations from Transformers (BERT) were the most commonly reported models. Only 25.0% of the studies reported models as ready for clinical use, whereas 67.9% stated that the models required further validation. F-score (30.8%) and area under the curve (15.4%) were the most common performance metrics; 10.9% of the studies used expert opinion for validation. Conclusions Large language models are increasingly being used in clinical medicine. Although most applications focus on data extraction and summarization, emerging studies are beginning to explore higher-level tasks such as clinical decision-making and multidisciplinary simulation. Significant heterogeneity continues to exist in model architecture, evaluation methods, and reporting standards. Further standardization is needed to develop transparent evaluation frameworks and ensure safe, reliable integration of large language models into complex clinical workflows.https://doi.org/10.1177/03000605251347556
spellingShingle Eric Nan Liang
Sophia Pei
Phillip Staibano
Benjamin van der Woerd
Clinical applications of large language models in medicine and surgery: A scoping review
Journal of International Medical Research
title Clinical applications of large language models in medicine and surgery: A scoping review
title_full Clinical applications of large language models in medicine and surgery: A scoping review
title_fullStr Clinical applications of large language models in medicine and surgery: A scoping review
title_full_unstemmed Clinical applications of large language models in medicine and surgery: A scoping review
title_short Clinical applications of large language models in medicine and surgery: A scoping review
title_sort clinical applications of large language models in medicine and surgery a scoping review
url https://doi.org/10.1177/03000605251347556
work_keys_str_mv AT ericnanliang clinicalapplicationsoflargelanguagemodelsinmedicineandsurgeryascopingreview
AT sophiapei clinicalapplicationsoflargelanguagemodelsinmedicineandsurgeryascopingreview
AT phillipstaibano clinicalapplicationsoflargelanguagemodelsinmedicineandsurgeryascopingreview
AT benjaminvanderwoerd clinicalapplicationsoflargelanguagemodelsinmedicineandsurgeryascopingreview