Clinical applications of large language models in medicine and surgery: A scoping review

Objective To provide a comprehensive overview of the current use of large language models in clinical medicine and surgery, with emphasis on model characteristics, clinical applications, and readiness for adoption. Methods A scoping review of studies on the use of large language models in clinical m...

Full description

Saved in:

Bibliographic Details
Main Authors:	Eric Nan Liang, Sophia Pei, Phillip Staibano, Benjamin van der Woerd
Format:	Article
Language:	English
Published:	SAGE Publishing 2025-07-01
Series:	Journal of International Medical Research
Online Access:	https://doi.org/10.1177/03000605251347556
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850081089819770880
author	Eric Nan Liang Sophia Pei Phillip Staibano Benjamin van der Woerd
author_facet	Eric Nan Liang Sophia Pei Phillip Staibano Benjamin van der Woerd
author_sort	Eric Nan Liang
collection	DOAJ
description	Objective To provide a comprehensive overview of the current use of large language models in clinical medicine and surgery, with emphasis on model characteristics, clinical applications, and readiness for adoption. Methods A scoping review of studies on the use of large language models in clinical medicine and surgery was conducted in accordance with the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-scoping review and JBI methodology (protocol registration: 10.37766/inplasy2025.3.0102). A comprehensive search of EMBASE, PubMed, CINAHL, and IEEE Xplore identified 3313 articles published between 2018 and 2023. After screening of articles and full-text review, 156 studies were included. Data were extracted for study type, sample size, clinical specialty, model architecture, training methods, application purpose, and performance metrics. Descriptive analyses were performed. Results Most studies were proof-of-concept studies (55.8%) or clinical trials (21.2%), with a steady rise in publications since 2022. Large language models were most frequently used for data extraction (69.9%), followed by clinical recommendations (11.5%), report generation (9.0%), and patient-facing chatbots (7.1%). Proprietary models were used in 57.7% of the studies, whereas 39.7% used open-source models. ChatGPT-3.5, ChatGPT-4, and Bidirectional Encoder Representations from Transformers (BERT) were the most commonly reported models. Only 25.0% of the studies reported models as ready for clinical use, whereas 67.9% stated that the models required further validation. F-score (30.8%) and area under the curve (15.4%) were the most common performance metrics; 10.9% of the studies used expert opinion for validation. Conclusions Large language models are increasingly being used in clinical medicine. Although most applications focus on data extraction and summarization, emerging studies are beginning to explore higher-level tasks such as clinical decision-making and multidisciplinary simulation. Significant heterogeneity continues to exist in model architecture, evaluation methods, and reporting standards. Further standardization is needed to develop transparent evaluation frameworks and ensure safe, reliable integration of large language models into complex clinical workflows.
format	Article
id	doaj-art-3489bc0df60448019469210e2b264662
institution	DOAJ
issn	1473-2300
language	English
publishDate	2025-07-01
publisher	SAGE Publishing
record_format	Article
series	Journal of International Medical Research
spelling	doaj-art-3489bc0df60448019469210e2b2646622025-08-20T02:44:49ZengSAGE PublishingJournal of International Medical Research1473-23002025-07-015310.1177/03000605251347556Clinical applications of large language models in medicine and surgery: A scoping reviewEric Nan LiangSophia PeiPhillip StaibanoBenjamin van der WoerdObjective To provide a comprehensive overview of the current use of large language models in clinical medicine and surgery, with emphasis on model characteristics, clinical applications, and readiness for adoption. Methods A scoping review of studies on the use of large language models in clinical medicine and surgery was conducted in accordance with the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-scoping review and JBI methodology (protocol registration: 10.37766/inplasy2025.3.0102). A comprehensive search of EMBASE, PubMed, CINAHL, and IEEE Xplore identified 3313 articles published between 2018 and 2023. After screening of articles and full-text review, 156 studies were included. Data were extracted for study type, sample size, clinical specialty, model architecture, training methods, application purpose, and performance metrics. Descriptive analyses were performed. Results Most studies were proof-of-concept studies (55.8%) or clinical trials (21.2%), with a steady rise in publications since 2022. Large language models were most frequently used for data extraction (69.9%), followed by clinical recommendations (11.5%), report generation (9.0%), and patient-facing chatbots (7.1%). Proprietary models were used in 57.7% of the studies, whereas 39.7% used open-source models. ChatGPT-3.5, ChatGPT-4, and Bidirectional Encoder Representations from Transformers (BERT) were the most commonly reported models. Only 25.0% of the studies reported models as ready for clinical use, whereas 67.9% stated that the models required further validation. F-score (30.8%) and area under the curve (15.4%) were the most common performance metrics; 10.9% of the studies used expert opinion for validation. Conclusions Large language models are increasingly being used in clinical medicine. Although most applications focus on data extraction and summarization, emerging studies are beginning to explore higher-level tasks such as clinical decision-making and multidisciplinary simulation. Significant heterogeneity continues to exist in model architecture, evaluation methods, and reporting standards. Further standardization is needed to develop transparent evaluation frameworks and ensure safe, reliable integration of large language models into complex clinical workflows.https://doi.org/10.1177/03000605251347556
spellingShingle	Eric Nan Liang Sophia Pei Phillip Staibano Benjamin van der Woerd Clinical applications of large language models in medicine and surgery: A scoping review Journal of International Medical Research
title	Clinical applications of large language models in medicine and surgery: A scoping review
title_full	Clinical applications of large language models in medicine and surgery: A scoping review
title_fullStr	Clinical applications of large language models in medicine and surgery: A scoping review
title_full_unstemmed	Clinical applications of large language models in medicine and surgery: A scoping review
title_short	Clinical applications of large language models in medicine and surgery: A scoping review
title_sort	clinical applications of large language models in medicine and surgery a scoping review
url	https://doi.org/10.1177/03000605251347556
work_keys_str_mv	AT ericnanliang clinicalapplicationsoflargelanguagemodelsinmedicineandsurgeryascopingreview AT sophiapei clinicalapplicationsoflargelanguagemodelsinmedicineandsurgeryascopingreview AT phillipstaibano clinicalapplicationsoflargelanguagemodelsinmedicineandsurgeryascopingreview AT benjaminvanderwoerd clinicalapplicationsoflargelanguagemodelsinmedicineandsurgeryascopingreview

Clinical applications of large language models in medicine and surgery: A scoping review

Similar Items