Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty

Abstract Purpose To determine the scope and accuracy of medical information provided by ChatGPT‐4 in response to clinical queries concerning total shoulder arthroplasty (TSA), and to compare these results to those of the Google search engine. Methods A patient‐replicated query for ‘total shoulder re...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jacob F. Oeding, Amy Z. Lu, Michael Mazzucco, Michael C. Fu, David M. Dines, Russell F. Warren, Lawrence V. Gulotta, Joshua S. Dines, Kyle N. Kunze
Format:	Article
Language:	English
Published:	Wiley 2024-10-01
Series:	Journal of Experimental Orthopaedics
Subjects:	ChatGPT information retrieval large language model LLM total shoulder arthroplasty
Online Access:	https://doi.org/10.1002/jeo2.70114
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850052432161144832
author	Jacob F. Oeding Amy Z. Lu Michael Mazzucco Michael C. Fu David M. Dines Russell F. Warren Lawrence V. Gulotta Joshua S. Dines Kyle N. Kunze
author_facet	Jacob F. Oeding Amy Z. Lu Michael Mazzucco Michael C. Fu David M. Dines Russell F. Warren Lawrence V. Gulotta Joshua S. Dines Kyle N. Kunze
author_sort	Jacob F. Oeding
collection	DOAJ
description	Abstract Purpose To determine the scope and accuracy of medical information provided by ChatGPT‐4 in response to clinical queries concerning total shoulder arthroplasty (TSA), and to compare these results to those of the Google search engine. Methods A patient‐replicated query for ‘total shoulder replacement’ was performed using both Google Web Search (the most frequently used search engine worldwide) and ChatGPT‐4. The top 10 frequently asked questions (FAQs), answers, and associated sources were extracted. This search was performed again independently to identify the top 10 FAQs necessitating numerical responses such that the concordance of answers could be compared between Google and ChatGPT‐4. The clinical relevance and accuracy of the provided information were graded by two blinded orthopaedic shoulder surgeons. Results Concerning FAQs with numeric responses, 8 out of 10 (80%) had identical answers or substantial overlap between ChatGPT‐4 and Google. Accuracy of information was not significantly different (p = 0.32). Google sources included 40% medical practices, 30% academic, 20% single‐surgeon practice, and 10% social media, while ChatGPT‐4 used 100% academic sources, representing a statistically significant difference (p = 0.001). Only 3 out of 10 (30%) FAQs with open‐ended answers were identical between ChatGPT‐4 and Google. The clinical relevance of FAQs was not significantly different (p = 0.18). Google sources for open‐ended questions included academic (60%), social media (20%), medical practice (10%) and single‐surgeon practice (10%), while 100% of sources for ChatGPT‐4 were academic, representing a statistically significant difference (p = 0.0025). Conclusion ChatGPT‐4 provided trustworthy academic sources for medical information retrieval concerning TSA, while sources used by Google were heterogeneous. Accuracy and clinical relevance of information were not significantly different between ChatGPT‐4 and Google. Level of Evidence Level IV cross‐sectional.
format	Article
id	doaj-art-ebe279a40e8a48bb87255bb055f1aa50
institution	DOAJ
issn	2197-1153
language	English
publishDate	2024-10-01
publisher	Wiley
record_format	Article
series	Journal of Experimental Orthopaedics
spelling	doaj-art-ebe279a40e8a48bb87255bb055f1aa502025-08-20T02:52:48ZengWileyJournal of Experimental Orthopaedics2197-11532024-10-01114n/an/a10.1002/jeo2.70114Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplastyJacob F. Oeding0Amy Z. Lu1Michael Mazzucco2Michael C. Fu3David M. Dines4Russell F. Warren5Lawrence V. Gulotta6Joshua S. Dines7Kyle N. Kunze8Department of Orthopaedics, Institute of Clinical Sciences, The Sahlgrenska Academy University of Gothenburg Gothenburg SwedenWeill Cornell Medical College New York New York USAWeill Cornell Medical College New York New York USADepartment of Orthopaedic Surgery Hospital for Special Surgery New York New York USADepartment of Orthopaedic Surgery Hospital for Special Surgery New York New York USADepartment of Orthopaedic Surgery Hospital for Special Surgery New York New York USADepartment of Orthopaedic Surgery Hospital for Special Surgery New York New York USADepartment of Orthopaedic Surgery Hospital for Special Surgery New York New York USADepartment of Orthopaedic Surgery Hospital for Special Surgery New York New York USAAbstract Purpose To determine the scope and accuracy of medical information provided by ChatGPT‐4 in response to clinical queries concerning total shoulder arthroplasty (TSA), and to compare these results to those of the Google search engine. Methods A patient‐replicated query for ‘total shoulder replacement’ was performed using both Google Web Search (the most frequently used search engine worldwide) and ChatGPT‐4. The top 10 frequently asked questions (FAQs), answers, and associated sources were extracted. This search was performed again independently to identify the top 10 FAQs necessitating numerical responses such that the concordance of answers could be compared between Google and ChatGPT‐4. The clinical relevance and accuracy of the provided information were graded by two blinded orthopaedic shoulder surgeons. Results Concerning FAQs with numeric responses, 8 out of 10 (80%) had identical answers or substantial overlap between ChatGPT‐4 and Google. Accuracy of information was not significantly different (p = 0.32). Google sources included 40% medical practices, 30% academic, 20% single‐surgeon practice, and 10% social media, while ChatGPT‐4 used 100% academic sources, representing a statistically significant difference (p = 0.001). Only 3 out of 10 (30%) FAQs with open‐ended answers were identical between ChatGPT‐4 and Google. The clinical relevance of FAQs was not significantly different (p = 0.18). Google sources for open‐ended questions included academic (60%), social media (20%), medical practice (10%) and single‐surgeon practice (10%), while 100% of sources for ChatGPT‐4 were academic, representing a statistically significant difference (p = 0.0025). Conclusion ChatGPT‐4 provided trustworthy academic sources for medical information retrieval concerning TSA, while sources used by Google were heterogeneous. Accuracy and clinical relevance of information were not significantly different between ChatGPT‐4 and Google. Level of Evidence Level IV cross‐sectional.https://doi.org/10.1002/jeo2.70114ChatGPTinformation retrievallarge language modelLLMtotal shoulder arthroplasty
spellingShingle	Jacob F. Oeding Amy Z. Lu Michael Mazzucco Michael C. Fu David M. Dines Russell F. Warren Lawrence V. Gulotta Joshua S. Dines Kyle N. Kunze Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty Journal of Experimental Orthopaedics ChatGPT information retrieval large language model LLM total shoulder arthroplasty
title	Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty
title_full	Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty
title_fullStr	Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty
title_full_unstemmed	Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty
title_short	Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty
title_sort	effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty
topic	ChatGPT information retrieval large language model LLM total shoulder arthroplasty
url	https://doi.org/10.1002/jeo2.70114
work_keys_str_mv	AT jacobfoeding effectivenessofalargelanguagemodelforclinicalinformationretrievalregardingshoulderarthroplasty AT amyzlu effectivenessofalargelanguagemodelforclinicalinformationretrievalregardingshoulderarthroplasty AT michaelmazzucco effectivenessofalargelanguagemodelforclinicalinformationretrievalregardingshoulderarthroplasty AT michaelcfu effectivenessofalargelanguagemodelforclinicalinformationretrievalregardingshoulderarthroplasty AT davidmdines effectivenessofalargelanguagemodelforclinicalinformationretrievalregardingshoulderarthroplasty AT russellfwarren effectivenessofalargelanguagemodelforclinicalinformationretrievalregardingshoulderarthroplasty AT lawrencevgulotta effectivenessofalargelanguagemodelforclinicalinformationretrievalregardingshoulderarthroplasty AT joshuasdines effectivenessofalargelanguagemodelforclinicalinformationretrievalregardingshoulderarthroplasty AT kylenkunze effectivenessofalargelanguagemodelforclinicalinformationretrievalregardingshoulderarthroplasty

Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty

Similar Items