Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study

Background: Large Language Models (LLMs) such as ChatGPT are gaining attention for their potential applications in healthcare. This study aimed to evaluate the diagnostic sensitivity of various LLMs in detecting hip or knee osteoarthritis (OA) using only patient-reported data collected via a structu...

Full description

Saved in:
Bibliographic Details
Main Authors: Stefano Pagano, Luigi Strumolo, Katrin Michalk, Julia Schiegl, Loreto C. Pulido, Jan Reinhard, Guenther Maderbacher, Tobias Renkawitz, Marie Schuster
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037024004343
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841560531689799680
author Stefano Pagano
Luigi Strumolo
Katrin Michalk
Julia Schiegl
Loreto C. Pulido
Jan Reinhard
Guenther Maderbacher
Tobias Renkawitz
Marie Schuster
author_facet Stefano Pagano
Luigi Strumolo
Katrin Michalk
Julia Schiegl
Loreto C. Pulido
Jan Reinhard
Guenther Maderbacher
Tobias Renkawitz
Marie Schuster
author_sort Stefano Pagano
collection DOAJ
description Background: Large Language Models (LLMs) such as ChatGPT are gaining attention for their potential applications in healthcare. This study aimed to evaluate the diagnostic sensitivity of various LLMs in detecting hip or knee osteoarthritis (OA) using only patient-reported data collected via a structured questionnaire, without prior medical consultation. Methods: A prospective observational study was conducted at an orthopaedic outpatient clinic specialized in hip and knee OA treatment. A total of 115 patients completed a paper-based questionnaire covering symptoms, medical history, and demographic information. The diagnostic performance of five different LLMs—including four versions of ChatGPT, two of Gemini, Llama, Gemma 2, and Mistral-Nemo—was analysed. Model-generated diagnoses were compared against those provided by experienced orthopaedic clinicians, which served as the reference standard. Results: GPT-4o achieved the highest diagnostic sensitivity at 92.3 %, significantly outperforming other LLMs. The completeness of patient responses to symptom-related questions was the strongest predictor of accuracy for GPT-4o (p < 0.001). Inter-model agreement was moderate among GPT-4 versions, whereas models such as Llama-3.1 demonstrated notably lower accuracy and concordance. Conclusions: GPT-4o demonstrated high accuracy and consistency in diagnosing OA based solely on patient-reported questionnaires, underscoring its potential as a supplementary diagnostic tool in clinical settings. Nevertheless, the reliance on patient-reported data without direct physician involvement highlights the critical need for medical oversight to ensure diagnostic accuracy. Further research is needed to refine LLM capabilities and expand their utility in broader diagnostic applications.
format Article
id doaj-art-05d6d2abb5c54877a32dcafcd7fbb479
institution Kabale University
issn 2001-0370
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj-art-05d6d2abb5c54877a32dcafcd7fbb4792025-01-04T04:56:14ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-0128915Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical studyStefano Pagano0Luigi Strumolo1Katrin Michalk2Julia Schiegl3Loreto C. Pulido4Jan Reinhard5Guenther Maderbacher6Tobias Renkawitz7Marie Schuster8Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany; Corresponding author.Freelance health consultant &amp; senior data analyst, Avellino, ItalyDepartment of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, GermanyDepartment of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, GermanyDepartment of Orthopaedics Hospital of Trauma Surgery, Marktredwitz Hospital, Marktredwitz, GermanyDepartment of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, GermanyDepartment of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, GermanyDepartment of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, GermanyDepartment of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, GermanyBackground: Large Language Models (LLMs) such as ChatGPT are gaining attention for their potential applications in healthcare. This study aimed to evaluate the diagnostic sensitivity of various LLMs in detecting hip or knee osteoarthritis (OA) using only patient-reported data collected via a structured questionnaire, without prior medical consultation. Methods: A prospective observational study was conducted at an orthopaedic outpatient clinic specialized in hip and knee OA treatment. A total of 115 patients completed a paper-based questionnaire covering symptoms, medical history, and demographic information. The diagnostic performance of five different LLMs—including four versions of ChatGPT, two of Gemini, Llama, Gemma 2, and Mistral-Nemo—was analysed. Model-generated diagnoses were compared against those provided by experienced orthopaedic clinicians, which served as the reference standard. Results: GPT-4o achieved the highest diagnostic sensitivity at 92.3 %, significantly outperforming other LLMs. The completeness of patient responses to symptom-related questions was the strongest predictor of accuracy for GPT-4o (p < 0.001). Inter-model agreement was moderate among GPT-4 versions, whereas models such as Llama-3.1 demonstrated notably lower accuracy and concordance. Conclusions: GPT-4o demonstrated high accuracy and consistency in diagnosing OA based solely on patient-reported questionnaires, underscoring its potential as a supplementary diagnostic tool in clinical settings. Nevertheless, the reliance on patient-reported data without direct physician involvement highlights the critical need for medical oversight to ensure diagnostic accuracy. Further research is needed to refine LLM capabilities and expand their utility in broader diagnostic applications.http://www.sciencedirect.com/science/article/pii/S2001037024004343Large Language Models (LLMs)GPT-4oChatGPTGeminiLlamaGemma 2
spellingShingle Stefano Pagano
Luigi Strumolo
Katrin Michalk
Julia Schiegl
Loreto C. Pulido
Jan Reinhard
Guenther Maderbacher
Tobias Renkawitz
Marie Schuster
Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study
Computational and Structural Biotechnology Journal
Large Language Models (LLMs)
GPT-4o
ChatGPT
Gemini
Llama
Gemma 2
title Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study
title_full Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study
title_fullStr Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study
title_full_unstemmed Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study
title_short Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study
title_sort evaluating chatgpt gemini and other large language models llms in orthopaedic diagnostics a prospective clinical study
topic Large Language Models (LLMs)
GPT-4o
ChatGPT
Gemini
Llama
Gemma 2
url http://www.sciencedirect.com/science/article/pii/S2001037024004343
work_keys_str_mv AT stefanopagano evaluatingchatgptgeminiandotherlargelanguagemodelsllmsinorthopaedicdiagnosticsaprospectiveclinicalstudy
AT luigistrumolo evaluatingchatgptgeminiandotherlargelanguagemodelsllmsinorthopaedicdiagnosticsaprospectiveclinicalstudy
AT katrinmichalk evaluatingchatgptgeminiandotherlargelanguagemodelsllmsinorthopaedicdiagnosticsaprospectiveclinicalstudy
AT juliaschiegl evaluatingchatgptgeminiandotherlargelanguagemodelsllmsinorthopaedicdiagnosticsaprospectiveclinicalstudy
AT loretocpulido evaluatingchatgptgeminiandotherlargelanguagemodelsllmsinorthopaedicdiagnosticsaprospectiveclinicalstudy
AT janreinhard evaluatingchatgptgeminiandotherlargelanguagemodelsllmsinorthopaedicdiagnosticsaprospectiveclinicalstudy
AT guenthermaderbacher evaluatingchatgptgeminiandotherlargelanguagemodelsllmsinorthopaedicdiagnosticsaprospectiveclinicalstudy
AT tobiasrenkawitz evaluatingchatgptgeminiandotherlargelanguagemodelsllmsinorthopaedicdiagnosticsaprospectiveclinicalstudy
AT marieschuster evaluatingchatgptgeminiandotherlargelanguagemodelsllmsinorthopaedicdiagnosticsaprospectiveclinicalstudy