Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study

Abstract Background The integration of artificial intelligence (AI) in healthcare has rapidly expanded, particularly in clinical decision-making. Large language models (LLMs) such as GPT-4 and GPT-3.5 have shown potential in various medical applications, including diagnostics and treatment planning....

Full description

Saved in:
Bibliographic Details
Main Authors: Sönmez Saglam, Veysel Uludag, Zekeriya Okan Karaduman, Mehmet Arıcan, Mücahid Osman Yücel, Raşit Emin Dalaslan
Format: Article
Language:English
Published: BMC 2025-04-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-025-02996-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849699207927037952
author Sönmez Saglam
Veysel Uludag
Zekeriya Okan Karaduman
Mehmet Arıcan
Mücahid Osman Yücel
Raşit Emin Dalaslan
author_facet Sönmez Saglam
Veysel Uludag
Zekeriya Okan Karaduman
Mehmet Arıcan
Mücahid Osman Yücel
Raşit Emin Dalaslan
author_sort Sönmez Saglam
collection DOAJ
description Abstract Background The integration of artificial intelligence (AI) in healthcare has rapidly expanded, particularly in clinical decision-making. Large language models (LLMs) such as GPT-4 and GPT-3.5 have shown potential in various medical applications, including diagnostics and treatment planning. However, their efficacy in specialized fields like sports surgery and physiotherapy remains underexplored. This study aims to compare the performance of GPT-4 and GPT-3.5 in clinical decision-making within these domains using a structured assessment approach. Methods This cross-sectional study included 56 professionals specializing in sports surgery and physiotherapy. Participants evaluated 10 standardized clinical scenarios generated by GPT-4 and GPT-3.5 using a 5-point Likert scale. The scenarios encompassed common musculoskeletal conditions, and assessments focused on diagnostic accuracy, treatment appropriateness, surgical technique detailing, and rehabilitation plan suitability. Data were collected anonymously via Google Forms. Statistical analysis included paired t-tests for direct model comparisons, one-way ANOVA to assess performance across multiple criteria, and Cronbach’s alpha to evaluate inter-rater reliability. Results GPT-4 significantly outperformed GPT-3.5 across all evaluated criteria. Paired t-test results (t(55) = 10.45, p < 0.001) demonstrated that GPT-4 provided more accurate diagnoses, superior treatment plans, and more detailed surgical recommendations. ANOVA results confirmed the higher suitability of GPT-4 in treatment planning (F(1, 55) = 35.22, p < 0.001) and rehabilitation protocols (F(1, 55) = 32.10, p < 0.001). Cronbach’s alpha values indicated higher internal consistency for GPT-4 (α = 0.478) compared to GPT-3.5 (α = 0.234), reflecting more reliable performance. Conclusions GPT-4 demonstrates superior performance compared to GPT-3.5 in clinical decision-making for sports surgery and physiotherapy. These findings suggest that advanced AI models can aid in diagnostic accuracy, treatment planning, and rehabilitation strategies. However, AI should function as a decision-support tool rather than a substitute for expert clinical judgment. Future studies should explore the integration of AI into real-world clinical workflows, validate findings using larger datasets, and compare additional AI models beyond the GPT series.
format Article
id doaj-art-79ae1b54e7a04cbeb76dffd4c95d468e
institution DOAJ
issn 1472-6947
language English
publishDate 2025-04-01
publisher BMC
record_format Article
series BMC Medical Informatics and Decision Making
spelling doaj-art-79ae1b54e7a04cbeb76dffd4c95d468e2025-08-20T03:18:41ZengBMCBMC Medical Informatics and Decision Making1472-69472025-04-012511810.1186/s12911-025-02996-8Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional studySönmez Saglam0Veysel Uludag1Zekeriya Okan Karaduman2Mehmet Arıcan3Mücahid Osman Yücel4Raşit Emin Dalaslan5Department of Orthopaedics and Traumatology, Faculty of Medicine, Duzce UniversityDepartment of Physiotherapy and Rehabilitation, Faculty of Health Sciences, Duzce UniversityDepartment of Orthopaedics and Traumatology, Faculty of Medicine, Duzce UniversityDepartment of Orthopaedics and Traumatology, Faculty of Medicine, Duzce UniversityDepartment of Orthopaedics and Traumatology, Faculty of Medicine, Duzce UniversityDepartment of Orthopaedics and Traumatology, Faculty of Medicine, Duzce UniversityAbstract Background The integration of artificial intelligence (AI) in healthcare has rapidly expanded, particularly in clinical decision-making. Large language models (LLMs) such as GPT-4 and GPT-3.5 have shown potential in various medical applications, including diagnostics and treatment planning. However, their efficacy in specialized fields like sports surgery and physiotherapy remains underexplored. This study aims to compare the performance of GPT-4 and GPT-3.5 in clinical decision-making within these domains using a structured assessment approach. Methods This cross-sectional study included 56 professionals specializing in sports surgery and physiotherapy. Participants evaluated 10 standardized clinical scenarios generated by GPT-4 and GPT-3.5 using a 5-point Likert scale. The scenarios encompassed common musculoskeletal conditions, and assessments focused on diagnostic accuracy, treatment appropriateness, surgical technique detailing, and rehabilitation plan suitability. Data were collected anonymously via Google Forms. Statistical analysis included paired t-tests for direct model comparisons, one-way ANOVA to assess performance across multiple criteria, and Cronbach’s alpha to evaluate inter-rater reliability. Results GPT-4 significantly outperformed GPT-3.5 across all evaluated criteria. Paired t-test results (t(55) = 10.45, p < 0.001) demonstrated that GPT-4 provided more accurate diagnoses, superior treatment plans, and more detailed surgical recommendations. ANOVA results confirmed the higher suitability of GPT-4 in treatment planning (F(1, 55) = 35.22, p < 0.001) and rehabilitation protocols (F(1, 55) = 32.10, p < 0.001). Cronbach’s alpha values indicated higher internal consistency for GPT-4 (α = 0.478) compared to GPT-3.5 (α = 0.234), reflecting more reliable performance. Conclusions GPT-4 demonstrates superior performance compared to GPT-3.5 in clinical decision-making for sports surgery and physiotherapy. These findings suggest that advanced AI models can aid in diagnostic accuracy, treatment planning, and rehabilitation strategies. However, AI should function as a decision-support tool rather than a substitute for expert clinical judgment. Future studies should explore the integration of AI into real-world clinical workflows, validate findings using larger datasets, and compare additional AI models beyond the GPT series.https://doi.org/10.1186/s12911-025-02996-8Artificial intelligenceLarge language modelsSports surgeryPhysiotherapyClinical decision-makingRehabilitation
spellingShingle Sönmez Saglam
Veysel Uludag
Zekeriya Okan Karaduman
Mehmet Arıcan
Mücahid Osman Yücel
Raşit Emin Dalaslan
Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study
BMC Medical Informatics and Decision Making
Artificial intelligence
Large language models
Sports surgery
Physiotherapy
Clinical decision-making
Rehabilitation
title Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study
title_full Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study
title_fullStr Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study
title_full_unstemmed Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study
title_short Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study
title_sort comparative evaluation of artificial intelligence models gpt 4 and gpt 3 5 in clinical decision making in sports surgery and physiotherapy a cross sectional study
topic Artificial intelligence
Large language models
Sports surgery
Physiotherapy
Clinical decision-making
Rehabilitation
url https://doi.org/10.1186/s12911-025-02996-8
work_keys_str_mv AT sonmezsaglam comparativeevaluationofartificialintelligencemodelsgpt4andgpt35inclinicaldecisionmakinginsportssurgeryandphysiotherapyacrosssectionalstudy
AT veyseluludag comparativeevaluationofartificialintelligencemodelsgpt4andgpt35inclinicaldecisionmakinginsportssurgeryandphysiotherapyacrosssectionalstudy
AT zekeriyaokankaraduman comparativeevaluationofartificialintelligencemodelsgpt4andgpt35inclinicaldecisionmakinginsportssurgeryandphysiotherapyacrosssectionalstudy
AT mehmetarıcan comparativeevaluationofartificialintelligencemodelsgpt4andgpt35inclinicaldecisionmakinginsportssurgeryandphysiotherapyacrosssectionalstudy
AT mucahidosmanyucel comparativeevaluationofartificialintelligencemodelsgpt4andgpt35inclinicaldecisionmakinginsportssurgeryandphysiotherapyacrosssectionalstudy
AT rasitemindalaslan comparativeevaluationofartificialintelligencemodelsgpt4andgpt35inclinicaldecisionmakinginsportssurgeryandphysiotherapyacrosssectionalstudy