Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study

Abstract Objective This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions. Methods Phase 1: Four LLM cha...

Full description

Saved in:
Bibliographic Details
Main Authors: Yi-Lin Wang, Li-Chao Tian, Jing-Yuan Meng, Jie-Chao Zhang, Zhi-Xing Nie, Wen-Rui Wei, Dao-fang Ding, Xiao-Ye Tang, Qian Zhang, Yong He
Format: Article
Language:English
Published: BMC 2025-08-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:https://doi.org/10.1186/s12911-025-03105-5
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849234625931509760
author Yi-Lin Wang
Li-Chao Tian
Jing-Yuan Meng
Jie-Chao Zhang
Zhi-Xing Nie
Wen-Rui Wei
Dao-fang Ding
Xiao-Ye Tang
Qian Zhang
Yong He
author_facet Yi-Lin Wang
Li-Chao Tian
Jing-Yuan Meng
Jie-Chao Zhang
Zhi-Xing Nie
Wen-Rui Wei
Dao-fang Ding
Xiao-Ye Tang
Qian Zhang
Yong He
author_sort Yi-Lin Wang
collection DOAJ
description Abstract Objective This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions. Methods Phase 1: Four LLM chatbots answered physician test questions on rotator cuff injuries, interacting with patients and students. Their performance was assessed for accuracy and clarity across 108 multiple-choice and 20 clinical questions. Phase 2: Twenty patients questioned the top two chatbots (ChatGPT-4o, Gemini), with responses rated for satisfaction and readability. Three physicians evaluated accuracy, usefulness, safety, and completeness using a 5-point Likert scale. Statistical analyses and plotting used IBM SPSS 29.0.1.0 and Prism 10; Friedman test compared evaluation and readability scores among chatbots with Bonferroni-corrected pairwise comparisons, Mann-Whitney U test compared ChatGPT-4o versus Gemini; statistical significance at p < 0.05. Results Gemini achieved the highest average accuracy. In the second part, Gemini showed the highest proficiency in answering rotator cuff injury-related queries (accuracy: 4.70; completeness: 4.72; readability: 4.70; usefulness: 4.61; safety: 4.70, post hoc Dunnett test, p < 0.05). Additionally, 20 rotator cuff injury patients questioned the top two models from Phase 1 (ChatGPT-4o and Gemini). ChatGPT-4o had the highest reading difficulty score (14.22, post hoc Dunnett test, p < 0.05), suggesting a middle school reading level or above. Statistical analysis showed significant differences in patient satisfaction (4.52 vs. 3.76, p < 0.001) and readability (4.35 vs. 4.23). Orthopedic surgeons rated ChatGPT-4o higher in accuracy, completeness, readability, usefulness, and safety (all p < 0.05), outperforming Gemini in all aspects. Conclusion The study found that LLMs, particularly ChatGPT-4o and Gemini, excelled in understanding rotator cuff injury-related knowledge and responding to patients, showing strong potential for further development.
format Article
id doaj-art-7938d66e22a443bd8e71d53c4db27cdb
institution Kabale University
issn 1472-6947
language English
publishDate 2025-08-01
publisher BMC
record_format Article
series BMC Medical Informatics and Decision Making
spelling doaj-art-7938d66e22a443bd8e71d53c4db27cdb2025-08-20T04:03:06ZengBMCBMC Medical Informatics and Decision Making1472-69472025-08-0125111010.1186/s12911-025-03105-5Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking studyYi-Lin Wang0Li-Chao Tian1Jing-Yuan Meng2Jie-Chao Zhang3Zhi-Xing Nie4Wen-Rui Wei5Dao-fang Ding6Xiao-Ye Tang7Qian Zhang8Yong He9Guanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineShanghai University of Traditional Chinese MedicineShanghai University of Traditional Chinese MedicineDepartment of Orthopedics, Wuxi No 9. People’s Hospital Affiliated to Soochow UniversityGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineAbstract Objective This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions. Methods Phase 1: Four LLM chatbots answered physician test questions on rotator cuff injuries, interacting with patients and students. Their performance was assessed for accuracy and clarity across 108 multiple-choice and 20 clinical questions. Phase 2: Twenty patients questioned the top two chatbots (ChatGPT-4o, Gemini), with responses rated for satisfaction and readability. Three physicians evaluated accuracy, usefulness, safety, and completeness using a 5-point Likert scale. Statistical analyses and plotting used IBM SPSS 29.0.1.0 and Prism 10; Friedman test compared evaluation and readability scores among chatbots with Bonferroni-corrected pairwise comparisons, Mann-Whitney U test compared ChatGPT-4o versus Gemini; statistical significance at p < 0.05. Results Gemini achieved the highest average accuracy. In the second part, Gemini showed the highest proficiency in answering rotator cuff injury-related queries (accuracy: 4.70; completeness: 4.72; readability: 4.70; usefulness: 4.61; safety: 4.70, post hoc Dunnett test, p < 0.05). Additionally, 20 rotator cuff injury patients questioned the top two models from Phase 1 (ChatGPT-4o and Gemini). ChatGPT-4o had the highest reading difficulty score (14.22, post hoc Dunnett test, p < 0.05), suggesting a middle school reading level or above. Statistical analysis showed significant differences in patient satisfaction (4.52 vs. 3.76, p < 0.001) and readability (4.35 vs. 4.23). Orthopedic surgeons rated ChatGPT-4o higher in accuracy, completeness, readability, usefulness, and safety (all p < 0.05), outperforming Gemini in all aspects. Conclusion The study found that LLMs, particularly ChatGPT-4o and Gemini, excelled in understanding rotator cuff injury-related knowledge and responding to patients, showing strong potential for further development.https://doi.org/10.1186/s12911-025-03105-5Large language modelPatient educationRotator cuff injuryReal world interview
spellingShingle Yi-Lin Wang
Li-Chao Tian
Jing-Yuan Meng
Jie-Chao Zhang
Zhi-Xing Nie
Wen-Rui Wei
Dao-fang Ding
Xiao-Ye Tang
Qian Zhang
Yong He
Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
BMC Medical Informatics and Decision Making
Large language model
Patient education
Rotator cuff injury
Real world interview
title Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
title_full Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
title_fullStr Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
title_full_unstemmed Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
title_short Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
title_sort evaluation of large language models in patient education and clinical decision support for rotator cuff injury a two phase benchmarking study
topic Large language model
Patient education
Rotator cuff injury
Real world interview
url https://doi.org/10.1186/s12911-025-03105-5
work_keys_str_mv AT yilinwang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy
AT lichaotian evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy
AT jingyuanmeng evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy
AT jiechaozhang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy
AT zhixingnie evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy
AT wenruiwei evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy
AT daofangding evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy
AT xiaoyetang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy
AT qianzhang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy
AT yonghe evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy