Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study

Abstract Objective This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions. Methods Phase 1: Four LLM cha...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yi-Lin Wang, Li-Chao Tian, Jing-Yuan Meng, Jie-Chao Zhang, Zhi-Xing Nie, Wen-Rui Wei, Dao-fang Ding, Xiao-Ye Tang, Qian Zhang, Yong He
Format:	Article
Language:	English
Published:	BMC 2025-08-01
Series:	BMC Medical Informatics and Decision Making
Subjects:	Large language model Patient education Rotator cuff injury Real world interview
Online Access:	https://doi.org/10.1186/s12911-025-03105-5
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849234625931509760
author	Yi-Lin Wang Li-Chao Tian Jing-Yuan Meng Jie-Chao Zhang Zhi-Xing Nie Wen-Rui Wei Dao-fang Ding Xiao-Ye Tang Qian Zhang Yong He
author_facet	Yi-Lin Wang Li-Chao Tian Jing-Yuan Meng Jie-Chao Zhang Zhi-Xing Nie Wen-Rui Wei Dao-fang Ding Xiao-Ye Tang Qian Zhang Yong He
author_sort	Yi-Lin Wang
collection	DOAJ
description	Abstract Objective This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions. Methods Phase 1: Four LLM chatbots answered physician test questions on rotator cuff injuries, interacting with patients and students. Their performance was assessed for accuracy and clarity across 108 multiple-choice and 20 clinical questions. Phase 2: Twenty patients questioned the top two chatbots (ChatGPT-4o, Gemini), with responses rated for satisfaction and readability. Three physicians evaluated accuracy, usefulness, safety, and completeness using a 5-point Likert scale. Statistical analyses and plotting used IBM SPSS 29.0.1.0 and Prism 10; Friedman test compared evaluation and readability scores among chatbots with Bonferroni-corrected pairwise comparisons, Mann-Whitney U test compared ChatGPT-4o versus Gemini; statistical significance at p < 0.05. Results Gemini achieved the highest average accuracy. In the second part, Gemini showed the highest proficiency in answering rotator cuff injury-related queries (accuracy: 4.70; completeness: 4.72; readability: 4.70; usefulness: 4.61; safety: 4.70, post hoc Dunnett test, p < 0.05). Additionally, 20 rotator cuff injury patients questioned the top two models from Phase 1 (ChatGPT-4o and Gemini). ChatGPT-4o had the highest reading difficulty score (14.22, post hoc Dunnett test, p < 0.05), suggesting a middle school reading level or above. Statistical analysis showed significant differences in patient satisfaction (4.52 vs. 3.76, p < 0.001) and readability (4.35 vs. 4.23). Orthopedic surgeons rated ChatGPT-4o higher in accuracy, completeness, readability, usefulness, and safety (all p < 0.05), outperforming Gemini in all aspects. Conclusion The study found that LLMs, particularly ChatGPT-4o and Gemini, excelled in understanding rotator cuff injury-related knowledge and responding to patients, showing strong potential for further development.
format	Article
id	doaj-art-7938d66e22a443bd8e71d53c4db27cdb
institution	Kabale University
issn	1472-6947
language	English
publishDate	2025-08-01
publisher	BMC
record_format	Article
series	BMC Medical Informatics and Decision Making
spelling	doaj-art-7938d66e22a443bd8e71d53c4db27cdb2025-08-20T04:03:06ZengBMCBMC Medical Informatics and Decision Making1472-69472025-08-0125111010.1186/s12911-025-03105-5Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking studyYi-Lin Wang0Li-Chao Tian1Jing-Yuan Meng2Jie-Chao Zhang3Zhi-Xing Nie4Wen-Rui Wei5Dao-fang Ding6Xiao-Ye Tang7Qian Zhang8Yong He9Guanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineShanghai University of Traditional Chinese MedicineShanghai University of Traditional Chinese MedicineDepartment of Orthopedics, Wuxi No 9. People’s Hospital Affiliated to Soochow UniversityGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineAbstract Objective This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions. Methods Phase 1: Four LLM chatbots answered physician test questions on rotator cuff injuries, interacting with patients and students. Their performance was assessed for accuracy and clarity across 108 multiple-choice and 20 clinical questions. Phase 2: Twenty patients questioned the top two chatbots (ChatGPT-4o, Gemini), with responses rated for satisfaction and readability. Three physicians evaluated accuracy, usefulness, safety, and completeness using a 5-point Likert scale. Statistical analyses and plotting used IBM SPSS 29.0.1.0 and Prism 10; Friedman test compared evaluation and readability scores among chatbots with Bonferroni-corrected pairwise comparisons, Mann-Whitney U test compared ChatGPT-4o versus Gemini; statistical significance at p < 0.05. Results Gemini achieved the highest average accuracy. In the second part, Gemini showed the highest proficiency in answering rotator cuff injury-related queries (accuracy: 4.70; completeness: 4.72; readability: 4.70; usefulness: 4.61; safety: 4.70, post hoc Dunnett test, p < 0.05). Additionally, 20 rotator cuff injury patients questioned the top two models from Phase 1 (ChatGPT-4o and Gemini). ChatGPT-4o had the highest reading difficulty score (14.22, post hoc Dunnett test, p < 0.05), suggesting a middle school reading level or above. Statistical analysis showed significant differences in patient satisfaction (4.52 vs. 3.76, p < 0.001) and readability (4.35 vs. 4.23). Orthopedic surgeons rated ChatGPT-4o higher in accuracy, completeness, readability, usefulness, and safety (all p < 0.05), outperforming Gemini in all aspects. Conclusion The study found that LLMs, particularly ChatGPT-4o and Gemini, excelled in understanding rotator cuff injury-related knowledge and responding to patients, showing strong potential for further development.https://doi.org/10.1186/s12911-025-03105-5Large language modelPatient educationRotator cuff injuryReal world interview
spellingShingle	Yi-Lin Wang Li-Chao Tian Jing-Yuan Meng Jie-Chao Zhang Zhi-Xing Nie Wen-Rui Wei Dao-fang Ding Xiao-Ye Tang Qian Zhang Yong He Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study BMC Medical Informatics and Decision Making Large language model Patient education Rotator cuff injury Real world interview
title	Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
title_full	Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
title_fullStr	Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
title_full_unstemmed	Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
title_short	Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
title_sort	evaluation of large language models in patient education and clinical decision support for rotator cuff injury a two phase benchmarking study
topic	Large language model Patient education Rotator cuff injury Real world interview
url	https://doi.org/10.1186/s12911-025-03105-5
work_keys_str_mv	AT yilinwang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT lichaotian evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT jingyuanmeng evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT jiechaozhang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT zhixingnie evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT wenruiwei evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT daofangding evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT xiaoyetang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT qianzhang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT yonghe evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy

Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study

Similar Items