Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study
Abstract Objective This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions. Methods Phase 1: Four LLM cha...
Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-08-01
|
| Series: | BMC Medical Informatics and Decision Making |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12911-025-03105-5 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849234625931509760 |
|---|---|
| author | Yi-Lin Wang Li-Chao Tian Jing-Yuan Meng Jie-Chao Zhang Zhi-Xing Nie Wen-Rui Wei Dao-fang Ding Xiao-Ye Tang Qian Zhang Yong He |
| author_facet | Yi-Lin Wang Li-Chao Tian Jing-Yuan Meng Jie-Chao Zhang Zhi-Xing Nie Wen-Rui Wei Dao-fang Ding Xiao-Ye Tang Qian Zhang Yong He |
| author_sort | Yi-Lin Wang |
| collection | DOAJ |
| description | Abstract Objective This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions. Methods Phase 1: Four LLM chatbots answered physician test questions on rotator cuff injuries, interacting with patients and students. Their performance was assessed for accuracy and clarity across 108 multiple-choice and 20 clinical questions. Phase 2: Twenty patients questioned the top two chatbots (ChatGPT-4o, Gemini), with responses rated for satisfaction and readability. Three physicians evaluated accuracy, usefulness, safety, and completeness using a 5-point Likert scale. Statistical analyses and plotting used IBM SPSS 29.0.1.0 and Prism 10; Friedman test compared evaluation and readability scores among chatbots with Bonferroni-corrected pairwise comparisons, Mann-Whitney U test compared ChatGPT-4o versus Gemini; statistical significance at p < 0.05. Results Gemini achieved the highest average accuracy. In the second part, Gemini showed the highest proficiency in answering rotator cuff injury-related queries (accuracy: 4.70; completeness: 4.72; readability: 4.70; usefulness: 4.61; safety: 4.70, post hoc Dunnett test, p < 0.05). Additionally, 20 rotator cuff injury patients questioned the top two models from Phase 1 (ChatGPT-4o and Gemini). ChatGPT-4o had the highest reading difficulty score (14.22, post hoc Dunnett test, p < 0.05), suggesting a middle school reading level or above. Statistical analysis showed significant differences in patient satisfaction (4.52 vs. 3.76, p < 0.001) and readability (4.35 vs. 4.23). Orthopedic surgeons rated ChatGPT-4o higher in accuracy, completeness, readability, usefulness, and safety (all p < 0.05), outperforming Gemini in all aspects. Conclusion The study found that LLMs, particularly ChatGPT-4o and Gemini, excelled in understanding rotator cuff injury-related knowledge and responding to patients, showing strong potential for further development. |
| format | Article |
| id | doaj-art-7938d66e22a443bd8e71d53c4db27cdb |
| institution | Kabale University |
| issn | 1472-6947 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | BMC |
| record_format | Article |
| series | BMC Medical Informatics and Decision Making |
| spelling | doaj-art-7938d66e22a443bd8e71d53c4db27cdb2025-08-20T04:03:06ZengBMCBMC Medical Informatics and Decision Making1472-69472025-08-0125111010.1186/s12911-025-03105-5Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking studyYi-Lin Wang0Li-Chao Tian1Jing-Yuan Meng2Jie-Chao Zhang3Zhi-Xing Nie4Wen-Rui Wei5Dao-fang Ding6Xiao-Ye Tang7Qian Zhang8Yong He9Guanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineShanghai University of Traditional Chinese MedicineShanghai University of Traditional Chinese MedicineDepartment of Orthopedics, Wuxi No 9. People’s Hospital Affiliated to Soochow UniversityGuanghua Hospital Affiliated to Shanghai University of Traditional Chinese MedicineAbstract Objective This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions. Methods Phase 1: Four LLM chatbots answered physician test questions on rotator cuff injuries, interacting with patients and students. Their performance was assessed for accuracy and clarity across 108 multiple-choice and 20 clinical questions. Phase 2: Twenty patients questioned the top two chatbots (ChatGPT-4o, Gemini), with responses rated for satisfaction and readability. Three physicians evaluated accuracy, usefulness, safety, and completeness using a 5-point Likert scale. Statistical analyses and plotting used IBM SPSS 29.0.1.0 and Prism 10; Friedman test compared evaluation and readability scores among chatbots with Bonferroni-corrected pairwise comparisons, Mann-Whitney U test compared ChatGPT-4o versus Gemini; statistical significance at p < 0.05. Results Gemini achieved the highest average accuracy. In the second part, Gemini showed the highest proficiency in answering rotator cuff injury-related queries (accuracy: 4.70; completeness: 4.72; readability: 4.70; usefulness: 4.61; safety: 4.70, post hoc Dunnett test, p < 0.05). Additionally, 20 rotator cuff injury patients questioned the top two models from Phase 1 (ChatGPT-4o and Gemini). ChatGPT-4o had the highest reading difficulty score (14.22, post hoc Dunnett test, p < 0.05), suggesting a middle school reading level or above. Statistical analysis showed significant differences in patient satisfaction (4.52 vs. 3.76, p < 0.001) and readability (4.35 vs. 4.23). Orthopedic surgeons rated ChatGPT-4o higher in accuracy, completeness, readability, usefulness, and safety (all p < 0.05), outperforming Gemini in all aspects. Conclusion The study found that LLMs, particularly ChatGPT-4o and Gemini, excelled in understanding rotator cuff injury-related knowledge and responding to patients, showing strong potential for further development.https://doi.org/10.1186/s12911-025-03105-5Large language modelPatient educationRotator cuff injuryReal world interview |
| spellingShingle | Yi-Lin Wang Li-Chao Tian Jing-Yuan Meng Jie-Chao Zhang Zhi-Xing Nie Wen-Rui Wei Dao-fang Ding Xiao-Ye Tang Qian Zhang Yong He Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study BMC Medical Informatics and Decision Making Large language model Patient education Rotator cuff injury Real world interview |
| title | Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study |
| title_full | Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study |
| title_fullStr | Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study |
| title_full_unstemmed | Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study |
| title_short | Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study |
| title_sort | evaluation of large language models in patient education and clinical decision support for rotator cuff injury a two phase benchmarking study |
| topic | Large language model Patient education Rotator cuff injury Real world interview |
| url | https://doi.org/10.1186/s12911-025-03105-5 |
| work_keys_str_mv | AT yilinwang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT lichaotian evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT jingyuanmeng evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT jiechaozhang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT zhixingnie evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT wenruiwei evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT daofangding evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT xiaoyetang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT qianzhang evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy AT yonghe evaluationoflargelanguagemodelsinpatienteducationandclinicaldecisionsupportforrotatorcuffinjuryatwophasebenchmarkingstudy |