Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study

BackgroundLarge language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain–specific nursing knowledge...

Full description

Saved in:

Bibliographic Details
Main Authors:	Shiben Zhu, Wanqin Hu, Zhi Yang, Jiani Yan, Fang Zhang
Format:	Article
Language:	English
Published:	JMIR Publications 2025-01-01
Series:	JMIR Medical Informatics
Online Access:	https://medinform.jmir.org/2025/1/e63731
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841546284699222016
author	Shiben Zhu Wanqin Hu Zhi Yang Jiani Yan Fang Zhang
author_facet	Shiben Zhu Wanqin Hu Zhi Yang Jiani Yan Fang Zhang
author_sort	Shiben Zhu
collection	DOAJ
description	BackgroundLarge language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain–specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored. ObjectiveThis study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy. MethodsThis retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques. ResultsQwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977. ConclusionsThis study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training.
format	Article
id	doaj-art-7ee1e61297c14468a260346d02590467
institution	Kabale University
issn	2291-9694
language	English
publishDate	2025-01-01
publisher	JMIR Publications
record_format	Article
series	JMIR Medical Informatics
spelling	doaj-art-7ee1e61297c14468a260346d025904672025-01-10T21:30:49ZengJMIR PublicationsJMIR Medical Informatics2291-96942025-01-0113e6373110.2196/63731Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative StudyShiben Zhuhttps://orcid.org/0000-0002-0846-0453Wanqin Huhttps://orcid.org/0000-0003-2548-5801Zhi Yanghttps://orcid.org/0009-0007-7354-5879Jiani Yanhttps://orcid.org/0009-0007-5373-0305Fang Zhanghttps://orcid.org/0009-0005-8263-0755 BackgroundLarge language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain–specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored. ObjectiveThis study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy. MethodsThis retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques. ResultsQwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977. ConclusionsThis study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training.https://medinform.jmir.org/2025/1/e63731
spellingShingle	Shiben Zhu Wanqin Hu Zhi Yang Jiani Yan Fang Zhang Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study JMIR Medical Informatics
title	Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
title_full	Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
title_fullStr	Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
title_full_unstemmed	Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
title_short	Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
title_sort	qwen 2 5 outperforms other large language models in the chinese national nursing licensing examination retrospective cross sectional comparative study
url	https://medinform.jmir.org/2025/1/e63731
work_keys_str_mv	AT shibenzhu qwen25outperformsotherlargelanguagemodelsinthechinesenationalnursinglicensingexaminationretrospectivecrosssectionalcomparativestudy AT wanqinhu qwen25outperformsotherlargelanguagemodelsinthechinesenationalnursinglicensingexaminationretrospectivecrosssectionalcomparativestudy AT zhiyang qwen25outperformsotherlargelanguagemodelsinthechinesenationalnursinglicensingexaminationretrospectivecrosssectionalcomparativestudy AT jianiyan qwen25outperformsotherlargelanguagemodelsinthechinesenationalnursinglicensingexaminationretrospectivecrosssectionalcomparativestudy AT fangzhang qwen25outperformsotherlargelanguagemodelsinthechinesenationalnursinglicensingexaminationretrospectivecrosssectionalcomparativestudy

Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study

Similar Items