Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study

BackgroundLarge language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain–specific nursing knowledge...

Full description

Saved in:
Bibliographic Details
Main Authors: Shiben Zhu, Wanqin Hu, Zhi Yang, Jiani Yan, Fang Zhang
Format: Article
Language:English
Published: JMIR Publications 2025-01-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2025/1/e63731
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841546284699222016
author Shiben Zhu
Wanqin Hu
Zhi Yang
Jiani Yan
Fang Zhang
author_facet Shiben Zhu
Wanqin Hu
Zhi Yang
Jiani Yan
Fang Zhang
author_sort Shiben Zhu
collection DOAJ
description BackgroundLarge language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain–specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored. ObjectiveThis study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy. MethodsThis retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques. ResultsQwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977. ConclusionsThis study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training.
format Article
id doaj-art-7ee1e61297c14468a260346d02590467
institution Kabale University
issn 2291-9694
language English
publishDate 2025-01-01
publisher JMIR Publications
record_format Article
series JMIR Medical Informatics
spelling doaj-art-7ee1e61297c14468a260346d025904672025-01-10T21:30:49ZengJMIR PublicationsJMIR Medical Informatics2291-96942025-01-0113e6373110.2196/63731Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative StudyShiben Zhuhttps://orcid.org/0000-0002-0846-0453Wanqin Huhttps://orcid.org/0000-0003-2548-5801Zhi Yanghttps://orcid.org/0009-0007-7354-5879Jiani Yanhttps://orcid.org/0009-0007-5373-0305Fang Zhanghttps://orcid.org/0009-0005-8263-0755 BackgroundLarge language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain–specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored. ObjectiveThis study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy. MethodsThis retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques. ResultsQwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977. ConclusionsThis study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training.https://medinform.jmir.org/2025/1/e63731
spellingShingle Shiben Zhu
Wanqin Hu
Zhi Yang
Jiani Yan
Fang Zhang
Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
JMIR Medical Informatics
title Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
title_full Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
title_fullStr Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
title_full_unstemmed Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
title_short Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study
title_sort qwen 2 5 outperforms other large language models in the chinese national nursing licensing examination retrospective cross sectional comparative study
url https://medinform.jmir.org/2025/1/e63731
work_keys_str_mv AT shibenzhu qwen25outperformsotherlargelanguagemodelsinthechinesenationalnursinglicensingexaminationretrospectivecrosssectionalcomparativestudy
AT wanqinhu qwen25outperformsotherlargelanguagemodelsinthechinesenationalnursinglicensingexaminationretrospectivecrosssectionalcomparativestudy
AT zhiyang qwen25outperformsotherlargelanguagemodelsinthechinesenationalnursinglicensingexaminationretrospectivecrosssectionalcomparativestudy
AT jianiyan qwen25outperformsotherlargelanguagemodelsinthechinesenationalnursinglicensingexaminationretrospectivecrosssectionalcomparativestudy
AT fangzhang qwen25outperformsotherlargelanguagemodelsinthechinesenationalnursinglicensingexaminationretrospectivecrosssectionalcomparativestudy