Large language model comparisons between English and Chinese query performance for cardiovascular prevention

Abstract Background Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear. Methods We evaluated capabilities of BARD (Google’s b...

Full description

Saved in:
Bibliographic Details
Main Authors: Hongwei Ji, Xiaofei Wang, Ching-Hui Sia, Jonathan Yap, Soo Teik Lim, Andie Hartanto Djohan, Yaowei Chang, Ning Zhang, Mengqi Guo, Fuhai Li, Zhi Wei Lim, Ya Xing Wang, Bin Sheng, Tien Yin Wong, Susan Cheng, Khung Keong Yeo, Yih-Chung Tham
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:Communications Medicine
Online Access:https://doi.org/10.1038/s43856-025-00802-0
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850154877040197632
author Hongwei Ji
Xiaofei Wang
Ching-Hui Sia
Jonathan Yap
Soo Teik Lim
Andie Hartanto Djohan
Yaowei Chang
Ning Zhang
Mengqi Guo
Fuhai Li
Zhi Wei Lim
Ya Xing Wang
Bin Sheng
Tien Yin Wong
Susan Cheng
Khung Keong Yeo
Yih-Chung Tham
author_facet Hongwei Ji
Xiaofei Wang
Ching-Hui Sia
Jonathan Yap
Soo Teik Lim
Andie Hartanto Djohan
Yaowei Chang
Ning Zhang
Mengqi Guo
Fuhai Li
Zhi Wei Lim
Ya Xing Wang
Bin Sheng
Tien Yin Wong
Susan Cheng
Khung Keong Yeo
Yih-Chung Tham
author_sort Hongwei Ji
collection DOAJ
description Abstract Background Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear. Methods We evaluated capabilities of BARD (Google’s bidirectional language model for semantic understanding), ChatGPT-3.5, ChatGPT-4.0 (OpenAI’s conversational models for generating human-like text) and ERNIE (Baidu’s knowledge-enhanced language model for context understanding) in addressing CVD prevention queries in English and Chinese. 75 CVD prevention questions were posed to each LLM. The primary outcome was the accuracy of responses (rated as appropriate, borderline, inappropriate). Results For English prompts, the chatbots’ appropriate ratings are as follows: BARD at 88.0%, ChatGPT-3.5 at 92.0%, and ChatGPT-4.0 at 97.3%. All models demonstrate temporal improvement in initially suboptimal responses, with BARD and ChatGPT-3.5 each improving by 67% (6/9 and 4/6), and ChatGPT-4.0 achieving a 100% (2/2) improvement rate. Both BARD and ChatGPT-4.0 outperform ChatGPT-3.5 in recognizing the correctness of their responses. For Chinese prompts, the “appropriate” ratings are: ERNIE at 84.0%, ChatGPT-3.5 at 88.0%, and ChatGPT-4.0 at 85.3%. However, ERNIE outperform ChatGPT-3.5 and ChatGPT-4.0 in temporal improvement and self-awareness of correctness. Conclusions For CVD prevention queries in English, ChatGPT-4.0 outperforms other LLMs in generating appropriate responses, temporal improvement, and self-awareness. The LLMs’ performance drops slightly for Chinese queries, reflecting potential language bias in these LLMs. Given growing availability and accessibility of LLM chatbots, regular and rigorous evaluations are essential to thoroughly assess the quality and limitations of the medical information they provide across widely spoken languages.
format Article
id doaj-art-d0d50e807b1c4d7fb403efae40d90449
institution OA Journals
issn 2730-664X
language English
publishDate 2025-05-01
publisher Nature Portfolio
record_format Article
series Communications Medicine
spelling doaj-art-d0d50e807b1c4d7fb403efae40d904492025-08-20T02:25:08ZengNature PortfolioCommunications Medicine2730-664X2025-05-01511810.1038/s43856-025-00802-0Large language model comparisons between English and Chinese query performance for cardiovascular preventionHongwei Ji0Xiaofei Wang1Ching-Hui Sia2Jonathan Yap3Soo Teik Lim4Andie Hartanto Djohan5Yaowei Chang6Ning Zhang7Mengqi Guo8Fuhai Li9Zhi Wei Lim10Ya Xing Wang11Bin Sheng12Tien Yin Wong13Susan Cheng14Khung Keong Yeo15Yih-Chung Tham16Beijing Visual Science and Translational Eye Research Institute (BERI), Eye Center of Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua Medicine, Tsinghua UniversityKey Laboratory for Biomechanics and Mechanobiology of Ministry of Education, Beijing Advanced Innovation Center for Biomedical Engineering, School of Biological Science and Medical Engineering, Beihang UniversityDepartment of Medicine, Yong Loo Lin School of Medicine, National University of SingaporeDepartment of Cardiology, National Heart Centre SingaporeDepartment of Cardiology, National Heart Centre SingaporeDepartment of Medicine, Yong Loo Lin School of Medicine, National University of SingaporeDivision of Cardiology, Department of Medicine and Clinical Science, Yamaguchi University Graduate School of MedicineDepartment of Cardiology, The Affiliated Hospital of Qingdao UniversityDepartment of Cardiology, The Affiliated Hospital of Qingdao UniversityDepartment of Cardiology, The Affiliated Hospital of Qingdao UniversityDean’s Office, Yong Loo Lin School of Medicine, National University of SingaporeBeijing Visual Science and Translational Eye Research Institute (BERI), Eye Center of Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua Medicine, Tsinghua UniversityDepartment of Computer Science and Engineering, Shanghai Jiao Tong UniversityBeijing Visual Science and Translational Eye Research Institute (BERI), Eye Center of Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua Medicine, Tsinghua UniversityDepartment of Cardiology, Smidt Heart Institute, Cedars-Sinai Medical CenterDepartment of Cardiology, National Heart Centre SingaporeDepartment of Ophthalmology, Yong Loo Lin School of Medicine, National University of SingaporeAbstract Background Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear. Methods We evaluated capabilities of BARD (Google’s bidirectional language model for semantic understanding), ChatGPT-3.5, ChatGPT-4.0 (OpenAI’s conversational models for generating human-like text) and ERNIE (Baidu’s knowledge-enhanced language model for context understanding) in addressing CVD prevention queries in English and Chinese. 75 CVD prevention questions were posed to each LLM. The primary outcome was the accuracy of responses (rated as appropriate, borderline, inappropriate). Results For English prompts, the chatbots’ appropriate ratings are as follows: BARD at 88.0%, ChatGPT-3.5 at 92.0%, and ChatGPT-4.0 at 97.3%. All models demonstrate temporal improvement in initially suboptimal responses, with BARD and ChatGPT-3.5 each improving by 67% (6/9 and 4/6), and ChatGPT-4.0 achieving a 100% (2/2) improvement rate. Both BARD and ChatGPT-4.0 outperform ChatGPT-3.5 in recognizing the correctness of their responses. For Chinese prompts, the “appropriate” ratings are: ERNIE at 84.0%, ChatGPT-3.5 at 88.0%, and ChatGPT-4.0 at 85.3%. However, ERNIE outperform ChatGPT-3.5 and ChatGPT-4.0 in temporal improvement and self-awareness of correctness. Conclusions For CVD prevention queries in English, ChatGPT-4.0 outperforms other LLMs in generating appropriate responses, temporal improvement, and self-awareness. The LLMs’ performance drops slightly for Chinese queries, reflecting potential language bias in these LLMs. Given growing availability and accessibility of LLM chatbots, regular and rigorous evaluations are essential to thoroughly assess the quality and limitations of the medical information they provide across widely spoken languages.https://doi.org/10.1038/s43856-025-00802-0
spellingShingle Hongwei Ji
Xiaofei Wang
Ching-Hui Sia
Jonathan Yap
Soo Teik Lim
Andie Hartanto Djohan
Yaowei Chang
Ning Zhang
Mengqi Guo
Fuhai Li
Zhi Wei Lim
Ya Xing Wang
Bin Sheng
Tien Yin Wong
Susan Cheng
Khung Keong Yeo
Yih-Chung Tham
Large language model comparisons between English and Chinese query performance for cardiovascular prevention
Communications Medicine
title Large language model comparisons between English and Chinese query performance for cardiovascular prevention
title_full Large language model comparisons between English and Chinese query performance for cardiovascular prevention
title_fullStr Large language model comparisons between English and Chinese query performance for cardiovascular prevention
title_full_unstemmed Large language model comparisons between English and Chinese query performance for cardiovascular prevention
title_short Large language model comparisons between English and Chinese query performance for cardiovascular prevention
title_sort large language model comparisons between english and chinese query performance for cardiovascular prevention
url https://doi.org/10.1038/s43856-025-00802-0
work_keys_str_mv AT hongweiji largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT xiaofeiwang largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT chinghuisia largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT jonathanyap largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT sooteiklim largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT andiehartantodjohan largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT yaoweichang largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT ningzhang largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT mengqiguo largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT fuhaili largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT zhiweilim largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT yaxingwang largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT binsheng largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT tienyinwong largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT susancheng largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT khungkeongyeo largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention
AT yihchungtham largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention