Large language model comparisons between English and Chinese query performance for cardiovascular prevention
Abstract Background Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear. Methods We evaluated capabilities of BARD (Google’s b...
Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-05-01
|
| Series: | Communications Medicine |
| Online Access: | https://doi.org/10.1038/s43856-025-00802-0 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850154877040197632 |
|---|---|
| author | Hongwei Ji Xiaofei Wang Ching-Hui Sia Jonathan Yap Soo Teik Lim Andie Hartanto Djohan Yaowei Chang Ning Zhang Mengqi Guo Fuhai Li Zhi Wei Lim Ya Xing Wang Bin Sheng Tien Yin Wong Susan Cheng Khung Keong Yeo Yih-Chung Tham |
| author_facet | Hongwei Ji Xiaofei Wang Ching-Hui Sia Jonathan Yap Soo Teik Lim Andie Hartanto Djohan Yaowei Chang Ning Zhang Mengqi Guo Fuhai Li Zhi Wei Lim Ya Xing Wang Bin Sheng Tien Yin Wong Susan Cheng Khung Keong Yeo Yih-Chung Tham |
| author_sort | Hongwei Ji |
| collection | DOAJ |
| description | Abstract Background Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear. Methods We evaluated capabilities of BARD (Google’s bidirectional language model for semantic understanding), ChatGPT-3.5, ChatGPT-4.0 (OpenAI’s conversational models for generating human-like text) and ERNIE (Baidu’s knowledge-enhanced language model for context understanding) in addressing CVD prevention queries in English and Chinese. 75 CVD prevention questions were posed to each LLM. The primary outcome was the accuracy of responses (rated as appropriate, borderline, inappropriate). Results For English prompts, the chatbots’ appropriate ratings are as follows: BARD at 88.0%, ChatGPT-3.5 at 92.0%, and ChatGPT-4.0 at 97.3%. All models demonstrate temporal improvement in initially suboptimal responses, with BARD and ChatGPT-3.5 each improving by 67% (6/9 and 4/6), and ChatGPT-4.0 achieving a 100% (2/2) improvement rate. Both BARD and ChatGPT-4.0 outperform ChatGPT-3.5 in recognizing the correctness of their responses. For Chinese prompts, the “appropriate” ratings are: ERNIE at 84.0%, ChatGPT-3.5 at 88.0%, and ChatGPT-4.0 at 85.3%. However, ERNIE outperform ChatGPT-3.5 and ChatGPT-4.0 in temporal improvement and self-awareness of correctness. Conclusions For CVD prevention queries in English, ChatGPT-4.0 outperforms other LLMs in generating appropriate responses, temporal improvement, and self-awareness. The LLMs’ performance drops slightly for Chinese queries, reflecting potential language bias in these LLMs. Given growing availability and accessibility of LLM chatbots, regular and rigorous evaluations are essential to thoroughly assess the quality and limitations of the medical information they provide across widely spoken languages. |
| format | Article |
| id | doaj-art-d0d50e807b1c4d7fb403efae40d90449 |
| institution | OA Journals |
| issn | 2730-664X |
| language | English |
| publishDate | 2025-05-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Communications Medicine |
| spelling | doaj-art-d0d50e807b1c4d7fb403efae40d904492025-08-20T02:25:08ZengNature PortfolioCommunications Medicine2730-664X2025-05-01511810.1038/s43856-025-00802-0Large language model comparisons between English and Chinese query performance for cardiovascular preventionHongwei Ji0Xiaofei Wang1Ching-Hui Sia2Jonathan Yap3Soo Teik Lim4Andie Hartanto Djohan5Yaowei Chang6Ning Zhang7Mengqi Guo8Fuhai Li9Zhi Wei Lim10Ya Xing Wang11Bin Sheng12Tien Yin Wong13Susan Cheng14Khung Keong Yeo15Yih-Chung Tham16Beijing Visual Science and Translational Eye Research Institute (BERI), Eye Center of Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua Medicine, Tsinghua UniversityKey Laboratory for Biomechanics and Mechanobiology of Ministry of Education, Beijing Advanced Innovation Center for Biomedical Engineering, School of Biological Science and Medical Engineering, Beihang UniversityDepartment of Medicine, Yong Loo Lin School of Medicine, National University of SingaporeDepartment of Cardiology, National Heart Centre SingaporeDepartment of Cardiology, National Heart Centre SingaporeDepartment of Medicine, Yong Loo Lin School of Medicine, National University of SingaporeDivision of Cardiology, Department of Medicine and Clinical Science, Yamaguchi University Graduate School of MedicineDepartment of Cardiology, The Affiliated Hospital of Qingdao UniversityDepartment of Cardiology, The Affiliated Hospital of Qingdao UniversityDepartment of Cardiology, The Affiliated Hospital of Qingdao UniversityDean’s Office, Yong Loo Lin School of Medicine, National University of SingaporeBeijing Visual Science and Translational Eye Research Institute (BERI), Eye Center of Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua Medicine, Tsinghua UniversityDepartment of Computer Science and Engineering, Shanghai Jiao Tong UniversityBeijing Visual Science and Translational Eye Research Institute (BERI), Eye Center of Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua Medicine, Tsinghua UniversityDepartment of Cardiology, Smidt Heart Institute, Cedars-Sinai Medical CenterDepartment of Cardiology, National Heart Centre SingaporeDepartment of Ophthalmology, Yong Loo Lin School of Medicine, National University of SingaporeAbstract Background Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear. Methods We evaluated capabilities of BARD (Google’s bidirectional language model for semantic understanding), ChatGPT-3.5, ChatGPT-4.0 (OpenAI’s conversational models for generating human-like text) and ERNIE (Baidu’s knowledge-enhanced language model for context understanding) in addressing CVD prevention queries in English and Chinese. 75 CVD prevention questions were posed to each LLM. The primary outcome was the accuracy of responses (rated as appropriate, borderline, inappropriate). Results For English prompts, the chatbots’ appropriate ratings are as follows: BARD at 88.0%, ChatGPT-3.5 at 92.0%, and ChatGPT-4.0 at 97.3%. All models demonstrate temporal improvement in initially suboptimal responses, with BARD and ChatGPT-3.5 each improving by 67% (6/9 and 4/6), and ChatGPT-4.0 achieving a 100% (2/2) improvement rate. Both BARD and ChatGPT-4.0 outperform ChatGPT-3.5 in recognizing the correctness of their responses. For Chinese prompts, the “appropriate” ratings are: ERNIE at 84.0%, ChatGPT-3.5 at 88.0%, and ChatGPT-4.0 at 85.3%. However, ERNIE outperform ChatGPT-3.5 and ChatGPT-4.0 in temporal improvement and self-awareness of correctness. Conclusions For CVD prevention queries in English, ChatGPT-4.0 outperforms other LLMs in generating appropriate responses, temporal improvement, and self-awareness. The LLMs’ performance drops slightly for Chinese queries, reflecting potential language bias in these LLMs. Given growing availability and accessibility of LLM chatbots, regular and rigorous evaluations are essential to thoroughly assess the quality and limitations of the medical information they provide across widely spoken languages.https://doi.org/10.1038/s43856-025-00802-0 |
| spellingShingle | Hongwei Ji Xiaofei Wang Ching-Hui Sia Jonathan Yap Soo Teik Lim Andie Hartanto Djohan Yaowei Chang Ning Zhang Mengqi Guo Fuhai Li Zhi Wei Lim Ya Xing Wang Bin Sheng Tien Yin Wong Susan Cheng Khung Keong Yeo Yih-Chung Tham Large language model comparisons between English and Chinese query performance for cardiovascular prevention Communications Medicine |
| title | Large language model comparisons between English and Chinese query performance for cardiovascular prevention |
| title_full | Large language model comparisons between English and Chinese query performance for cardiovascular prevention |
| title_fullStr | Large language model comparisons between English and Chinese query performance for cardiovascular prevention |
| title_full_unstemmed | Large language model comparisons between English and Chinese query performance for cardiovascular prevention |
| title_short | Large language model comparisons between English and Chinese query performance for cardiovascular prevention |
| title_sort | large language model comparisons between english and chinese query performance for cardiovascular prevention |
| url | https://doi.org/10.1038/s43856-025-00802-0 |
| work_keys_str_mv | AT hongweiji largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT xiaofeiwang largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT chinghuisia largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT jonathanyap largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT sooteiklim largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT andiehartantodjohan largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT yaoweichang largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT ningzhang largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT mengqiguo largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT fuhaili largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT zhiweilim largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT yaxingwang largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT binsheng largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT tienyinwong largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT susancheng largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT khungkeongyeo largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention AT yihchungtham largelanguagemodelcomparisonsbetweenenglishandchinesequeryperformanceforcardiovascularprevention |