Large language model comparisons between English and Chinese query performance for cardiovascular prevention

Abstract Background Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear. Methods We evaluated capabilities of BARD (Google’s b...

Full description

Saved in:
Bibliographic Details
Main Authors: Hongwei Ji, Xiaofei Wang, Ching-Hui Sia, Jonathan Yap, Soo Teik Lim, Andie Hartanto Djohan, Yaowei Chang, Ning Zhang, Mengqi Guo, Fuhai Li, Zhi Wei Lim, Ya Xing Wang, Bin Sheng, Tien Yin Wong, Susan Cheng, Khung Keong Yeo, Yih-Chung Tham
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:Communications Medicine
Online Access:https://doi.org/10.1038/s43856-025-00802-0
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Background Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear. Methods We evaluated capabilities of BARD (Google’s bidirectional language model for semantic understanding), ChatGPT-3.5, ChatGPT-4.0 (OpenAI’s conversational models for generating human-like text) and ERNIE (Baidu’s knowledge-enhanced language model for context understanding) in addressing CVD prevention queries in English and Chinese. 75 CVD prevention questions were posed to each LLM. The primary outcome was the accuracy of responses (rated as appropriate, borderline, inappropriate). Results For English prompts, the chatbots’ appropriate ratings are as follows: BARD at 88.0%, ChatGPT-3.5 at 92.0%, and ChatGPT-4.0 at 97.3%. All models demonstrate temporal improvement in initially suboptimal responses, with BARD and ChatGPT-3.5 each improving by 67% (6/9 and 4/6), and ChatGPT-4.0 achieving a 100% (2/2) improvement rate. Both BARD and ChatGPT-4.0 outperform ChatGPT-3.5 in recognizing the correctness of their responses. For Chinese prompts, the “appropriate” ratings are: ERNIE at 84.0%, ChatGPT-3.5 at 88.0%, and ChatGPT-4.0 at 85.3%. However, ERNIE outperform ChatGPT-3.5 and ChatGPT-4.0 in temporal improvement and self-awareness of correctness. Conclusions For CVD prevention queries in English, ChatGPT-4.0 outperforms other LLMs in generating appropriate responses, temporal improvement, and self-awareness. The LLMs’ performance drops slightly for Chinese queries, reflecting potential language bias in these LLMs. Given growing availability and accessibility of LLM chatbots, regular and rigorous evaluations are essential to thoroughly assess the quality and limitations of the medical information they provide across widely spoken languages.
ISSN:2730-664X