Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study

Abstract BackgroundDiagnosing rare diseases remains challenging due to their inherent complexity and limited physician knowledge. Large language models (LLMs) offer new potential to enhance diagnostic workflows. ObjectiveThis study aimed to evaluate the diagnostic...

Full description

Saved in:
Bibliographic Details
Main Authors: Wei Zhong, YiFan Liu, Yan Liu, Kai Yang, HuiMin Gao, HuiHui Yan, WenJing Hao, YouSheng Yan, ChengHong Yin
Format: Article
Language:English
Published: JMIR Publications 2025-06-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e69929
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850157617525030912
author Wei Zhong
YiFan Liu
Yan Liu
Kai Yang
HuiMin Gao
HuiHui Yan
WenJing Hao
YouSheng Yan
ChengHong Yin
author_facet Wei Zhong
YiFan Liu
Yan Liu
Kai Yang
HuiMin Gao
HuiHui Yan
WenJing Hao
YouSheng Yan
ChengHong Yin
author_sort Wei Zhong
collection DOAJ
description Abstract BackgroundDiagnosing rare diseases remains challenging due to their inherent complexity and limited physician knowledge. Large language models (LLMs) offer new potential to enhance diagnostic workflows. ObjectiveThis study aimed to evaluate the diagnostic accuracy of ChatGPT-4o and 4 open-source LLMs (qwen2.5:7b, Llama3.1:8b, qwen2.5:72b, and Llama3.1:70b) for rare diseases, assesses the language effect on diagnostic performance, and explore retrieval augmented generation (RAG) and chain-of-thought (CoT) reasoning. MethodsWe extracted clinical manifestations of 121 rare diseases from China’s inaugural rare disease catalog. ChatGPT-4o generated a primary and 5 differential diagnoses, while 4 LLMs were assessed in both English and Chinese contexts. The lowest-performing model underwent RAG and CoT re-evaluation. Diagnostic accuracy was compared via the McNemar test. A survey evaluated 11 clinicians’ familiarity with rare diseases. ResultsChatGPT-4o demonstrated the highest diagnostic accuracy with 90.1%. Language effects varied across models: qwen2.5:7b showed comparable performance in Chinese (51.2%) and English (47.9%; χ1Pχ1PPPχ1Pχ1PPPPPPPχ1Pχ1P ConclusionsChatGPT-4o demonstrated superior diagnostic performance for rare diseases. While Llama3.1:8b demonstrates viability for localized deployment in resource-constrained English diagnostic workflows, Chinese applications require larger models to achieve comparable diagnostic accuracy. This urgency is heightened by the release of open-source models like DeepSeek-R1, which may see rapid adoption without thorough validation. Successful clinical implementation of LLMs requires 3 core elements: model parameterization, user language, and pretraining data. The integration of RAG significantly enhanced open-source LLM accuracy for rare disease diagnosis, although caution remains warranted for low-parameter reasoning models showing substantial performance limitations. We recommend hospital IT departments and policymakers prioritize language relevance in model selection and consider integrating RAG with curated knowledge bases to enhance diagnostic utility in constrained settings, while exercising caution with low-parameter models.
format Article
id doaj-art-3eaedd8220b4413a9d297529e2dbd64e
institution OA Journals
issn 1438-8871
language English
publishDate 2025-06-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-3eaedd8220b4413a9d297529e2dbd64e2025-08-20T02:24:07ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-06-0127e69929e6992910.2196/69929Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative StudyWei Zhonghttp://orcid.org/0000-0001-9823-9500YiFan Liuhttp://orcid.org/0009-0008-7339-4756Yan Liuhttp://orcid.org/0000-0003-1698-5783Kai Yanghttp://orcid.org/0000-0002-7457-3106HuiMin Gaohttp://orcid.org/0009-0004-8874-6022HuiHui Yanhttp://orcid.org/0009-0008-2979-9895WenJing Haohttp://orcid.org/0009-0006-8537-0036YouSheng Yanhttp://orcid.org/0000-0002-0405-1302ChengHong Yinhttp://orcid.org/0000-0002-2503-3285 Abstract BackgroundDiagnosing rare diseases remains challenging due to their inherent complexity and limited physician knowledge. Large language models (LLMs) offer new potential to enhance diagnostic workflows. ObjectiveThis study aimed to evaluate the diagnostic accuracy of ChatGPT-4o and 4 open-source LLMs (qwen2.5:7b, Llama3.1:8b, qwen2.5:72b, and Llama3.1:70b) for rare diseases, assesses the language effect on diagnostic performance, and explore retrieval augmented generation (RAG) and chain-of-thought (CoT) reasoning. MethodsWe extracted clinical manifestations of 121 rare diseases from China’s inaugural rare disease catalog. ChatGPT-4o generated a primary and 5 differential diagnoses, while 4 LLMs were assessed in both English and Chinese contexts. The lowest-performing model underwent RAG and CoT re-evaluation. Diagnostic accuracy was compared via the McNemar test. A survey evaluated 11 clinicians’ familiarity with rare diseases. ResultsChatGPT-4o demonstrated the highest diagnostic accuracy with 90.1%. Language effects varied across models: qwen2.5:7b showed comparable performance in Chinese (51.2%) and English (47.9%; χ1Pχ1PPPχ1Pχ1PPPPPPPχ1Pχ1P ConclusionsChatGPT-4o demonstrated superior diagnostic performance for rare diseases. While Llama3.1:8b demonstrates viability for localized deployment in resource-constrained English diagnostic workflows, Chinese applications require larger models to achieve comparable diagnostic accuracy. This urgency is heightened by the release of open-source models like DeepSeek-R1, which may see rapid adoption without thorough validation. Successful clinical implementation of LLMs requires 3 core elements: model parameterization, user language, and pretraining data. The integration of RAG significantly enhanced open-source LLM accuracy for rare disease diagnosis, although caution remains warranted for low-parameter reasoning models showing substantial performance limitations. We recommend hospital IT departments and policymakers prioritize language relevance in model selection and consider integrating RAG with curated knowledge bases to enhance diagnostic utility in constrained settings, while exercising caution with low-parameter models.https://www.jmir.org/2025/1/e69929
spellingShingle Wei Zhong
YiFan Liu
Yan Liu
Kai Yang
HuiMin Gao
HuiHui Yan
WenJing Hao
YouSheng Yan
ChengHong Yin
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study
Journal of Medical Internet Research
title Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study
title_full Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study
title_fullStr Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study
title_full_unstemmed Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study
title_short Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study
title_sort performance of chatgpt 4o and four open source large language models in generating diagnoses based on china s rare disease catalog comparative study
url https://www.jmir.org/2025/1/e69929
work_keys_str_mv AT weizhong performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy
AT yifanliu performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy
AT yanliu performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy
AT kaiyang performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy
AT huimingao performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy
AT huihuiyan performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy
AT wenjinghao performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy
AT youshengyan performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy
AT chenghongyin performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy