Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study
Abstract BackgroundDiagnosing rare diseases remains challenging due to their inherent complexity and limited physician knowledge. Large language models (LLMs) offer new potential to enhance diagnostic workflows. ObjectiveThis study aimed to evaluate the diagnostic...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
JMIR Publications
2025-06-01
|
| Series: | Journal of Medical Internet Research |
| Online Access: | https://www.jmir.org/2025/1/e69929 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850157617525030912 |
|---|---|
| author | Wei Zhong YiFan Liu Yan Liu Kai Yang HuiMin Gao HuiHui Yan WenJing Hao YouSheng Yan ChengHong Yin |
| author_facet | Wei Zhong YiFan Liu Yan Liu Kai Yang HuiMin Gao HuiHui Yan WenJing Hao YouSheng Yan ChengHong Yin |
| author_sort | Wei Zhong |
| collection | DOAJ |
| description |
Abstract
BackgroundDiagnosing rare diseases remains challenging due to their inherent complexity and limited physician knowledge. Large language models (LLMs) offer new potential to enhance diagnostic workflows.
ObjectiveThis study aimed to evaluate the diagnostic accuracy of ChatGPT-4o and 4 open-source LLMs (qwen2.5:7b, Llama3.1:8b, qwen2.5:72b, and Llama3.1:70b) for rare diseases, assesses the language effect on diagnostic performance, and explore retrieval augmented generation (RAG) and chain-of-thought (CoT) reasoning.
MethodsWe extracted clinical manifestations of 121 rare diseases from China’s inaugural rare disease catalog. ChatGPT-4o generated a primary and 5 differential diagnoses, while 4 LLMs were assessed in both English and Chinese contexts. The lowest-performing model underwent RAG and CoT re-evaluation. Diagnostic accuracy was compared via the McNemar test. A survey evaluated 11 clinicians’ familiarity with rare diseases.
ResultsChatGPT-4o demonstrated the highest diagnostic accuracy with 90.1%. Language effects varied across models: qwen2.5:7b showed comparable performance in Chinese (51.2%) and English (47.9%; χ1Pχ1PPPχ1Pχ1PPPPPPPχ1Pχ1P
ConclusionsChatGPT-4o demonstrated superior diagnostic performance for rare diseases. While Llama3.1:8b demonstrates viability for localized deployment in resource-constrained English diagnostic workflows, Chinese applications require larger models to achieve comparable diagnostic accuracy. This urgency is heightened by the release of open-source models like DeepSeek-R1, which may see rapid adoption without thorough validation. Successful clinical implementation of LLMs requires 3 core elements: model parameterization, user language, and pretraining data. The integration of RAG significantly enhanced open-source LLM accuracy for rare disease diagnosis, although caution remains warranted for low-parameter reasoning models showing substantial performance limitations. We recommend hospital IT departments and policymakers prioritize language relevance in model selection and consider integrating RAG with curated knowledge bases to enhance diagnostic utility in constrained settings, while exercising caution with low-parameter models. |
| format | Article |
| id | doaj-art-3eaedd8220b4413a9d297529e2dbd64e |
| institution | OA Journals |
| issn | 1438-8871 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | JMIR Publications |
| record_format | Article |
| series | Journal of Medical Internet Research |
| spelling | doaj-art-3eaedd8220b4413a9d297529e2dbd64e2025-08-20T02:24:07ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-06-0127e69929e6992910.2196/69929Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative StudyWei Zhonghttp://orcid.org/0000-0001-9823-9500YiFan Liuhttp://orcid.org/0009-0008-7339-4756Yan Liuhttp://orcid.org/0000-0003-1698-5783Kai Yanghttp://orcid.org/0000-0002-7457-3106HuiMin Gaohttp://orcid.org/0009-0004-8874-6022HuiHui Yanhttp://orcid.org/0009-0008-2979-9895WenJing Haohttp://orcid.org/0009-0006-8537-0036YouSheng Yanhttp://orcid.org/0000-0002-0405-1302ChengHong Yinhttp://orcid.org/0000-0002-2503-3285 Abstract BackgroundDiagnosing rare diseases remains challenging due to their inherent complexity and limited physician knowledge. Large language models (LLMs) offer new potential to enhance diagnostic workflows. ObjectiveThis study aimed to evaluate the diagnostic accuracy of ChatGPT-4o and 4 open-source LLMs (qwen2.5:7b, Llama3.1:8b, qwen2.5:72b, and Llama3.1:70b) for rare diseases, assesses the language effect on diagnostic performance, and explore retrieval augmented generation (RAG) and chain-of-thought (CoT) reasoning. MethodsWe extracted clinical manifestations of 121 rare diseases from China’s inaugural rare disease catalog. ChatGPT-4o generated a primary and 5 differential diagnoses, while 4 LLMs were assessed in both English and Chinese contexts. The lowest-performing model underwent RAG and CoT re-evaluation. Diagnostic accuracy was compared via the McNemar test. A survey evaluated 11 clinicians’ familiarity with rare diseases. ResultsChatGPT-4o demonstrated the highest diagnostic accuracy with 90.1%. Language effects varied across models: qwen2.5:7b showed comparable performance in Chinese (51.2%) and English (47.9%; χ1Pχ1PPPχ1Pχ1PPPPPPPχ1Pχ1P ConclusionsChatGPT-4o demonstrated superior diagnostic performance for rare diseases. While Llama3.1:8b demonstrates viability for localized deployment in resource-constrained English diagnostic workflows, Chinese applications require larger models to achieve comparable diagnostic accuracy. This urgency is heightened by the release of open-source models like DeepSeek-R1, which may see rapid adoption without thorough validation. Successful clinical implementation of LLMs requires 3 core elements: model parameterization, user language, and pretraining data. The integration of RAG significantly enhanced open-source LLM accuracy for rare disease diagnosis, although caution remains warranted for low-parameter reasoning models showing substantial performance limitations. We recommend hospital IT departments and policymakers prioritize language relevance in model selection and consider integrating RAG with curated knowledge bases to enhance diagnostic utility in constrained settings, while exercising caution with low-parameter models.https://www.jmir.org/2025/1/e69929 |
| spellingShingle | Wei Zhong YiFan Liu Yan Liu Kai Yang HuiMin Gao HuiHui Yan WenJing Hao YouSheng Yan ChengHong Yin Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study Journal of Medical Internet Research |
| title | Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study |
| title_full | Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study |
| title_fullStr | Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study |
| title_full_unstemmed | Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study |
| title_short | Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study |
| title_sort | performance of chatgpt 4o and four open source large language models in generating diagnoses based on china s rare disease catalog comparative study |
| url | https://www.jmir.org/2025/1/e69929 |
| work_keys_str_mv | AT weizhong performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy AT yifanliu performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy AT yanliu performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy AT kaiyang performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy AT huimingao performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy AT huihuiyan performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy AT wenjinghao performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy AT youshengyan performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy AT chenghongyin performanceofchatgpt4oandfouropensourcelargelanguagemodelsingeneratingdiagnosesbasedonchinasrarediseasecatalogcomparativestudy |