Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro

Abstract Large language models (LLMs) have established a presence in providing medical services to patients and supporting clinical practice for doctors. To explore the ability of LLMs in answering clinical questions related to autoimmune diseases, this study was designed with 65 questions related t...

Full description

Saved in:
Bibliographic Details
Main Authors: Juntao Ma, Jie Yu, Anran Xie, Taihong Huang, Wenjing Liu, Mengyin Ma, Yue Tao, Fuyu Zang, Qisi Zheng, Wenbo Zhu, Yuxin Chen, Mingzhe Ning, Yijia Zhu
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-02601-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850124752509730816
author Juntao Ma
Jie Yu
Anran Xie
Taihong Huang
Wenjing Liu
Mengyin Ma
Yue Tao
Fuyu Zang
Qisi Zheng
Wenbo Zhu
Yuxin Chen
Mingzhe Ning
Yijia Zhu
author_facet Juntao Ma
Jie Yu
Anran Xie
Taihong Huang
Wenjing Liu
Mengyin Ma
Yue Tao
Fuyu Zang
Qisi Zheng
Wenbo Zhu
Yuxin Chen
Mingzhe Ning
Yijia Zhu
author_sort Juntao Ma
collection DOAJ
description Abstract Large language models (LLMs) have established a presence in providing medical services to patients and supporting clinical practice for doctors. To explore the ability of LLMs in answering clinical questions related to autoimmune diseases, this study was designed with 65 questions related to autoimmune diseases, covering five domains: concepts, report interpretation, diagnosis, prevention and treatment, and prognosis. Types of diseases include Sjögren’s syndrome, systemic lupus erythematosus, rheumatoid arthritis, systemic sclerosis, and others. These questions were answered by three LLMs: ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. The responses were then evaluated by 8 clinicians based on criteria including relevance, completeness, accuracy, safety, readability, and simplicity. We analyzed the scores of the three LLMs across five domains and six dimensions and compared their accuracy in answering the report interpretation section with that of two senior doctors and two junior doctors. The results showed that the performance of the three LLMs in the evaluation of autoimmune diseases significantly surpassed that of both junior and senior doctors. Notably, Claude 3.5 Sonnet excelled in providing comprehensive and accurate responses to clinical questions on autoimmune diseases, demonstrating the great potential of LLMs in assisting doctors with the diagnosis, treatment, and management of autoimmune diseases.
format Article
id doaj-art-61d414e63b3c41b58a2e9cef6b07ac83
institution OA Journals
issn 2045-2322
language English
publishDate 2025-05-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-61d414e63b3c41b58a2e9cef6b07ac832025-08-20T02:34:14ZengNature PortfolioScientific Reports2045-23222025-05-011511910.1038/s41598-025-02601-yLarge language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 proJuntao Ma0Jie Yu1Anran Xie2Taihong Huang3Wenjing Liu4Mengyin Ma5Yue Tao6Fuyu Zang7Qisi Zheng8Wenbo Zhu9Yuxin Chen10Mingzhe Ning11Yijia Zhu12Department of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineDepartment of Laboratory Medicine, Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese MedicineAbstract Large language models (LLMs) have established a presence in providing medical services to patients and supporting clinical practice for doctors. To explore the ability of LLMs in answering clinical questions related to autoimmune diseases, this study was designed with 65 questions related to autoimmune diseases, covering five domains: concepts, report interpretation, diagnosis, prevention and treatment, and prognosis. Types of diseases include Sjögren’s syndrome, systemic lupus erythematosus, rheumatoid arthritis, systemic sclerosis, and others. These questions were answered by three LLMs: ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. The responses were then evaluated by 8 clinicians based on criteria including relevance, completeness, accuracy, safety, readability, and simplicity. We analyzed the scores of the three LLMs across five domains and six dimensions and compared their accuracy in answering the report interpretation section with that of two senior doctors and two junior doctors. The results showed that the performance of the three LLMs in the evaluation of autoimmune diseases significantly surpassed that of both junior and senior doctors. Notably, Claude 3.5 Sonnet excelled in providing comprehensive and accurate responses to clinical questions on autoimmune diseases, demonstrating the great potential of LLMs in assisting doctors with the diagnosis, treatment, and management of autoimmune diseases.https://doi.org/10.1038/s41598-025-02601-yLarge Language modelsAutoimmune diseasesPerformance evaluation
spellingShingle Juntao Ma
Jie Yu
Anran Xie
Taihong Huang
Wenjing Liu
Mengyin Ma
Yue Tao
Fuyu Zang
Qisi Zheng
Wenbo Zhu
Yuxin Chen
Mingzhe Ning
Yijia Zhu
Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro
Scientific Reports
Large Language models
Autoimmune diseases
Performance evaluation
title Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro
title_full Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro
title_fullStr Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro
title_full_unstemmed Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro
title_short Large language model evaluation in autoimmune disease clinical questions comparing ChatGPT 4o, Claude 3.5 Sonnet and Gemini 1.5 pro
title_sort large language model evaluation in autoimmune disease clinical questions comparing chatgpt 4o claude 3 5 sonnet and gemini 1 5 pro
topic Large Language models
Autoimmune diseases
Performance evaluation
url https://doi.org/10.1038/s41598-025-02601-y
work_keys_str_mv AT juntaoma largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT jieyu largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT anranxie largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT taihonghuang largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT wenjingliu largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT mengyinma largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT yuetao largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT fuyuzang largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT qisizheng largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT wenbozhu largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT yuxinchen largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT mingzhening largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro
AT yijiazhu largelanguagemodelevaluationinautoimmunediseaseclinicalquestionscomparingchatgpt4oclaude35sonnetandgemini15pro