Comparing ChatGPT-3.5 and ChatGPT-4’s alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma

Summary: Clinical reliability assessment of large language models is necessary due to their increasing use in healthcare. This study assessed the performance of ChatGPT-3.5 and ChatGPT-4 in answering questions deducted from the German evidence-based S3 guideline for adult soft tissue sarcoma (STS)....

Full description

Saved in:

Bibliographic Details
Main Authors:	Cheng-Peng Li, Jens Jakob, Franka Menge, Christoph Reißfelder, Peter Hohenberger, Cui Yang
Format:	Article
Language:	English
Published:	Elsevier 2024-12-01
Series:	iScience
Subjects:	Oncology Artificial intelligence
Online Access:	http://www.sciencedirect.com/science/article/pii/S2589004224027202
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850121698490187776
author	Cheng-Peng Li Jens Jakob Franka Menge Christoph Reißfelder Peter Hohenberger Cui Yang
author_facet	Cheng-Peng Li Jens Jakob Franka Menge Christoph Reißfelder Peter Hohenberger Cui Yang
author_sort	Cheng-Peng Li
collection	DOAJ
description	Summary: Clinical reliability assessment of large language models is necessary due to their increasing use in healthcare. This study assessed the performance of ChatGPT-3.5 and ChatGPT-4 in answering questions deducted from the German evidence-based S3 guideline for adult soft tissue sarcoma (STS). Reponses to 80 complex clinical questions covering diagnosis, treatment, and surveillance aspects were independently scored by two sarcoma experts for accuracy and adequacy. ChatGPT-4 outperformed ChatGPT-3.5 overall, with higher median scores in both accuracy (5.5 vs. 5.0) and adequacy (5.0 vs. 4.0). While both versions performed similarly on questions about retroperitoneal/visceral sarcoma and gastrointestinal stromal tumor (GIST)-specific treatment as well as questions about surveillance, ChatGPT-4 performed better on questions about general STS treatment and extremity/trunk sarcomas. Despite their potential as a supportive tool, both models occasionally offered misleading and potentially life-threatening information. This underscores the significance of cautious adoption and human monitoring in clinical settings.
format	Article
id	doaj-art-051e2cb123514c1697b9b3341b4c5fb2
institution	OA Journals
issn	2589-0042
language	English
publishDate	2024-12-01
publisher	Elsevier
record_format	Article
series	iScience
spelling	doaj-art-051e2cb123514c1697b9b3341b4c5fb22025-08-20T02:35:01ZengElsevieriScience2589-00422024-12-01271211149310.1016/j.isci.2024.111493Comparing ChatGPT-3.5 and ChatGPT-4’s alignments with the German evidence-based S3 guideline for adult soft tissue sarcomaCheng-Peng Li0Jens Jakob1Franka Menge2Christoph Reißfelder3Peter Hohenberger4Cui Yang5Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Sarcoma Center, Peking University Cancer Hospital & Institute, Beijing, China; Department of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, GermanyDepartment of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, GermanyDepartment of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, GermanyDepartment of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany; DKFZ-Hector Cancer Institute, Medical Faculty Mannheim, Heidelberg University, Mannheim, GermanyDivision of Surgical Oncology and Thoracic Surgery, Medical Faculty Mannheim, University of Heidelberg, Mannheim, GermanyDepartment of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany; AI Health Innovation Cluster, German Cancer Research Center (DKFZ), Heidelberg, Germany; Corresponding authorSummary: Clinical reliability assessment of large language models is necessary due to their increasing use in healthcare. This study assessed the performance of ChatGPT-3.5 and ChatGPT-4 in answering questions deducted from the German evidence-based S3 guideline for adult soft tissue sarcoma (STS). Reponses to 80 complex clinical questions covering diagnosis, treatment, and surveillance aspects were independently scored by two sarcoma experts for accuracy and adequacy. ChatGPT-4 outperformed ChatGPT-3.5 overall, with higher median scores in both accuracy (5.5 vs. 5.0) and adequacy (5.0 vs. 4.0). While both versions performed similarly on questions about retroperitoneal/visceral sarcoma and gastrointestinal stromal tumor (GIST)-specific treatment as well as questions about surveillance, ChatGPT-4 performed better on questions about general STS treatment and extremity/trunk sarcomas. Despite their potential as a supportive tool, both models occasionally offered misleading and potentially life-threatening information. This underscores the significance of cautious adoption and human monitoring in clinical settings.http://www.sciencedirect.com/science/article/pii/S2589004224027202OncologyArtificial intelligence
spellingShingle	Cheng-Peng Li Jens Jakob Franka Menge Christoph Reißfelder Peter Hohenberger Cui Yang Comparing ChatGPT-3.5 and ChatGPT-4’s alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma iScience Oncology Artificial intelligence
title	Comparing ChatGPT-3.5 and ChatGPT-4’s alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma
title_full	Comparing ChatGPT-3.5 and ChatGPT-4’s alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma
title_fullStr	Comparing ChatGPT-3.5 and ChatGPT-4’s alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma
title_full_unstemmed	Comparing ChatGPT-3.5 and ChatGPT-4’s alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma
title_short	Comparing ChatGPT-3.5 and ChatGPT-4’s alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma
title_sort	comparing chatgpt 3 5 and chatgpt 4 s alignments with the german evidence based s3 guideline for adult soft tissue sarcoma
topic	Oncology Artificial intelligence
url	http://www.sciencedirect.com/science/article/pii/S2589004224027202
work_keys_str_mv	AT chengpengli comparingchatgpt35andchatgpt4salignmentswiththegermanevidencebaseds3guidelineforadultsofttissuesarcoma AT jensjakob comparingchatgpt35andchatgpt4salignmentswiththegermanevidencebaseds3guidelineforadultsofttissuesarcoma AT frankamenge comparingchatgpt35andchatgpt4salignmentswiththegermanevidencebaseds3guidelineforadultsofttissuesarcoma AT christophreißfelder comparingchatgpt35andchatgpt4salignmentswiththegermanevidencebaseds3guidelineforadultsofttissuesarcoma AT peterhohenberger comparingchatgpt35andchatgpt4salignmentswiththegermanevidencebaseds3guidelineforadultsofttissuesarcoma AT cuiyang comparingchatgpt35andchatgpt4salignmentswiththegermanevidencebaseds3guidelineforadultsofttissuesarcoma

Comparing ChatGPT-3.5 and ChatGPT-4’s alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma

Similar Items