ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study

Background Recent breakthroughs in artificial intelligence research include the development of generative pretrained transformers (GPT). ChatGPT has been shown to perform well when answering several sets of medical multiple-choice questions. However, it has not been tested for writing free-text asse...

Full description

Saved in:

Bibliographic Details
Main Authors:	Artin Entezarjou, Carl Wikberg, Ronny Gunnarsson, David Sundemo, Rasmus Arvidsson
Format:	Article
Language:	English
Published:	BMJ Publishing Group 2024-12-01
Series:	BMJ Open
Online Access:	https://bmjopen.bmj.com/content/14/12/e086148.full
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850057911960600576
author	Artin Entezarjou Carl Wikberg Ronny Gunnarsson David Sundemo Rasmus Arvidsson
author_facet	Artin Entezarjou Carl Wikberg Ronny Gunnarsson David Sundemo Rasmus Arvidsson
author_sort	Artin Entezarjou
collection	DOAJ
description	Background Recent breakthroughs in artificial intelligence research include the development of generative pretrained transformers (GPT). ChatGPT has been shown to perform well when answering several sets of medical multiple-choice questions. However, it has not been tested for writing free-text assessments of complex cases in primary care.Objectives To compare the performance of ChatGPT, version GPT-4, with that of real doctors.Design and setting A blinded observational comparative study conducted in the Swedish primary care setting. Responses from GPT-4 and real doctors to cases from the Swedish family medicine specialist examination were scored by blinded reviewers, and the scores were compared.Participants Anonymous responses from the Swedish family medicine specialist examination 2017–2022 were used.Outcome measures Primary: the mean difference in scores between GPT-4’s responses and randomly selected responses by human doctors, as well as between GPT-4’s responses and top-tier responses by human doctors. Secondary: the correlation between differences in response length and response score; the intraclass correlation coefficient between reviewers; and the percentage of maximum score achieved by each group in different subject categories.Results The mean scores were 6.0, 7.2 and 4.5 for randomly selected doctor responses, top-tier doctor responses and GPT-4 responses, respectively, on a 10-point scale. The scores for the random doctor responses were, on average, 1.6 points higher than those of GPT-4 (p<0.001, 95% CI 0.9 to 2.2) and the top-tier doctor scores were, on average, 2.7 points higher than those of GPT-4 (p<0.001, 95 % CI 2.2 to 3.3). Following the release of GPT-4o, the experiment was repeated, although this time with only a single reviewer scoring the answers. In this follow-up, random doctor responses were scored 0.7 points higher than those of GPT-4o (p=0.044).Conclusion In complex primary care cases, GPT-4 performs worse than human doctors taking the family medicine specialist examination. Future GPT-based chatbots may perform better, but comprehensive evaluations are needed before implementing chatbots for medical decision support in primary care.
format	Article
id	doaj-art-e02e6d7b14394ad9aa89ab83d072cc3e
institution	DOAJ
issn	2044-6055
language	English
publishDate	2024-12-01
publisher	BMJ Publishing Group
record_format	Article
series	BMJ Open
spelling	doaj-art-e02e6d7b14394ad9aa89ab83d072cc3e2025-08-20T02:51:18ZengBMJ Publishing GroupBMJ Open2044-60552024-12-01141210.1136/bmjopen-2024-086148ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative studyArtin Entezarjou0Carl Wikberg1Ronny Gunnarsson2David Sundemo3Rasmus Arvidsson4General Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, SwedenGeneral Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, SwedenGeneral Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, SwedenGeneral Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, SwedenGeneral Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, SwedenBackground Recent breakthroughs in artificial intelligence research include the development of generative pretrained transformers (GPT). ChatGPT has been shown to perform well when answering several sets of medical multiple-choice questions. However, it has not been tested for writing free-text assessments of complex cases in primary care.Objectives To compare the performance of ChatGPT, version GPT-4, with that of real doctors.Design and setting A blinded observational comparative study conducted in the Swedish primary care setting. Responses from GPT-4 and real doctors to cases from the Swedish family medicine specialist examination were scored by blinded reviewers, and the scores were compared.Participants Anonymous responses from the Swedish family medicine specialist examination 2017–2022 were used.Outcome measures Primary: the mean difference in scores between GPT-4’s responses and randomly selected responses by human doctors, as well as between GPT-4’s responses and top-tier responses by human doctors. Secondary: the correlation between differences in response length and response score; the intraclass correlation coefficient between reviewers; and the percentage of maximum score achieved by each group in different subject categories.Results The mean scores were 6.0, 7.2 and 4.5 for randomly selected doctor responses, top-tier doctor responses and GPT-4 responses, respectively, on a 10-point scale. The scores for the random doctor responses were, on average, 1.6 points higher than those of GPT-4 (p<0.001, 95% CI 0.9 to 2.2) and the top-tier doctor scores were, on average, 2.7 points higher than those of GPT-4 (p<0.001, 95 % CI 2.2 to 3.3). Following the release of GPT-4o, the experiment was repeated, although this time with only a single reviewer scoring the answers. In this follow-up, random doctor responses were scored 0.7 points higher than those of GPT-4o (p=0.044).Conclusion In complex primary care cases, GPT-4 performs worse than human doctors taking the family medicine specialist examination. Future GPT-based chatbots may perform better, but comprehensive evaluations are needed before implementing chatbots for medical decision support in primary care.https://bmjopen.bmj.com/content/14/12/e086148.full
spellingShingle	Artin Entezarjou Carl Wikberg Ronny Gunnarsson David Sundemo Rasmus Arvidsson ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study BMJ Open
title	ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study
title_full	ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study
title_fullStr	ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study
title_full_unstemmed	ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study
title_short	ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study
title_sort	chatgpt gpt 4 versus doctors on complex cases of the swedish family medicine specialist examination an observational comparative study
url	https://bmjopen.bmj.com/content/14/12/e086148.full
work_keys_str_mv	AT artinentezarjou chatgptgpt4versusdoctorsoncomplexcasesoftheswedishfamilymedicinespecialistexaminationanobservationalcomparativestudy AT carlwikberg chatgptgpt4versusdoctorsoncomplexcasesoftheswedishfamilymedicinespecialistexaminationanobservationalcomparativestudy AT ronnygunnarsson chatgptgpt4versusdoctorsoncomplexcasesoftheswedishfamilymedicinespecialistexaminationanobservationalcomparativestudy AT davidsundemo chatgptgpt4versusdoctorsoncomplexcasesoftheswedishfamilymedicinespecialistexaminationanobservationalcomparativestudy AT rasmusarvidsson chatgptgpt4versusdoctorsoncomplexcasesoftheswedishfamilymedicinespecialistexaminationanobservationalcomparativestudy

ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study

Similar Items