Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis

Background: Nonsurgical treatment of uncomplicated appendicitis is a reasonable option in many cases despite the sparsity of robust, easy access, externally validated, and multimodally informed clinical decision support systems (CDSSs). Developed by OpenAI, the Generative Pre-trained Transformer 3.5...

Full description

Saved in:
Bibliographic Details
Main Authors: Sebastian Sanduleanu, Koray Ersahin, Johannes Bremm, Narmin Talibova, Tim Damer, Merve Erdogan, Jonathan Kottlors, Lukas Goertz, Christiane Bruns, David Maintz, Nuran Abdullayev
Format: Article
Language:English
Published: MDPI AG 2024-10-01
Series:AI
Subjects:
Online Access:https://www.mdpi.com/2673-2688/5/4/96
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850050063384969216
author Sebastian Sanduleanu
Koray Ersahin
Johannes Bremm
Narmin Talibova
Tim Damer
Merve Erdogan
Jonathan Kottlors
Lukas Goertz
Christiane Bruns
David Maintz
Nuran Abdullayev
author_facet Sebastian Sanduleanu
Koray Ersahin
Johannes Bremm
Narmin Talibova
Tim Damer
Merve Erdogan
Jonathan Kottlors
Lukas Goertz
Christiane Bruns
David Maintz
Nuran Abdullayev
author_sort Sebastian Sanduleanu
collection DOAJ
description Background: Nonsurgical treatment of uncomplicated appendicitis is a reasonable option in many cases despite the sparsity of robust, easy access, externally validated, and multimodally informed clinical decision support systems (CDSSs). Developed by OpenAI, the Generative Pre-trained Transformer 3.5 model (GPT-3) may provide enhanced decision support for surgeons in less certain appendicitis cases or those posing a higher risk for (relative) operative contra-indications. Our objective was to determine whether GPT-3.5, when provided high-throughput clinical, laboratory, and radiological text-based information, will come to clinical decisions similar to those of a machine learning model and a board-certified surgeon (reference standard) in decision-making for appendectomy versus conservative treatment. Methods: In this cohort study, we randomly collected patients presenting at the emergency department (ED) of two German hospitals (GFO, Troisdorf, and University Hospital Cologne) with right abdominal pain between October 2022 and October 2023. Statistical analysis was performed using R, version 3.6.2, on RStudio, version 2023.03.0 + 386. Overall agreement between the GPT-3.5 output and the reference standard was assessed by means of inter-observer kappa values as well as accuracy, sensitivity, specificity, and positive and negative predictive values with the “Caret” and “irr” packages. Statistical significance was defined as <i>p</i> < 0.05. Results: There was agreement between the surgeon’s decision and GPT-3.5 in 102 of 113 cases, and all cases where the surgeon decided upon conservative treatment were correctly classified by GPT-3.5. The estimated model training accuracy was 83.3% (95% CI: 74.0, 90.4), while the validation accuracy for the model was 87.0% (95% CI: 66.4, 97.2). This is in comparison to the GPT-3.5 accuracy of 90.3% (95% CI: 83.2, 95.0), which did not perform significantly better in comparison to the machine learning model (<i>p</i> = 0.21). Conclusions: This study, the first study of the “intended use” of GPT-3.5 for surgical treatment to our knowledge, comparing surgical decision-making versus an algorithm found a high degree of agreement between board-certified surgeons and GPT-3.5 for surgical decision-making in patients presenting to the emergency department with lower abdominal pain.
format Article
id doaj-art-62340a59cf834f8683a2d8b3500ea14c
institution DOAJ
issn 2673-2688
language English
publishDate 2024-10-01
publisher MDPI AG
record_format Article
series AI
spelling doaj-art-62340a59cf834f8683a2d8b3500ea14c2025-08-20T02:53:34ZengMDPI AGAI2673-26882024-10-01541942195410.3390/ai5040096Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected AppendicitisSebastian Sanduleanu0Koray Ersahin1Johannes Bremm2Narmin Talibova3Tim Damer4Merve Erdogan5Jonathan Kottlors6Lukas Goertz7Christiane Bruns8David Maintz9Nuran Abdullayev10Department of Emergency Medicine, Vogelsbeek 5, 6001 BE Weert, The NetherlandsDepartment of General and Visceral Surgery, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 50937 Troisdorf, GermanyInstitute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, GermanyDepartment of Internal Medicine III, University Hospital, 89081 Ulm, GermanyDepartment of General and Visceral Surgery, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 50937 Troisdorf, GermanyDepartment of Radiology and Neuroradiology, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 53840 Troisdorf, GermanyInstitute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, GermanyInstitute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, GermanyDepartment of General, Visceral, Tumor and Transplantation Surgery, University Hospital of Cologne, Kerpener Straße 62, 50937 Cologne, GermanyInstitute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, GermanyDepartment of Radiology and Neuroradiology, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 53840 Troisdorf, GermanyBackground: Nonsurgical treatment of uncomplicated appendicitis is a reasonable option in many cases despite the sparsity of robust, easy access, externally validated, and multimodally informed clinical decision support systems (CDSSs). Developed by OpenAI, the Generative Pre-trained Transformer 3.5 model (GPT-3) may provide enhanced decision support for surgeons in less certain appendicitis cases or those posing a higher risk for (relative) operative contra-indications. Our objective was to determine whether GPT-3.5, when provided high-throughput clinical, laboratory, and radiological text-based information, will come to clinical decisions similar to those of a machine learning model and a board-certified surgeon (reference standard) in decision-making for appendectomy versus conservative treatment. Methods: In this cohort study, we randomly collected patients presenting at the emergency department (ED) of two German hospitals (GFO, Troisdorf, and University Hospital Cologne) with right abdominal pain between October 2022 and October 2023. Statistical analysis was performed using R, version 3.6.2, on RStudio, version 2023.03.0 + 386. Overall agreement between the GPT-3.5 output and the reference standard was assessed by means of inter-observer kappa values as well as accuracy, sensitivity, specificity, and positive and negative predictive values with the “Caret” and “irr” packages. Statistical significance was defined as <i>p</i> < 0.05. Results: There was agreement between the surgeon’s decision and GPT-3.5 in 102 of 113 cases, and all cases where the surgeon decided upon conservative treatment were correctly classified by GPT-3.5. The estimated model training accuracy was 83.3% (95% CI: 74.0, 90.4), while the validation accuracy for the model was 87.0% (95% CI: 66.4, 97.2). This is in comparison to the GPT-3.5 accuracy of 90.3% (95% CI: 83.2, 95.0), which did not perform significantly better in comparison to the machine learning model (<i>p</i> = 0.21). Conclusions: This study, the first study of the “intended use” of GPT-3.5 for surgical treatment to our knowledge, comparing surgical decision-making versus an algorithm found a high degree of agreement between board-certified surgeons and GPT-3.5 for surgical decision-making in patients presenting to the emergency department with lower abdominal pain.https://www.mdpi.com/2673-2688/5/4/96appendectomyartificial intelligencesurgical decision-making
spellingShingle Sebastian Sanduleanu
Koray Ersahin
Johannes Bremm
Narmin Talibova
Tim Damer
Merve Erdogan
Jonathan Kottlors
Lukas Goertz
Christiane Bruns
David Maintz
Nuran Abdullayev
Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis
AI
appendectomy
artificial intelligence
surgical decision-making
title Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis
title_full Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis
title_fullStr Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis
title_full_unstemmed Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis
title_short Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis
title_sort feasibility of gpt 3 5 versus machine learning for automated surgical decision making determination a multicenter study on suspected appendicitis
topic appendectomy
artificial intelligence
surgical decision-making
url https://www.mdpi.com/2673-2688/5/4/96
work_keys_str_mv AT sebastiansanduleanu feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis
AT korayersahin feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis
AT johannesbremm feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis
AT narmintalibova feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis
AT timdamer feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis
AT merveerdogan feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis
AT jonathankottlors feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis
AT lukasgoertz feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis
AT christianebruns feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis
AT davidmaintz feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis
AT nuranabdullayev feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis