Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis
Background: Nonsurgical treatment of uncomplicated appendicitis is a reasonable option in many cases despite the sparsity of robust, easy access, externally validated, and multimodally informed clinical decision support systems (CDSSs). Developed by OpenAI, the Generative Pre-trained Transformer 3.5...
Saved in:
| Main Authors: | , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2024-10-01
|
| Series: | AI |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2673-2688/5/4/96 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850050063384969216 |
|---|---|
| author | Sebastian Sanduleanu Koray Ersahin Johannes Bremm Narmin Talibova Tim Damer Merve Erdogan Jonathan Kottlors Lukas Goertz Christiane Bruns David Maintz Nuran Abdullayev |
| author_facet | Sebastian Sanduleanu Koray Ersahin Johannes Bremm Narmin Talibova Tim Damer Merve Erdogan Jonathan Kottlors Lukas Goertz Christiane Bruns David Maintz Nuran Abdullayev |
| author_sort | Sebastian Sanduleanu |
| collection | DOAJ |
| description | Background: Nonsurgical treatment of uncomplicated appendicitis is a reasonable option in many cases despite the sparsity of robust, easy access, externally validated, and multimodally informed clinical decision support systems (CDSSs). Developed by OpenAI, the Generative Pre-trained Transformer 3.5 model (GPT-3) may provide enhanced decision support for surgeons in less certain appendicitis cases or those posing a higher risk for (relative) operative contra-indications. Our objective was to determine whether GPT-3.5, when provided high-throughput clinical, laboratory, and radiological text-based information, will come to clinical decisions similar to those of a machine learning model and a board-certified surgeon (reference standard) in decision-making for appendectomy versus conservative treatment. Methods: In this cohort study, we randomly collected patients presenting at the emergency department (ED) of two German hospitals (GFO, Troisdorf, and University Hospital Cologne) with right abdominal pain between October 2022 and October 2023. Statistical analysis was performed using R, version 3.6.2, on RStudio, version 2023.03.0 + 386. Overall agreement between the GPT-3.5 output and the reference standard was assessed by means of inter-observer kappa values as well as accuracy, sensitivity, specificity, and positive and negative predictive values with the “Caret” and “irr” packages. Statistical significance was defined as <i>p</i> < 0.05. Results: There was agreement between the surgeon’s decision and GPT-3.5 in 102 of 113 cases, and all cases where the surgeon decided upon conservative treatment were correctly classified by GPT-3.5. The estimated model training accuracy was 83.3% (95% CI: 74.0, 90.4), while the validation accuracy for the model was 87.0% (95% CI: 66.4, 97.2). This is in comparison to the GPT-3.5 accuracy of 90.3% (95% CI: 83.2, 95.0), which did not perform significantly better in comparison to the machine learning model (<i>p</i> = 0.21). Conclusions: This study, the first study of the “intended use” of GPT-3.5 for surgical treatment to our knowledge, comparing surgical decision-making versus an algorithm found a high degree of agreement between board-certified surgeons and GPT-3.5 for surgical decision-making in patients presenting to the emergency department with lower abdominal pain. |
| format | Article |
| id | doaj-art-62340a59cf834f8683a2d8b3500ea14c |
| institution | DOAJ |
| issn | 2673-2688 |
| language | English |
| publishDate | 2024-10-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | AI |
| spelling | doaj-art-62340a59cf834f8683a2d8b3500ea14c2025-08-20T02:53:34ZengMDPI AGAI2673-26882024-10-01541942195410.3390/ai5040096Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected AppendicitisSebastian Sanduleanu0Koray Ersahin1Johannes Bremm2Narmin Talibova3Tim Damer4Merve Erdogan5Jonathan Kottlors6Lukas Goertz7Christiane Bruns8David Maintz9Nuran Abdullayev10Department of Emergency Medicine, Vogelsbeek 5, 6001 BE Weert, The NetherlandsDepartment of General and Visceral Surgery, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 50937 Troisdorf, GermanyInstitute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, GermanyDepartment of Internal Medicine III, University Hospital, 89081 Ulm, GermanyDepartment of General and Visceral Surgery, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 50937 Troisdorf, GermanyDepartment of Radiology and Neuroradiology, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 53840 Troisdorf, GermanyInstitute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, GermanyInstitute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, GermanyDepartment of General, Visceral, Tumor and Transplantation Surgery, University Hospital of Cologne, Kerpener Straße 62, 50937 Cologne, GermanyInstitute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, GermanyDepartment of Radiology and Neuroradiology, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 53840 Troisdorf, GermanyBackground: Nonsurgical treatment of uncomplicated appendicitis is a reasonable option in many cases despite the sparsity of robust, easy access, externally validated, and multimodally informed clinical decision support systems (CDSSs). Developed by OpenAI, the Generative Pre-trained Transformer 3.5 model (GPT-3) may provide enhanced decision support for surgeons in less certain appendicitis cases or those posing a higher risk for (relative) operative contra-indications. Our objective was to determine whether GPT-3.5, when provided high-throughput clinical, laboratory, and radiological text-based information, will come to clinical decisions similar to those of a machine learning model and a board-certified surgeon (reference standard) in decision-making for appendectomy versus conservative treatment. Methods: In this cohort study, we randomly collected patients presenting at the emergency department (ED) of two German hospitals (GFO, Troisdorf, and University Hospital Cologne) with right abdominal pain between October 2022 and October 2023. Statistical analysis was performed using R, version 3.6.2, on RStudio, version 2023.03.0 + 386. Overall agreement between the GPT-3.5 output and the reference standard was assessed by means of inter-observer kappa values as well as accuracy, sensitivity, specificity, and positive and negative predictive values with the “Caret” and “irr” packages. Statistical significance was defined as <i>p</i> < 0.05. Results: There was agreement between the surgeon’s decision and GPT-3.5 in 102 of 113 cases, and all cases where the surgeon decided upon conservative treatment were correctly classified by GPT-3.5. The estimated model training accuracy was 83.3% (95% CI: 74.0, 90.4), while the validation accuracy for the model was 87.0% (95% CI: 66.4, 97.2). This is in comparison to the GPT-3.5 accuracy of 90.3% (95% CI: 83.2, 95.0), which did not perform significantly better in comparison to the machine learning model (<i>p</i> = 0.21). Conclusions: This study, the first study of the “intended use” of GPT-3.5 for surgical treatment to our knowledge, comparing surgical decision-making versus an algorithm found a high degree of agreement between board-certified surgeons and GPT-3.5 for surgical decision-making in patients presenting to the emergency department with lower abdominal pain.https://www.mdpi.com/2673-2688/5/4/96appendectomyartificial intelligencesurgical decision-making |
| spellingShingle | Sebastian Sanduleanu Koray Ersahin Johannes Bremm Narmin Talibova Tim Damer Merve Erdogan Jonathan Kottlors Lukas Goertz Christiane Bruns David Maintz Nuran Abdullayev Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis AI appendectomy artificial intelligence surgical decision-making |
| title | Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis |
| title_full | Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis |
| title_fullStr | Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis |
| title_full_unstemmed | Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis |
| title_short | Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis |
| title_sort | feasibility of gpt 3 5 versus machine learning for automated surgical decision making determination a multicenter study on suspected appendicitis |
| topic | appendectomy artificial intelligence surgical decision-making |
| url | https://www.mdpi.com/2673-2688/5/4/96 |
| work_keys_str_mv | AT sebastiansanduleanu feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis AT korayersahin feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis AT johannesbremm feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis AT narmintalibova feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis AT timdamer feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis AT merveerdogan feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis AT jonathankottlors feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis AT lukasgoertz feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis AT christianebruns feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis AT davidmaintz feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis AT nuranabdullayev feasibilityofgpt35versusmachinelearningforautomatedsurgicaldecisionmakingdeterminationamulticenterstudyonsuspectedappendicitis |