ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings

Introduction: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings. Methods: Common presen...

Full description

Saved in:
Bibliographic Details
Main Authors: Pooya Hosseini-Monfared, Shayan Amiri, Alireza Mirahmadi, Amirhossein Shahbazi, Aliasghar Alamian, Mohammad Azizi, Seyed Morteza Kazemi
Format: Article
Language:English
Published: Shahid Beheshti University of Medical Sciences 2025-04-01
Series:Archives of Academic Emergency Medicine
Subjects:
Online Access:https://journals.sbmu.ac.ir/aaem/index.php/AAEM/article/view/2580
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849702084814831616
author Pooya Hosseini-Monfared
Shayan Amiri
Alireza Mirahmadi
Amirhossein Shahbazi
Aliasghar Alamian
Mohammad Azizi
Seyed Morteza Kazemi
author_facet Pooya Hosseini-Monfared
Shayan Amiri
Alireza Mirahmadi
Amirhossein Shahbazi
Aliasghar Alamian
Mohammad Azizi
Seyed Morteza Kazemi
author_sort Pooya Hosseini-Monfared
collection DOAJ
description Introduction: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings. Methods: Common presentations of ankle pain were identified through consultations with an experienced orthopedic surgeon and a review of relevant hospital and social media sources. To replicate typical patient inquiries, questions were crafted in simple, non-technical language, requesting three possible differential diagnoses for each scenario. The second phase involved designing case vignettes reflecting scenarios typical for triage nurses or physicians. Responses from ChatGPT were evaluated against a benchmark established by two experienced orthopedic surgeons, with a scoring system assessing the accuracy, clarity, and relevance of the differential diagnoses based on their order. Results: In 21 ankle pain presentations, ChatGPT-o1 preview outperformed ChatGPT-4 in both accuracy and clarity, with only the clarity score reaching statistical significance (p < 0.001). ChatGPT-o1 preview also had a significantly higher total score (p = 0.004). In 15 case vignettes, ChatGPT-o1 preview scored better on diagnostic and management clarity, though differences in diagnostic accuracy were not statistically significant. Among 51 questions, ChatGPT-4 and ChatGPT-o1 preview produced incorrect responses for 5 (9.8%) and 4 (7.8%) questions, respectively. Inter-rater reliability analysis demonstrated excellent reliability of the scoring system with interclass coefficients of 0.99 (95% CI, 0.998–0.999) for accuracy scores and 0.99 (95% CI, 0.990–0.995) for clarity scores. Conclusion: Our findings demonstrated that both ChatGPT-4 and ChatGPT-o1 preview provide acceptable performance in the triage of ankle pain cases in emergency settings. ChatGPT-o1 preview outperformed ChatGPT-4, offering clearer and more precise responses. While both models show potential as supportive tools, their role should remain supervised and strictly supplementary to clinical expertise.
format Article
id doaj-art-0909be6ccc4c4cbfbb97d52fd520aae6
institution DOAJ
issn 2645-4904
language English
publishDate 2025-04-01
publisher Shahid Beheshti University of Medical Sciences
record_format Article
series Archives of Academic Emergency Medicine
spelling doaj-art-0909be6ccc4c4cbfbb97d52fd520aae62025-08-20T03:17:46ZengShahid Beheshti University of Medical SciencesArchives of Academic Emergency Medicine2645-49042025-04-0113110.22037/aaemj.v13i1.2580ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency SettingsPooya Hosseini-Monfared0Shayan Amiri1Alireza Mirahmadi2Amirhossein Shahbazi3Aliasghar Alamian4Mohammad Azizi5Seyed Morteza Kazemi6Bone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IranBone and Joint Reconstruction Research Center, Department of Orthopedics, School of Medicine, Iran University of Medical Sciences, Tehran, IranMusculoskeletal Translational Innovation Initiative, Carl J. Shapiro Department of Orthopaedic Surgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA Student research committee, School of Medicine, Ilam University of Medical Sciences, Ilam, IranBone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IranBone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IranBone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran Introduction: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings. Methods: Common presentations of ankle pain were identified through consultations with an experienced orthopedic surgeon and a review of relevant hospital and social media sources. To replicate typical patient inquiries, questions were crafted in simple, non-technical language, requesting three possible differential diagnoses for each scenario. The second phase involved designing case vignettes reflecting scenarios typical for triage nurses or physicians. Responses from ChatGPT were evaluated against a benchmark established by two experienced orthopedic surgeons, with a scoring system assessing the accuracy, clarity, and relevance of the differential diagnoses based on their order. Results: In 21 ankle pain presentations, ChatGPT-o1 preview outperformed ChatGPT-4 in both accuracy and clarity, with only the clarity score reaching statistical significance (p < 0.001). ChatGPT-o1 preview also had a significantly higher total score (p = 0.004). In 15 case vignettes, ChatGPT-o1 preview scored better on diagnostic and management clarity, though differences in diagnostic accuracy were not statistically significant. Among 51 questions, ChatGPT-4 and ChatGPT-o1 preview produced incorrect responses for 5 (9.8%) and 4 (7.8%) questions, respectively. Inter-rater reliability analysis demonstrated excellent reliability of the scoring system with interclass coefficients of 0.99 (95% CI, 0.998–0.999) for accuracy scores and 0.99 (95% CI, 0.990–0.995) for clarity scores. Conclusion: Our findings demonstrated that both ChatGPT-4 and ChatGPT-o1 preview provide acceptable performance in the triage of ankle pain cases in emergency settings. ChatGPT-o1 preview outperformed ChatGPT-4, offering clearer and more precise responses. While both models show potential as supportive tools, their role should remain supervised and strictly supplementary to clinical expertise. https://journals.sbmu.ac.ir/aaem/index.php/AAEM/article/view/2580Large Language ModelsChatGPTEmergency MedicineTriageArtificial intelligenceAnkle
spellingShingle Pooya Hosseini-Monfared
Shayan Amiri
Alireza Mirahmadi
Amirhossein Shahbazi
Aliasghar Alamian
Mohammad Azizi
Seyed Morteza Kazemi
ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
Archives of Academic Emergency Medicine
Large Language Models
ChatGPT
Emergency Medicine
Triage
Artificial intelligence
Ankle
title ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
title_full ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
title_fullStr ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
title_full_unstemmed ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
title_short ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
title_sort chatgpt o1 preview outperforms chatgpt 4 as a diagnostic support tool for ankle pain triage in emergency settings
topic Large Language Models
ChatGPT
Emergency Medicine
Triage
Artificial intelligence
Ankle
url https://journals.sbmu.ac.ir/aaem/index.php/AAEM/article/view/2580
work_keys_str_mv AT pooyahosseinimonfared chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings
AT shayanamiri chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings
AT alirezamirahmadi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings
AT amirhosseinshahbazi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings
AT aliasgharalamian chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings
AT mohammadazizi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings
AT seyedmortezakazemi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings