ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings

Introduction: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings. Methods: Common presen...

Full description

Saved in:

Bibliographic Details
Main Authors:	Pooya Hosseini-Monfared, Shayan Amiri, Alireza Mirahmadi, Amirhossein Shahbazi, Aliasghar Alamian, Mohammad Azizi, Seyed Morteza Kazemi
Format:	Article
Language:	English
Published:	Shahid Beheshti University of Medical Sciences 2025-04-01
Series:	Archives of Academic Emergency Medicine
Subjects:	Large Language Models ChatGPT Emergency Medicine Triage Artificial intelligence Ankle
Online Access:	https://journals.sbmu.ac.ir/aaem/index.php/AAEM/article/view/2580
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849702084814831616
author	Pooya Hosseini-Monfared Shayan Amiri Alireza Mirahmadi Amirhossein Shahbazi Aliasghar Alamian Mohammad Azizi Seyed Morteza Kazemi
author_facet	Pooya Hosseini-Monfared Shayan Amiri Alireza Mirahmadi Amirhossein Shahbazi Aliasghar Alamian Mohammad Azizi Seyed Morteza Kazemi
author_sort	Pooya Hosseini-Monfared
collection	DOAJ
description	Introduction: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings. Methods: Common presentations of ankle pain were identified through consultations with an experienced orthopedic surgeon and a review of relevant hospital and social media sources. To replicate typical patient inquiries, questions were crafted in simple, non-technical language, requesting three possible differential diagnoses for each scenario. The second phase involved designing case vignettes reflecting scenarios typical for triage nurses or physicians. Responses from ChatGPT were evaluated against a benchmark established by two experienced orthopedic surgeons, with a scoring system assessing the accuracy, clarity, and relevance of the differential diagnoses based on their order. Results: In 21 ankle pain presentations, ChatGPT-o1 preview outperformed ChatGPT-4 in both accuracy and clarity, with only the clarity score reaching statistical significance (p < 0.001). ChatGPT-o1 preview also had a significantly higher total score (p = 0.004). In 15 case vignettes, ChatGPT-o1 preview scored better on diagnostic and management clarity, though differences in diagnostic accuracy were not statistically significant. Among 51 questions, ChatGPT-4 and ChatGPT-o1 preview produced incorrect responses for 5 (9.8%) and 4 (7.8%) questions, respectively. Inter-rater reliability analysis demonstrated excellent reliability of the scoring system with interclass coefficients of 0.99 (95% CI, 0.998–0.999) for accuracy scores and 0.99 (95% CI, 0.990–0.995) for clarity scores. Conclusion: Our findings demonstrated that both ChatGPT-4 and ChatGPT-o1 preview provide acceptable performance in the triage of ankle pain cases in emergency settings. ChatGPT-o1 preview outperformed ChatGPT-4, offering clearer and more precise responses. While both models show potential as supportive tools, their role should remain supervised and strictly supplementary to clinical expertise.
format	Article
id	doaj-art-0909be6ccc4c4cbfbb97d52fd520aae6
institution	DOAJ
issn	2645-4904
language	English
publishDate	2025-04-01
publisher	Shahid Beheshti University of Medical Sciences
record_format	Article
series	Archives of Academic Emergency Medicine
spelling	doaj-art-0909be6ccc4c4cbfbb97d52fd520aae62025-08-20T03:17:46ZengShahid Beheshti University of Medical SciencesArchives of Academic Emergency Medicine2645-49042025-04-0113110.22037/aaemj.v13i1.2580ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency SettingsPooya Hosseini-Monfared0Shayan Amiri1Alireza Mirahmadi2Amirhossein Shahbazi3Aliasghar Alamian4Mohammad Azizi5Seyed Morteza Kazemi6Bone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IranBone and Joint Reconstruction Research Center, Department of Orthopedics, School of Medicine, Iran University of Medical Sciences, Tehran, IranMusculoskeletal Translational Innovation Initiative, Carl J. Shapiro Department of Orthopaedic Surgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA Student research committee, School of Medicine, Ilam University of Medical Sciences, Ilam, IranBone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IranBone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IranBone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran Introduction: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings. Methods: Common presentations of ankle pain were identified through consultations with an experienced orthopedic surgeon and a review of relevant hospital and social media sources. To replicate typical patient inquiries, questions were crafted in simple, non-technical language, requesting three possible differential diagnoses for each scenario. The second phase involved designing case vignettes reflecting scenarios typical for triage nurses or physicians. Responses from ChatGPT were evaluated against a benchmark established by two experienced orthopedic surgeons, with a scoring system assessing the accuracy, clarity, and relevance of the differential diagnoses based on their order. Results: In 21 ankle pain presentations, ChatGPT-o1 preview outperformed ChatGPT-4 in both accuracy and clarity, with only the clarity score reaching statistical significance (p < 0.001). ChatGPT-o1 preview also had a significantly higher total score (p = 0.004). In 15 case vignettes, ChatGPT-o1 preview scored better on diagnostic and management clarity, though differences in diagnostic accuracy were not statistically significant. Among 51 questions, ChatGPT-4 and ChatGPT-o1 preview produced incorrect responses for 5 (9.8%) and 4 (7.8%) questions, respectively. Inter-rater reliability analysis demonstrated excellent reliability of the scoring system with interclass coefficients of 0.99 (95% CI, 0.998–0.999) for accuracy scores and 0.99 (95% CI, 0.990–0.995) for clarity scores. Conclusion: Our findings demonstrated that both ChatGPT-4 and ChatGPT-o1 preview provide acceptable performance in the triage of ankle pain cases in emergency settings. ChatGPT-o1 preview outperformed ChatGPT-4, offering clearer and more precise responses. While both models show potential as supportive tools, their role should remain supervised and strictly supplementary to clinical expertise. https://journals.sbmu.ac.ir/aaem/index.php/AAEM/article/view/2580Large Language ModelsChatGPTEmergency MedicineTriageArtificial intelligenceAnkle
spellingShingle	Pooya Hosseini-Monfared Shayan Amiri Alireza Mirahmadi Amirhossein Shahbazi Aliasghar Alamian Mohammad Azizi Seyed Morteza Kazemi ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings Archives of Academic Emergency Medicine Large Language Models ChatGPT Emergency Medicine Triage Artificial intelligence Ankle
title	ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
title_full	ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
title_fullStr	ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
title_full_unstemmed	ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
title_short	ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
title_sort	chatgpt o1 preview outperforms chatgpt 4 as a diagnostic support tool for ankle pain triage in emergency settings
topic	Large Language Models ChatGPT Emergency Medicine Triage Artificial intelligence Ankle
url	https://journals.sbmu.ac.ir/aaem/index.php/AAEM/article/view/2580
work_keys_str_mv	AT pooyahosseinimonfared chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT shayanamiri chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT alirezamirahmadi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT amirhosseinshahbazi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT aliasgharalamian chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT mohammadazizi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT seyedmortezakazemi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings

ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings

Similar Items