ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings
Introduction: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings. Methods: Common presen...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Shahid Beheshti University of Medical Sciences
2025-04-01
|
| Series: | Archives of Academic Emergency Medicine |
| Subjects: | |
| Online Access: | https://journals.sbmu.ac.ir/aaem/index.php/AAEM/article/view/2580 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849702084814831616 |
|---|---|
| author | Pooya Hosseini-Monfared Shayan Amiri Alireza Mirahmadi Amirhossein Shahbazi Aliasghar Alamian Mohammad Azizi Seyed Morteza Kazemi |
| author_facet | Pooya Hosseini-Monfared Shayan Amiri Alireza Mirahmadi Amirhossein Shahbazi Aliasghar Alamian Mohammad Azizi Seyed Morteza Kazemi |
| author_sort | Pooya Hosseini-Monfared |
| collection | DOAJ |
| description |
Introduction: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings.
Methods: Common presentations of ankle pain were identified through consultations with an experienced orthopedic surgeon and a review of relevant hospital and social media sources. To replicate typical patient inquiries, questions were crafted in simple, non-technical language, requesting three possible differential diagnoses for each scenario. The second phase involved designing case vignettes reflecting scenarios typical for triage nurses or physicians. Responses from ChatGPT were evaluated against a benchmark established by two experienced orthopedic surgeons, with a scoring system assessing the accuracy, clarity, and relevance of the differential diagnoses based on their order.
Results: In 21 ankle pain presentations, ChatGPT-o1 preview outperformed ChatGPT-4 in both accuracy and clarity, with only the clarity score reaching statistical significance (p < 0.001). ChatGPT-o1 preview also had a significantly higher total score (p = 0.004). In 15 case vignettes, ChatGPT-o1 preview scored better on diagnostic and management clarity, though differences in diagnostic accuracy were not statistically significant. Among 51 questions, ChatGPT-4 and ChatGPT-o1 preview produced incorrect responses for 5 (9.8%) and 4 (7.8%) questions, respectively. Inter-rater reliability analysis demonstrated excellent reliability of the scoring system with interclass coefficients of 0.99 (95% CI, 0.998–0.999) for accuracy scores and 0.99 (95% CI, 0.990–0.995) for clarity scores.
Conclusion: Our findings demonstrated that both ChatGPT-4 and ChatGPT-o1 preview provide acceptable performance in the triage of ankle pain cases in emergency settings. ChatGPT-o1 preview outperformed ChatGPT-4, offering clearer and more precise responses. While both models show potential as supportive tools, their role should remain supervised and strictly supplementary to clinical expertise.
|
| format | Article |
| id | doaj-art-0909be6ccc4c4cbfbb97d52fd520aae6 |
| institution | DOAJ |
| issn | 2645-4904 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Shahid Beheshti University of Medical Sciences |
| record_format | Article |
| series | Archives of Academic Emergency Medicine |
| spelling | doaj-art-0909be6ccc4c4cbfbb97d52fd520aae62025-08-20T03:17:46ZengShahid Beheshti University of Medical SciencesArchives of Academic Emergency Medicine2645-49042025-04-0113110.22037/aaemj.v13i1.2580ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency SettingsPooya Hosseini-Monfared0Shayan Amiri1Alireza Mirahmadi2Amirhossein Shahbazi3Aliasghar Alamian4Mohammad Azizi5Seyed Morteza Kazemi6Bone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IranBone and Joint Reconstruction Research Center, Department of Orthopedics, School of Medicine, Iran University of Medical Sciences, Tehran, IranMusculoskeletal Translational Innovation Initiative, Carl J. Shapiro Department of Orthopaedic Surgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA Student research committee, School of Medicine, Ilam University of Medical Sciences, Ilam, IranBone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IranBone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, IranBone Joint and Related Tissues Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran Introduction: ChatGPT, a general-purpose language model, is not specifically optimized for medical applications. This study aimed to assess the performance of ChatGPT-4 and o1-preview in generating differential diagnoses for common cases of ankle pain in emergency settings. Methods: Common presentations of ankle pain were identified through consultations with an experienced orthopedic surgeon and a review of relevant hospital and social media sources. To replicate typical patient inquiries, questions were crafted in simple, non-technical language, requesting three possible differential diagnoses for each scenario. The second phase involved designing case vignettes reflecting scenarios typical for triage nurses or physicians. Responses from ChatGPT were evaluated against a benchmark established by two experienced orthopedic surgeons, with a scoring system assessing the accuracy, clarity, and relevance of the differential diagnoses based on their order. Results: In 21 ankle pain presentations, ChatGPT-o1 preview outperformed ChatGPT-4 in both accuracy and clarity, with only the clarity score reaching statistical significance (p < 0.001). ChatGPT-o1 preview also had a significantly higher total score (p = 0.004). In 15 case vignettes, ChatGPT-o1 preview scored better on diagnostic and management clarity, though differences in diagnostic accuracy were not statistically significant. Among 51 questions, ChatGPT-4 and ChatGPT-o1 preview produced incorrect responses for 5 (9.8%) and 4 (7.8%) questions, respectively. Inter-rater reliability analysis demonstrated excellent reliability of the scoring system with interclass coefficients of 0.99 (95% CI, 0.998–0.999) for accuracy scores and 0.99 (95% CI, 0.990–0.995) for clarity scores. Conclusion: Our findings demonstrated that both ChatGPT-4 and ChatGPT-o1 preview provide acceptable performance in the triage of ankle pain cases in emergency settings. ChatGPT-o1 preview outperformed ChatGPT-4, offering clearer and more precise responses. While both models show potential as supportive tools, their role should remain supervised and strictly supplementary to clinical expertise. https://journals.sbmu.ac.ir/aaem/index.php/AAEM/article/view/2580Large Language ModelsChatGPTEmergency MedicineTriageArtificial intelligenceAnkle |
| spellingShingle | Pooya Hosseini-Monfared Shayan Amiri Alireza Mirahmadi Amirhossein Shahbazi Aliasghar Alamian Mohammad Azizi Seyed Morteza Kazemi ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings Archives of Academic Emergency Medicine Large Language Models ChatGPT Emergency Medicine Triage Artificial intelligence Ankle |
| title | ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings |
| title_full | ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings |
| title_fullStr | ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings |
| title_full_unstemmed | ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings |
| title_short | ChatGPT-o1 Preview Outperforms ChatGPT-4 as a Diagnostic Support Tool for Ankle Pain Triage in Emergency Settings |
| title_sort | chatgpt o1 preview outperforms chatgpt 4 as a diagnostic support tool for ankle pain triage in emergency settings |
| topic | Large Language Models ChatGPT Emergency Medicine Triage Artificial intelligence Ankle |
| url | https://journals.sbmu.ac.ir/aaem/index.php/AAEM/article/view/2580 |
| work_keys_str_mv | AT pooyahosseinimonfared chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT shayanamiri chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT alirezamirahmadi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT amirhosseinshahbazi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT aliasgharalamian chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT mohammadazizi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings AT seyedmortezakazemi chatgpto1previewoutperformschatgpt4asadiagnosticsupporttoolforanklepaintriageinemergencysettings |