Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination
Background Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Methods Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In...
Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SAGE Publishing
2025-02-01
|
| Series: | Journal of Orthopaedic Surgery |
| Online Access: | https://doi.org/10.1177/10225536241268789 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850024180508000256 |
|---|---|
| author | Andrew Y Xu Manjot Singh Mariah Balmaceno-Criss Allison Oh David Leigh Mohammad Daher Daniel Alsoof Christopher L McDonald Bassel G Diebo Alan H Daniels |
| author_facet | Andrew Y Xu Manjot Singh Mariah Balmaceno-Criss Allison Oh David Leigh Mohammad Daher Daniel Alsoof Christopher L McDonald Bassel G Diebo Alan H Daniels |
| author_sort | Andrew Y Xu |
| collection | DOAJ |
| description | Background Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Methods Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions. Results GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 ( p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard ( p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 ( p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 ( p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated ( p = .003 and p < .001, respectively) and higher-order questions ( p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions ( p = .045). Conclusion The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE , and outperforming predecessor GPT-3.5 and Google Bard. |
| format | Article |
| id | doaj-art-0db3bcce8ab9435bb522688c9c3c29b7 |
| institution | DOAJ |
| issn | 2309-4990 |
| language | English |
| publishDate | 2025-02-01 |
| publisher | SAGE Publishing |
| record_format | Article |
| series | Journal of Orthopaedic Surgery |
| spelling | doaj-art-0db3bcce8ab9435bb522688c9c3c29b72025-08-20T03:01:11ZengSAGE PublishingJournal of Orthopaedic Surgery2309-49902025-02-013310.1177/10225536241268789Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examinationAndrew Y XuManjot SinghMariah Balmaceno-CrissAllison OhDavid LeighMohammad DaherDaniel AlsoofChristopher L McDonaldBassel G DieboAlan H DanielsBackground Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Methods Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions. Results GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 ( p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard ( p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 ( p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 ( p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated ( p = .003 and p < .001, respectively) and higher-order questions ( p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions ( p = .045). Conclusion The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE , and outperforming predecessor GPT-3.5 and Google Bard.https://doi.org/10.1177/10225536241268789 |
| spellingShingle | Andrew Y Xu Manjot Singh Mariah Balmaceno-Criss Allison Oh David Leigh Mohammad Daher Daniel Alsoof Christopher L McDonald Bassel G Diebo Alan H Daniels Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination Journal of Orthopaedic Surgery |
| title | Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination |
| title_full | Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination |
| title_fullStr | Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination |
| title_full_unstemmed | Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination |
| title_short | Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination |
| title_sort | comparitive performance of artificial intelligence based large language models on the orthopedic in training examination |
| url | https://doi.org/10.1177/10225536241268789 |
| work_keys_str_mv | AT andrewyxu comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination AT manjotsingh comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination AT mariahbalmacenocriss comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination AT allisonoh comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination AT davidleigh comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination AT mohammaddaher comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination AT danielalsoof comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination AT christopherlmcdonald comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination AT basselgdiebo comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination AT alanhdaniels comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination |