Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination

Background Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Methods Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In...

Full description

Saved in:
Bibliographic Details
Main Authors: Andrew Y Xu, Manjot Singh, Mariah Balmaceno-Criss, Allison Oh, David Leigh, Mohammad Daher, Daniel Alsoof, Christopher L McDonald, Bassel G Diebo, Alan H Daniels
Format: Article
Language:English
Published: SAGE Publishing 2025-02-01
Series:Journal of Orthopaedic Surgery
Online Access:https://doi.org/10.1177/10225536241268789
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850024180508000256
author Andrew Y Xu
Manjot Singh
Mariah Balmaceno-Criss
Allison Oh
David Leigh
Mohammad Daher
Daniel Alsoof
Christopher L McDonald
Bassel G Diebo
Alan H Daniels
author_facet Andrew Y Xu
Manjot Singh
Mariah Balmaceno-Criss
Allison Oh
David Leigh
Mohammad Daher
Daniel Alsoof
Christopher L McDonald
Bassel G Diebo
Alan H Daniels
author_sort Andrew Y Xu
collection DOAJ
description Background Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Methods Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions. Results GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 ( p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard ( p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 ( p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 ( p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated ( p = .003 and p < .001, respectively) and higher-order questions ( p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions ( p = .045). Conclusion The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE , and outperforming predecessor GPT-3.5 and Google Bard.
format Article
id doaj-art-0db3bcce8ab9435bb522688c9c3c29b7
institution DOAJ
issn 2309-4990
language English
publishDate 2025-02-01
publisher SAGE Publishing
record_format Article
series Journal of Orthopaedic Surgery
spelling doaj-art-0db3bcce8ab9435bb522688c9c3c29b72025-08-20T03:01:11ZengSAGE PublishingJournal of Orthopaedic Surgery2309-49902025-02-013310.1177/10225536241268789Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examinationAndrew Y XuManjot SinghMariah Balmaceno-CrissAllison OhDavid LeighMohammad DaherDaniel AlsoofChristopher L McDonaldBassel G DieboAlan H DanielsBackground Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown. Methods Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions. Results GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 ( p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard ( p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 ( p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 ( p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated ( p = .003 and p < .001, respectively) and higher-order questions ( p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions ( p = .045). Conclusion The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE , and outperforming predecessor GPT-3.5 and Google Bard.https://doi.org/10.1177/10225536241268789
spellingShingle Andrew Y Xu
Manjot Singh
Mariah Balmaceno-Criss
Allison Oh
David Leigh
Mohammad Daher
Daniel Alsoof
Christopher L McDonald
Bassel G Diebo
Alan H Daniels
Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination
Journal of Orthopaedic Surgery
title Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination
title_full Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination
title_fullStr Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination
title_full_unstemmed Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination
title_short Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination
title_sort comparitive performance of artificial intelligence based large language models on the orthopedic in training examination
url https://doi.org/10.1177/10225536241268789
work_keys_str_mv AT andrewyxu comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination
AT manjotsingh comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination
AT mariahbalmacenocriss comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination
AT allisonoh comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination
AT davidleigh comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination
AT mohammaddaher comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination
AT danielalsoof comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination
AT christopherlmcdonald comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination
AT basselgdiebo comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination
AT alanhdaniels comparitiveperformanceofartificialintelligencebasedlargelanguagemodelsontheorthopedicintrainingexamination