OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions

Purpose: To evaluate and compare the performance of human test takers and three artificial intelligence (AI) models—OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash—on ophthalmology board–style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive com...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ryan Shean, BA, Tathya Shah, BS, Sina Sobhani, BS, Alan Tang, BS, Ali Setayesh, BA, Kyle Bolo, MD, Van Nguyen, MD, Benjamin Xu, MD, PhD
Format:	Article
Language:	English
Published:	Elsevier 2025-11-01
Series:	Ophthalmology Science
Subjects:	Artificial intelligence Ophthalmology Medical education Large language models
Online Access:	http://www.sciencedirect.com/science/article/pii/S2666914525001423
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849429752633360384
author	Ryan Shean, BA Tathya Shah, BS Sina Sobhani, BS Alan Tang, BS Ali Setayesh, BA Kyle Bolo, MD Van Nguyen, MD Benjamin Xu, MD, PhD
author_facet	Ryan Shean, BA Tathya Shah, BS Sina Sobhani, BS Alan Tang, BS Ali Setayesh, BA Kyle Bolo, MD Van Nguyen, MD Benjamin Xu, MD, PhD
author_sort	Ryan Shean, BA
collection	DOAJ
description	Purpose: To evaluate and compare the performance of human test takers and three artificial intelligence (AI) models—OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash—on ophthalmology board–style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive complexity level. Design: A cross-sectional study. Subjects: Five hundred questions sourced from the Basic and Clinical Science Course (BCSC) and EyeQuiz question banks. Methods: Three large language models interpreted the questions using standardized prompting procedures. Subanalysis was performed, stratifying the questions by subspecialty and complexity defined by the Buckwalter taxonomic schema. Statistical analysis, including the analysis of variance and McNemar test, was conducted to assess performance differences. Main Outcome Measures: Accuracy of responses for each model and human test takers, stratified by subspecialty and cognitive complexity. Results: OpenAI o1 achieved the highest overall accuracy (423/500, 84.6%), significantly outperforming GPT-4o (331/500, 66.2%; P < 0.001) and Gemini (301/500, 60.2%; P < 0.001). o1 demonstrated superior performance on both BCSC (228/250, 91.2%) and EyeQuiz (195/250, 78.0%) questions compared with GPT-4o (BCSC: 183/250, 73.2%; EyeQuiz: 148/250, 59.2%) and Gemini (BCSC: 163/250, 65.2%; EyeQuiz: 137/250, 54.8%). On BCSC questions, human performance was lower (64.5%) than Gemini 1.5 Flash (65.2%), GPT-4o (73.2%), and OpenAI o1 (91.2%) (P < 0.001). OpenAI o1 outperformed other models in each of the nine ophthalmic subfields and three cognitive complexity levels. Conclusions: OpenAI o1 outperformed GPT-4o, Gemini, and human test takers in answering ophthalmology board–style questions from two question banks and across three complexity levels. These findings highlight advances in AI technology and OpenAI o1’s growing potential as an adjunct in ophthalmic education and care. Financial Disclosure(s): The author(s) have no proprietary or commercial interest in any materials discussed in this article.
format	Article
id	doaj-art-73da0f06b009493d9d60582977e8b156
institution	Kabale University
issn	2666-9145
language	English
publishDate	2025-11-01
publisher	Elsevier
record_format	Article
series	Ophthalmology Science
spelling	doaj-art-73da0f06b009493d9d60582977e8b1562025-08-20T03:28:14ZengElsevierOphthalmology Science2666-91452025-11-015610084410.1016/j.xops.2025.100844OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style QuestionsRyan Shean, BA0Tathya Shah, BS1Sina Sobhani, BS2Alan Tang, BS3Ali Setayesh, BA4Kyle Bolo, MD5Van Nguyen, MD6Benjamin Xu, MD, PhD7Keck School of Medicine, University of Southern California, Los Angeles, CaliforniaKeck School of Medicine, University of Southern California, Los Angeles, CaliforniaKeck School of Medicine, University of Southern California, Los Angeles, CaliforniaKeck School of Medicine, University of Southern California, Los Angeles, CaliforniaKeck School of Medicine, University of Southern California, Los Angeles, CaliforniaRoski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, CaliforniaRoski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, CaliforniaRoski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, California; Correspondence: Benjamin Xu, MD, PhD, Department of Ophthalmology, Keck School of Medicine at the University of Southern California, 1450 San Pablo Street, 4th Floor, Suite 4700, Los Angeles, CA 90033.Purpose: To evaluate and compare the performance of human test takers and three artificial intelligence (AI) models—OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash—on ophthalmology board–style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive complexity level. Design: A cross-sectional study. Subjects: Five hundred questions sourced from the Basic and Clinical Science Course (BCSC) and EyeQuiz question banks. Methods: Three large language models interpreted the questions using standardized prompting procedures. Subanalysis was performed, stratifying the questions by subspecialty and complexity defined by the Buckwalter taxonomic schema. Statistical analysis, including the analysis of variance and McNemar test, was conducted to assess performance differences. Main Outcome Measures: Accuracy of responses for each model and human test takers, stratified by subspecialty and cognitive complexity. Results: OpenAI o1 achieved the highest overall accuracy (423/500, 84.6%), significantly outperforming GPT-4o (331/500, 66.2%; P < 0.001) and Gemini (301/500, 60.2%; P < 0.001). o1 demonstrated superior performance on both BCSC (228/250, 91.2%) and EyeQuiz (195/250, 78.0%) questions compared with GPT-4o (BCSC: 183/250, 73.2%; EyeQuiz: 148/250, 59.2%) and Gemini (BCSC: 163/250, 65.2%; EyeQuiz: 137/250, 54.8%). On BCSC questions, human performance was lower (64.5%) than Gemini 1.5 Flash (65.2%), GPT-4o (73.2%), and OpenAI o1 (91.2%) (P < 0.001). OpenAI o1 outperformed other models in each of the nine ophthalmic subfields and three cognitive complexity levels. Conclusions: OpenAI o1 outperformed GPT-4o, Gemini, and human test takers in answering ophthalmology board–style questions from two question banks and across three complexity levels. These findings highlight advances in AI technology and OpenAI o1’s growing potential as an adjunct in ophthalmic education and care. Financial Disclosure(s): The author(s) have no proprietary or commercial interest in any materials discussed in this article.http://www.sciencedirect.com/science/article/pii/S2666914525001423Artificial intelligenceOphthalmologyMedical educationLarge language models
spellingShingle	Ryan Shean, BA Tathya Shah, BS Sina Sobhani, BS Alan Tang, BS Ali Setayesh, BA Kyle Bolo, MD Van Nguyen, MD Benjamin Xu, MD, PhD OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions Ophthalmology Science Artificial intelligence Ophthalmology Medical education Large language models
title	OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions
title_full	OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions
title_fullStr	OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions
title_full_unstemmed	OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions
title_short	OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions
title_sort	openai o1 large language model outperforms gpt 4o gemini 1 5 flash and human test takers on ophthalmology board style questions
topic	Artificial intelligence Ophthalmology Medical education Large language models
url	http://www.sciencedirect.com/science/article/pii/S2666914525001423
work_keys_str_mv	AT ryansheanba openaio1largelanguagemodeloutperformsgpt4ogemini15flashandhumantesttakersonophthalmologyboardstylequestions AT tathyashahbs openaio1largelanguagemodeloutperformsgpt4ogemini15flashandhumantesttakersonophthalmologyboardstylequestions AT sinasobhanibs openaio1largelanguagemodeloutperformsgpt4ogemini15flashandhumantesttakersonophthalmologyboardstylequestions AT alantangbs openaio1largelanguagemodeloutperformsgpt4ogemini15flashandhumantesttakersonophthalmologyboardstylequestions AT alisetayeshba openaio1largelanguagemodeloutperformsgpt4ogemini15flashandhumantesttakersonophthalmologyboardstylequestions AT kylebolomd openaio1largelanguagemodeloutperformsgpt4ogemini15flashandhumantesttakersonophthalmologyboardstylequestions AT vannguyenmd openaio1largelanguagemodeloutperformsgpt4ogemini15flashandhumantesttakersonophthalmologyboardstylequestions AT benjaminxumdphd openaio1largelanguagemodeloutperformsgpt4ogemini15flashandhumantesttakersonophthalmologyboardstylequestions

OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board–Style Questions

Similar Items