Artificial Intelligence vs. Human Cognition: A Comparative Analysis of ChatGPT and Candidates Sitting the European Board of Ophthalmology Diploma Examination

Background/Objectives: This paper aims to assess ChatGPT’s performance in answering European Board of Ophthalmology Diploma (EBOD) examination papers and to compare these results to pass benchmarks and candidate results. Methods: This cross-sectional study used a sample of past exam papers from 2012...

Full description

Saved in:
Bibliographic Details
Main Authors: Anna P. Maino, Jakub Klikowski, Brendan Strong, Wahid Ghaffari, Michał Woźniak, Tristan Bourcier, Andrzej Grzybowski
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Vision
Subjects:
Online Access:https://www.mdpi.com/2411-5150/9/2/31
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Background/Objectives: This paper aims to assess ChatGPT’s performance in answering European Board of Ophthalmology Diploma (EBOD) examination papers and to compare these results to pass benchmarks and candidate results. Methods: This cross-sectional study used a sample of past exam papers from 2012, 2013, 2020–2023 EBOD examinations. This study analyzed ChatGPT’s responses to 440 multiple choice questions (MCQs), each containing five true/false statements (2200 statements in total) and 48 single best answer (SBA) questions. Results: ChatGPT, for MCQs, scored on average 64.39%. ChatGPT’s strongest metric performance for MCQs was precision (68.76%). ChatGPT performed best at answering pathology MCQs (Grubbs test <i>p</i> < 0.05). Optics and refraction had the lowest-scoring MCQ performance across all metrics. ChatGPT-3.5 Turbo performed worse than human candidates and ChatGPT-4o on easy questions (75% vs. 100% accuracy) but outperformed humans and ChatGPT-4o on challenging questions (50% vs. 28% accuracy). ChatGPT’s SBA performance averaged 28.43%, with the highest score and strongest performance in precision (29.36%). Pathology SBA questions were consistently the lowest-scoring topic across most metrics. ChatGPT demonstrated a nonsignificant tendency to select option 1 more frequently (<i>p</i> = 0.19). When answering SBAs, human candidates scored higher than ChatGPT in all metric areas measured. Conclusions: ChatGPT performed stronger for true/false questions, scoring a pass mark in most instances. Performance was poorer for SBA questions, suggesting that ChatGPT’s ability in information retrieval is better than that in knowledge integration. ChatGPT could become a valuable tool in ophthalmic education, allowing exam boards to test their exam papers to ensure they are pitched at the right level, marking open-ended questions and providing detailed feedback.
ISSN:2411-5150