Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents

Abstract Background Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini,...

Full description

Saved in:
Bibliographic Details
Main Author: Melih Can Gül
Format: Article
Language:English
Published: BMC 2025-08-01
Series:BMC Medical Education
Subjects:
Online Access:https://doi.org/10.1186/s12909-025-07856-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849226191750299648
author Melih Can Gül
author_facet Melih Can Gül
author_sort Melih Can Gül
collection DOAJ
description Abstract Background Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini, GPT-4o, and Copilot) with specialists and residents on European General Surgery Board test questions, focusing on accuracy across question formats, lengths, and difficulty levels. Methods A total of 120 multiple-choice questions were systematically sampled from the General Surgery Examination and Board Review question bank using a structured randomization protocol. The questions were administered via Google Forms to four large language models (Llama-3, GPT-4o, Gemini, and Copilot) and 30 surgeons (15 board-certified specialists and 15 residents) under timed, single-session conditions. Participant demographics (age, gender, years of experience) were recorded. Questions were categorized by word count (short, medium, long) and by difficulty level (easy, moderate, hard), rated by three independent board-certified surgeons. Group accuracy rates were compared using ANOVA with appropriate post-hoc tests, and 95% confidence intervals were reported. Results Board-certified surgeons achieved the highest accuracy rate at 81.6% (95% CI: 78.9–84.3), followed by surgical residents at 69.9% (95% CI: 66.7–73.1). Among large language models (LLMs), Llama-3 demonstrated the best performance with an accuracy of 65.8% (95% CI: 62.4–69.2), whereas Copilot showed the lowest performance at 51.7% (95% CI: 48.1–55.3). LLM performance declined significantly as item difficulty and length increased, particularly for Copilot (68.3% on short vs. 36.4% on long questions, p < 0.001). In contrast, human participants maintained relatively stable accuracy across difficulty levels. Notably, only Llama-3 ranked within the human performance range, placing 26th among 30 surgeons, while all other LLMs failed to surpass the 60% accuracy threshold (p < 0.001). Conclusion Current LLMs underperform compared to human specialists when faced with questions requiring high-level medical knowledge, reinforcing their current role as supplementary tools in surgical education rather than replacements for expert clinical judgment.
format Article
id doaj-art-29511ab831ad4d4c8b054f5b89bfdd78
institution Kabale University
issn 1472-6920
language English
publishDate 2025-08-01
publisher BMC
record_format Article
series BMC Medical Education
spelling doaj-art-29511ab831ad4d4c8b054f5b89bfdd782025-08-24T11:35:12ZengBMCBMC Medical Education1472-69202025-08-0125111010.1186/s12909-025-07856-7Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residentsMelih Can Gül0Department of Gastrointestinal Surgery, Afyonkarahisar State HospitalAbstract Background Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini, GPT-4o, and Copilot) with specialists and residents on European General Surgery Board test questions, focusing on accuracy across question formats, lengths, and difficulty levels. Methods A total of 120 multiple-choice questions were systematically sampled from the General Surgery Examination and Board Review question bank using a structured randomization protocol. The questions were administered via Google Forms to four large language models (Llama-3, GPT-4o, Gemini, and Copilot) and 30 surgeons (15 board-certified specialists and 15 residents) under timed, single-session conditions. Participant demographics (age, gender, years of experience) were recorded. Questions were categorized by word count (short, medium, long) and by difficulty level (easy, moderate, hard), rated by three independent board-certified surgeons. Group accuracy rates were compared using ANOVA with appropriate post-hoc tests, and 95% confidence intervals were reported. Results Board-certified surgeons achieved the highest accuracy rate at 81.6% (95% CI: 78.9–84.3), followed by surgical residents at 69.9% (95% CI: 66.7–73.1). Among large language models (LLMs), Llama-3 demonstrated the best performance with an accuracy of 65.8% (95% CI: 62.4–69.2), whereas Copilot showed the lowest performance at 51.7% (95% CI: 48.1–55.3). LLM performance declined significantly as item difficulty and length increased, particularly for Copilot (68.3% on short vs. 36.4% on long questions, p < 0.001). In contrast, human participants maintained relatively stable accuracy across difficulty levels. Notably, only Llama-3 ranked within the human performance range, placing 26th among 30 surgeons, while all other LLMs failed to surpass the 60% accuracy threshold (p < 0.001). Conclusion Current LLMs underperform compared to human specialists when faced with questions requiring high-level medical knowledge, reinforcing their current role as supplementary tools in surgical education rather than replacements for expert clinical judgment.https://doi.org/10.1186/s12909-025-07856-7Artificial intelligenceBoard examinationsHuman-AI comparisonMedical educationSurgical training
spellingShingle Melih Can Gül
Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents
BMC Medical Education
Artificial intelligence
Board examinations
Human-AI comparison
Medical education
Surgical training
title Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents
title_full Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents
title_fullStr Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents
title_full_unstemmed Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents
title_short Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents
title_sort large language models underperform in european general surgery board examinations a comparative study with experts and surgical residents
topic Artificial intelligence
Board examinations
Human-AI comparison
Medical education
Surgical training
url https://doi.org/10.1186/s12909-025-07856-7
work_keys_str_mv AT melihcangul largelanguagemodelsunderperformineuropeangeneralsurgeryboardexaminationsacomparativestudywithexpertsandsurgicalresidents