Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents
Abstract Background Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini,...
Saved in:
| Main Author: | |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-08-01
|
| Series: | BMC Medical Education |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12909-025-07856-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Background Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini, GPT-4o, and Copilot) with specialists and residents on European General Surgery Board test questions, focusing on accuracy across question formats, lengths, and difficulty levels. Methods A total of 120 multiple-choice questions were systematically sampled from the General Surgery Examination and Board Review question bank using a structured randomization protocol. The questions were administered via Google Forms to four large language models (Llama-3, GPT-4o, Gemini, and Copilot) and 30 surgeons (15 board-certified specialists and 15 residents) under timed, single-session conditions. Participant demographics (age, gender, years of experience) were recorded. Questions were categorized by word count (short, medium, long) and by difficulty level (easy, moderate, hard), rated by three independent board-certified surgeons. Group accuracy rates were compared using ANOVA with appropriate post-hoc tests, and 95% confidence intervals were reported. Results Board-certified surgeons achieved the highest accuracy rate at 81.6% (95% CI: 78.9–84.3), followed by surgical residents at 69.9% (95% CI: 66.7–73.1). Among large language models (LLMs), Llama-3 demonstrated the best performance with an accuracy of 65.8% (95% CI: 62.4–69.2), whereas Copilot showed the lowest performance at 51.7% (95% CI: 48.1–55.3). LLM performance declined significantly as item difficulty and length increased, particularly for Copilot (68.3% on short vs. 36.4% on long questions, p < 0.001). In contrast, human participants maintained relatively stable accuracy across difficulty levels. Notably, only Llama-3 ranked within the human performance range, placing 26th among 30 surgeons, while all other LLMs failed to surpass the 60% accuracy threshold (p < 0.001). Conclusion Current LLMs underperform compared to human specialists when faced with questions requiring high-level medical knowledge, reinforcing their current role as supplementary tools in surgical education rather than replacements for expert clinical judgment. |
|---|---|
| ISSN: | 1472-6920 |