Large language models in medical education: a comparative cross-platform evaluation in answering histological questions

Large language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabili...

Full description

Saved in:

Bibliographic Details
Main Authors:	Volodymyr Mavrych, Einas M. Yousef, Ahmed Yaqinuddin, Olena Bolgova
Format:	Article
Language:	English
Published:	Taylor & Francis Group 2025-12-01
Series:	Medical Education Online
Subjects:	Large language models medical education histology artificial intelligence ChatGPT Claude
Online Access:	https://www.tandfonline.com/doi/10.1080/10872981.2025.2534065
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849704357058052096
author	Volodymyr Mavrych Einas M. Yousef Ahmed Yaqinuddin Olena Bolgova
author_facet	Volodymyr Mavrych Einas M. Yousef Ahmed Yaqinuddin Olena Bolgova
author_sort	Volodymyr Mavrych
collection	DOAJ
description	Large language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabilities in medical education. To evaluate and compare the performance of five current LLMs: GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, Copilot, and DeepSeek R1 on correctly answering medical histology multiple choice questions (MCQs). This cross-sectional comparative study used 200 USMLE-style histology MCQs across 20 topics. Each LLM completed all the questions in three separate attempts. Performance metrics included accuracy rates, test-retest reliability (ICC), and topic-specific analysis. Statistical analysis employed ANOVA with post-hoc Tukey’s tests and two-way mixed ANOVA for system-topic interactions. All LLMs achieved exceptionally high accuracy (Mean 91.1%, SD 7.2). Gemini performed best (92.0%), followed by Claude (91.5%), Copilot (91.0%), GPT-4 (90.8%), and DeepSeek (90.3%), with no significant differences between systems (p > 0.05). Claude showed the highest reliability (ICC = 0.931), followed by GPT-4 (ICC = 0.882). Complete accuracy and reproducibility (100%) were detected in Histological Methods, Blood and Hemopoiesis, and Circulatory System, while Muscle tissue (76.0%) and Lymphoid System (84.7%) presented the greatest challenges. LLMs demonstrate exceptional accuracy and reliability in answering histological MCQs, significantly outperforming other medical disciplines. Minimal inter-system variability suggests technological maturity, though topic-specific challenges and reliability concerns indicate the continued need for human expertise. These findings reflect rapid AI advancement and identify histology as particularly suitable for AI-assisted medical education.Clinical trial number: The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.
format	Article
id	doaj-art-affcd004da3645848aee2c704aa8f0a3
institution	DOAJ
issn	1087-2981
language	English
publishDate	2025-12-01
publisher	Taylor & Francis Group
record_format	Article
series	Medical Education Online
spelling	doaj-art-affcd004da3645848aee2c704aa8f0a32025-08-20T03:16:46ZengTaylor & Francis GroupMedical Education Online1087-29812025-12-0130110.1080/10872981.2025.2534065Large language models in medical education: a comparative cross-platform evaluation in answering histological questionsVolodymyr Mavrych0Einas M. Yousef1Ahmed Yaqinuddin2Olena Bolgova3College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi ArabiaCollege of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi ArabiaCollege of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi ArabiaCollege of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi ArabiaLarge language models (LLMs) have shown promising capabilities across medical disciplines, yet their performance in basic medical sciences remains incompletely characterized. Medical histology, requiring factual knowledge and interpretative skills, provides a unique domain for evaluating AI capabilities in medical education. To evaluate and compare the performance of five current LLMs: GPT-4.1, Claude 3.7 Sonnet, Gemini 2.0 Flash, Copilot, and DeepSeek R1 on correctly answering medical histology multiple choice questions (MCQs). This cross-sectional comparative study used 200 USMLE-style histology MCQs across 20 topics. Each LLM completed all the questions in three separate attempts. Performance metrics included accuracy rates, test-retest reliability (ICC), and topic-specific analysis. Statistical analysis employed ANOVA with post-hoc Tukey’s tests and two-way mixed ANOVA for system-topic interactions. All LLMs achieved exceptionally high accuracy (Mean 91.1%, SD 7.2). Gemini performed best (92.0%), followed by Claude (91.5%), Copilot (91.0%), GPT-4 (90.8%), and DeepSeek (90.3%), with no significant differences between systems (p > 0.05). Claude showed the highest reliability (ICC = 0.931), followed by GPT-4 (ICC = 0.882). Complete accuracy and reproducibility (100%) were detected in Histological Methods, Blood and Hemopoiesis, and Circulatory System, while Muscle tissue (76.0%) and Lymphoid System (84.7%) presented the greatest challenges. LLMs demonstrate exceptional accuracy and reliability in answering histological MCQs, significantly outperforming other medical disciplines. Minimal inter-system variability suggests technological maturity, though topic-specific challenges and reliability concerns indicate the continued need for human expertise. These findings reflect rapid AI advancement and identify histology as particularly suitable for AI-assisted medical education.Clinical trial number: The clinical trial number is not pertinent to this study as it does not involve medicinal products or therapeutic interventions.https://www.tandfonline.com/doi/10.1080/10872981.2025.2534065Large language modelsmedical educationhistologyartificial intelligenceChatGPTClaude
spellingShingle	Volodymyr Mavrych Einas M. Yousef Ahmed Yaqinuddin Olena Bolgova Large language models in medical education: a comparative cross-platform evaluation in answering histological questions Medical Education Online Large language models medical education histology artificial intelligence ChatGPT Claude
title	Large language models in medical education: a comparative cross-platform evaluation in answering histological questions
title_full	Large language models in medical education: a comparative cross-platform evaluation in answering histological questions
title_fullStr	Large language models in medical education: a comparative cross-platform evaluation in answering histological questions
title_full_unstemmed	Large language models in medical education: a comparative cross-platform evaluation in answering histological questions
title_short	Large language models in medical education: a comparative cross-platform evaluation in answering histological questions
title_sort	large language models in medical education a comparative cross platform evaluation in answering histological questions
topic	Large language models medical education histology artificial intelligence ChatGPT Claude
url	https://www.tandfonline.com/doi/10.1080/10872981.2025.2534065
work_keys_str_mv	AT volodymyrmavrych largelanguagemodelsinmedicaleducationacomparativecrossplatformevaluationinansweringhistologicalquestions AT einasmyousef largelanguagemodelsinmedicaleducationacomparativecrossplatformevaluationinansweringhistologicalquestions AT ahmedyaqinuddin largelanguagemodelsinmedicaleducationacomparativecrossplatformevaluationinansweringhistologicalquestions AT olenabolgova largelanguagemodelsinmedicaleducationacomparativecrossplatformevaluationinansweringhistologicalquestions

Large language models in medical education: a comparative cross-platform evaluation in answering histological questions

Similar Items