Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders

The assessment of short-answer questions (SAQs) in medical education is resource-intensive, requiring significant expert time. Large Language Models (LLMs) offer potential for automating this process, but their efficacy in specialized medical education assessment remains understudied. To evaluate th...

Full description

Saved in:

Bibliographic Details
Main Authors:	Olena Bolgova, Paul Ganguly, Muhammad Faisal Ikram, Volodymyr Mavrych
Format:	Article
Language:	English
Published:	Taylor & Francis Group 2025-12-01
Series:	Medical Education Online
Subjects:	Medical education assessment short answer questions large language models artificial intelligence claude
Online Access:	https://www.tandfonline.com/doi/10.1080/10872981.2025.2550751
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849225398395600896
author	Olena Bolgova Paul Ganguly Muhammad Faisal Ikram Volodymyr Mavrych
author_facet	Olena Bolgova Paul Ganguly Muhammad Faisal Ikram Volodymyr Mavrych
author_sort	Olena Bolgova
collection	DOAJ
description	The assessment of short-answer questions (SAQs) in medical education is resource-intensive, requiring significant expert time. Large Language Models (LLMs) offer potential for automating this process, but their efficacy in specialized medical education assessment remains understudied. To evaluate the capability of five LLMs to grade medical SAQs compared to expert human graders across four distinct medical disciplines. This study analyzed 804 student responses across anatomy, histology, embryology, and physiology. Three faculty members graded all responses. Five LLMs (GPT-4.1, Gemini, Claude, Copilot, DeepSeek) evaluated responses twice: first using their learned representations to generate their own grading criteria (A1), then using expert-provided rubrics (A2). Agreement was measured using Cohen’s Kappa and Intraclass Correlation Coefficient (ICC). Expert-expert agreement was substantial across all questions (average Kappa: 0.69, ICC: 0.86), ranging from moderate (SAQ2: 0.57) to almost perfect (SAQ4: 0.87). LLM performance varied dramatically by question type and model. The highest expert-LLM agreement was observed for Claude on SAQ3 (Kappa: 0.61) and DeepSeek on SAQ2 (Kappa: 0.53). Providing expert criteria had inconsistent effects, significantly improving some model-question combinations while decreasing others. No single LLM consistently outperformed others across all domains. LLM strictness in grading unsatisfactory responses varied substantially from experts. LLMs demonstrated domain-specific variations in grading capabilities. The provision of expert criteria did not consistently improve performance. While LLMs show promise for supporting medical education assessment, their implementation requires domain-specific considerations and continued human oversight.
format	Article
id	doaj-art-53ad28eb5be74b5d8e367b9667da2a65
institution	Kabale University
issn	1087-2981
language	English
publishDate	2025-12-01
publisher	Taylor & Francis Group
record_format	Article
series	Medical Education Online
spelling	doaj-art-53ad28eb5be74b5d8e367b9667da2a652025-08-24T18:59:03ZengTaylor & Francis GroupMedical Education Online1087-29812025-12-0130110.1080/10872981.2025.2550751Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human gradersOlena Bolgova0Paul Ganguly1Muhammad Faisal Ikram2Volodymyr Mavrych3College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi ArabiaCollege of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi ArabiaCollege of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi ArabiaCollege of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi ArabiaThe assessment of short-answer questions (SAQs) in medical education is resource-intensive, requiring significant expert time. Large Language Models (LLMs) offer potential for automating this process, but their efficacy in specialized medical education assessment remains understudied. To evaluate the capability of five LLMs to grade medical SAQs compared to expert human graders across four distinct medical disciplines. This study analyzed 804 student responses across anatomy, histology, embryology, and physiology. Three faculty members graded all responses. Five LLMs (GPT-4.1, Gemini, Claude, Copilot, DeepSeek) evaluated responses twice: first using their learned representations to generate their own grading criteria (A1), then using expert-provided rubrics (A2). Agreement was measured using Cohen’s Kappa and Intraclass Correlation Coefficient (ICC). Expert-expert agreement was substantial across all questions (average Kappa: 0.69, ICC: 0.86), ranging from moderate (SAQ2: 0.57) to almost perfect (SAQ4: 0.87). LLM performance varied dramatically by question type and model. The highest expert-LLM agreement was observed for Claude on SAQ3 (Kappa: 0.61) and DeepSeek on SAQ2 (Kappa: 0.53). Providing expert criteria had inconsistent effects, significantly improving some model-question combinations while decreasing others. No single LLM consistently outperformed others across all domains. LLM strictness in grading unsatisfactory responses varied substantially from experts. LLMs demonstrated domain-specific variations in grading capabilities. The provision of expert criteria did not consistently improve performance. While LLMs show promise for supporting medical education assessment, their implementation requires domain-specific considerations and continued human oversight.https://www.tandfonline.com/doi/10.1080/10872981.2025.2550751Medical educationassessmentshort answer questionslarge language modelsartificial intelligenceclaude
spellingShingle	Olena Bolgova Paul Ganguly Muhammad Faisal Ikram Volodymyr Mavrych Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders Medical Education Online Medical education assessment short answer questions large language models artificial intelligence claude
title	Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders
title_full	Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders
title_fullStr	Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders
title_full_unstemmed	Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders
title_short	Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders
title_sort	evaluating large language models as graders of medical short answer questions a comparative analysis with expert human graders
topic	Medical education assessment short answer questions large language models artificial intelligence claude
url	https://www.tandfonline.com/doi/10.1080/10872981.2025.2550751
work_keys_str_mv	AT olenabolgova evaluatinglargelanguagemodelsasgradersofmedicalshortanswerquestionsacomparativeanalysiswithexperthumangraders AT paulganguly evaluatinglargelanguagemodelsasgradersofmedicalshortanswerquestionsacomparativeanalysiswithexperthumangraders AT muhammadfaisalikram evaluatinglargelanguagemodelsasgradersofmedicalshortanswerquestionsacomparativeanalysiswithexperthumangraders AT volodymyrmavrych evaluatinglargelanguagemodelsasgradersofmedicalshortanswerquestionsacomparativeanalysiswithexperthumangraders

Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders

Similar Items