Is GPT-4 fair? An empirical analysis in automatic short answer grading

Short open-ended questions represent a central resource in formative and summative assessments both face-to-face and online settings, ranging from elementary to higher education. However, grading these questions remains challenging for instructors, raising attention to the field of Automatic Short A...

Full description

Saved in:
Bibliographic Details
Main Authors: Luiz Rodrigues, Cleon Xavier, Newarney Costa, Dragan Gasevic, Rafael Ferreira Mello
Format: Article
Language:English
Published: Elsevier 2025-06-01
Series:Computers and Education: Artificial Intelligence
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666920X25000682
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Short open-ended questions represent a central resource in formative and summative assessments both face-to-face and online settings, ranging from elementary to higher education. However, grading these questions remains challenging for instructors, raising attention to the field of Automatic Short Answer Grading (ASAG). While ASAG has yielded valuable contributions to learning analytics, it often faces generalizability issues. Accordingly, the rapid advancement in Large Language Models (LLMs) has motivated their adoption to empower ASAG systems. Despite that, previous research has not investigated whether LLMs are fair graders in the context of ASAG. Therefore, this paper presents an empirical analysis aimed to understand LLMs' fairness in ASAG by using human grades as a baseline, comparing them to GPT-4's answers, and investigating whether the LLM's grades are equivalent in grading answers from varied groups of humans. Our results demonstrated that, while GPT-4 tended to be more lenient in its grading, it maintained consistent evaluation standards when assessing responses from different groups of students. GPT-4 remained consistent for questions of different subjects and levels of Bloom's taxonomy and for people with different demographics. These findings suggest GPT-4 is a fair grader, supporting its potential to empower educators and developers in using and designing ASAG systems. Nevertheless, we recommend further research to investigate these findings and understand how to optimize GPT-4's grades.
ISSN:2666-920X