Grading explanations of problem-solving process and generating feedback using large language models at human-level accuracy

[This paper is part of the Focused Collection in Artificial Intelligence Tools in Physics Teaching and Physics Education Research.] This study examines the feasibility and potential advantages of using large language models, in particular GPT-4o, to perform partial credit grading of large numbers of...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhongzhou Chen, Tong Wan
Format: Article
Language:English
Published: American Physical Society 2025-03-01
Series:Physical Review Physics Education Research
Online Access:http://doi.org/10.1103/PhysRevPhysEducRes.21.010126
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:[This paper is part of the Focused Collection in Artificial Intelligence Tools in Physics Teaching and Physics Education Research.] This study examines the feasibility and potential advantages of using large language models, in particular GPT-4o, to perform partial credit grading of large numbers of student written responses to introductory level physics problems. Students were instructed to write down verbal explanations of their reasoning process when solving one conceptual and two numerical calculation problems on two exams. The explanations were then graded according to a three-item rubric with each item graded as binary (1 or 0). We first demonstrate that machine grading using GPT-4o with no examples or reference answers can reliably agree with human graders in 70%–80% of all cases, which is equal to or higher than the level at which two human graders agree with each other. Two methods are essential for achieving this level of accuracy: (i) Adding explanation language to each rubric item that targets the errors of initial machine grading. (ii) Running the grading process 5 times and taking the most frequent outcome. Next, we show that the variation in outcomes across five machine grading attempts can serve as a grading confidence index. The index allows a human expert to identify ∼40% of all potentially incorrect gradings by reviewing just 10%–15% of all responses with the highest variation. Finally, we show that it is straightforward to use GPT-4o to write a clear and detailed explanation of the partial credit grading outcome. Those explanations can be used as feedback for students, which will allow students to understand their grades and raise different opinions when necessary. Almost all feedback messages generated were rated three or above on a five-point scale by two instructors who had taught the course multiple times. The entire grading and feedback generating process costs roughly $5 per 100 student answers, which shows immense promise for automating labor-intensive grading process through a combination of machine grading with human input and supervision.
ISSN:2469-9896