Evaluating large language models for criterion-based grading from agreement to consistency

Abstract This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criter...

Full description

Saved in:
Bibliographic Details
Main Authors: Da-Wei Zhang, Melissa Boey, Yan Yu Tan, Alexis Hoh Sheng Jia
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:npj Science of Learning
Online Access:https://doi.org/10.1038/s41539-024-00291-1
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criterion-based grading with a detailed understanding of the criteria, underscoring the importance of domain-specific understanding over model complexity. These findings highlight the potential of LLMs to deliver scalable educational feedback.
ISSN:2056-7936