Evaluating large language models for criterion-based grading from agreement to consistency

Abstract This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criter...

Full description

Saved in:
Bibliographic Details
Main Authors: Da-Wei Zhang, Melissa Boey, Yan Yu Tan, Alexis Hoh Sheng Jia
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:npj Science of Learning
Online Access:https://doi.org/10.1038/s41539-024-00291-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850075720365113344
author Da-Wei Zhang
Melissa Boey
Yan Yu Tan
Alexis Hoh Sheng Jia
author_facet Da-Wei Zhang
Melissa Boey
Yan Yu Tan
Alexis Hoh Sheng Jia
author_sort Da-Wei Zhang
collection DOAJ
description Abstract This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criterion-based grading with a detailed understanding of the criteria, underscoring the importance of domain-specific understanding over model complexity. These findings highlight the potential of LLMs to deliver scalable educational feedback.
format Article
id doaj-art-e57244a5ed194038b5e7ef75da47879e
institution DOAJ
issn 2056-7936
language English
publishDate 2024-12-01
publisher Nature Portfolio
record_format Article
series npj Science of Learning
spelling doaj-art-e57244a5ed194038b5e7ef75da47879e2025-08-20T02:46:13ZengNature Portfolionpj Science of Learning2056-79362024-12-01911410.1038/s41539-024-00291-1Evaluating large language models for criterion-based grading from agreement to consistencyDa-Wei Zhang0Melissa Boey1Yan Yu Tan2Alexis Hoh Sheng Jia3Department of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University MalaysiaDepartment of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University MalaysiaDepartment of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University MalaysiaDepartment of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University MalaysiaAbstract This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criterion-based grading with a detailed understanding of the criteria, underscoring the importance of domain-specific understanding over model complexity. These findings highlight the potential of LLMs to deliver scalable educational feedback.https://doi.org/10.1038/s41539-024-00291-1
spellingShingle Da-Wei Zhang
Melissa Boey
Yan Yu Tan
Alexis Hoh Sheng Jia
Evaluating large language models for criterion-based grading from agreement to consistency
npj Science of Learning
title Evaluating large language models for criterion-based grading from agreement to consistency
title_full Evaluating large language models for criterion-based grading from agreement to consistency
title_fullStr Evaluating large language models for criterion-based grading from agreement to consistency
title_full_unstemmed Evaluating large language models for criterion-based grading from agreement to consistency
title_short Evaluating large language models for criterion-based grading from agreement to consistency
title_sort evaluating large language models for criterion based grading from agreement to consistency
url https://doi.org/10.1038/s41539-024-00291-1
work_keys_str_mv AT daweizhang evaluatinglargelanguagemodelsforcriterionbasedgradingfromagreementtoconsistency
AT melissaboey evaluatinglargelanguagemodelsforcriterionbasedgradingfromagreementtoconsistency
AT yanyutan evaluatinglargelanguagemodelsforcriterionbasedgradingfromagreementtoconsistency
AT alexishohshengjia evaluatinglargelanguagemodelsforcriterionbasedgradingfromagreementtoconsistency