Evaluating large language models for criterion-based grading from agreement to consistency

Evaluating large language models for criterion-based grading from agreement to consistency

Abstract This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criter...

Full description

Saved in:

Bibliographic Details
Main Authors:	Da-Wei Zhang, Melissa Boey, Yan Yu Tan, Alexis Hoh Sheng Jia
Format:	Article
Language:	English
Published:	Nature Portfolio 2024-12-01
Series:	npj Science of Learning
Online Access:	https://doi.org/10.1038/s41539-024-00291-1
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Climate Warming in Response to Emission Reductions Consistent with the Paris Agreement
by: Fang Wang, et al.
Published: (2018-01-01)

Is This Reliable Enough? Examining Classification Consistency and Accuracy in a Criterion-Referenced Test
by: Susanne Alger
Published: (2016-04-01)

Reliability Analysis of Horizontal Curves Using Geometric Design Consistency Assessment Criterion
by: Hossein Saedi, et al.
Published: (2024-01-01)

Is This Reliable Enough? Examining Classification Consistency and Accuracy in a Criterion-Referenced Test
by: Susanne Alger
Published: (2016-07-01)

Agreement among the energy expenditure prediction equations with the criterion model in the exhaustive treadmill test protocols
by: معرفت سیاه کوهیان, et al.
Published: (2016-11-01)

An Evaluation of Telepractice During the COVID-19 Pandemic for the Treatment of Speech and Language Disorders in Belgium
by: Ronny Boey, et al.
Published: (2022-05-01)

Adaptive Gaussian back-end based on LDOF criterion for language recognition
by: Zhong-fu YE, et al.
Published: (2017-04-01)

Large language models can consistently generate high-quality content for election disinformation operations.
by: Angus R Williams, et al.
Published: (2025-01-01)

The Impact of Norm- and Criterion-Referenced Grading Systems on Students’ Course-Related Expectations
by: Jingxuan Liu, et al.
Published: (2024-12-01)

The Impact of Norm- and Criterion-Referenced Grading Systems on Students’ Course-Related Expectations
by: Jingxuan Liu, et al.
Published: (2024-12-01)

The Impact of Norm- and Criterion-Referenced Grading Systems on Students’ Course-Related Expectations
by: Jingxuan Liu, et al.
Published: (2024-12-01)

Criterion for evaluating vehicle performance
by: E. E. KOSSOV, et al.
Published: (2019-06-01)

Agreement between a photogrammetry-derived five-compartment body-composition model and a three-compartment criterion among adult athletes
by: Kworweinski Lafontant, et al.
Published: (2025-09-01)

Evaluating multiple large language models on orbital diseases
by: Qi-Chen Yang, et al.
Published: (2025-07-01)

Understanding carbon footprint: An evaluation criterion for achieving sustainable development
by: Fang Yu, et al.
Published: (2024-12-01)

Correction: Evaluating multiple large language models on orbital diseases
by: Qi-Chen Yang, et al.
Published: (2025-08-01)

Сomparative analysis of grading models using fuzzy logic to enhance fairness and consistency in student performance evaluation
by: Alibek Barlybayev, et al.
Published: (2025-12-01)

Metastability and Consistency of Large-scale Neural Networks revealed by TMS-EEGMetastability and Consistency of Large-scale Neural Networks revealed by TMS-EEG
by: Keiichi Kitajo
Published: (2025-01-01)

Evaluation of performance of generative large language models for stroke care
by: John Tayu Lee, et al.
Published: (2025-07-01)

JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs)
by: Jorge Cisneros-González, et al.
Published: (2025-06-01)

Quality consistency evaluation for carbon nanotubes
by: Zihao Song, et al.
Published: (2025-03-01)

Erratum: Measuring and Improving Consistency in Pretrained Language Models
by: Yanai Elazar, et al.
Published: (2022-01-01)

Arabic Rules Between Philosophy of Language and Grammar Criterion: A Modernist Approach
by: Hany Ismail Ramadan
Published: (2023-12-01)

Large Language Models for Automated Grading and Synthetic Data Generation in Communication-Based Training Assessment
by: Joseph Salisbury, et al.
Published: (2025-05-01)

An optimization method for hemi-diaphragm measurement of dynamic chest X-ray radiography during respiration based on graphics and diaphragm motion consistency criterion
by: Yingjian Yang, et al.
Published: (2025-06-01)

A Progressive Search Method for Roundness Evaluation Based on Minimum Zone Criterion
by: Jian Mei, et al.
Published: (2025-04-01)

Large language model-based multimodal system for detecting and grading ocular surface diseases from smartphone images
by: Zhongwen Li, et al.
Published: (2025-05-01)

Enhancing the Accuracy of Human Phenotype Ontology Identification: Comparative Evaluation of Multimodal Large Language Models
by: Wei Zhong, et al.
Published: (2025-06-01)

Harnessing large language models to auto-evaluate the student project reports
by: Haoze Du, et al.
Published: (2024-12-01)

AI-teacher agreement in evaluating learning diaries
by: Lhea Reinhold, et al.
Published: (2025-07-01)

Agreement Verbs in Turkish Sign Language (TİD) from the Perspective of Templatic Morphology
by: Bahtiyar Makaroğlu, et al.
Published: (2018-07-01)

CHOICE OF STEEL GRADE FOR HIGHLYSTRESSED WHEEL GEARS OF TRANSMISSIONS OF MOTOR AND TRACTOR EQUIPMENT BY CRITERION OF CEMENTED LAYER HARDENABILITY
by: S. P. Rudenko, et al.
Published: (2013-02-01)

Moral criticism as a criterion for evaluating Bakhtiar Zewar's poems
by: Basl Taymoor Ismail, et al.
Published: (2022-10-01)

Multi-Criterion Stability Evaluation and Surface Boundary Method for Slopes
by: Lier Lu, et al.
Published: (2025-01-01)

Beyond words: evaluating large language models in transportation planning
by: Shaowei Ying, et al.
Published: (2025-04-01)

METRICS FOR EVALUATING CONSISTENCY IN DISTRIBUTED DATASTORES
by: Galyna Zholtkevych
Published: (2020-06-01)

Consistent Evaluation Methods for Microfluidic Mixers
by: Oliver Blaschke, et al.
Published: (2024-10-01)

Normal Criterion Concerning Shared Values
by: Wei Chen, et al.
Published: (2012-01-01)

Grading explanations of problem-solving process and generating feedback using large language models at human-level accuracy
by: Zhongzhou Chen, et al.
Published: (2025-03-01)

Evaluating the performance of large language & visual-language models in cervical cytology screening
by: Qi Hong, et al.
Published: (2025-05-01)