Evaluating large language models for criterion-based grading from agreement to consistency
Abstract This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criter...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2024-12-01
|
| Series: | npj Science of Learning |
| Online Access: | https://doi.org/10.1038/s41539-024-00291-1 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850075720365113344 |
|---|---|
| author | Da-Wei Zhang Melissa Boey Yan Yu Tan Alexis Hoh Sheng Jia |
| author_facet | Da-Wei Zhang Melissa Boey Yan Yu Tan Alexis Hoh Sheng Jia |
| author_sort | Da-Wei Zhang |
| collection | DOAJ |
| description | Abstract This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criterion-based grading with a detailed understanding of the criteria, underscoring the importance of domain-specific understanding over model complexity. These findings highlight the potential of LLMs to deliver scalable educational feedback. |
| format | Article |
| id | doaj-art-e57244a5ed194038b5e7ef75da47879e |
| institution | DOAJ |
| issn | 2056-7936 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | npj Science of Learning |
| spelling | doaj-art-e57244a5ed194038b5e7ef75da47879e2025-08-20T02:46:13ZengNature Portfolionpj Science of Learning2056-79362024-12-01911410.1038/s41539-024-00291-1Evaluating large language models for criterion-based grading from agreement to consistencyDa-Wei Zhang0Melissa Boey1Yan Yu Tan2Alexis Hoh Sheng Jia3Department of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University MalaysiaDepartment of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University MalaysiaDepartment of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University MalaysiaDepartment of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University MalaysiaAbstract This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criterion-based grading with a detailed understanding of the criteria, underscoring the importance of domain-specific understanding over model complexity. These findings highlight the potential of LLMs to deliver scalable educational feedback.https://doi.org/10.1038/s41539-024-00291-1 |
| spellingShingle | Da-Wei Zhang Melissa Boey Yan Yu Tan Alexis Hoh Sheng Jia Evaluating large language models for criterion-based grading from agreement to consistency npj Science of Learning |
| title | Evaluating large language models for criterion-based grading from agreement to consistency |
| title_full | Evaluating large language models for criterion-based grading from agreement to consistency |
| title_fullStr | Evaluating large language models for criterion-based grading from agreement to consistency |
| title_full_unstemmed | Evaluating large language models for criterion-based grading from agreement to consistency |
| title_short | Evaluating large language models for criterion-based grading from agreement to consistency |
| title_sort | evaluating large language models for criterion based grading from agreement to consistency |
| url | https://doi.org/10.1038/s41539-024-00291-1 |
| work_keys_str_mv | AT daweizhang evaluatinglargelanguagemodelsforcriterionbasedgradingfromagreementtoconsistency AT melissaboey evaluatinglargelanguagemodelsforcriterionbasedgradingfromagreementtoconsistency AT yanyutan evaluatinglargelanguagemodelsforcriterionbasedgradingfromagreementtoconsistency AT alexishohshengjia evaluatinglargelanguagemodelsforcriterionbasedgradingfromagreementtoconsistency |