Can AI provide useful holistic essay scoring?

Researchers have sought for decades to automate holistic essay scoring. Over the years, these programs have improved significantly. However, accuracy requires significant amounts of training on human-scored texts—reducing the expediency and usefulness of such programs for routine uses by teachers ac...

Full description

Saved in:
Bibliographic Details
Main Authors: Tamara P. Tate, Jacob Steiss, Drew Bailey, Steve Graham, Youngsun Moon, Daniel Ritchie, Waverly Tseng, Mark Warschauer
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:Computers and Education: Artificial Intelligence
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666920X24000584
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850054024206745600
author Tamara P. Tate
Jacob Steiss
Drew Bailey
Steve Graham
Youngsun Moon
Daniel Ritchie
Waverly Tseng
Mark Warschauer
author_facet Tamara P. Tate
Jacob Steiss
Drew Bailey
Steve Graham
Youngsun Moon
Daniel Ritchie
Waverly Tseng
Mark Warschauer
author_sort Tamara P. Tate
collection DOAJ
description Researchers have sought for decades to automate holistic essay scoring. Over the years, these programs have improved significantly. However, accuracy requires significant amounts of training on human-scored texts—reducing the expediency and usefulness of such programs for routine uses by teachers across the nation on non-standardized prompts. This study analyzes the output of multiple versions of ChatGPT scoring of secondary student essays from three extant corpora and compares it to quality human ratings. We find that the current iteration of ChatGPT scoring is not statistically significantly different from human scoring; substantial agreement with humans is achievable and may be sufficient for low-stakes, formative assessment purposes. However, as large language models evolve additional research will be needed to continue to assess their aptitude for this task as well as determine whether their proximity to human scoring can be improved through prompting or training.
format Article
id doaj-art-b9740b2f8a7544f88e4e523afe5e7e0b
institution DOAJ
issn 2666-920X
language English
publishDate 2024-12-01
publisher Elsevier
record_format Article
series Computers and Education: Artificial Intelligence
spelling doaj-art-b9740b2f8a7544f88e4e523afe5e7e0b2025-08-20T02:52:23ZengElsevierComputers and Education: Artificial Intelligence2666-920X2024-12-01710025510.1016/j.caeai.2024.100255Can AI provide useful holistic essay scoring?Tamara P. Tate0Jacob Steiss1Drew Bailey2Steve Graham3Youngsun Moon4Daniel Ritchie5Waverly Tseng6Mark Warschauer7University of California, Irvine, USA; Corresponding author.University of California, Irvine, USAUniversity of California, Irvine, USAArizona State University, USAUniversity of California, Irvine, USAUniversity of California, Irvine, USAUniversity of California, Irvine, USAUniversity of California, Irvine, USAResearchers have sought for decades to automate holistic essay scoring. Over the years, these programs have improved significantly. However, accuracy requires significant amounts of training on human-scored texts—reducing the expediency and usefulness of such programs for routine uses by teachers across the nation on non-standardized prompts. This study analyzes the output of multiple versions of ChatGPT scoring of secondary student essays from three extant corpora and compares it to quality human ratings. We find that the current iteration of ChatGPT scoring is not statistically significantly different from human scoring; substantial agreement with humans is achievable and may be sufficient for low-stakes, formative assessment purposes. However, as large language models evolve additional research will be needed to continue to assess their aptitude for this task as well as determine whether their proximity to human scoring can be improved through prompting or training.http://www.sciencedirect.com/science/article/pii/S2666920X24000584Artificial intelligenceAIAutomated scoringWritingAssessmentLarge language models
spellingShingle Tamara P. Tate
Jacob Steiss
Drew Bailey
Steve Graham
Youngsun Moon
Daniel Ritchie
Waverly Tseng
Mark Warschauer
Can AI provide useful holistic essay scoring?
Computers and Education: Artificial Intelligence
Artificial intelligence
AI
Automated scoring
Writing
Assessment
Large language models
title Can AI provide useful holistic essay scoring?
title_full Can AI provide useful holistic essay scoring?
title_fullStr Can AI provide useful holistic essay scoring?
title_full_unstemmed Can AI provide useful holistic essay scoring?
title_short Can AI provide useful holistic essay scoring?
title_sort can ai provide useful holistic essay scoring
topic Artificial intelligence
AI
Automated scoring
Writing
Assessment
Large language models
url http://www.sciencedirect.com/science/article/pii/S2666920X24000584
work_keys_str_mv AT tamaraptate canaiprovideusefulholisticessayscoring
AT jacobsteiss canaiprovideusefulholisticessayscoring
AT drewbailey canaiprovideusefulholisticessayscoring
AT stevegraham canaiprovideusefulholisticessayscoring
AT youngsunmoon canaiprovideusefulholisticessayscoring
AT danielritchie canaiprovideusefulholisticessayscoring
AT waverlytseng canaiprovideusefulholisticessayscoring
AT markwarschauer canaiprovideusefulholisticessayscoring