Can AI provide useful holistic essay scoring?
Researchers have sought for decades to automate holistic essay scoring. Over the years, these programs have improved significantly. However, accuracy requires significant amounts of training on human-scored texts—reducing the expediency and usefulness of such programs for routine uses by teachers ac...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2024-12-01
|
| Series: | Computers and Education: Artificial Intelligence |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2666920X24000584 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850054024206745600 |
|---|---|
| author | Tamara P. Tate Jacob Steiss Drew Bailey Steve Graham Youngsun Moon Daniel Ritchie Waverly Tseng Mark Warschauer |
| author_facet | Tamara P. Tate Jacob Steiss Drew Bailey Steve Graham Youngsun Moon Daniel Ritchie Waverly Tseng Mark Warschauer |
| author_sort | Tamara P. Tate |
| collection | DOAJ |
| description | Researchers have sought for decades to automate holistic essay scoring. Over the years, these programs have improved significantly. However, accuracy requires significant amounts of training on human-scored texts—reducing the expediency and usefulness of such programs for routine uses by teachers across the nation on non-standardized prompts. This study analyzes the output of multiple versions of ChatGPT scoring of secondary student essays from three extant corpora and compares it to quality human ratings. We find that the current iteration of ChatGPT scoring is not statistically significantly different from human scoring; substantial agreement with humans is achievable and may be sufficient for low-stakes, formative assessment purposes. However, as large language models evolve additional research will be needed to continue to assess their aptitude for this task as well as determine whether their proximity to human scoring can be improved through prompting or training. |
| format | Article |
| id | doaj-art-b9740b2f8a7544f88e4e523afe5e7e0b |
| institution | DOAJ |
| issn | 2666-920X |
| language | English |
| publishDate | 2024-12-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Computers and Education: Artificial Intelligence |
| spelling | doaj-art-b9740b2f8a7544f88e4e523afe5e7e0b2025-08-20T02:52:23ZengElsevierComputers and Education: Artificial Intelligence2666-920X2024-12-01710025510.1016/j.caeai.2024.100255Can AI provide useful holistic essay scoring?Tamara P. Tate0Jacob Steiss1Drew Bailey2Steve Graham3Youngsun Moon4Daniel Ritchie5Waverly Tseng6Mark Warschauer7University of California, Irvine, USA; Corresponding author.University of California, Irvine, USAUniversity of California, Irvine, USAArizona State University, USAUniversity of California, Irvine, USAUniversity of California, Irvine, USAUniversity of California, Irvine, USAUniversity of California, Irvine, USAResearchers have sought for decades to automate holistic essay scoring. Over the years, these programs have improved significantly. However, accuracy requires significant amounts of training on human-scored texts—reducing the expediency and usefulness of such programs for routine uses by teachers across the nation on non-standardized prompts. This study analyzes the output of multiple versions of ChatGPT scoring of secondary student essays from three extant corpora and compares it to quality human ratings. We find that the current iteration of ChatGPT scoring is not statistically significantly different from human scoring; substantial agreement with humans is achievable and may be sufficient for low-stakes, formative assessment purposes. However, as large language models evolve additional research will be needed to continue to assess their aptitude for this task as well as determine whether their proximity to human scoring can be improved through prompting or training.http://www.sciencedirect.com/science/article/pii/S2666920X24000584Artificial intelligenceAIAutomated scoringWritingAssessmentLarge language models |
| spellingShingle | Tamara P. Tate Jacob Steiss Drew Bailey Steve Graham Youngsun Moon Daniel Ritchie Waverly Tseng Mark Warschauer Can AI provide useful holistic essay scoring? Computers and Education: Artificial Intelligence Artificial intelligence AI Automated scoring Writing Assessment Large language models |
| title | Can AI provide useful holistic essay scoring? |
| title_full | Can AI provide useful holistic essay scoring? |
| title_fullStr | Can AI provide useful holistic essay scoring? |
| title_full_unstemmed | Can AI provide useful holistic essay scoring? |
| title_short | Can AI provide useful holistic essay scoring? |
| title_sort | can ai provide useful holistic essay scoring |
| topic | Artificial intelligence AI Automated scoring Writing Assessment Large language models |
| url | http://www.sciencedirect.com/science/article/pii/S2666920X24000584 |
| work_keys_str_mv | AT tamaraptate canaiprovideusefulholisticessayscoring AT jacobsteiss canaiprovideusefulholisticessayscoring AT drewbailey canaiprovideusefulholisticessayscoring AT stevegraham canaiprovideusefulholisticessayscoring AT youngsunmoon canaiprovideusefulholisticessayscoring AT danielritchie canaiprovideusefulholisticessayscoring AT waverlytseng canaiprovideusefulholisticessayscoring AT markwarschauer canaiprovideusefulholisticessayscoring |