Assessing the quality of automatic-generated short answers using GPT-4

Open-ended assessments play a pivotal role in enabling instructors to evaluate student knowledge acquisition and provide constructive feedback. Integrating large language models (LLMs) such as GPT-4 in educational settings presents a transformative opportunity for assessment methodologies. However,...

Full description

Saved in:
Bibliographic Details
Main Authors: Luiz Rodrigues, Filipe Dwan Pereira, Luciano Cabral, Dragan Gašević, Geber Ramalho, Rafael Ferreira Mello
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:Computers and Education: Artificial Intelligence
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666920X24000511
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850249113845628928
author Luiz Rodrigues
Filipe Dwan Pereira
Luciano Cabral
Dragan Gašević
Geber Ramalho
Rafael Ferreira Mello
author_facet Luiz Rodrigues
Filipe Dwan Pereira
Luciano Cabral
Dragan Gašević
Geber Ramalho
Rafael Ferreira Mello
author_sort Luiz Rodrigues
collection DOAJ
description Open-ended assessments play a pivotal role in enabling instructors to evaluate student knowledge acquisition and provide constructive feedback. Integrating large language models (LLMs) such as GPT-4 in educational settings presents a transformative opportunity for assessment methodologies. However, existing literature on LLMs addressing open-ended questions lacks breadth, relying on limited data or overlooking question difficulty levels. This study evaluates GPT-4's proficiency in responding to open-ended questions spanning diverse topics and cognitive complexities in comparison to human responses. To facilitate this assessment, we generated a dataset of 738 open-ended questions across Biology, Earth Sciences, and Physics and systematically categorized it based on Bloom's Taxonomy. Each question included eight human-generated responses and two from GPT-4. The outcomes indicate GPT-4's superior performance over humans, encompassing both native and non-native speakers, irrespective of gender. Nevertheless, this advantage was not sustained in ’remembering’ or ’creating’ questions aligned with Bloom's Taxonomy. These results highlight GPT-4's potential for underpinning advanced question-answering systems, its promising role in supporting non-native speakers, and its capacity to augment teacher assistance in assessments. However, limitations in nuanced argumentation and creativity underscore areas necessitating refinement in these models, guiding future research toward bolstering pedagogical support.
format Article
id doaj-art-6e9da4e2974f41fcb00124a1ffa152ae
institution OA Journals
issn 2666-920X
language English
publishDate 2024-12-01
publisher Elsevier
record_format Article
series Computers and Education: Artificial Intelligence
spelling doaj-art-6e9da4e2974f41fcb00124a1ffa152ae2025-08-20T01:58:34ZengElsevierComputers and Education: Artificial Intelligence2666-920X2024-12-01710024810.1016/j.caeai.2024.100248Assessing the quality of automatic-generated short answers using GPT-4Luiz Rodrigues0Filipe Dwan Pereira1Luciano Cabral2Dragan Gašević3Geber Ramalho4Rafael Ferreira Mello5Computing Institute, Federal University of Alagoas, Macei'o, Alagoas, Brazil; Corresponding author.Department of Computer Science, Federal University of Roraima, Boa Vista, BrazilFederal Institute of Pernambuco, Jaboat~ao Dos Guararapes, BrazilFaculty of Information Technology, Monash University, Melbourne, AustraliaFaculty of Information Technology, Monash University, Melbourne, Australia; Federal University of Pernambuco, Recife, Pernambuco, BrazilCESAR School, Centro de Estudos e Sistemas Avancados Do Recife, Recife, Brazil; Corresponding author.Open-ended assessments play a pivotal role in enabling instructors to evaluate student knowledge acquisition and provide constructive feedback. Integrating large language models (LLMs) such as GPT-4 in educational settings presents a transformative opportunity for assessment methodologies. However, existing literature on LLMs addressing open-ended questions lacks breadth, relying on limited data or overlooking question difficulty levels. This study evaluates GPT-4's proficiency in responding to open-ended questions spanning diverse topics and cognitive complexities in comparison to human responses. To facilitate this assessment, we generated a dataset of 738 open-ended questions across Biology, Earth Sciences, and Physics and systematically categorized it based on Bloom's Taxonomy. Each question included eight human-generated responses and two from GPT-4. The outcomes indicate GPT-4's superior performance over humans, encompassing both native and non-native speakers, irrespective of gender. Nevertheless, this advantage was not sustained in ’remembering’ or ’creating’ questions aligned with Bloom's Taxonomy. These results highlight GPT-4's potential for underpinning advanced question-answering systems, its promising role in supporting non-native speakers, and its capacity to augment teacher assistance in assessments. However, limitations in nuanced argumentation and creativity underscore areas necessitating refinement in these models, guiding future research toward bolstering pedagogical support.http://www.sciencedirect.com/science/article/pii/S2666920X24000511Automatic answer generationQuestion-answeringLarge language modelsGPT-4Natural language processing
spellingShingle Luiz Rodrigues
Filipe Dwan Pereira
Luciano Cabral
Dragan Gašević
Geber Ramalho
Rafael Ferreira Mello
Assessing the quality of automatic-generated short answers using GPT-4
Computers and Education: Artificial Intelligence
Automatic answer generation
Question-answering
Large language models
GPT-4
Natural language processing
title Assessing the quality of automatic-generated short answers using GPT-4
title_full Assessing the quality of automatic-generated short answers using GPT-4
title_fullStr Assessing the quality of automatic-generated short answers using GPT-4
title_full_unstemmed Assessing the quality of automatic-generated short answers using GPT-4
title_short Assessing the quality of automatic-generated short answers using GPT-4
title_sort assessing the quality of automatic generated short answers using gpt 4
topic Automatic answer generation
Question-answering
Large language models
GPT-4
Natural language processing
url http://www.sciencedirect.com/science/article/pii/S2666920X24000511
work_keys_str_mv AT luizrodrigues assessingthequalityofautomaticgeneratedshortanswersusinggpt4
AT filipedwanpereira assessingthequalityofautomaticgeneratedshortanswersusinggpt4
AT lucianocabral assessingthequalityofautomaticgeneratedshortanswersusinggpt4
AT dragangasevic assessingthequalityofautomaticgeneratedshortanswersusinggpt4
AT geberramalho assessingthequalityofautomaticgeneratedshortanswersusinggpt4
AT rafaelferreiramello assessingthequalityofautomaticgeneratedshortanswersusinggpt4