Evaluating the psychometric properties of ChatGPT-generated questions

Not much is known about how LLM-generated questions compare to gold-standard, traditional formative assessments concerning their difficulty and discrimination parameters, which are valued properties in the psychometric measurement field. We follow a rigorous measurement methodology to compare a set...

Full description

Saved in:

Bibliographic Details
Main Authors:	Shreya Bhandari, Yunting Liu, Yerin Kwak, Zachary A. Pardos
Format:	Article
Language:	English
Published:	Elsevier 2024-12-01
Series:	Computers and Education: Artificial Intelligence
Subjects:	Formative assessment Generative AI Item response theory Psychometric measurement Large language models
Online Access:	http://www.sciencedirect.com/science/article/pii/S2666920X24000870
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846115708545531904
author	Shreya Bhandari Yunting Liu Yerin Kwak Zachary A. Pardos
author_facet	Shreya Bhandari Yunting Liu Yerin Kwak Zachary A. Pardos
author_sort	Shreya Bhandari
collection	DOAJ
description	Not much is known about how LLM-generated questions compare to gold-standard, traditional formative assessments concerning their difficulty and discrimination parameters, which are valued properties in the psychometric measurement field. We follow a rigorous measurement methodology to compare a set of ChatGPT-generated questions, produced from one lesson summary in a textbook, to existing questions from a published Creative Commons textbook. To do this, we collected and analyzed responses from 207 test respondents who answered questions from both item pools and used a linking methodology to compare IRT properties between the two pools. We find that neither the difficulty nor discrimination parameters of the 15 items in each pool differ statistically significantly, with some evidence that the ChatGPT items were marginally better at differentiating different respondent abilities. The response time also does not differ significantly between the two sources of items. The ChatGPT-generated items showed evidence of unidimensionality and did not affect the unidimensionality of the original set of items when tested together. Finally, through a fine-grained learning objective labeling analysis, we found greater similarity in the learning objective distribution of ChatGPT-generated items and the items from the target OpenStax lesson (0.9666) than between ChatGPT-generated items and adjacent OpenStax lessons (0.6859 for the previous lesson and 0.6153 for the subsequent lesson). These results corroborate our conclusion that generative AI can produce algebra items of similar quality to existing textbook questions that share the same construct or constructs as those questions.
format	Article
id	doaj-art-eac7a747885b4416ac405d13b03face0
institution	Kabale University
issn	2666-920X
language	English
publishDate	2024-12-01
publisher	Elsevier
record_format	Article
series	Computers and Education: Artificial Intelligence
spelling	doaj-art-eac7a747885b4416ac405d13b03face02024-12-19T11:01:28ZengElsevierComputers and Education: Artificial Intelligence2666-920X2024-12-017100284Evaluating the psychometric properties of ChatGPT-generated questionsShreya Bhandari0Yunting Liu1Yerin Kwak2Zachary A. Pardos3University of California, Berkeley, EECS, Berkeley, CA, USAUniversity of California, Berkeley, School of Education, Berkeley, CA, USAUniversity of California, Berkeley, School of Education, Berkeley, CA, USAUniversity of California, Berkeley, School of Education, Berkeley, CA, USA; Corresponding author.Not much is known about how LLM-generated questions compare to gold-standard, traditional formative assessments concerning their difficulty and discrimination parameters, which are valued properties in the psychometric measurement field. We follow a rigorous measurement methodology to compare a set of ChatGPT-generated questions, produced from one lesson summary in a textbook, to existing questions from a published Creative Commons textbook. To do this, we collected and analyzed responses from 207 test respondents who answered questions from both item pools and used a linking methodology to compare IRT properties between the two pools. We find that neither the difficulty nor discrimination parameters of the 15 items in each pool differ statistically significantly, with some evidence that the ChatGPT items were marginally better at differentiating different respondent abilities. The response time also does not differ significantly between the two sources of items. The ChatGPT-generated items showed evidence of unidimensionality and did not affect the unidimensionality of the original set of items when tested together. Finally, through a fine-grained learning objective labeling analysis, we found greater similarity in the learning objective distribution of ChatGPT-generated items and the items from the target OpenStax lesson (0.9666) than between ChatGPT-generated items and adjacent OpenStax lessons (0.6859 for the previous lesson and 0.6153 for the subsequent lesson). These results corroborate our conclusion that generative AI can produce algebra items of similar quality to existing textbook questions that share the same construct or constructs as those questions.http://www.sciencedirect.com/science/article/pii/S2666920X24000870Formative assessmentGenerative AIItem response theoryPsychometric measurementLarge language models
spellingShingle	Shreya Bhandari Yunting Liu Yerin Kwak Zachary A. Pardos Evaluating the psychometric properties of ChatGPT-generated questions Computers and Education: Artificial Intelligence Formative assessment Generative AI Item response theory Psychometric measurement Large language models
title	Evaluating the psychometric properties of ChatGPT-generated questions
title_full	Evaluating the psychometric properties of ChatGPT-generated questions
title_fullStr	Evaluating the psychometric properties of ChatGPT-generated questions
title_full_unstemmed	Evaluating the psychometric properties of ChatGPT-generated questions
title_short	Evaluating the psychometric properties of ChatGPT-generated questions
title_sort	evaluating the psychometric properties of chatgpt generated questions
topic	Formative assessment Generative AI Item response theory Psychometric measurement Large language models
url	http://www.sciencedirect.com/science/article/pii/S2666920X24000870
work_keys_str_mv	AT shreyabhandari evaluatingthepsychometricpropertiesofchatgptgeneratedquestions AT yuntingliu evaluatingthepsychometricpropertiesofchatgptgeneratedquestions AT yerinkwak evaluatingthepsychometricpropertiesofchatgptgeneratedquestions AT zacharyapardos evaluatingthepsychometricpropertiesofchatgptgeneratedquestions

Evaluating the psychometric properties of ChatGPT-generated questions

Similar Items