Evaluating AI-generated vs. human-written reading comprehension passages: an expert SWOT analysis and comparative study for an educational large-scale assessment

Abstract Background The increasing capabilities of generative artificial intelligence (AI), exemplified by OpenAI’s transformer-based language model GPT-4 (ChatGPT), have drawn attention to its application in educational contexts. This study evaluates the potential of such models in generating Germa...

Full description

Saved in:
Bibliographic Details
Main Authors: Lisa Marie Ripoll Y Schmitz, Philipp Sonnleitner
Format: Article
Language:English
Published: SpringerOpen 2025-07-01
Series:Large-scale Assessments in Education
Subjects:
Online Access:https://doi.org/10.1186/s40536-025-00255-w
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849343137015660544
author Lisa Marie Ripoll Y Schmitz
Philipp Sonnleitner
author_facet Lisa Marie Ripoll Y Schmitz
Philipp Sonnleitner
author_sort Lisa Marie Ripoll Y Schmitz
collection DOAJ
description Abstract Background The increasing capabilities of generative artificial intelligence (AI), exemplified by OpenAI’s transformer-based language model GPT-4 (ChatGPT), have drawn attention to its application in educational contexts. This study evaluates the potential of such models in generating German reading comprehension texts for educational large-scale assessments, within the multilingual context of Luxembourg. Addressing the challenges faced by item developers in sourcing or manually developing numerous suitable texts, the study aims to determine if ChatGPT can assist text creation while maintaining high-quality standards. Methods The study employed a mixed-methods approach. In a qualitative focus group discussion, experts identified the strengths, weaknesses, opportunities and threats (SWOT) of using GPT-4 for text generation. These insights informed the construction of a Text Analysis Cognitive Model (TACM), which served as theoretical foundation. Narrative and informative reading comprehension texts were then generated using two distinct prompt engineering techniques, derived from original passages and TACM specifications. In a blinded online review, N = 89 participants evaluated human-written and AI-generated texts with regard to their readability, correctness, coherence, engagement and adequacy for reading assessment. Results All administered texts were of similarly high quality, with reviewers being unable to consistently identify authorship origins. Quantitative evaluations indicated that one-shot prompts are effective for creating high-quality informative texts, whereas human-written texts remain superior for narratives. Zero-shot prompts offer considerable flexibility and creativity, but still require human refinement. Conclusion These findings offer promising first insights into GPT-4’s capacity to emulate human-written texts which can be used in the large-scale assessment context. The considerable potential of using generative AI-models as a flexible and efficacious assistant in the creation of reading comprehension texts is highlighted. Still, the necessity of human oversight is emphasized through an augmented intelligence-driven perspective. Given the jurisdictional framework of the European Union, an effective implementation of ChatGPT in the test development process remains hypothetical at this time but is likely to change.
format Article
id doaj-art-15ff8a051c954d81aa352e212018fba3
institution Kabale University
issn 2196-0739
language English
publishDate 2025-07-01
publisher SpringerOpen
record_format Article
series Large-scale Assessments in Education
spelling doaj-art-15ff8a051c954d81aa352e212018fba32025-08-20T03:43:10ZengSpringerOpenLarge-scale Assessments in Education2196-07392025-07-0113112910.1186/s40536-025-00255-wEvaluating AI-generated vs. human-written reading comprehension passages: an expert SWOT analysis and comparative study for an educational large-scale assessmentLisa Marie Ripoll Y Schmitz0Philipp Sonnleitner1Luxembourg Centre for Educational Testing, University of LuxembourgLuxembourg Centre for Educational Testing, University of LuxembourgAbstract Background The increasing capabilities of generative artificial intelligence (AI), exemplified by OpenAI’s transformer-based language model GPT-4 (ChatGPT), have drawn attention to its application in educational contexts. This study evaluates the potential of such models in generating German reading comprehension texts for educational large-scale assessments, within the multilingual context of Luxembourg. Addressing the challenges faced by item developers in sourcing or manually developing numerous suitable texts, the study aims to determine if ChatGPT can assist text creation while maintaining high-quality standards. Methods The study employed a mixed-methods approach. In a qualitative focus group discussion, experts identified the strengths, weaknesses, opportunities and threats (SWOT) of using GPT-4 for text generation. These insights informed the construction of a Text Analysis Cognitive Model (TACM), which served as theoretical foundation. Narrative and informative reading comprehension texts were then generated using two distinct prompt engineering techniques, derived from original passages and TACM specifications. In a blinded online review, N = 89 participants evaluated human-written and AI-generated texts with regard to their readability, correctness, coherence, engagement and adequacy for reading assessment. Results All administered texts were of similarly high quality, with reviewers being unable to consistently identify authorship origins. Quantitative evaluations indicated that one-shot prompts are effective for creating high-quality informative texts, whereas human-written texts remain superior for narratives. Zero-shot prompts offer considerable flexibility and creativity, but still require human refinement. Conclusion These findings offer promising first insights into GPT-4’s capacity to emulate human-written texts which can be used in the large-scale assessment context. The considerable potential of using generative AI-models as a flexible and efficacious assistant in the creation of reading comprehension texts is highlighted. Still, the necessity of human oversight is emphasized through an augmented intelligence-driven perspective. Given the jurisdictional framework of the European Union, an effective implementation of ChatGPT in the test development process remains hypothetical at this time but is likely to change.https://doi.org/10.1186/s40536-025-00255-wGenerative artificial intelligenceLarge language modelsChatGPTReading comprehensionEducational large-scale assessmentText analysis cognitive model
spellingShingle Lisa Marie Ripoll Y Schmitz
Philipp Sonnleitner
Evaluating AI-generated vs. human-written reading comprehension passages: an expert SWOT analysis and comparative study for an educational large-scale assessment
Large-scale Assessments in Education
Generative artificial intelligence
Large language models
ChatGPT
Reading comprehension
Educational large-scale assessment
Text analysis cognitive model
title Evaluating AI-generated vs. human-written reading comprehension passages: an expert SWOT analysis and comparative study for an educational large-scale assessment
title_full Evaluating AI-generated vs. human-written reading comprehension passages: an expert SWOT analysis and comparative study for an educational large-scale assessment
title_fullStr Evaluating AI-generated vs. human-written reading comprehension passages: an expert SWOT analysis and comparative study for an educational large-scale assessment
title_full_unstemmed Evaluating AI-generated vs. human-written reading comprehension passages: an expert SWOT analysis and comparative study for an educational large-scale assessment
title_short Evaluating AI-generated vs. human-written reading comprehension passages: an expert SWOT analysis and comparative study for an educational large-scale assessment
title_sort evaluating ai generated vs human written reading comprehension passages an expert swot analysis and comparative study for an educational large scale assessment
topic Generative artificial intelligence
Large language models
ChatGPT
Reading comprehension
Educational large-scale assessment
Text analysis cognitive model
url https://doi.org/10.1186/s40536-025-00255-w
work_keys_str_mv AT lisamarieripollyschmitz evaluatingaigeneratedvshumanwrittenreadingcomprehensionpassagesanexpertswotanalysisandcomparativestudyforaneducationallargescaleassessment
AT philippsonnleitner evaluatingaigeneratedvshumanwrittenreadingcomprehensionpassagesanexpertswotanalysisandcomparativestudyforaneducationallargescaleassessment