Evaluating AI-generated vs. human-written reading comprehension passages: an expert SWOT analysis and comparative study for an educational large-scale assessment

Abstract Background The increasing capabilities of generative artificial intelligence (AI), exemplified by OpenAI’s transformer-based language model GPT-4 (ChatGPT), have drawn attention to its application in educational contexts. This study evaluates the potential of such models in generating Germa...

Full description

Saved in:
Bibliographic Details
Main Authors: Lisa Marie Ripoll Y Schmitz, Philipp Sonnleitner
Format: Article
Language:English
Published: SpringerOpen 2025-07-01
Series:Large-scale Assessments in Education
Subjects:
Online Access:https://doi.org/10.1186/s40536-025-00255-w
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Background The increasing capabilities of generative artificial intelligence (AI), exemplified by OpenAI’s transformer-based language model GPT-4 (ChatGPT), have drawn attention to its application in educational contexts. This study evaluates the potential of such models in generating German reading comprehension texts for educational large-scale assessments, within the multilingual context of Luxembourg. Addressing the challenges faced by item developers in sourcing or manually developing numerous suitable texts, the study aims to determine if ChatGPT can assist text creation while maintaining high-quality standards. Methods The study employed a mixed-methods approach. In a qualitative focus group discussion, experts identified the strengths, weaknesses, opportunities and threats (SWOT) of using GPT-4 for text generation. These insights informed the construction of a Text Analysis Cognitive Model (TACM), which served as theoretical foundation. Narrative and informative reading comprehension texts were then generated using two distinct prompt engineering techniques, derived from original passages and TACM specifications. In a blinded online review, N = 89 participants evaluated human-written and AI-generated texts with regard to their readability, correctness, coherence, engagement and adequacy for reading assessment. Results All administered texts were of similarly high quality, with reviewers being unable to consistently identify authorship origins. Quantitative evaluations indicated that one-shot prompts are effective for creating high-quality informative texts, whereas human-written texts remain superior for narratives. Zero-shot prompts offer considerable flexibility and creativity, but still require human refinement. Conclusion These findings offer promising first insights into GPT-4’s capacity to emulate human-written texts which can be used in the large-scale assessment context. The considerable potential of using generative AI-models as a flexible and efficacious assistant in the creation of reading comprehension texts is highlighted. Still, the necessity of human oversight is emphasized through an augmented intelligence-driven perspective. Given the jurisdictional framework of the European Union, an effective implementation of ChatGPT in the test development process remains hypothetical at this time but is likely to change.
ISSN:2196-0739