Linking errors introduced by rapid guessing responses when employing multigroup concurrent IRT scaling

Abstract Background Test score comparability in international large-scale assessments (LSAs) is greatly important to ensure test fairness. To effectively compare test scores on an international scale, score linking is widely used to convert raw scores from different linguistic version of test forms...

Full description

Saved in:

Bibliographic Details
Main Author:	Jiayi Deng
Format:	Article
Language:	English
Published:	SpringerOpen 2025-08-01
Series:	Large-scale Assessments in Education
Subjects:	Multigroup concurrent IRT calibration, rapid guessing Score linking International large-scale assessment
Online Access:	https://doi.org/10.1186/s40536-025-00265-8
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849343093141143552
author	Jiayi Deng
author_facet	Jiayi Deng
author_sort	Jiayi Deng
collection	DOAJ
description	Abstract Background Test score comparability in international large-scale assessments (LSAs) is greatly important to ensure test fairness. To effectively compare test scores on an international scale, score linking is widely used to convert raw scores from different linguistic version of test forms into a common score scale. An example is the multigroup concurrent IRT calibration method, which is used for estimating item and ability parameters across multiple linguistic groups of test-takers. Although prior researchers demonstrated its effectiveness in offering greater global comparability in score scales, they assumed comparable test-taking efforts across cultural and linguistic populations. This assumption may not hold true due to differential rapid guessing (RG) rates, potentially biasing item parameter estimation. To address this gap, this study aimed to investigate the linking errors introduced by RG responses when employing multigroup concurrent IRT calibration. Method In the analysis, RG responses were identified using response time data. The study utilized data from the Arabic and Chinese groups in the PISA 2018 Form 18 science module. Test scores for these two linguistic groups were linked through multigroup concurrent IRT calibration, which applies common item parameters across most items and groups, while allowing a select few items to have group-specific parameters. The fit of the item-level model was assessed to identify items requiring group-specific parameters. Results The Arabic group showed a notably higher RG rates for the selected test form when comparing to the Chinese group. Despite observed differential RG, the multigroup concurrent IRT calibration procedure showed robustness to anchor and misfit item identification and ability estimation. However, differential RG was found to have the potential to reduce the precision of individual ability estimation. Findings suggest that RG can influence the multigroup concurrent IRT calibration process, potentially compromising the fairness of test scores in international LSAs. Conclusion This study highlights the critical need to identify and address noneffortful test-taking behaviors, such as RG, to ensure comparability of test scores across different linguistic versions of an assessment. Additionally, documenting variations in test-taking efforts across countries and languages is essential for accurate student performance evaluation and informed educational decisions.
format	Article
id	doaj-art-bf4a86ec0f584c659e5301318a256a77
institution	Kabale University
issn	2196-0739
language	English
publishDate	2025-08-01
publisher	SpringerOpen
record_format	Article
series	Large-scale Assessments in Education
spelling	doaj-art-bf4a86ec0f584c659e5301318a256a772025-08-20T03:43:10ZengSpringerOpenLarge-scale Assessments in Education2196-07392025-08-0113111810.1186/s40536-025-00265-8Linking errors introduced by rapid guessing responses when employing multigroup concurrent IRT scalingJiayi Deng0Human Resources Research OrganizationAbstract Background Test score comparability in international large-scale assessments (LSAs) is greatly important to ensure test fairness. To effectively compare test scores on an international scale, score linking is widely used to convert raw scores from different linguistic version of test forms into a common score scale. An example is the multigroup concurrent IRT calibration method, which is used for estimating item and ability parameters across multiple linguistic groups of test-takers. Although prior researchers demonstrated its effectiveness in offering greater global comparability in score scales, they assumed comparable test-taking efforts across cultural and linguistic populations. This assumption may not hold true due to differential rapid guessing (RG) rates, potentially biasing item parameter estimation. To address this gap, this study aimed to investigate the linking errors introduced by RG responses when employing multigroup concurrent IRT calibration. Method In the analysis, RG responses were identified using response time data. The study utilized data from the Arabic and Chinese groups in the PISA 2018 Form 18 science module. Test scores for these two linguistic groups were linked through multigroup concurrent IRT calibration, which applies common item parameters across most items and groups, while allowing a select few items to have group-specific parameters. The fit of the item-level model was assessed to identify items requiring group-specific parameters. Results The Arabic group showed a notably higher RG rates for the selected test form when comparing to the Chinese group. Despite observed differential RG, the multigroup concurrent IRT calibration procedure showed robustness to anchor and misfit item identification and ability estimation. However, differential RG was found to have the potential to reduce the precision of individual ability estimation. Findings suggest that RG can influence the multigroup concurrent IRT calibration process, potentially compromising the fairness of test scores in international LSAs. Conclusion This study highlights the critical need to identify and address noneffortful test-taking behaviors, such as RG, to ensure comparability of test scores across different linguistic versions of an assessment. Additionally, documenting variations in test-taking efforts across countries and languages is essential for accurate student performance evaluation and informed educational decisions.https://doi.org/10.1186/s40536-025-00265-8Multigroup concurrent IRT calibration, rapid guessingScore linkingInternational large-scale assessment
spellingShingle	Jiayi Deng Linking errors introduced by rapid guessing responses when employing multigroup concurrent IRT scaling Large-scale Assessments in Education Multigroup concurrent IRT calibration, rapid guessing Score linking International large-scale assessment
title	Linking errors introduced by rapid guessing responses when employing multigroup concurrent IRT scaling
title_full	Linking errors introduced by rapid guessing responses when employing multigroup concurrent IRT scaling
title_fullStr	Linking errors introduced by rapid guessing responses when employing multigroup concurrent IRT scaling
title_full_unstemmed	Linking errors introduced by rapid guessing responses when employing multigroup concurrent IRT scaling
title_short	Linking errors introduced by rapid guessing responses when employing multigroup concurrent IRT scaling
title_sort	linking errors introduced by rapid guessing responses when employing multigroup concurrent irt scaling
topic	Multigroup concurrent IRT calibration, rapid guessing Score linking International large-scale assessment
url	https://doi.org/10.1186/s40536-025-00265-8
work_keys_str_mv	AT jiayideng linkingerrorsintroducedbyrapidguessingresponseswhenemployingmultigroupconcurrentirtscaling

Linking errors introduced by rapid guessing responses when employing multigroup concurrent IRT scaling

Similar Items