Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis

Abstract BackgroundProgrammatic assessment supports flexible learning and individual progression but challenges educators to develop frequent assessments reflecting different competencies. The continuous creation of large volumes of assessment items, in a consistent format and...

Full description

Saved in:

Bibliographic Details
Main Authors:	Enjy Abouzeid, Rita Wassef, Ayesha Jawwad, Patricia Harris
Format:	Article
Language:	English
Published:	JMIR Publications 2025-05-01
Series:	JMIR Medical Education
Online Access:	https://mededu.jmir.org/2025/1/e69521
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850225001202974720
author	Enjy Abouzeid Rita Wassef Ayesha Jawwad Patricia Harris
author_facet	Enjy Abouzeid Rita Wassef Ayesha Jawwad Patricia Harris
author_sort	Enjy Abouzeid
collection	DOAJ
description	Abstract BackgroundProgrammatic assessment supports flexible learning and individual progression but challenges educators to develop frequent assessments reflecting different competencies. The continuous creation of large volumes of assessment items, in a consistent format and comparatively restricted time, is laborious. The application of technological innovations, including artificial intelligence (AI), has been tried to address this challenge. A major concern raised is the validity of the information produced by AI tools, and if not properly verified, it can produce inaccurate and therefore inappropriate assessments. ObjectiveThis study was designed to examine the content validity and consistency of different AI chatbots in creating single best answer (SBA) questions, a refined format of multiple choice questions better suited to assess higher levels of knowledge, for undergraduate medical students. MethodsThis study followed 3 steps. First, 3 researchers used a unified prompt script to generate 10 SBA questions across 4 chatbot platforms. Second, assessors evaluated the chatbot outputs for consistency by identifying similarities and differences between users and across chatbots. With 3 assessors and 10 learning objectives, the maximum possible score for any individual chatbot was 30. Third, 7 assessors internally moderated the questions using a rating scale developed by the research team to evaluate scientific accuracy and educational quality. ResultsIn response to the prompts, all chatbots generated 10 questions each, except Bing, which failed to respond to 1 prompt. ChatGPT-4 exhibited the highest variation in question generation but did not fully satisfy the “cover test.” Gemini performed well across most evaluation criteria, except for item balance, and relied heavily on the vignette for answers but showed a preference for one answer option. Bing scored low in most evaluation areas but generated appropriately structured lead-in questions. SBA questions from GPT-3.5, Gemini, and ChatGPT-4 had similar Item Content Validity Index and Scale Level Content Validity Index values, while the Krippendorff alpha coefficient was low (0.016). Bing performed poorly in content clarity, overall validity, and item construction accuracy. A 2-way ANOVA without replication revealed statistically significant differences among chatbots and domains (PP ConclusionsAI chatbots can aid the production of questions aligned with learning objectives, and individual chatbots have their own strengths and weaknesses. Nevertheless, all require expert evaluation to ensure their suitability for use. Using AI to generate SBA prompts us to reconsider Bloom’s taxonomy of the cognitive domain, which traditionally positions creation as the highest level of cognition.
format	Article
id	doaj-art-31b8966a2f8b4516ba59e76b4356a455
institution	OA Journals
issn	2369-3762
language	English
publishDate	2025-05-01
publisher	JMIR Publications
record_format	Article
series	JMIR Medical Education
spelling	doaj-art-31b8966a2f8b4516ba59e76b4356a4552025-08-20T02:05:29ZengJMIR PublicationsJMIR Medical Education2369-37622025-05-0111e69521e6952110.2196/69521Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative AnalysisEnjy Abouzeidhttp://orcid.org/0000-0002-9431-6019Rita Wassefhttp://orcid.org/0000-0002-8431-2428Ayesha Jawwadhttp://orcid.org/0000-0003-1508-1395Patricia Harrishttp://orcid.org/0000-0002-6593-8185 Abstract BackgroundProgrammatic assessment supports flexible learning and individual progression but challenges educators to develop frequent assessments reflecting different competencies. The continuous creation of large volumes of assessment items, in a consistent format and comparatively restricted time, is laborious. The application of technological innovations, including artificial intelligence (AI), has been tried to address this challenge. A major concern raised is the validity of the information produced by AI tools, and if not properly verified, it can produce inaccurate and therefore inappropriate assessments. ObjectiveThis study was designed to examine the content validity and consistency of different AI chatbots in creating single best answer (SBA) questions, a refined format of multiple choice questions better suited to assess higher levels of knowledge, for undergraduate medical students. MethodsThis study followed 3 steps. First, 3 researchers used a unified prompt script to generate 10 SBA questions across 4 chatbot platforms. Second, assessors evaluated the chatbot outputs for consistency by identifying similarities and differences between users and across chatbots. With 3 assessors and 10 learning objectives, the maximum possible score for any individual chatbot was 30. Third, 7 assessors internally moderated the questions using a rating scale developed by the research team to evaluate scientific accuracy and educational quality. ResultsIn response to the prompts, all chatbots generated 10 questions each, except Bing, which failed to respond to 1 prompt. ChatGPT-4 exhibited the highest variation in question generation but did not fully satisfy the “cover test.” Gemini performed well across most evaluation criteria, except for item balance, and relied heavily on the vignette for answers but showed a preference for one answer option. Bing scored low in most evaluation areas but generated appropriately structured lead-in questions. SBA questions from GPT-3.5, Gemini, and ChatGPT-4 had similar Item Content Validity Index and Scale Level Content Validity Index values, while the Krippendorff alpha coefficient was low (0.016). Bing performed poorly in content clarity, overall validity, and item construction accuracy. A 2-way ANOVA without replication revealed statistically significant differences among chatbots and domains (PP ConclusionsAI chatbots can aid the production of questions aligned with learning objectives, and individual chatbots have their own strengths and weaknesses. Nevertheless, all require expert evaluation to ensure their suitability for use. Using AI to generate SBA prompts us to reconsider Bloom’s taxonomy of the cognitive domain, which traditionally positions creation as the highest level of cognition.https://mededu.jmir.org/2025/1/e69521
spellingShingle	Enjy Abouzeid Rita Wassef Ayesha Jawwad Patricia Harris Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis JMIR Medical Education
title	Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis
title_full	Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis
title_fullStr	Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis
title_full_unstemmed	Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis
title_short	Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis
title_sort	chatbots role in generating single best answer questions for undergraduate medical student assessment comparative analysis
url	https://mededu.jmir.org/2025/1/e69521
work_keys_str_mv	AT enjyabouzeid chatbotsroleingeneratingsinglebestanswerquestionsforundergraduatemedicalstudentassessmentcomparativeanalysis AT ritawassef chatbotsroleingeneratingsinglebestanswerquestionsforundergraduatemedicalstudentassessmentcomparativeanalysis AT ayeshajawwad chatbotsroleingeneratingsinglebestanswerquestionsforundergraduatemedicalstudentassessmentcomparativeanalysis AT patriciaharris chatbotsroleingeneratingsinglebestanswerquestionsforundergraduatemedicalstudentassessmentcomparativeanalysis

Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis

Similar Items