Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis
Abstract BackgroundProgrammatic assessment supports flexible learning and individual progression but challenges educators to develop frequent assessments reflecting different competencies. The continuous creation of large volumes of assessment items, in a consistent format and...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
JMIR Publications
2025-05-01
|
| Series: | JMIR Medical Education |
| Online Access: | https://mededu.jmir.org/2025/1/e69521 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850225001202974720 |
|---|---|
| author | Enjy Abouzeid Rita Wassef Ayesha Jawwad Patricia Harris |
| author_facet | Enjy Abouzeid Rita Wassef Ayesha Jawwad Patricia Harris |
| author_sort | Enjy Abouzeid |
| collection | DOAJ |
| description |
Abstract
BackgroundProgrammatic assessment supports flexible learning and individual progression but challenges educators to develop frequent assessments reflecting different competencies. The continuous creation of large volumes of assessment items, in a consistent format and comparatively restricted time, is laborious. The application of technological innovations, including artificial intelligence (AI), has been tried to address this challenge. A major concern raised is the validity of the information produced by AI tools, and if not properly verified, it can produce inaccurate and therefore inappropriate assessments.
ObjectiveThis study was designed to examine the content validity and consistency of different AI chatbots in creating single best answer (SBA) questions, a refined format of multiple choice questions better suited to assess higher levels of knowledge, for undergraduate medical students.
MethodsThis study followed 3 steps. First, 3 researchers used a unified prompt script to generate 10 SBA questions across 4 chatbot platforms. Second, assessors evaluated the chatbot outputs for consistency by identifying similarities and differences between users and across chatbots. With 3 assessors and 10 learning objectives, the maximum possible score for any individual chatbot was 30. Third, 7 assessors internally moderated the questions using a rating scale developed by the research team to evaluate scientific accuracy and educational quality.
ResultsIn response to the prompts, all chatbots generated 10 questions each, except Bing, which failed to respond to 1 prompt. ChatGPT-4 exhibited the highest variation in question generation but did not fully satisfy the “cover test.” Gemini performed well across most evaluation criteria, except for item balance, and relied heavily on the vignette for answers but showed a preference for one answer option. Bing scored low in most evaluation areas but generated appropriately structured lead-in questions. SBA questions from GPT-3.5, Gemini, and ChatGPT-4 had similar Item Content Validity Index and Scale Level Content Validity Index values, while the Krippendorff alpha coefficient was low (0.016). Bing performed poorly in content clarity, overall validity, and item construction accuracy. A 2-way ANOVA without replication revealed statistically significant differences among chatbots and domains (PP
ConclusionsAI chatbots can aid the production of questions aligned with learning objectives, and individual chatbots have their own strengths and weaknesses. Nevertheless, all require expert evaluation to ensure their suitability for use. Using AI to generate SBA prompts us to reconsider Bloom’s taxonomy of the cognitive domain, which traditionally positions creation as the highest level of cognition. |
| format | Article |
| id | doaj-art-31b8966a2f8b4516ba59e76b4356a455 |
| institution | OA Journals |
| issn | 2369-3762 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | JMIR Publications |
| record_format | Article |
| series | JMIR Medical Education |
| spelling | doaj-art-31b8966a2f8b4516ba59e76b4356a4552025-08-20T02:05:29ZengJMIR PublicationsJMIR Medical Education2369-37622025-05-0111e69521e6952110.2196/69521Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative AnalysisEnjy Abouzeidhttp://orcid.org/0000-0002-9431-6019Rita Wassefhttp://orcid.org/0000-0002-8431-2428Ayesha Jawwadhttp://orcid.org/0000-0003-1508-1395Patricia Harrishttp://orcid.org/0000-0002-6593-8185 Abstract BackgroundProgrammatic assessment supports flexible learning and individual progression but challenges educators to develop frequent assessments reflecting different competencies. The continuous creation of large volumes of assessment items, in a consistent format and comparatively restricted time, is laborious. The application of technological innovations, including artificial intelligence (AI), has been tried to address this challenge. A major concern raised is the validity of the information produced by AI tools, and if not properly verified, it can produce inaccurate and therefore inappropriate assessments. ObjectiveThis study was designed to examine the content validity and consistency of different AI chatbots in creating single best answer (SBA) questions, a refined format of multiple choice questions better suited to assess higher levels of knowledge, for undergraduate medical students. MethodsThis study followed 3 steps. First, 3 researchers used a unified prompt script to generate 10 SBA questions across 4 chatbot platforms. Second, assessors evaluated the chatbot outputs for consistency by identifying similarities and differences between users and across chatbots. With 3 assessors and 10 learning objectives, the maximum possible score for any individual chatbot was 30. Third, 7 assessors internally moderated the questions using a rating scale developed by the research team to evaluate scientific accuracy and educational quality. ResultsIn response to the prompts, all chatbots generated 10 questions each, except Bing, which failed to respond to 1 prompt. ChatGPT-4 exhibited the highest variation in question generation but did not fully satisfy the “cover test.” Gemini performed well across most evaluation criteria, except for item balance, and relied heavily on the vignette for answers but showed a preference for one answer option. Bing scored low in most evaluation areas but generated appropriately structured lead-in questions. SBA questions from GPT-3.5, Gemini, and ChatGPT-4 had similar Item Content Validity Index and Scale Level Content Validity Index values, while the Krippendorff alpha coefficient was low (0.016). Bing performed poorly in content clarity, overall validity, and item construction accuracy. A 2-way ANOVA without replication revealed statistically significant differences among chatbots and domains (PP ConclusionsAI chatbots can aid the production of questions aligned with learning objectives, and individual chatbots have their own strengths and weaknesses. Nevertheless, all require expert evaluation to ensure their suitability for use. Using AI to generate SBA prompts us to reconsider Bloom’s taxonomy of the cognitive domain, which traditionally positions creation as the highest level of cognition.https://mededu.jmir.org/2025/1/e69521 |
| spellingShingle | Enjy Abouzeid Rita Wassef Ayesha Jawwad Patricia Harris Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis JMIR Medical Education |
| title | Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis |
| title_full | Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis |
| title_fullStr | Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis |
| title_full_unstemmed | Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis |
| title_short | Chatbots’ Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis |
| title_sort | chatbots role in generating single best answer questions for undergraduate medical student assessment comparative analysis |
| url | https://mededu.jmir.org/2025/1/e69521 |
| work_keys_str_mv | AT enjyabouzeid chatbotsroleingeneratingsinglebestanswerquestionsforundergraduatemedicalstudentassessmentcomparativeanalysis AT ritawassef chatbotsroleingeneratingsinglebestanswerquestionsforundergraduatemedicalstudentassessmentcomparativeanalysis AT ayeshajawwad chatbotsroleingeneratingsinglebestanswerquestionsforundergraduatemedicalstudentassessmentcomparativeanalysis AT patriciaharris chatbotsroleingeneratingsinglebestanswerquestionsforundergraduatemedicalstudentassessmentcomparativeanalysis |