AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination

Abstract Background The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy rem...

Full description

Saved in:
Bibliographic Details
Main Authors: Alex KK Law, Jerome So, Chun Tat Lui, Yu Fai Choi, Koon Ho Cheung, Kevin Kei-ching Hung, Colin Alexander Graham
Format: Article
Language:English
Published: BMC 2025-02-01
Series:BMC Medical Education
Subjects:
Online Access:https://doi.org/10.1186/s12909-025-06796-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823862009331974144
author Alex KK Law
Jerome So
Chun Tat Lui
Yu Fai Choi
Koon Ho Cheung
Kevin Kei-ching Hung
Colin Alexander Graham
author_facet Alex KK Law
Jerome So
Chun Tat Lui
Yu Fai Choi
Koon Ho Cheung
Kevin Kei-ching Hung
Colin Alexander Graham
author_sort Alex KK Law
collection DOAJ
description Abstract Background The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams. Objective This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam. Methods A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs—one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom’s taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated. Results Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 ± 0.23 vs. 0.26 ± 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12–0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours). Conclusion ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.
format Article
id doaj-art-de53e05697e64cf089061ab76fe225c3
institution Kabale University
issn 1472-6920
language English
publishDate 2025-02-01
publisher BMC
record_format Article
series BMC Medical Education
spelling doaj-art-de53e05697e64cf089061ab76fe225c32025-02-09T12:42:27ZengBMCBMC Medical Education1472-69202025-02-012511910.1186/s12909-025-06796-6AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examinationAlex KK Law0Jerome So1Chun Tat Lui2Yu Fai Choi3Koon Ho Cheung4Kevin Kei-ching Hung5Colin Alexander Graham6The Accident and Emergency Medicine Academic Unit (AEMAU), The Chinese University of Hong Kong (CUHK)Department of Accident & Emergency, Tseung Kwan O HospitalHong Kong College of Emergency MedicineHong Kong College of Emergency MedicineHong Kong College of Emergency MedicineThe Accident and Emergency Medicine Academic Unit (AEMAU), The Chinese University of Hong Kong (CUHK)The Accident and Emergency Medicine Academic Unit (AEMAU), The Chinese University of Hong Kong (CUHK)Abstract Background The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams. Objective This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam. Methods A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs—one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom’s taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated. Results Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 ± 0.23 vs. 0.26 ± 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12–0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours). Conclusion ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.https://doi.org/10.1186/s12909-025-06796-6Artificial intelligenceEducational measurementMultiple choice questionsMedical educationCognitive processes
spellingShingle Alex KK Law
Jerome So
Chun Tat Lui
Yu Fai Choi
Koon Ho Cheung
Kevin Kei-ching Hung
Colin Alexander Graham
AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination
BMC Medical Education
Artificial intelligence
Educational measurement
Multiple choice questions
Medical education
Cognitive processes
title AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination
title_full AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination
title_fullStr AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination
title_full_unstemmed AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination
title_short AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination
title_sort ai versus human generated multiple choice questions for medical education a cohort study in a high stakes examination
topic Artificial intelligence
Educational measurement
Multiple choice questions
Medical education
Cognitive processes
url https://doi.org/10.1186/s12909-025-06796-6
work_keys_str_mv AT alexkklaw aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination
AT jeromeso aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination
AT chuntatlui aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination
AT yufaichoi aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination
AT koonhocheung aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination
AT kevinkeichinghung aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination
AT colinalexandergraham aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination