AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination
Abstract Background The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy rem...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2025-02-01
|
Series: | BMC Medical Education |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12909-025-06796-6 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823862009331974144 |
---|---|
author | Alex KK Law Jerome So Chun Tat Lui Yu Fai Choi Koon Ho Cheung Kevin Kei-ching Hung Colin Alexander Graham |
author_facet | Alex KK Law Jerome So Chun Tat Lui Yu Fai Choi Koon Ho Cheung Kevin Kei-ching Hung Colin Alexander Graham |
author_sort | Alex KK Law |
collection | DOAJ |
description | Abstract Background The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams. Objective This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam. Methods A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs—one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom’s taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated. Results Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 ± 0.23 vs. 0.26 ± 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12–0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours). Conclusion ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality. |
format | Article |
id | doaj-art-de53e05697e64cf089061ab76fe225c3 |
institution | Kabale University |
issn | 1472-6920 |
language | English |
publishDate | 2025-02-01 |
publisher | BMC |
record_format | Article |
series | BMC Medical Education |
spelling | doaj-art-de53e05697e64cf089061ab76fe225c32025-02-09T12:42:27ZengBMCBMC Medical Education1472-69202025-02-012511910.1186/s12909-025-06796-6AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examinationAlex KK Law0Jerome So1Chun Tat Lui2Yu Fai Choi3Koon Ho Cheung4Kevin Kei-ching Hung5Colin Alexander Graham6The Accident and Emergency Medicine Academic Unit (AEMAU), The Chinese University of Hong Kong (CUHK)Department of Accident & Emergency, Tseung Kwan O HospitalHong Kong College of Emergency MedicineHong Kong College of Emergency MedicineHong Kong College of Emergency MedicineThe Accident and Emergency Medicine Academic Unit (AEMAU), The Chinese University of Hong Kong (CUHK)The Accident and Emergency Medicine Academic Unit (AEMAU), The Chinese University of Hong Kong (CUHK)Abstract Background The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams. Objective This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam. Methods A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs—one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom’s taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated. Results Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 ± 0.22 vs. 0.69 ± 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 ± 0.23 vs. 0.26 ± 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12–0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (χ² = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours). Conclusion ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.https://doi.org/10.1186/s12909-025-06796-6Artificial intelligenceEducational measurementMultiple choice questionsMedical educationCognitive processes |
spellingShingle | Alex KK Law Jerome So Chun Tat Lui Yu Fai Choi Koon Ho Cheung Kevin Kei-ching Hung Colin Alexander Graham AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination BMC Medical Education Artificial intelligence Educational measurement Multiple choice questions Medical education Cognitive processes |
title | AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination |
title_full | AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination |
title_fullStr | AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination |
title_full_unstemmed | AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination |
title_short | AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination |
title_sort | ai versus human generated multiple choice questions for medical education a cohort study in a high stakes examination |
topic | Artificial intelligence Educational measurement Multiple choice questions Medical education Cognitive processes |
url | https://doi.org/10.1186/s12909-025-06796-6 |
work_keys_str_mv | AT alexkklaw aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination AT jeromeso aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination AT chuntatlui aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination AT yufaichoi aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination AT koonhocheung aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination AT kevinkeichinghung aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination AT colinalexandergraham aiversushumangeneratedmultiplechoicequestionsformedicaleducationacohortstudyinahighstakesexamination |