Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts

Abstract Objective This study systematically evaluates the performance of artificial intelligence (AI)-generated examinations in periodontology education, comparing their quality, student outcomes, and practical applications with those of human-designed examinations. Methods A randomized controlled...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xiang Ma, Wei Pan, Xiao-ning Yu
Format:	Article
Language:	English
Published:	BMC 2025-07-01
Series:	BMC Medical Education
Subjects:	Artificial intelligence Medical education Assessment Periodontology Automated item generation
Online Access:	https://doi.org/10.1186/s12909-025-07706-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849764102053822464
author	Xiang Ma Wei Pan Xiao-ning Yu
author_facet	Xiang Ma Wei Pan Xiao-ning Yu
author_sort	Xiang Ma
collection	DOAJ
description	Abstract Objective This study systematically evaluates the performance of artificial intelligence (AI)-generated examinations in periodontology education, comparing their quality, student outcomes, and practical applications with those of human-designed examinations. Methods A randomized controlled trial was conducted with 126 undergraduate dental students, who were divided into AI (n = 63) and human (n = 63) test groups. The AI-generated examination was developed using GPT-4, while the human examination was derived from the 2024 institutional final exam. Both assessments covered identical content from Periodontology (5th Edition) and included 90 multiple-choice questions (MCQs) across five formats: A1: Single-sentence best choice; A2: Case summary best choice; A3: Case group best choice; A4: Case chain best choice; X: Multiple correct options. Psychometric properties (reliability, validity, difficulty, discrimination) and student feedback were analyzed using split-half reliability, content coverage analysis, factor analysis, and 5-point Likert scales. Results The AI examination demonstrated superior content coverage (81.3% vs. 72.4%) and significantly higher total scores (79.34 ± 6.93 vs. 73.17 ± 9.57, p = 0.027). However, it showed significantly lower discrimination indices overall (0.35 vs. 0.49, p = 0.004). Both examinations exhibited adequate split-half reliability (AI = 0.81, human = 0.84) and comparable difficulty distributions (AI: easy 40.0%, moderate 46.7%, difficult 13.3%; human: easy 30.0%, moderate 50.0%, difficult 20.0%; p = 0.274). Student feedback revealed significantly lower ratings for the AI test in terms of perceived difficulty appropriateness (3.53 ± 1.03 vs. 4.19 ± 0.76, p < 0.001), knowledge coverage (3.67 ± 0.89 vs. 4.19 ± 0.72, p < 0.001), and learning inspiration (3.79 ± 0.90 vs. 4.25 ± 0.67, p = 0.001). Conclusion While AI-generated examinations improve content breadth and efficiency, their limited clinical contextualization and discrimination constrain their use in high-stakes applications. A hybrid “AI-human collaborative generation” framework, integrating medical knowledge graphs for contextual optimization, is proposed to balance automation with assessment precision. This study provides empirical evidence for the role of AI in enhancing dental education assessment systems.
format	Article
id	doaj-art-237321d3cb4e419b850b6024de2e86fa
institution	DOAJ
issn	1472-6920
language	English
publishDate	2025-07-01
publisher	BMC
record_format	Article
series	BMC Medical Education
spelling	doaj-art-237321d3cb4e419b850b6024de2e86fa2025-08-20T03:05:14ZengBMCBMC Medical Education1472-69202025-07-0125111110.1186/s12909-025-07706-6Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterpartsXiang Ma0Wei Pan1Xiao-ning Yu2The Affiliated Yantai Stomatological Hospital of Binzhou Medical UniversityThe Affiliated Yantai Stomatological Hospital of Binzhou Medical UniversityShandong Technology and Business UniversityAbstract Objective This study systematically evaluates the performance of artificial intelligence (AI)-generated examinations in periodontology education, comparing their quality, student outcomes, and practical applications with those of human-designed examinations. Methods A randomized controlled trial was conducted with 126 undergraduate dental students, who were divided into AI (n = 63) and human (n = 63) test groups. The AI-generated examination was developed using GPT-4, while the human examination was derived from the 2024 institutional final exam. Both assessments covered identical content from Periodontology (5th Edition) and included 90 multiple-choice questions (MCQs) across five formats: A1: Single-sentence best choice; A2: Case summary best choice; A3: Case group best choice; A4: Case chain best choice; X: Multiple correct options. Psychometric properties (reliability, validity, difficulty, discrimination) and student feedback were analyzed using split-half reliability, content coverage analysis, factor analysis, and 5-point Likert scales. Results The AI examination demonstrated superior content coverage (81.3% vs. 72.4%) and significantly higher total scores (79.34 ± 6.93 vs. 73.17 ± 9.57, p = 0.027). However, it showed significantly lower discrimination indices overall (0.35 vs. 0.49, p = 0.004). Both examinations exhibited adequate split-half reliability (AI = 0.81, human = 0.84) and comparable difficulty distributions (AI: easy 40.0%, moderate 46.7%, difficult 13.3%; human: easy 30.0%, moderate 50.0%, difficult 20.0%; p = 0.274). Student feedback revealed significantly lower ratings for the AI test in terms of perceived difficulty appropriateness (3.53 ± 1.03 vs. 4.19 ± 0.76, p < 0.001), knowledge coverage (3.67 ± 0.89 vs. 4.19 ± 0.72, p < 0.001), and learning inspiration (3.79 ± 0.90 vs. 4.25 ± 0.67, p = 0.001). Conclusion While AI-generated examinations improve content breadth and efficiency, their limited clinical contextualization and discrimination constrain their use in high-stakes applications. A hybrid “AI-human collaborative generation” framework, integrating medical knowledge graphs for contextual optimization, is proposed to balance automation with assessment precision. This study provides empirical evidence for the role of AI in enhancing dental education assessment systems.https://doi.org/10.1186/s12909-025-07706-6Artificial intelligenceMedical educationAssessmentPeriodontologyAutomated item generation
spellingShingle	Xiang Ma Wei Pan Xiao-ning Yu Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts BMC Medical Education Artificial intelligence Medical education Assessment Periodontology Automated item generation
title	Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
title_full	Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
title_fullStr	Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
title_full_unstemmed	Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
title_short	Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
title_sort	evaluating ai generated examination papers in periodontology a comparative study with human designed counterparts
topic	Artificial intelligence Medical education Assessment Periodontology Automated item generation
url	https://doi.org/10.1186/s12909-025-07706-6
work_keys_str_mv	AT xiangma evaluatingaigeneratedexaminationpapersinperiodontologyacomparativestudywithhumandesignedcounterparts AT weipan evaluatingaigeneratedexaminationpapersinperiodontologyacomparativestudywithhumandesignedcounterparts AT xiaoningyu evaluatingaigeneratedexaminationpapersinperiodontologyacomparativestudywithhumandesignedcounterparts

Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts

Similar Items