Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts

Abstract Objective This study systematically evaluates the performance of artificial intelligence (AI)-generated examinations in periodontology education, comparing their quality, student outcomes, and practical applications with those of human-designed examinations. Methods A randomized controlled...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiang Ma, Wei Pan, Xiao-ning Yu
Format: Article
Language:English
Published: BMC 2025-07-01
Series:BMC Medical Education
Subjects:
Online Access:https://doi.org/10.1186/s12909-025-07706-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849764102053822464
author Xiang Ma
Wei Pan
Xiao-ning Yu
author_facet Xiang Ma
Wei Pan
Xiao-ning Yu
author_sort Xiang Ma
collection DOAJ
description Abstract Objective This study systematically evaluates the performance of artificial intelligence (AI)-generated examinations in periodontology education, comparing their quality, student outcomes, and practical applications with those of human-designed examinations. Methods A randomized controlled trial was conducted with 126 undergraduate dental students, who were divided into AI (n = 63) and human (n = 63) test groups. The AI-generated examination was developed using GPT-4, while the human examination was derived from the 2024 institutional final exam. Both assessments covered identical content from Periodontology (5th Edition) and included 90 multiple-choice questions (MCQs) across five formats: A1: Single-sentence best choice; A2: Case summary best choice; A3: Case group best choice; A4: Case chain best choice; X: Multiple correct options. Psychometric properties (reliability, validity, difficulty, discrimination) and student feedback were analyzed using split-half reliability, content coverage analysis, factor analysis, and 5-point Likert scales. Results The AI examination demonstrated superior content coverage (81.3% vs. 72.4%) and significantly higher total scores (79.34 ± 6.93 vs. 73.17 ± 9.57, p = 0.027). However, it showed significantly lower discrimination indices overall (0.35 vs. 0.49, p = 0.004). Both examinations exhibited adequate split-half reliability (AI = 0.81, human = 0.84) and comparable difficulty distributions (AI: easy 40.0%, moderate 46.7%, difficult 13.3%; human: easy 30.0%, moderate 50.0%, difficult 20.0%; p = 0.274). Student feedback revealed significantly lower ratings for the AI test in terms of perceived difficulty appropriateness (3.53 ± 1.03 vs. 4.19 ± 0.76, p < 0.001), knowledge coverage (3.67 ± 0.89 vs. 4.19 ± 0.72, p < 0.001), and learning inspiration (3.79 ± 0.90 vs. 4.25 ± 0.67, p = 0.001). Conclusion While AI-generated examinations improve content breadth and efficiency, their limited clinical contextualization and discrimination constrain their use in high-stakes applications. A hybrid “AI-human collaborative generation” framework, integrating medical knowledge graphs for contextual optimization, is proposed to balance automation with assessment precision. This study provides empirical evidence for the role of AI in enhancing dental education assessment systems.
format Article
id doaj-art-237321d3cb4e419b850b6024de2e86fa
institution DOAJ
issn 1472-6920
language English
publishDate 2025-07-01
publisher BMC
record_format Article
series BMC Medical Education
spelling doaj-art-237321d3cb4e419b850b6024de2e86fa2025-08-20T03:05:14ZengBMCBMC Medical Education1472-69202025-07-0125111110.1186/s12909-025-07706-6Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterpartsXiang Ma0Wei Pan1Xiao-ning Yu2The Affiliated Yantai Stomatological Hospital of Binzhou Medical UniversityThe Affiliated Yantai Stomatological Hospital of Binzhou Medical UniversityShandong Technology and Business UniversityAbstract Objective This study systematically evaluates the performance of artificial intelligence (AI)-generated examinations in periodontology education, comparing their quality, student outcomes, and practical applications with those of human-designed examinations. Methods A randomized controlled trial was conducted with 126 undergraduate dental students, who were divided into AI (n = 63) and human (n = 63) test groups. The AI-generated examination was developed using GPT-4, while the human examination was derived from the 2024 institutional final exam. Both assessments covered identical content from Periodontology (5th Edition) and included 90 multiple-choice questions (MCQs) across five formats: A1: Single-sentence best choice; A2: Case summary best choice; A3: Case group best choice; A4: Case chain best choice; X: Multiple correct options. Psychometric properties (reliability, validity, difficulty, discrimination) and student feedback were analyzed using split-half reliability, content coverage analysis, factor analysis, and 5-point Likert scales. Results The AI examination demonstrated superior content coverage (81.3% vs. 72.4%) and significantly higher total scores (79.34 ± 6.93 vs. 73.17 ± 9.57, p = 0.027). However, it showed significantly lower discrimination indices overall (0.35 vs. 0.49, p = 0.004). Both examinations exhibited adequate split-half reliability (AI = 0.81, human = 0.84) and comparable difficulty distributions (AI: easy 40.0%, moderate 46.7%, difficult 13.3%; human: easy 30.0%, moderate 50.0%, difficult 20.0%; p = 0.274). Student feedback revealed significantly lower ratings for the AI test in terms of perceived difficulty appropriateness (3.53 ± 1.03 vs. 4.19 ± 0.76, p < 0.001), knowledge coverage (3.67 ± 0.89 vs. 4.19 ± 0.72, p < 0.001), and learning inspiration (3.79 ± 0.90 vs. 4.25 ± 0.67, p = 0.001). Conclusion While AI-generated examinations improve content breadth and efficiency, their limited clinical contextualization and discrimination constrain their use in high-stakes applications. A hybrid “AI-human collaborative generation” framework, integrating medical knowledge graphs for contextual optimization, is proposed to balance automation with assessment precision. This study provides empirical evidence for the role of AI in enhancing dental education assessment systems.https://doi.org/10.1186/s12909-025-07706-6Artificial intelligenceMedical educationAssessmentPeriodontologyAutomated item generation
spellingShingle Xiang Ma
Wei Pan
Xiao-ning Yu
Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
BMC Medical Education
Artificial intelligence
Medical education
Assessment
Periodontology
Automated item generation
title Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
title_full Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
title_fullStr Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
title_full_unstemmed Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
title_short Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts
title_sort evaluating ai generated examination papers in periodontology a comparative study with human designed counterparts
topic Artificial intelligence
Medical education
Assessment
Periodontology
Automated item generation
url https://doi.org/10.1186/s12909-025-07706-6
work_keys_str_mv AT xiangma evaluatingaigeneratedexaminationpapersinperiodontologyacomparativestudywithhumandesignedcounterparts
AT weipan evaluatingaigeneratedexaminationpapersinperiodontologyacomparativestudywithhumandesignedcounterparts
AT xiaoningyu evaluatingaigeneratedexaminationpapersinperiodontologyacomparativestudywithhumandesignedcounterparts