Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data

Abstract Early prediction of academic performance is vital for reducing attrition in online higher education. However, existing models often lack comprehensive data integration and comparison with state-of-the-art techniques. This study, which involved 2,225 engineering students at a public universi...

Full description

Saved in:
Bibliographic Details
Main Authors: Felipe Emiliano Arévalo-Cordovilla, Marta Peña
Format: Article
Language:English
Published: Nature Portfolio 2025-08-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-15388-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849344109659029504
author Felipe Emiliano Arévalo-Cordovilla
Marta Peña
author_facet Felipe Emiliano Arévalo-Cordovilla
Marta Peña
author_sort Felipe Emiliano Arévalo-Cordovilla
collection DOAJ
description Abstract Early prediction of academic performance is vital for reducing attrition in online higher education. However, existing models often lack comprehensive data integration and comparison with state-of-the-art techniques. This study, which involved 2,225 engineering students at a public university in Ecuador, addressed these gaps. The objective was to develop a robust predictive framework by integrating Moodle interactions, academic history, and demographic data using SMOTE for class balancing. The methodology involved a comparative evaluation of seven base learners, including traditional algorithms, Random Forest, and gradient boosting ensembles (XGBoost, LightGBM), and a final stacking model, all validated using a 5-fold stratified cross-validation. While the LightGBM model emerged as the best-performing base model (Area Under the Curve (AUC) = 0.953, F1 = 0.950), the stacking ensemble (AUC = 0.835) did not offer a significant performance improvement and showed considerable instability. SHAP analysis confirmed that early grades were the most influential predictors across top models. The final model demonstrated strong fairness across gender, ethnicity, and socioeconomic status (consistency = 0.907). These findings enable institutions to identify at-risk students using state-of-the-art interpretable and fair models. These findings enable institutions to identify at-risk students using state-of-the-art, interpretable, and fair models, advancing learning analytics by validating key success predictors against contemporary benchmarks.
format Article
id doaj-art-0e016835959c42f7b6ef163bdf202cb2
institution Kabale University
issn 2045-2322
language English
publishDate 2025-08-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-0e016835959c42f7b6ef163bdf202cb22025-08-20T03:42:45ZengNature PortfolioScientific Reports2045-23222025-08-0115111410.1038/s41598-025-15388-9Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal dataFelipe Emiliano Arévalo-Cordovilla0Marta Peña1Faculty of Science and Engineering, Universidad Estatal de MilagroDepartment of Mathematics and IOC Research Institute, Universitat Politècnica de Catalunya—BarcelonaTechAbstract Early prediction of academic performance is vital for reducing attrition in online higher education. However, existing models often lack comprehensive data integration and comparison with state-of-the-art techniques. This study, which involved 2,225 engineering students at a public university in Ecuador, addressed these gaps. The objective was to develop a robust predictive framework by integrating Moodle interactions, academic history, and demographic data using SMOTE for class balancing. The methodology involved a comparative evaluation of seven base learners, including traditional algorithms, Random Forest, and gradient boosting ensembles (XGBoost, LightGBM), and a final stacking model, all validated using a 5-fold stratified cross-validation. While the LightGBM model emerged as the best-performing base model (Area Under the Curve (AUC) = 0.953, F1 = 0.950), the stacking ensemble (AUC = 0.835) did not offer a significant performance improvement and showed considerable instability. SHAP analysis confirmed that early grades were the most influential predictors across top models. The final model demonstrated strong fairness across gender, ethnicity, and socioeconomic status (consistency = 0.907). These findings enable institutions to identify at-risk students using state-of-the-art interpretable and fair models. These findings enable institutions to identify at-risk students using state-of-the-art, interpretable, and fair models, advancing learning analytics by validating key success predictors against contemporary benchmarks.https://doi.org/10.1038/s41598-025-15388-9Academic performanceEarly predictionEnsemble modelGradient boostingLearning analyticsStacking
spellingShingle Felipe Emiliano Arévalo-Cordovilla
Marta Peña
Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data
Scientific Reports
Academic performance
Early prediction
Ensemble model
Gradient boosting
Learning analytics
Stacking
title Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data
title_full Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data
title_fullStr Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data
title_full_unstemmed Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data
title_short Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data
title_sort evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data
topic Academic performance
Early prediction
Ensemble model
Gradient boosting
Learning analytics
Stacking
url https://doi.org/10.1038/s41598-025-15388-9
work_keys_str_mv AT felipeemilianoarevalocordovilla evaluatingensemblemodelsforfairandinterpretablepredictioninhighereducationusingmultimodaldata
AT martapena evaluatingensemblemodelsforfairandinterpretablepredictioninhighereducationusingmultimodaldata