Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data
Abstract Early prediction of academic performance is vital for reducing attrition in online higher education. However, existing models often lack comprehensive data integration and comparison with state-of-the-art techniques. This study, which involved 2,225 engineering students at a public universi...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-08-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-15388-9 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849344109659029504 |
|---|---|
| author | Felipe Emiliano Arévalo-Cordovilla Marta Peña |
| author_facet | Felipe Emiliano Arévalo-Cordovilla Marta Peña |
| author_sort | Felipe Emiliano Arévalo-Cordovilla |
| collection | DOAJ |
| description | Abstract Early prediction of academic performance is vital for reducing attrition in online higher education. However, existing models often lack comprehensive data integration and comparison with state-of-the-art techniques. This study, which involved 2,225 engineering students at a public university in Ecuador, addressed these gaps. The objective was to develop a robust predictive framework by integrating Moodle interactions, academic history, and demographic data using SMOTE for class balancing. The methodology involved a comparative evaluation of seven base learners, including traditional algorithms, Random Forest, and gradient boosting ensembles (XGBoost, LightGBM), and a final stacking model, all validated using a 5-fold stratified cross-validation. While the LightGBM model emerged as the best-performing base model (Area Under the Curve (AUC) = 0.953, F1 = 0.950), the stacking ensemble (AUC = 0.835) did not offer a significant performance improvement and showed considerable instability. SHAP analysis confirmed that early grades were the most influential predictors across top models. The final model demonstrated strong fairness across gender, ethnicity, and socioeconomic status (consistency = 0.907). These findings enable institutions to identify at-risk students using state-of-the-art interpretable and fair models. These findings enable institutions to identify at-risk students using state-of-the-art, interpretable, and fair models, advancing learning analytics by validating key success predictors against contemporary benchmarks. |
| format | Article |
| id | doaj-art-0e016835959c42f7b6ef163bdf202cb2 |
| institution | Kabale University |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-0e016835959c42f7b6ef163bdf202cb22025-08-20T03:42:45ZengNature PortfolioScientific Reports2045-23222025-08-0115111410.1038/s41598-025-15388-9Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal dataFelipe Emiliano Arévalo-Cordovilla0Marta Peña1Faculty of Science and Engineering, Universidad Estatal de MilagroDepartment of Mathematics and IOC Research Institute, Universitat Politècnica de Catalunya—BarcelonaTechAbstract Early prediction of academic performance is vital for reducing attrition in online higher education. However, existing models often lack comprehensive data integration and comparison with state-of-the-art techniques. This study, which involved 2,225 engineering students at a public university in Ecuador, addressed these gaps. The objective was to develop a robust predictive framework by integrating Moodle interactions, academic history, and demographic data using SMOTE for class balancing. The methodology involved a comparative evaluation of seven base learners, including traditional algorithms, Random Forest, and gradient boosting ensembles (XGBoost, LightGBM), and a final stacking model, all validated using a 5-fold stratified cross-validation. While the LightGBM model emerged as the best-performing base model (Area Under the Curve (AUC) = 0.953, F1 = 0.950), the stacking ensemble (AUC = 0.835) did not offer a significant performance improvement and showed considerable instability. SHAP analysis confirmed that early grades were the most influential predictors across top models. The final model demonstrated strong fairness across gender, ethnicity, and socioeconomic status (consistency = 0.907). These findings enable institutions to identify at-risk students using state-of-the-art interpretable and fair models. These findings enable institutions to identify at-risk students using state-of-the-art, interpretable, and fair models, advancing learning analytics by validating key success predictors against contemporary benchmarks.https://doi.org/10.1038/s41598-025-15388-9Academic performanceEarly predictionEnsemble modelGradient boostingLearning analyticsStacking |
| spellingShingle | Felipe Emiliano Arévalo-Cordovilla Marta Peña Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data Scientific Reports Academic performance Early prediction Ensemble model Gradient boosting Learning analytics Stacking |
| title | Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data |
| title_full | Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data |
| title_fullStr | Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data |
| title_full_unstemmed | Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data |
| title_short | Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data |
| title_sort | evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data |
| topic | Academic performance Early prediction Ensemble model Gradient boosting Learning analytics Stacking |
| url | https://doi.org/10.1038/s41598-025-15388-9 |
| work_keys_str_mv | AT felipeemilianoarevalocordovilla evaluatingensemblemodelsforfairandinterpretablepredictioninhighereducationusingmultimodaldata AT martapena evaluatingensemblemodelsforfairandinterpretablepredictioninhighereducationusingmultimodaldata |