Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study

The pervasive challenge of missing data in scientific research forces a critical trade-off: discarding incomplete observations, which risks significant information loss, while conventional imputation methods struggle to maintain accuracy in high-dimensional settings. Although approaches like multipl...

Full description

Saved in:
Bibliographic Details
Main Authors: Oyebayo Ridwan Olaniran, Ali Rashash R. Alzahrani
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/13/6/956
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849342673368907776
author Oyebayo Ridwan Olaniran
Ali Rashash R. Alzahrani
author_facet Oyebayo Ridwan Olaniran
Ali Rashash R. Alzahrani
author_sort Oyebayo Ridwan Olaniran
collection DOAJ
description The pervasive challenge of missing data in scientific research forces a critical trade-off: discarding incomplete observations, which risks significant information loss, while conventional imputation methods struggle to maintain accuracy in high-dimensional settings. Although approaches like multiple imputation (MI) and random forest (RF) proximity-based imputation offer improvements over naive deletion, they exhibit limitations in complex missing data scenarios or sparse high-dimensional settings. To address these gaps, we propose a novel integration of Multiple Imputation by Chained Equations (MICE) with Bayesian Random Forest (BRF), leveraging MICE’s iterative flexibility and BRF’s probabilistic robustness to enhance the imputation accuracy and downstream predictive performance. Our hybrid framework, BRF-MICE, uniquely combines the efficiency of MICE’s chained equations with BRF’s ability to quantify uncertainty through Bayesian tree ensembles, providing stable parameter estimates even under extreme missingness. We empirically validate this approach using synthetic datasets with controlled missingness mechanisms (MCAR, MAR, MNAR) and dimensionality, contrasting it against established methods, including RF and Bayesian Additive Regression Trees (BART). The results demonstrate that BRF-MICE achieves a superior performance in classification and regression tasks, with a 15–20% lower error under varying missingness conditions compared to RF and BART while maintaining computational scalability. The method’s iterative Bayesian updates effectively propagate imputation uncertainty, reducing overconfidence in high-dimensional predictions, a key weakness of frequentist alternatives.
format Article
id doaj-art-3aec15bb0aa14b3fa13b2186cd6fd122
institution Kabale University
issn 2227-7390
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj-art-3aec15bb0aa14b3fa13b2186cd6fd1222025-08-20T03:43:16ZengMDPI AGMathematics2227-73902025-03-0113695610.3390/math13060956Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation StudyOyebayo Ridwan Olaniran0Ali Rashash R. Alzahrani1Department of Statistics, Faculty of Physical Sciences, University of Ilorin, Ilorin 1515, NigeriaMathematics Department, Faculty of Sciences, Umm Al-Qura University, Makkah 24382, Saudi ArabiaThe pervasive challenge of missing data in scientific research forces a critical trade-off: discarding incomplete observations, which risks significant information loss, while conventional imputation methods struggle to maintain accuracy in high-dimensional settings. Although approaches like multiple imputation (MI) and random forest (RF) proximity-based imputation offer improvements over naive deletion, they exhibit limitations in complex missing data scenarios or sparse high-dimensional settings. To address these gaps, we propose a novel integration of Multiple Imputation by Chained Equations (MICE) with Bayesian Random Forest (BRF), leveraging MICE’s iterative flexibility and BRF’s probabilistic robustness to enhance the imputation accuracy and downstream predictive performance. Our hybrid framework, BRF-MICE, uniquely combines the efficiency of MICE’s chained equations with BRF’s ability to quantify uncertainty through Bayesian tree ensembles, providing stable parameter estimates even under extreme missingness. We empirically validate this approach using synthetic datasets with controlled missingness mechanisms (MCAR, MAR, MNAR) and dimensionality, contrasting it against established methods, including RF and Bayesian Additive Regression Trees (BART). The results demonstrate that BRF-MICE achieves a superior performance in classification and regression tasks, with a 15–20% lower error under varying missingness conditions compared to RF and BART while maintaining computational scalability. The method’s iterative Bayesian updates effectively propagate imputation uncertainty, reducing overconfidence in high-dimensional predictions, a key weakness of frequentist alternatives.https://www.mdpi.com/2227-7390/13/6/956multiple imputationmissing dataBayesian random foresthigh-dimensional analysisrandom forestsimulation study
spellingShingle Oyebayo Ridwan Olaniran
Ali Rashash R. Alzahrani
Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study
Mathematics
multiple imputation
missing data
Bayesian random forest
high-dimensional analysis
random forest
simulation study
title Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study
title_full Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study
title_fullStr Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study
title_full_unstemmed Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study
title_short Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study
title_sort bayesian random forest with multiple imputation by chain equations for high dimensional missing data a simulation study
topic multiple imputation
missing data
Bayesian random forest
high-dimensional analysis
random forest
simulation study
url https://www.mdpi.com/2227-7390/13/6/956
work_keys_str_mv AT oyebayoridwanolaniran bayesianrandomforestwithmultipleimputationbychainequationsforhighdimensionalmissingdataasimulationstudy
AT alirashashralzahrani bayesianrandomforestwithmultipleimputationbychainequationsforhighdimensionalmissingdataasimulationstudy