Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study

The pervasive challenge of missing data in scientific research forces a critical trade-off: discarding incomplete observations, which risks significant information loss, while conventional imputation methods struggle to maintain accuracy in high-dimensional settings. Although approaches like multipl...

Full description

Saved in:
Bibliographic Details
Main Authors: Oyebayo Ridwan Olaniran, Ali Rashash R. Alzahrani
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/13/6/956
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The pervasive challenge of missing data in scientific research forces a critical trade-off: discarding incomplete observations, which risks significant information loss, while conventional imputation methods struggle to maintain accuracy in high-dimensional settings. Although approaches like multiple imputation (MI) and random forest (RF) proximity-based imputation offer improvements over naive deletion, they exhibit limitations in complex missing data scenarios or sparse high-dimensional settings. To address these gaps, we propose a novel integration of Multiple Imputation by Chained Equations (MICE) with Bayesian Random Forest (BRF), leveraging MICE’s iterative flexibility and BRF’s probabilistic robustness to enhance the imputation accuracy and downstream predictive performance. Our hybrid framework, BRF-MICE, uniquely combines the efficiency of MICE’s chained equations with BRF’s ability to quantify uncertainty through Bayesian tree ensembles, providing stable parameter estimates even under extreme missingness. We empirically validate this approach using synthetic datasets with controlled missingness mechanisms (MCAR, MAR, MNAR) and dimensionality, contrasting it against established methods, including RF and Bayesian Additive Regression Trees (BART). The results demonstrate that BRF-MICE achieves a superior performance in classification and regression tasks, with a 15–20% lower error under varying missingness conditions compared to RF and BART while maintaining computational scalability. The method’s iterative Bayesian updates effectively propagate imputation uncertainty, reducing overconfidence in high-dimensional predictions, a key weakness of frequentist alternatives.
ISSN:2227-7390