A comparison of various imputation algorithms for missing data.

<h4>Background</h4>Many datasets in medicine and other branches of science are incomplete. In this article we compare various imputation algorithms for missing data.<h4>Objectives</h4>We take the point of view that it has already been decided that the imputation should be car...

Full description

Saved in:
Bibliographic Details
Main Authors: Jürgen Kampf, Iryna Dykun, Tienush Rassaf, Amir Abbas Mahabadi
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0319784
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849327837817864192
author Jürgen Kampf
Iryna Dykun
Tienush Rassaf
Amir Abbas Mahabadi
author_facet Jürgen Kampf
Iryna Dykun
Tienush Rassaf
Amir Abbas Mahabadi
author_sort Jürgen Kampf
collection DOAJ
description <h4>Background</h4>Many datasets in medicine and other branches of science are incomplete. In this article we compare various imputation algorithms for missing data.<h4>Objectives</h4>We take the point of view that it has already been decided that the imputation should be carried out using multiple imputation by chained equation and the only decision left is that of a subroutine for the one-dimensional imputations. The subroutines to be compared are predictive mean matching, weighted predictive mean matching, sampling, classification or regression trees and random forests.<h4>Methods</h4>We compare these subroutines on real data and on simulated data. We consider the estimation of expected values, variances and coefficients of linear regression models, logistic regression models and Cox regression models. As real data we use data of the survival times after the diagnosis of an obstructive coronary artery disease with systolic blood pressure, LDL, diabetes, smoking behavior and family history of premature heart diseases as variables for which values have to be imputed. While we are mainly interested in statistical properties like biases, mean squared errors or coverage probabilities of confidence intervals, we also have an eye on the computation time.<h4>Results</h4>Weighted predictive mean matching had to be excluded from the statistical comparison due to its enormous computation time. Among the remaining algorithms, in most situations we tested, predictive mean matching performed best.<h4>Novelty</h4>This is by far the largest comparison study for subroutines of multiple imputation by chained equations that has been performed up to now.
format Article
id doaj-art-9a90710bdf324c0299f24693dd3a7b6f
institution Kabale University
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-9a90710bdf324c0299f24693dd3a7b6f2025-08-20T03:47:45ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01205e031978410.1371/journal.pone.0319784A comparison of various imputation algorithms for missing data.Jürgen KampfIryna DykunTienush RassafAmir Abbas Mahabadi<h4>Background</h4>Many datasets in medicine and other branches of science are incomplete. In this article we compare various imputation algorithms for missing data.<h4>Objectives</h4>We take the point of view that it has already been decided that the imputation should be carried out using multiple imputation by chained equation and the only decision left is that of a subroutine for the one-dimensional imputations. The subroutines to be compared are predictive mean matching, weighted predictive mean matching, sampling, classification or regression trees and random forests.<h4>Methods</h4>We compare these subroutines on real data and on simulated data. We consider the estimation of expected values, variances and coefficients of linear regression models, logistic regression models and Cox regression models. As real data we use data of the survival times after the diagnosis of an obstructive coronary artery disease with systolic blood pressure, LDL, diabetes, smoking behavior and family history of premature heart diseases as variables for which values have to be imputed. While we are mainly interested in statistical properties like biases, mean squared errors or coverage probabilities of confidence intervals, we also have an eye on the computation time.<h4>Results</h4>Weighted predictive mean matching had to be excluded from the statistical comparison due to its enormous computation time. Among the remaining algorithms, in most situations we tested, predictive mean matching performed best.<h4>Novelty</h4>This is by far the largest comparison study for subroutines of multiple imputation by chained equations that has been performed up to now.https://doi.org/10.1371/journal.pone.0319784
spellingShingle Jürgen Kampf
Iryna Dykun
Tienush Rassaf
Amir Abbas Mahabadi
A comparison of various imputation algorithms for missing data.
PLoS ONE
title A comparison of various imputation algorithms for missing data.
title_full A comparison of various imputation algorithms for missing data.
title_fullStr A comparison of various imputation algorithms for missing data.
title_full_unstemmed A comparison of various imputation algorithms for missing data.
title_short A comparison of various imputation algorithms for missing data.
title_sort comparison of various imputation algorithms for missing data
url https://doi.org/10.1371/journal.pone.0319784
work_keys_str_mv AT jurgenkampf acomparisonofvariousimputationalgorithmsformissingdata
AT irynadykun acomparisonofvariousimputationalgorithmsformissingdata
AT tienushrassaf acomparisonofvariousimputationalgorithmsformissingdata
AT amirabbasmahabadi acomparisonofvariousimputationalgorithmsformissingdata
AT jurgenkampf comparisonofvariousimputationalgorithmsformissingdata
AT irynadykun comparisonofvariousimputationalgorithmsformissingdata
AT tienushrassaf comparisonofvariousimputationalgorithmsformissingdata
AT amirabbasmahabadi comparisonofvariousimputationalgorithmsformissingdata