Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.

All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage a...

Full description

Saved in:
Bibliographic Details
Main Authors: Charlotte S C Woolley, Ian G Handel, B Mark Bronsvoort, Jeffrey J Schoenebeck, Dylan N Clements
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-01-01
Series:PLoS ONE
Online Access:https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0228154&type=printable
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850043474786648064
author Charlotte S C Woolley
Ian G Handel
B Mark Bronsvoort
Jeffrey J Schoenebeck
Dylan N Clements
author_facet Charlotte S C Woolley
Ian G Handel
B Mark Bronsvoort
Jeffrey J Schoenebeck
Dylan N Clements
author_sort Charlotte S C Woolley
collection DOAJ
description All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.
format Article
id doaj-art-90d85b6332df42bf8d145fe244552fa8
institution DOAJ
issn 1932-6203
language English
publishDate 2020-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-90d85b6332df42bf8d145fe244552fa82025-08-20T02:55:13ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01151e022815410.1371/journal.pone.0228154Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.Charlotte S C WoolleyIan G HandelB Mark BronsvoortJeffrey J SchoenebeckDylan N ClementsAll data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0228154&type=printable
spellingShingle Charlotte S C Woolley
Ian G Handel
B Mark Bronsvoort
Jeffrey J Schoenebeck
Dylan N Clements
Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.
PLoS ONE
title Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.
title_full Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.
title_fullStr Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.
title_full_unstemmed Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.
title_short Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data.
title_sort is it time to stop sweeping data cleaning under the carpet a novel algorithm for outlier management in growth data
url https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0228154&type=printable
work_keys_str_mv AT charlottescwoolley isittimetostopsweepingdatacleaningunderthecarpetanovelalgorithmforoutliermanagementingrowthdata
AT ianghandel isittimetostopsweepingdatacleaningunderthecarpetanovelalgorithmforoutliermanagementingrowthdata
AT bmarkbronsvoort isittimetostopsweepingdatacleaningunderthecarpetanovelalgorithmforoutliermanagementingrowthdata
AT jeffreyjschoenebeck isittimetostopsweepingdatacleaningunderthecarpetanovelalgorithmforoutliermanagementingrowthdata
AT dylannclements isittimetostopsweepingdatacleaningunderthecarpetanovelalgorithmforoutliermanagementingrowthdata