K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data

AimsThe objective of this study is to illustrate the application of a machine learning algorithm, K Nearest Neighbor (k-NN) to impute missing alcohol data in a prospective study among pregnant women.MethodsWe used data from the Safe Passage study (n = 11,083). Daily alcohol consumption for the last...

Full description

Saved in:
Bibliographic Details
Main Authors: Ayesha Sania, Nicolò Pini, Morgan E. Nelson, Michael M. Myers, Lauren C. Shuffrey, Maristella Lucchini, Amy J. Elliott, Hein J. Odendaal, William P. Fifer
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Advances in Drug and Alcohol Research
Subjects:
Online Access:https://www.frontierspartnerships.org/articles/10.3389/adar.2024.13449/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832583513815121920
author Ayesha Sania
Ayesha Sania
Nicolò Pini
Nicolò Pini
Morgan E. Nelson
Michael M. Myers
Michael M. Myers
Lauren C. Shuffrey
Maristella Lucchini
Maristella Lucchini
Amy J. Elliott
Amy J. Elliott
Hein J. Odendaal
William P. Fifer
William P. Fifer
author_facet Ayesha Sania
Ayesha Sania
Nicolò Pini
Nicolò Pini
Morgan E. Nelson
Michael M. Myers
Michael M. Myers
Lauren C. Shuffrey
Maristella Lucchini
Maristella Lucchini
Amy J. Elliott
Amy J. Elliott
Hein J. Odendaal
William P. Fifer
William P. Fifer
author_sort Ayesha Sania
collection DOAJ
description AimsThe objective of this study is to illustrate the application of a machine learning algorithm, K Nearest Neighbor (k-NN) to impute missing alcohol data in a prospective study among pregnant women.MethodsWe used data from the Safe Passage study (n = 11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Follow back method, which generated a variable amount of missing data per participants. Of the 3.2 million person-days of observation, data were missing for 0.36 million (11.4%). Using the k-NN imputed values were weighted for the distances and matched for the day of the week. Since participants with no missing days were not comparable to those with missing data, segments of non-missing data from all participants were included as a reference. Validation was done after randomly deleting data for 5–15 consecutive days from the first trimester.ResultsWe found that data from 5 nearest neighbors (i.e., K = 5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from the first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/−1 drink/day of the actual. Imputation accuracy varied by study site because of the differences in the magnitude of drinking and proportion of missing data.Conclusionk-NN can be used to impute missing data from longitudinal studies of alcohol during pregnancy with high accuracy.
format Article
id doaj-art-f247a30099ca493c83c81a9f8d8ce073
institution Kabale University
issn 2674-0001
language English
publishDate 2025-01-01
publisher Frontiers Media S.A.
record_format Article
series Advances in Drug and Alcohol Research
spelling doaj-art-f247a30099ca493c83c81a9f8d8ce0732025-01-28T12:09:20ZengFrontiers Media S.A.Advances in Drug and Alcohol Research2674-00012025-01-01410.3389/adar.2024.1344913449K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol dataAyesha Sania0Ayesha Sania1Nicolò Pini2Nicolò Pini3Morgan E. Nelson4Michael M. Myers5Michael M. Myers6Lauren C. Shuffrey7Maristella Lucchini8Maristella Lucchini9Amy J. Elliott10Amy J. Elliott11Hein J. Odendaal12William P. Fifer13William P. Fifer14Department of Psychiatry, Columbia University Irving Medical Center, New York, NY, United StatesDivision of Developmental Neuroscience, New York State Psychiatric Institute, New York, NY, United StatesDepartment of Psychiatry, Columbia University Irving Medical Center, New York, NY, United StatesDivision of Developmental Neuroscience, New York State Psychiatric Institute, New York, NY, United StatesResearch Triangle Institute, Research Triangle Park, Durham, NC, United StatesDepartment of Psychiatry, Columbia University Irving Medical Center, New York, NY, United StatesDivision of Developmental Neuroscience, New York State Psychiatric Institute, New York, NY, United StatesDepartment of Child and Adolescent Psychiatry, NYU Grossman School of Medicine, New York, NY, United StatesDepartment of Psychiatry, Columbia University Irving Medical Center, New York, NY, United StatesDivision of Developmental Neuroscience, New York State Psychiatric Institute, New York, NY, United StatesCenter for Pediatric and Community Research, Avera Health, Sioux Falls, SD, United StatesDepartment of Pediatrics, University of South Dakota School of Medicine, Sioux Falls, SD, United StatesDepartment of Obstetrics and Gynecology, Faculty of Medicine and Health Science, Stellenbosch University, Cape Town, Western Cape, South AfricaDepartment of Psychiatry, Columbia University Irving Medical Center, New York, NY, United StatesDivision of Developmental Neuroscience, New York State Psychiatric Institute, New York, NY, United StatesAimsThe objective of this study is to illustrate the application of a machine learning algorithm, K Nearest Neighbor (k-NN) to impute missing alcohol data in a prospective study among pregnant women.MethodsWe used data from the Safe Passage study (n = 11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Follow back method, which generated a variable amount of missing data per participants. Of the 3.2 million person-days of observation, data were missing for 0.36 million (11.4%). Using the k-NN imputed values were weighted for the distances and matched for the day of the week. Since participants with no missing days were not comparable to those with missing data, segments of non-missing data from all participants were included as a reference. Validation was done after randomly deleting data for 5–15 consecutive days from the first trimester.ResultsWe found that data from 5 nearest neighbors (i.e., K = 5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from the first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/−1 drink/day of the actual. Imputation accuracy varied by study site because of the differences in the magnitude of drinking and proportion of missing data.Conclusionk-NN can be used to impute missing data from longitudinal studies of alcohol during pregnancy with high accuracy.https://www.frontierspartnerships.org/articles/10.3389/adar.2024.13449/fullk nearest neighbork-NNmachine learningdata missingnessdata imputationprenatal alcohol data
spellingShingle Ayesha Sania
Ayesha Sania
Nicolò Pini
Nicolò Pini
Morgan E. Nelson
Michael M. Myers
Michael M. Myers
Lauren C. Shuffrey
Maristella Lucchini
Maristella Lucchini
Amy J. Elliott
Amy J. Elliott
Hein J. Odendaal
William P. Fifer
William P. Fifer
K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data
Advances in Drug and Alcohol Research
k nearest neighbor
k-NN
machine learning
data missingness
data imputation
prenatal alcohol data
title K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data
title_full K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data
title_fullStr K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data
title_full_unstemmed K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data
title_short K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data
title_sort k nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data
topic k nearest neighbor
k-NN
machine learning
data missingness
data imputation
prenatal alcohol data
url https://www.frontierspartnerships.org/articles/10.3389/adar.2024.13449/full
work_keys_str_mv AT ayeshasania knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT ayeshasania knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT nicolopini knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT nicolopini knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT morganenelson knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT michaelmmyers knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT michaelmmyers knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT laurencshuffrey knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT maristellalucchini knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT maristellalucchini knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT amyjelliott knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT amyjelliott knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT heinjodendaal knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT williampfifer knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata
AT williampfifer knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata