K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data
AimsThe objective of this study is to illustrate the application of a machine learning algorithm, K Nearest Neighbor (k-NN) to impute missing alcohol data in a prospective study among pregnant women.MethodsWe used data from the Safe Passage study (n = 11,083). Daily alcohol consumption for the last...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2025-01-01
|
Series: | Advances in Drug and Alcohol Research |
Subjects: | |
Online Access: | https://www.frontierspartnerships.org/articles/10.3389/adar.2024.13449/full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832583513815121920 |
---|---|
author | Ayesha Sania Ayesha Sania Nicolò Pini Nicolò Pini Morgan E. Nelson Michael M. Myers Michael M. Myers Lauren C. Shuffrey Maristella Lucchini Maristella Lucchini Amy J. Elliott Amy J. Elliott Hein J. Odendaal William P. Fifer William P. Fifer |
author_facet | Ayesha Sania Ayesha Sania Nicolò Pini Nicolò Pini Morgan E. Nelson Michael M. Myers Michael M. Myers Lauren C. Shuffrey Maristella Lucchini Maristella Lucchini Amy J. Elliott Amy J. Elliott Hein J. Odendaal William P. Fifer William P. Fifer |
author_sort | Ayesha Sania |
collection | DOAJ |
description | AimsThe objective of this study is to illustrate the application of a machine learning algorithm, K Nearest Neighbor (k-NN) to impute missing alcohol data in a prospective study among pregnant women.MethodsWe used data from the Safe Passage study (n = 11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Follow back method, which generated a variable amount of missing data per participants. Of the 3.2 million person-days of observation, data were missing for 0.36 million (11.4%). Using the k-NN imputed values were weighted for the distances and matched for the day of the week. Since participants with no missing days were not comparable to those with missing data, segments of non-missing data from all participants were included as a reference. Validation was done after randomly deleting data for 5–15 consecutive days from the first trimester.ResultsWe found that data from 5 nearest neighbors (i.e., K = 5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from the first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/−1 drink/day of the actual. Imputation accuracy varied by study site because of the differences in the magnitude of drinking and proportion of missing data.Conclusionk-NN can be used to impute missing data from longitudinal studies of alcohol during pregnancy with high accuracy. |
format | Article |
id | doaj-art-f247a30099ca493c83c81a9f8d8ce073 |
institution | Kabale University |
issn | 2674-0001 |
language | English |
publishDate | 2025-01-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Advances in Drug and Alcohol Research |
spelling | doaj-art-f247a30099ca493c83c81a9f8d8ce0732025-01-28T12:09:20ZengFrontiers Media S.A.Advances in Drug and Alcohol Research2674-00012025-01-01410.3389/adar.2024.1344913449K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol dataAyesha Sania0Ayesha Sania1Nicolò Pini2Nicolò Pini3Morgan E. Nelson4Michael M. Myers5Michael M. Myers6Lauren C. Shuffrey7Maristella Lucchini8Maristella Lucchini9Amy J. Elliott10Amy J. Elliott11Hein J. Odendaal12William P. Fifer13William P. Fifer14Department of Psychiatry, Columbia University Irving Medical Center, New York, NY, United StatesDivision of Developmental Neuroscience, New York State Psychiatric Institute, New York, NY, United StatesDepartment of Psychiatry, Columbia University Irving Medical Center, New York, NY, United StatesDivision of Developmental Neuroscience, New York State Psychiatric Institute, New York, NY, United StatesResearch Triangle Institute, Research Triangle Park, Durham, NC, United StatesDepartment of Psychiatry, Columbia University Irving Medical Center, New York, NY, United StatesDivision of Developmental Neuroscience, New York State Psychiatric Institute, New York, NY, United StatesDepartment of Child and Adolescent Psychiatry, NYU Grossman School of Medicine, New York, NY, United StatesDepartment of Psychiatry, Columbia University Irving Medical Center, New York, NY, United StatesDivision of Developmental Neuroscience, New York State Psychiatric Institute, New York, NY, United StatesCenter for Pediatric and Community Research, Avera Health, Sioux Falls, SD, United StatesDepartment of Pediatrics, University of South Dakota School of Medicine, Sioux Falls, SD, United StatesDepartment of Obstetrics and Gynecology, Faculty of Medicine and Health Science, Stellenbosch University, Cape Town, Western Cape, South AfricaDepartment of Psychiatry, Columbia University Irving Medical Center, New York, NY, United StatesDivision of Developmental Neuroscience, New York State Psychiatric Institute, New York, NY, United StatesAimsThe objective of this study is to illustrate the application of a machine learning algorithm, K Nearest Neighbor (k-NN) to impute missing alcohol data in a prospective study among pregnant women.MethodsWe used data from the Safe Passage study (n = 11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Follow back method, which generated a variable amount of missing data per participants. Of the 3.2 million person-days of observation, data were missing for 0.36 million (11.4%). Using the k-NN imputed values were weighted for the distances and matched for the day of the week. Since participants with no missing days were not comparable to those with missing data, segments of non-missing data from all participants were included as a reference. Validation was done after randomly deleting data for 5–15 consecutive days from the first trimester.ResultsWe found that data from 5 nearest neighbors (i.e., K = 5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from the first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/−1 drink/day of the actual. Imputation accuracy varied by study site because of the differences in the magnitude of drinking and proportion of missing data.Conclusionk-NN can be used to impute missing data from longitudinal studies of alcohol during pregnancy with high accuracy.https://www.frontierspartnerships.org/articles/10.3389/adar.2024.13449/fullk nearest neighbork-NNmachine learningdata missingnessdata imputationprenatal alcohol data |
spellingShingle | Ayesha Sania Ayesha Sania Nicolò Pini Nicolò Pini Morgan E. Nelson Michael M. Myers Michael M. Myers Lauren C. Shuffrey Maristella Lucchini Maristella Lucchini Amy J. Elliott Amy J. Elliott Hein J. Odendaal William P. Fifer William P. Fifer K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data Advances in Drug and Alcohol Research k nearest neighbor k-NN machine learning data missingness data imputation prenatal alcohol data |
title | K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data |
title_full | K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data |
title_fullStr | K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data |
title_full_unstemmed | K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data |
title_short | K-nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data |
title_sort | k nearest neighbor algorithm for imputing missing longitudinal prenatal alcohol data |
topic | k nearest neighbor k-NN machine learning data missingness data imputation prenatal alcohol data |
url | https://www.frontierspartnerships.org/articles/10.3389/adar.2024.13449/full |
work_keys_str_mv | AT ayeshasania knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT ayeshasania knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT nicolopini knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT nicolopini knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT morganenelson knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT michaelmmyers knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT michaelmmyers knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT laurencshuffrey knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT maristellalucchini knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT maristellalucchini knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT amyjelliott knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT amyjelliott knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT heinjodendaal knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT williampfifer knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata AT williampfifer knearestneighboralgorithmforimputingmissinglongitudinalprenatalalcoholdata |