Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline

Patient-level grouped data are prevalent in public health and medical fields, and multiple instance learning (MIL) offers a framework to address the challenges associated with this type of data structure. This study compares four data aggregation methods designed to tackle the grouped structure in c...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhengxiao Yang, Hao Zhou, Sudesh Srivastav, Jeffrey G. Shaffer, Kuukua E. Abraham, Samuel M. Naandam, Samuel Kakraba
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Data
Subjects:
Online Access:https://www.mdpi.com/2306-5729/10/1/4
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832588681477619712
author Zhengxiao Yang
Hao Zhou
Sudesh Srivastav
Jeffrey G. Shaffer
Kuukua E. Abraham
Samuel M. Naandam
Samuel Kakraba
author_facet Zhengxiao Yang
Hao Zhou
Sudesh Srivastav
Jeffrey G. Shaffer
Kuukua E. Abraham
Samuel M. Naandam
Samuel Kakraba
author_sort Zhengxiao Yang
collection DOAJ
description Patient-level grouped data are prevalent in public health and medical fields, and multiple instance learning (MIL) offers a framework to address the challenges associated with this type of data structure. This study compares four data aggregation methods designed to tackle the grouped structure in classification tasks: post-mean, post-max, post-min, and pre-mean aggregation. We developed a customized AI pipeline that incorporates twelve machine learning algorithms along with the four aggregation methods to detect Parkinson’s disease (PD) using multiple voice recordings from individuals available in the UCI Machine Learning Repository, which includes 756 voice recordings from 188 PD patients and 64 healthy individuals. Seven performance metrics—accuracy, precision, sensitivity, specificity, F1 score, AUC, and MCC—were utilized for model evaluation. Various techniques, such as Bag Over-Sampling (BOS), cross-validation, and grid search, were implemented to enhance classification performance. Among the four aggregation methods, post-mean aggregation combined with XGBoost achieved the highest accuracy (0.880), F1 score (0.922), and MCC (0.672). Furthermore, we identified potential trends in selecting aggregation methods that are suitable for imbalanced data, particularly based on their differences in sensitivity and specificity. These findings provide meaningful implications for the further exploration of grouped imbalanced data.
format Article
id doaj-art-2ba3eedcc72047959873f236a3698182
institution Kabale University
issn 2306-5729
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Data
spelling doaj-art-2ba3eedcc72047959873f236a36981822025-01-24T13:28:32ZengMDPI AGData2306-57292025-01-01101410.3390/data10010004Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence PipelineZhengxiao Yang0Hao Zhou1Sudesh Srivastav2Jeffrey G. Shaffer3Kuukua E. Abraham4Samuel M. Naandam5Samuel Kakraba6Biostatistics and Data Science Graduate Program, Celia Scott Weatherhead School of Public Health and Tropical Medicine, Tulane University, 1440 Canal St., New Orleans, LA 70112, USABiostatistics and Data Science Graduate Program, Celia Scott Weatherhead School of Public Health and Tropical Medicine, Tulane University, 1440 Canal St., New Orleans, LA 70112, USADepartment of Biostatistics and Data Science, Celia Scott Weatherhead School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA 70112, USADepartment of Biostatistics and Data Science, Celia Scott Weatherhead School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA 70112, USADepartment of Mathematics and Statistics, Minnesota State University, Mankato, MN 60001, USADepartment of Mathematics, University of Cape Coast, Cape Coast 00233, GhanaDepartment of Biostatistics and Data Science, Celia Scott Weatherhead School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA 70112, USAPatient-level grouped data are prevalent in public health and medical fields, and multiple instance learning (MIL) offers a framework to address the challenges associated with this type of data structure. This study compares four data aggregation methods designed to tackle the grouped structure in classification tasks: post-mean, post-max, post-min, and pre-mean aggregation. We developed a customized AI pipeline that incorporates twelve machine learning algorithms along with the four aggregation methods to detect Parkinson’s disease (PD) using multiple voice recordings from individuals available in the UCI Machine Learning Repository, which includes 756 voice recordings from 188 PD patients and 64 healthy individuals. Seven performance metrics—accuracy, precision, sensitivity, specificity, F1 score, AUC, and MCC—were utilized for model evaluation. Various techniques, such as Bag Over-Sampling (BOS), cross-validation, and grid search, were implemented to enhance classification performance. Among the four aggregation methods, post-mean aggregation combined with XGBoost achieved the highest accuracy (0.880), F1 score (0.922), and MCC (0.672). Furthermore, we identified potential trends in selecting aggregation methods that are suitable for imbalanced data, particularly based on their differences in sensitivity and specificity. These findings provide meaningful implications for the further exploration of grouped imbalanced data.https://www.mdpi.com/2306-5729/10/1/4Parkinson’s disease (PD)machine learning (ML)artificial intelligence (AI)multiple instance learning (MIL)data aggregationclassification
spellingShingle Zhengxiao Yang
Hao Zhou
Sudesh Srivastav
Jeffrey G. Shaffer
Kuukua E. Abraham
Samuel M. Naandam
Samuel Kakraba
Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
Data
Parkinson’s disease (PD)
machine learning (ML)
artificial intelligence (AI)
multiple instance learning (MIL)
data aggregation
classification
title Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
title_full Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
title_fullStr Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
title_full_unstemmed Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
title_short Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
title_sort optimizing parkinson s disease prediction a comparative analysis of data aggregation methods using multiple voice recordings via an automated artificial intelligence pipeline
topic Parkinson’s disease (PD)
machine learning (ML)
artificial intelligence (AI)
multiple instance learning (MIL)
data aggregation
classification
url https://www.mdpi.com/2306-5729/10/1/4
work_keys_str_mv AT zhengxiaoyang optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline
AT haozhou optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline
AT sudeshsrivastav optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline
AT jeffreygshaffer optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline
AT kuukuaeabraham optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline
AT samuelmnaandam optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline
AT samuelkakraba optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline