Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline

Patient-level grouped data are prevalent in public health and medical fields, and multiple instance learning (MIL) offers a framework to address the challenges associated with this type of data structure. This study compares four data aggregation methods designed to tackle the grouped structure in c...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhengxiao Yang, Hao Zhou, Sudesh Srivastav, Jeffrey G. Shaffer, Kuukua E. Abraham, Samuel M. Naandam, Samuel Kakraba
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Data
Subjects:	Parkinson’s disease (PD) machine learning (ML) artificial intelligence (AI) multiple instance learning (MIL) data aggregation classification
Online Access:	https://www.mdpi.com/2306-5729/10/1/4
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832588681477619712
author	Zhengxiao Yang Hao Zhou Sudesh Srivastav Jeffrey G. Shaffer Kuukua E. Abraham Samuel M. Naandam Samuel Kakraba
author_facet	Zhengxiao Yang Hao Zhou Sudesh Srivastav Jeffrey G. Shaffer Kuukua E. Abraham Samuel M. Naandam Samuel Kakraba
author_sort	Zhengxiao Yang
collection	DOAJ
description	Patient-level grouped data are prevalent in public health and medical fields, and multiple instance learning (MIL) offers a framework to address the challenges associated with this type of data structure. This study compares four data aggregation methods designed to tackle the grouped structure in classification tasks: post-mean, post-max, post-min, and pre-mean aggregation. We developed a customized AI pipeline that incorporates twelve machine learning algorithms along with the four aggregation methods to detect Parkinson’s disease (PD) using multiple voice recordings from individuals available in the UCI Machine Learning Repository, which includes 756 voice recordings from 188 PD patients and 64 healthy individuals. Seven performance metrics—accuracy, precision, sensitivity, specificity, F1 score, AUC, and MCC—were utilized for model evaluation. Various techniques, such as Bag Over-Sampling (BOS), cross-validation, and grid search, were implemented to enhance classification performance. Among the four aggregation methods, post-mean aggregation combined with XGBoost achieved the highest accuracy (0.880), F1 score (0.922), and MCC (0.672). Furthermore, we identified potential trends in selecting aggregation methods that are suitable for imbalanced data, particularly based on their differences in sensitivity and specificity. These findings provide meaningful implications for the further exploration of grouped imbalanced data.
format	Article
id	doaj-art-2ba3eedcc72047959873f236a3698182
institution	Kabale University
issn	2306-5729
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Data
spelling	doaj-art-2ba3eedcc72047959873f236a36981822025-01-24T13:28:32ZengMDPI AGData2306-57292025-01-01101410.3390/data10010004Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence PipelineZhengxiao Yang0Hao Zhou1Sudesh Srivastav2Jeffrey G. Shaffer3Kuukua E. Abraham4Samuel M. Naandam5Samuel Kakraba6Biostatistics and Data Science Graduate Program, Celia Scott Weatherhead School of Public Health and Tropical Medicine, Tulane University, 1440 Canal St., New Orleans, LA 70112, USABiostatistics and Data Science Graduate Program, Celia Scott Weatherhead School of Public Health and Tropical Medicine, Tulane University, 1440 Canal St., New Orleans, LA 70112, USADepartment of Biostatistics and Data Science, Celia Scott Weatherhead School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA 70112, USADepartment of Biostatistics and Data Science, Celia Scott Weatherhead School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA 70112, USADepartment of Mathematics and Statistics, Minnesota State University, Mankato, MN 60001, USADepartment of Mathematics, University of Cape Coast, Cape Coast 00233, GhanaDepartment of Biostatistics and Data Science, Celia Scott Weatherhead School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA 70112, USAPatient-level grouped data are prevalent in public health and medical fields, and multiple instance learning (MIL) offers a framework to address the challenges associated with this type of data structure. This study compares four data aggregation methods designed to tackle the grouped structure in classification tasks: post-mean, post-max, post-min, and pre-mean aggregation. We developed a customized AI pipeline that incorporates twelve machine learning algorithms along with the four aggregation methods to detect Parkinson’s disease (PD) using multiple voice recordings from individuals available in the UCI Machine Learning Repository, which includes 756 voice recordings from 188 PD patients and 64 healthy individuals. Seven performance metrics—accuracy, precision, sensitivity, specificity, F1 score, AUC, and MCC—were utilized for model evaluation. Various techniques, such as Bag Over-Sampling (BOS), cross-validation, and grid search, were implemented to enhance classification performance. Among the four aggregation methods, post-mean aggregation combined with XGBoost achieved the highest accuracy (0.880), F1 score (0.922), and MCC (0.672). Furthermore, we identified potential trends in selecting aggregation methods that are suitable for imbalanced data, particularly based on their differences in sensitivity and specificity. These findings provide meaningful implications for the further exploration of grouped imbalanced data.https://www.mdpi.com/2306-5729/10/1/4Parkinson’s disease (PD)machine learning (ML)artificial intelligence (AI)multiple instance learning (MIL)data aggregationclassification
spellingShingle	Zhengxiao Yang Hao Zhou Sudesh Srivastav Jeffrey G. Shaffer Kuukua E. Abraham Samuel M. Naandam Samuel Kakraba Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline Data Parkinson’s disease (PD) machine learning (ML) artificial intelligence (AI) multiple instance learning (MIL) data aggregation classification
title	Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
title_full	Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
title_fullStr	Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
title_full_unstemmed	Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
title_short	Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline
title_sort	optimizing parkinson s disease prediction a comparative analysis of data aggregation methods using multiple voice recordings via an automated artificial intelligence pipeline
topic	Parkinson’s disease (PD) machine learning (ML) artificial intelligence (AI) multiple instance learning (MIL) data aggregation classification
url	https://www.mdpi.com/2306-5729/10/1/4
work_keys_str_mv	AT zhengxiaoyang optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline AT haozhou optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline AT sudeshsrivastav optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline AT jeffreygshaffer optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline AT kuukuaeabraham optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline AT samuelmnaandam optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline AT samuelkakraba optimizingparkinsonsdiseasepredictionacomparativeanalysisofdataaggregationmethodsusingmultiplevoicerecordingsviaanautomatedartificialintelligencepipeline

Optimizing Parkinson’s Disease Prediction: A Comparative Analysis of Data Aggregation Methods Using Multiple Voice Recordings via an Automated Artificial Intelligence Pipeline

Similar Items