‘Machine Learning’ multiclassification for stage diagnosis of Alzheimer’s disease utilizing augmented blood gene expression and feature fusion

Abstract Objective The present study explores the classification of Alzheimer’s disease (AD) stages, encompassing cognitive normalcy, Mild Cognitive Impairment (MCI), and AD/Dementia, through the application of Machine Learning (ML) multiclassification algorithms. This investigation utilizes blood g...

Full description

Saved in:
Bibliographic Details
Main Authors: Manash Sarma, Subarna Chatterjee
Format: Article
Language:English
Published: Springer 2025-06-01
Series:Discover Applied Sciences
Subjects:
Online Access:https://doi.org/10.1007/s42452-025-07237-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850100565897379840
author Manash Sarma
Subarna Chatterjee
author_facet Manash Sarma
Subarna Chatterjee
author_sort Manash Sarma
collection DOAJ
description Abstract Objective The present study explores the classification of Alzheimer’s disease (AD) stages, encompassing cognitive normalcy, Mild Cognitive Impairment (MCI), and AD/Dementia, through the application of Machine Learning (ML) multiclassification algorithms. This investigation utilizes blood gene expression datasets obtained from participants in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the National Center for Biotechnology Information (NCBI). Three blood gene expression datasets of high dimensionality and low sample size (HDLSS) have been utilized in this study, with one dataset exhibiting significant class imbalance. This study integrates clinical data from electronic health records (EHRs) with gene expression datasets, which has been found to significantly enhance the accuracy of stage diagnosis. Methods A combination of XGBoost and SFBS (“sequential floating backward selection”) methods is utilized to select features. Our research identified a subset of 95 gene transcripts exhibiting optimal efficacy from an extensive collection of over 49,000 transcripts within the ADNI gene expression dataset. Furthermore, our analysis of two integrated NCBI datasets revealed 125 gene transcripts demonstrating superior effectiveness among more than 30,000 potential candidates. These findings resulted in the development of two distinct model categories: one derived from the ADNI dataset and the other from the integrated NCBI dataset. DL classifier is used for developing models of both categories while GB (Gradient Boost), SVM (Support Vector Machine) classifier based models are built to identify AD stages from NCBI participants. Because of high data imbalance in genomic data, border line oversampling is explored for model training and original data for validation. We have conducted a multimodal analysis and stage classification by integrating the ADNI gene expression and clinical datasets using ‘Feature-Level Fusion’. Result In the case of ADNI study participants, we obtained best multi-classification performance with ‘ROC AUC’ scores of 0. 76, 0.76, 0.71 for the CN, MCI, and Dementia stages, respectively. We achieved F1 scores of 0.71, 0.77, 0.53 for these same categories. For the NCBI-based model, the best AUC scores of 0.82, 0.74, and 0.79 (for CN, MCI, and AD, respectively) and F1 scores of 0.75, 0.60, and 0.77 were attained when evaluated using GSE3060 test data. When assessed with GSE3061 test data, the model achieved optimal AUC scores of 0.81, 0.75, and 0.78, and F1 scores of 0.74, 0.67, and 0.73.This research identified MAPK14, MID1, TEP1, PLG, DRAXIN, USP47 as genes associated with AD. In the context of ADNI data, the integration of clinical data with gene expression data led to an enhancement of the best F1 scores to 0.85, 0.86, and 0.83 for CN, MCI, and AD, respectively. Additionally, the ROC AUC scores were improved to 0.90, 0.85, and 0.89. Conclusion Using machine learning multiclassification techniques on blood gene expression profile data from ADNI and NCBI, we achieved the most promising results to date for diagnosing multiple stages of Alzheimer’s disease. This proves that the efficacy of our feature selection techniques that could find essential genes associated with AD. Highly accurate of diagnosis of stages that include MCI from genetic data can potentially provide timely alert for individuals susceptible/predisposed to AD.
format Article
id doaj-art-dd2a2bdbebcf49759eaed7e49c93e0cf
institution DOAJ
issn 3004-9261
language English
publishDate 2025-06-01
publisher Springer
record_format Article
series Discover Applied Sciences
spelling doaj-art-dd2a2bdbebcf49759eaed7e49c93e0cf2025-08-20T02:40:15ZengSpringerDiscover Applied Sciences3004-92612025-06-017612810.1007/s42452-025-07237-1‘Machine Learning’ multiclassification for stage diagnosis of Alzheimer’s disease utilizing augmented blood gene expression and feature fusionManash Sarma0Subarna Chatterjee1Department of CSE, Ramaiah University of Applied SciencesDepartment of CSE, Ramaiah University of Applied SciencesAbstract Objective The present study explores the classification of Alzheimer’s disease (AD) stages, encompassing cognitive normalcy, Mild Cognitive Impairment (MCI), and AD/Dementia, through the application of Machine Learning (ML) multiclassification algorithms. This investigation utilizes blood gene expression datasets obtained from participants in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and the National Center for Biotechnology Information (NCBI). Three blood gene expression datasets of high dimensionality and low sample size (HDLSS) have been utilized in this study, with one dataset exhibiting significant class imbalance. This study integrates clinical data from electronic health records (EHRs) with gene expression datasets, which has been found to significantly enhance the accuracy of stage diagnosis. Methods A combination of XGBoost and SFBS (“sequential floating backward selection”) methods is utilized to select features. Our research identified a subset of 95 gene transcripts exhibiting optimal efficacy from an extensive collection of over 49,000 transcripts within the ADNI gene expression dataset. Furthermore, our analysis of two integrated NCBI datasets revealed 125 gene transcripts demonstrating superior effectiveness among more than 30,000 potential candidates. These findings resulted in the development of two distinct model categories: one derived from the ADNI dataset and the other from the integrated NCBI dataset. DL classifier is used for developing models of both categories while GB (Gradient Boost), SVM (Support Vector Machine) classifier based models are built to identify AD stages from NCBI participants. Because of high data imbalance in genomic data, border line oversampling is explored for model training and original data for validation. We have conducted a multimodal analysis and stage classification by integrating the ADNI gene expression and clinical datasets using ‘Feature-Level Fusion’. Result In the case of ADNI study participants, we obtained best multi-classification performance with ‘ROC AUC’ scores of 0. 76, 0.76, 0.71 for the CN, MCI, and Dementia stages, respectively. We achieved F1 scores of 0.71, 0.77, 0.53 for these same categories. For the NCBI-based model, the best AUC scores of 0.82, 0.74, and 0.79 (for CN, MCI, and AD, respectively) and F1 scores of 0.75, 0.60, and 0.77 were attained when evaluated using GSE3060 test data. When assessed with GSE3061 test data, the model achieved optimal AUC scores of 0.81, 0.75, and 0.78, and F1 scores of 0.74, 0.67, and 0.73.This research identified MAPK14, MID1, TEP1, PLG, DRAXIN, USP47 as genes associated with AD. In the context of ADNI data, the integration of clinical data with gene expression data led to an enhancement of the best F1 scores to 0.85, 0.86, and 0.83 for CN, MCI, and AD, respectively. Additionally, the ROC AUC scores were improved to 0.90, 0.85, and 0.89. Conclusion Using machine learning multiclassification techniques on blood gene expression profile data from ADNI and NCBI, we achieved the most promising results to date for diagnosing multiple stages of Alzheimer’s disease. This proves that the efficacy of our feature selection techniques that could find essential genes associated with AD. Highly accurate of diagnosis of stages that include MCI from genetic data can potentially provide timely alert for individuals susceptible/predisposed to AD.https://doi.org/10.1007/s42452-025-07237-1Disease stage diagnosisBlood gene expressionData imbalanceMulticlassificationF1 scoreAD risk gene
spellingShingle Manash Sarma
Subarna Chatterjee
‘Machine Learning’ multiclassification for stage diagnosis of Alzheimer’s disease utilizing augmented blood gene expression and feature fusion
Discover Applied Sciences
Disease stage diagnosis
Blood gene expression
Data imbalance
Multiclassification
F1 score
AD risk gene
title ‘Machine Learning’ multiclassification for stage diagnosis of Alzheimer’s disease utilizing augmented blood gene expression and feature fusion
title_full ‘Machine Learning’ multiclassification for stage diagnosis of Alzheimer’s disease utilizing augmented blood gene expression and feature fusion
title_fullStr ‘Machine Learning’ multiclassification for stage diagnosis of Alzheimer’s disease utilizing augmented blood gene expression and feature fusion
title_full_unstemmed ‘Machine Learning’ multiclassification for stage diagnosis of Alzheimer’s disease utilizing augmented blood gene expression and feature fusion
title_short ‘Machine Learning’ multiclassification for stage diagnosis of Alzheimer’s disease utilizing augmented blood gene expression and feature fusion
title_sort machine learning multiclassification for stage diagnosis of alzheimer s disease utilizing augmented blood gene expression and feature fusion
topic Disease stage diagnosis
Blood gene expression
Data imbalance
Multiclassification
F1 score
AD risk gene
url https://doi.org/10.1007/s42452-025-07237-1
work_keys_str_mv AT manashsarma machinelearningmulticlassificationforstagediagnosisofalzheimersdiseaseutilizingaugmentedbloodgeneexpressionandfeaturefusion
AT subarnachatterjee machinelearningmulticlassificationforstagediagnosisofalzheimersdiseaseutilizingaugmentedbloodgeneexpressionandfeaturefusion