Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers

Abstract Purpose This study explores the application of machine learning to high-dimensional proteomics datasets for identifying Alzheimer’s disease (AD) biomarkers. AD, a neurodegenerative disorder affecting millions worldwide, necessitates early and accurate diagnosis for effective management. Met...

Full description

Saved in:
Bibliographic Details
Main Authors: Christoffer Ivarsson Orrelid, Oscar Rosberg, Sophia Weiner, Fredrik D. Johansson, Johan Gobom, Henrik Zetterberg, Newton Mwai, Lena Stempfle
Format: Article
Language:English
Published: BMC 2025-03-01
Series:Fluids and Barriers of the CNS
Subjects:
Online Access:https://doi.org/10.1186/s12987-025-00634-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850251812387422208
author Christoffer Ivarsson Orrelid
Oscar Rosberg
Sophia Weiner
Fredrik D. Johansson
Johan Gobom
Henrik Zetterberg
Newton Mwai
Lena Stempfle
author_facet Christoffer Ivarsson Orrelid
Oscar Rosberg
Sophia Weiner
Fredrik D. Johansson
Johan Gobom
Henrik Zetterberg
Newton Mwai
Lena Stempfle
author_sort Christoffer Ivarsson Orrelid
collection DOAJ
description Abstract Purpose This study explores the application of machine learning to high-dimensional proteomics datasets for identifying Alzheimer’s disease (AD) biomarkers. AD, a neurodegenerative disorder affecting millions worldwide, necessitates early and accurate diagnosis for effective management. Methods We leverage Tandem Mass Tag (TMT) proteomics data from the cerebrospinal fluid (CSF) samples from the frontal cortex of patients with idiopathic normal pressure hydrocephalus (iNPH), a condition often comorbid with AD, with rare access to both lumbar and ventricular samples. Our methodology includes extensive data preprocessing to address batch effects and missing values, followed by the use of the Synthetic Minority Over-sampling Technique (SMOTE) for data augmentation to overcome the small sample size. We apply linear, and non-linear machine learning models, and ensemble methods, to compare iNPH patients with and without biomarker evidence of AD pathology ( $$A\beta ^-T^-$$ A β - T - or $$A\beta ^+T^+$$ A β + T + ) in a classification task. Results We present a machine learning workflow for working with high-dimensional TMT proteomics data that addresses their inherent data characteristics. Our results demonstrate that batch effect correction has no or minor impact on the models’ performance and robust feature selection is critical for model stability and performance, especially in the high-dimensional proteomics data setting for AD diagnostics. The results further indicated that removing features with missing values produced stronger models than imputing them, and the batch effect had minimal impact on the models Our best-performing disease-progression detection model, a random forest, achieves an AUC of 0.84 (± 0.03). Conclusion We identify several novel protein biomarkers candidates, such as FABP3 and GOT1, with potential diagnostic value for AD pathology detection, suggesting the necessity of different biomarkers for AD diagnoses for patients with iNPH, and considering different biomarkers for ventricular and lumbar CSF samples. This work underscores the importance of a meticulous machine learning process in enhancing biomarker discovery. Our study also provides insights in translating biomarkers from other central nervous system diseases like iNPH, and both ventricular and lumbar CSF samples for biomarker discovery, providing a foundation for future research and clinical applications.
format Article
id doaj-art-bd96dd68efbc4d96b4bdad4305b555d1
institution OA Journals
issn 2045-8118
language English
publishDate 2025-03-01
publisher BMC
record_format Article
series Fluids and Barriers of the CNS
spelling doaj-art-bd96dd68efbc4d96b4bdad4305b555d12025-08-20T01:57:48ZengBMCFluids and Barriers of the CNS2045-81182025-03-0122111810.1186/s12987-025-00634-zApplying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkersChristoffer Ivarsson Orrelid0Oscar Rosberg1Sophia Weiner2Fredrik D. Johansson3Johan Gobom4Henrik Zetterberg5Newton Mwai6Lena Stempfle7Computer Science and Engineering, Chalmers University of Technology and University of GothenburgComputer Science and Engineering, Chalmers University of Technology and University of GothenburgDepartment of Psychiatry and Neurochemistry, The Sahlgrenska Academy at the University of GothenburgComputer Science and Engineering, Chalmers University of Technology and University of GothenburgDepartment of Psychiatry and Neurochemistry, The Sahlgrenska Academy at the University of GothenburgDepartment of Psychiatry and Neurochemistry, The Sahlgrenska Academy at the University of GothenburgComputer Science and Engineering, Chalmers University of Technology and University of GothenburgComputer Science and Engineering, Chalmers University of Technology and University of GothenburgAbstract Purpose This study explores the application of machine learning to high-dimensional proteomics datasets for identifying Alzheimer’s disease (AD) biomarkers. AD, a neurodegenerative disorder affecting millions worldwide, necessitates early and accurate diagnosis for effective management. Methods We leverage Tandem Mass Tag (TMT) proteomics data from the cerebrospinal fluid (CSF) samples from the frontal cortex of patients with idiopathic normal pressure hydrocephalus (iNPH), a condition often comorbid with AD, with rare access to both lumbar and ventricular samples. Our methodology includes extensive data preprocessing to address batch effects and missing values, followed by the use of the Synthetic Minority Over-sampling Technique (SMOTE) for data augmentation to overcome the small sample size. We apply linear, and non-linear machine learning models, and ensemble methods, to compare iNPH patients with and without biomarker evidence of AD pathology ( $$A\beta ^-T^-$$ A β - T - or $$A\beta ^+T^+$$ A β + T + ) in a classification task. Results We present a machine learning workflow for working with high-dimensional TMT proteomics data that addresses their inherent data characteristics. Our results demonstrate that batch effect correction has no or minor impact on the models’ performance and robust feature selection is critical for model stability and performance, especially in the high-dimensional proteomics data setting for AD diagnostics. The results further indicated that removing features with missing values produced stronger models than imputing them, and the batch effect had minimal impact on the models Our best-performing disease-progression detection model, a random forest, achieves an AUC of 0.84 (± 0.03). Conclusion We identify several novel protein biomarkers candidates, such as FABP3 and GOT1, with potential diagnostic value for AD pathology detection, suggesting the necessity of different biomarkers for AD diagnoses for patients with iNPH, and considering different biomarkers for ventricular and lumbar CSF samples. This work underscores the importance of a meticulous machine learning process in enhancing biomarker discovery. Our study also provides insights in translating biomarkers from other central nervous system diseases like iNPH, and both ventricular and lumbar CSF samples for biomarker discovery, providing a foundation for future research and clinical applications.https://doi.org/10.1186/s12987-025-00634-zAlzheimer’s diseaseProteomicsMass spectrometryHigh-dimensional dataBiomarkersMachine learning
spellingShingle Christoffer Ivarsson Orrelid
Oscar Rosberg
Sophia Weiner
Fredrik D. Johansson
Johan Gobom
Henrik Zetterberg
Newton Mwai
Lena Stempfle
Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers
Fluids and Barriers of the CNS
Alzheimer’s disease
Proteomics
Mass spectrometry
High-dimensional data
Biomarkers
Machine learning
title Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers
title_full Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers
title_fullStr Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers
title_full_unstemmed Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers
title_short Applying machine learning to high-dimensional proteomics datasets for the identification of Alzheimer’s disease biomarkers
title_sort applying machine learning to high dimensional proteomics datasets for the identification of alzheimer s disease biomarkers
topic Alzheimer’s disease
Proteomics
Mass spectrometry
High-dimensional data
Biomarkers
Machine learning
url https://doi.org/10.1186/s12987-025-00634-z
work_keys_str_mv AT christofferivarssonorrelid applyingmachinelearningtohighdimensionalproteomicsdatasetsfortheidentificationofalzheimersdiseasebiomarkers
AT oscarrosberg applyingmachinelearningtohighdimensionalproteomicsdatasetsfortheidentificationofalzheimersdiseasebiomarkers
AT sophiaweiner applyingmachinelearningtohighdimensionalproteomicsdatasetsfortheidentificationofalzheimersdiseasebiomarkers
AT fredrikdjohansson applyingmachinelearningtohighdimensionalproteomicsdatasetsfortheidentificationofalzheimersdiseasebiomarkers
AT johangobom applyingmachinelearningtohighdimensionalproteomicsdatasetsfortheidentificationofalzheimersdiseasebiomarkers
AT henrikzetterberg applyingmachinelearningtohighdimensionalproteomicsdatasetsfortheidentificationofalzheimersdiseasebiomarkers
AT newtonmwai applyingmachinelearningtohighdimensionalproteomicsdatasetsfortheidentificationofalzheimersdiseasebiomarkers
AT lenastempfle applyingmachinelearningtohighdimensionalproteomicsdatasetsfortheidentificationofalzheimersdiseasebiomarkers