Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clustering

Abstract Background High-dimensional proteomics data present significant challenges in biomarker discovery due to technical noise, feature redundancy, and multicollinearity. Current feature selection methods, including filter, wrapper, and embedded approaches, struggle with stability, sparsity, and...

Full description

Saved in:
Bibliographic Details
Main Authors: FuDong Wen, Yue Su, Dan Liu, YuPeng Wang, MeiNa Liu
Format: Article
Language:English
Published: BMC 2025-07-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-025-06193-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849767431135821824
author FuDong Wen
Yue Su
Dan Liu
YuPeng Wang
MeiNa Liu
author_facet FuDong Wen
Yue Su
Dan Liu
YuPeng Wang
MeiNa Liu
author_sort FuDong Wen
collection DOAJ
description Abstract Background High-dimensional proteomics data present significant challenges in biomarker discovery due to technical noise, feature redundancy, and multicollinearity. Current feature selection methods, including filter, wrapper, and embedded approaches, struggle with stability, sparsity, and computational efficiency. To address these limitations, we propose Soft-Thresholded Compressed Sensing (ST-CS), a hybrid framework integrating 1-bit compressed sensing with K-Medoids clustering. Unlike conventional methods relying on manual thresholds, ST-CS automates feature selection by dynamically partitioning coefficient magnitudes into discriminative biomarkers and noise. Results Evaluations on simulated and real-world proteomic datasets demonstrated ST-CS’s superiority in feature selection capability and classification performance. In simulations, ST-CS achieved feature selection robustness with balanced sensitivity (> 80%) and specificity (> 99.8%), reducing false discovery rates (FDR) by 20–50% compared to Hard-Thresholded Compressed Sensing (HT-CS). Additionally, it attained superior F1 scores and Matthews Correlation Coefficients (MCC), outperforming HT-CS, LASSO, and SPLSDA in identifying true biomarkers while suppressing noise. For classification performance, ST-CS surpassed all methods in the area under the receiver operating characteristic curve (AUC) across varying noise levels while maintaining sparsity. Applied to Clinical Proteomic Tumor Analysis Consortium (CPTAC) datasets, ST-CS matched HT-CS’s classification accuracy (AUC = 97.47% for intrahepatic cholangiocarcinoma) but with 57% fewer selected features (37 vs. 86), demonstrating its dual strength in precision biomarker discovery and predictive accuracy. For glioblastoma data, ST-CS achieved higher AUC (72.71%) than HT-CS (72.15%), LASSO (67.80%), and SPLSDA (71.38%) while retaining a parsimonious feature set (30 vs. 58 features for HT-CS). In ovarian serous cystadenocarcinoma, ST-CS further demonstrated its adaptability, attaining superior AUC (75.86%) over HT-CS (75.61%), LASSO (61.00%), and SPLSDA (70.75%) with only 24 ± 5 selected biomarkers. These results highlight ST-CS’s ability to rigorously automate feature selection while balancing classification efficacy, interpretability, and scalability for translational proteomics.
format Article
id doaj-art-9f4a7d7e69f74987afa305d80e2aaab5
institution DOAJ
issn 1471-2105
language English
publishDate 2025-07-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj-art-9f4a7d7e69f74987afa305d80e2aaab52025-08-20T03:04:11ZengBMCBMC Bioinformatics1471-21052025-07-0126111610.1186/s12859-025-06193-2Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clusteringFuDong Wen0Yue Su1Dan Liu2YuPeng Wang3MeiNa Liu4Department of Biostatistics, Public Health College, Harbin Medical UniversityDepartment of Biostatistics, Public Health College, Harbin Medical UniversityDepartment of Biostatistics, Public Health College, Harbin Medical UniversityDepartment of Biostatistics, Public Health College, Harbin Medical UniversityDepartment of Biostatistics, Public Health College, Harbin Medical UniversityAbstract Background High-dimensional proteomics data present significant challenges in biomarker discovery due to technical noise, feature redundancy, and multicollinearity. Current feature selection methods, including filter, wrapper, and embedded approaches, struggle with stability, sparsity, and computational efficiency. To address these limitations, we propose Soft-Thresholded Compressed Sensing (ST-CS), a hybrid framework integrating 1-bit compressed sensing with K-Medoids clustering. Unlike conventional methods relying on manual thresholds, ST-CS automates feature selection by dynamically partitioning coefficient magnitudes into discriminative biomarkers and noise. Results Evaluations on simulated and real-world proteomic datasets demonstrated ST-CS’s superiority in feature selection capability and classification performance. In simulations, ST-CS achieved feature selection robustness with balanced sensitivity (> 80%) and specificity (> 99.8%), reducing false discovery rates (FDR) by 20–50% compared to Hard-Thresholded Compressed Sensing (HT-CS). Additionally, it attained superior F1 scores and Matthews Correlation Coefficients (MCC), outperforming HT-CS, LASSO, and SPLSDA in identifying true biomarkers while suppressing noise. For classification performance, ST-CS surpassed all methods in the area under the receiver operating characteristic curve (AUC) across varying noise levels while maintaining sparsity. Applied to Clinical Proteomic Tumor Analysis Consortium (CPTAC) datasets, ST-CS matched HT-CS’s classification accuracy (AUC = 97.47% for intrahepatic cholangiocarcinoma) but with 57% fewer selected features (37 vs. 86), demonstrating its dual strength in precision biomarker discovery and predictive accuracy. For glioblastoma data, ST-CS achieved higher AUC (72.71%) than HT-CS (72.15%), LASSO (67.80%), and SPLSDA (71.38%) while retaining a parsimonious feature set (30 vs. 58 features for HT-CS). In ovarian serous cystadenocarcinoma, ST-CS further demonstrated its adaptability, attaining superior AUC (75.86%) over HT-CS (75.61%), LASSO (61.00%), and SPLSDA (70.75%) with only 24 ± 5 selected biomarkers. These results highlight ST-CS’s ability to rigorously automate feature selection while balancing classification efficacy, interpretability, and scalability for translational proteomics.https://doi.org/10.1186/s12859-025-06193-2Feature selectionClassificationCompressed sensingK-Medoids clusteringProteomics
spellingShingle FuDong Wen
Yue Su
Dan Liu
YuPeng Wang
MeiNa Liu
Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clustering
BMC Bioinformatics
Feature selection
Classification
Compressed sensing
K-Medoids clustering
Proteomics
title Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clustering
title_full Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clustering
title_fullStr Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clustering
title_full_unstemmed Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clustering
title_short Automated sparse feature selection in high-dimensional proteomics data via 1-bit compressed sensing and K-Medoids clustering
title_sort automated sparse feature selection in high dimensional proteomics data via 1 bit compressed sensing and k medoids clustering
topic Feature selection
Classification
Compressed sensing
K-Medoids clustering
Proteomics
url https://doi.org/10.1186/s12859-025-06193-2
work_keys_str_mv AT fudongwen automatedsparsefeatureselectioninhighdimensionalproteomicsdatavia1bitcompressedsensingandkmedoidsclustering
AT yuesu automatedsparsefeatureselectioninhighdimensionalproteomicsdatavia1bitcompressedsensingandkmedoidsclustering
AT danliu automatedsparsefeatureselectioninhighdimensionalproteomicsdatavia1bitcompressedsensingandkmedoidsclustering
AT yupengwang automatedsparsefeatureselectioninhighdimensionalproteomicsdatavia1bitcompressedsensingandkmedoidsclustering
AT meinaliu automatedsparsefeatureselectioninhighdimensionalproteomicsdatavia1bitcompressedsensingandkmedoidsclustering