Machine learning-based feature selection for ultra-high-dimensional survival data: a computational approach

Ultra-high-dimensional (UHD) survival data presents significant computational challenges in biomedical research, particularly in Renal Cell Carcinoma (RCC), where genomic complexity complicates risk assessment. Effective feature selection is crucial for identifying key biomarkers that improve RCC d...

Full description

Saved in:
Bibliographic Details
Main Authors: Nahid Salma, Majid Khan Majahar Ali, Raja Aqib Shamim
Format: Article
Language:English
Published: Nigerian Society of Physical Sciences 2025-08-01
Series:Journal of Nigerian Society of Physical Sciences
Subjects:
Online Access:https://journal.nsps.org.ng/index.php/jnsps/article/view/2810
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849716402014912512
author Nahid Salma
Majid Khan Majahar Ali
Raja Aqib Shamim
author_facet Nahid Salma
Majid Khan Majahar Ali
Raja Aqib Shamim
author_sort Nahid Salma
collection DOAJ
description Ultra-high-dimensional (UHD) survival data presents significant computational challenges in biomedical research, particularly in Renal Cell Carcinoma (RCC), where genomic complexity complicates risk assessment. Effective feature selection is crucial for identifying key biomarkers that improve RCC diagnosis, prognosis, and treatment. This study evaluates machine learning (ML)-based feature selection methods to address limitations in scalability, feature redundancy, and predictive accuracy in UHD RCC survival data. Gene expression data from 4,224 differentially expressed genes across 74 individuals was analyzed using LASSO, EN, Adaptive LASSO, Group LASSO, SIS, ISIS, SCAD, and SVM. Models were assessed using Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² values. SCAD demonstrated the best predictive performance (MSE: 529.00, RMSE: 23.00, R²: 0.69), surpassing ISIS (R²: 0.61), SIS (R²: 0.60), and EN (R²: 0.57). LASSO and Adaptive LASSO underperformed. SCAD identified 14 key genes—NCAM1, ATP1B3, NAT8, MT2A, GTF2F2, X4197, GUCY2C, SLC3A1, CRYZ, DES, MT1L, NFYB, PRKAR2B, and CLIP1—as potential RCC biomarkers. Gene interaction network analysis confirmed their role in RCC progression. Despite SCAD’s strong performance, it left 31% of data variability unexplained, suggesting hybrid ML models that integrate ensemble learning, two-component regression structures, and deep learning-based feature selection could further enhance gene selection and predictive accuracy. This research supports SDG 3 (Good Health and Well-being) and SDG 9 (Industry, Innovation, and Infrastructure) by advancing precision medicine, early RCC detection, and biomedical data-driven innovations for improved clinical decision-making.
format Article
id doaj-art-ef3d79300ad541a488d072f7d3ed7e66
institution DOAJ
issn 2714-2817
2714-4704
language English
publishDate 2025-08-01
publisher Nigerian Society of Physical Sciences
record_format Article
series Journal of Nigerian Society of Physical Sciences
spelling doaj-art-ef3d79300ad541a488d072f7d3ed7e662025-08-20T03:13:00ZengNigerian Society of Physical SciencesJournal of Nigerian Society of Physical Sciences2714-28172714-47042025-08-017310.46481/jnsps.2025.2810Machine learning-based feature selection for ultra-high-dimensional survival data: a computational approachNahid SalmaMajid Khan Majahar AliRaja Aqib Shamim Ultra-high-dimensional (UHD) survival data presents significant computational challenges in biomedical research, particularly in Renal Cell Carcinoma (RCC), where genomic complexity complicates risk assessment. Effective feature selection is crucial for identifying key biomarkers that improve RCC diagnosis, prognosis, and treatment. This study evaluates machine learning (ML)-based feature selection methods to address limitations in scalability, feature redundancy, and predictive accuracy in UHD RCC survival data. Gene expression data from 4,224 differentially expressed genes across 74 individuals was analyzed using LASSO, EN, Adaptive LASSO, Group LASSO, SIS, ISIS, SCAD, and SVM. Models were assessed using Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² values. SCAD demonstrated the best predictive performance (MSE: 529.00, RMSE: 23.00, R²: 0.69), surpassing ISIS (R²: 0.61), SIS (R²: 0.60), and EN (R²: 0.57). LASSO and Adaptive LASSO underperformed. SCAD identified 14 key genes—NCAM1, ATP1B3, NAT8, MT2A, GTF2F2, X4197, GUCY2C, SLC3A1, CRYZ, DES, MT1L, NFYB, PRKAR2B, and CLIP1—as potential RCC biomarkers. Gene interaction network analysis confirmed their role in RCC progression. Despite SCAD’s strong performance, it left 31% of data variability unexplained, suggesting hybrid ML models that integrate ensemble learning, two-component regression structures, and deep learning-based feature selection could further enhance gene selection and predictive accuracy. This research supports SDG 3 (Good Health and Well-being) and SDG 9 (Industry, Innovation, and Infrastructure) by advancing precision medicine, early RCC detection, and biomedical data-driven innovations for improved clinical decision-making. https://journal.nsps.org.ng/index.php/jnsps/article/view/2810Ultra-high dimensionMachine LearningFeature SelectionRenal Cell CarcinomaSurvival Data
spellingShingle Nahid Salma
Majid Khan Majahar Ali
Raja Aqib Shamim
Machine learning-based feature selection for ultra-high-dimensional survival data: a computational approach
Journal of Nigerian Society of Physical Sciences
Ultra-high dimension
Machine Learning
Feature Selection
Renal Cell Carcinoma
Survival Data
title Machine learning-based feature selection for ultra-high-dimensional survival data: a computational approach
title_full Machine learning-based feature selection for ultra-high-dimensional survival data: a computational approach
title_fullStr Machine learning-based feature selection for ultra-high-dimensional survival data: a computational approach
title_full_unstemmed Machine learning-based feature selection for ultra-high-dimensional survival data: a computational approach
title_short Machine learning-based feature selection for ultra-high-dimensional survival data: a computational approach
title_sort machine learning based feature selection for ultra high dimensional survival data a computational approach
topic Ultra-high dimension
Machine Learning
Feature Selection
Renal Cell Carcinoma
Survival Data
url https://journal.nsps.org.ng/index.php/jnsps/article/view/2810
work_keys_str_mv AT nahidsalma machinelearningbasedfeatureselectionforultrahighdimensionalsurvivaldataacomputationalapproach
AT majidkhanmajaharali machinelearningbasedfeatureselectionforultrahighdimensionalsurvivaldataacomputationalapproach
AT rajaaqibshamim machinelearningbasedfeatureselectionforultrahighdimensionalsurvivaldataacomputationalapproach