An integrated approach of feature selection and machine learning for early detection of breast cancer

Abstract Breast cancer ranks among the most prevalent cancers in women globally, with its treatment efficacy heavily reliant on the early identification and diagnosis of the disease. The importance of early detection and diagnosis cannot be overstated in enhancing the survival prospects of those aff...

Full description

Saved in:
Bibliographic Details
Main Authors: Jing Zhu, Zhenhang Zhao, Bangzheng Yin, Canpeng Wu, Chan Yin, Rong Chen, Youde Ding
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-97685-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850146421215330304
author Jing Zhu
Zhenhang Zhao
Bangzheng Yin
Canpeng Wu
Chan Yin
Rong Chen
Youde Ding
author_facet Jing Zhu
Zhenhang Zhao
Bangzheng Yin
Canpeng Wu
Chan Yin
Rong Chen
Youde Ding
author_sort Jing Zhu
collection DOAJ
description Abstract Breast cancer ranks among the most prevalent cancers in women globally, with its treatment efficacy heavily reliant on the early identification and diagnosis of the disease. The importance of early detection and diagnosis cannot be overstated in enhancing the survival prospects of those afflicted with breast cancer. With the increasing application of machine learning technology in the medical field, algorithm-based diagnostic tools provide new possibilities for early prediction of breast cancer. In this study, we introduced a novel feature selection approach, which leverages Shapley additive explanation (SHAP) values as the basis for Recursive Feature Elimination (RFE), utilizing a Random Forest (RF) algorithm within the RFE framework. To address the data imbalance challenge, we incorporated Borderline-SMOTE1. The efficacy of the proposed method was assessed using five machine learning models, K-Nearest Neighbor (KNN), Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Light Gradient Boosting Machine (LightGBM), applied to the Wisconsin Breast Cancer Diagnosis (WBCD) datasets. Optimizing hyperparameters of five models using the Particle Swarm Optimization (PSO) algorithm. In the datasets, 26 features were filtered using our recommended algorithm, the LightGBM-PSO model demonstrated an outstanding performance. The model demonstrated an impressive accuracy of 99.0% in differentiating between benign and malignant cases, boasting a specificity and precision of 100%, a recall rate of 97.40%, an F-measure of 98.68%, an AUC of 0.9870, and a 10-fold cross-validation accuracy of 0.9808. Subsequently, we developed a corresponding online tool (https://breast-cancer-prediction-tool-cgbjlhkns7yig6bmzvztmc.streamlit.app/) based on this model for predicting the risk of breast cancer. Feature selection using recommended algorithm and optimization of the LightGBM model through PSO can significantly enhance the accuracy of breast cancer prediction. This could potentially improve the prognosis for patients diagnosed with breast cancer.
format Article
id doaj-art-fdcb2a6a85f641518bcba71ff0c353e5
institution OA Journals
issn 2045-2322
language English
publishDate 2025-04-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-fdcb2a6a85f641518bcba71ff0c353e52025-08-20T02:27:52ZengNature PortfolioScientific Reports2045-23222025-04-0115111210.1038/s41598-025-97685-xAn integrated approach of feature selection and machine learning for early detection of breast cancerJing Zhu0Zhenhang Zhao1Bangzheng Yin2Canpeng Wu3Chan Yin4Rong Chen5Youde Ding6Experimental Centre, Guangzhou UniversityElectronics and Communication Engineering, Guangzhou UniversityInstitute of Information Engineering, Guangzhou Railway PolytechnicElectronics and Communication Engineering, Guangzhou UniversityThe Central Hospital of ShaoyangThe Central Hospital of ShaoyangSchool of Biomedical Engineering, Guangzhou Medical UniversityAbstract Breast cancer ranks among the most prevalent cancers in women globally, with its treatment efficacy heavily reliant on the early identification and diagnosis of the disease. The importance of early detection and diagnosis cannot be overstated in enhancing the survival prospects of those afflicted with breast cancer. With the increasing application of machine learning technology in the medical field, algorithm-based diagnostic tools provide new possibilities for early prediction of breast cancer. In this study, we introduced a novel feature selection approach, which leverages Shapley additive explanation (SHAP) values as the basis for Recursive Feature Elimination (RFE), utilizing a Random Forest (RF) algorithm within the RFE framework. To address the data imbalance challenge, we incorporated Borderline-SMOTE1. The efficacy of the proposed method was assessed using five machine learning models, K-Nearest Neighbor (KNN), Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Light Gradient Boosting Machine (LightGBM), applied to the Wisconsin Breast Cancer Diagnosis (WBCD) datasets. Optimizing hyperparameters of five models using the Particle Swarm Optimization (PSO) algorithm. In the datasets, 26 features were filtered using our recommended algorithm, the LightGBM-PSO model demonstrated an outstanding performance. The model demonstrated an impressive accuracy of 99.0% in differentiating between benign and malignant cases, boasting a specificity and precision of 100%, a recall rate of 97.40%, an F-measure of 98.68%, an AUC of 0.9870, and a 10-fold cross-validation accuracy of 0.9808. Subsequently, we developed a corresponding online tool (https://breast-cancer-prediction-tool-cgbjlhkns7yig6bmzvztmc.streamlit.app/) based on this model for predicting the risk of breast cancer. Feature selection using recommended algorithm and optimization of the LightGBM model through PSO can significantly enhance the accuracy of breast cancer prediction. This could potentially improve the prognosis for patients diagnosed with breast cancer.https://doi.org/10.1038/s41598-025-97685-xBreast cancerLightGBMSHAPBorderline-SMOTE1RFEPSO
spellingShingle Jing Zhu
Zhenhang Zhao
Bangzheng Yin
Canpeng Wu
Chan Yin
Rong Chen
Youde Ding
An integrated approach of feature selection and machine learning for early detection of breast cancer
Scientific Reports
Breast cancer
LightGBM
SHAP
Borderline-SMOTE1
RFE
PSO
title An integrated approach of feature selection and machine learning for early detection of breast cancer
title_full An integrated approach of feature selection and machine learning for early detection of breast cancer
title_fullStr An integrated approach of feature selection and machine learning for early detection of breast cancer
title_full_unstemmed An integrated approach of feature selection and machine learning for early detection of breast cancer
title_short An integrated approach of feature selection and machine learning for early detection of breast cancer
title_sort integrated approach of feature selection and machine learning for early detection of breast cancer
topic Breast cancer
LightGBM
SHAP
Borderline-SMOTE1
RFE
PSO
url https://doi.org/10.1038/s41598-025-97685-x
work_keys_str_mv AT jingzhu anintegratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT zhenhangzhao anintegratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT bangzhengyin anintegratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT canpengwu anintegratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT chanyin anintegratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT rongchen anintegratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT youdeding anintegratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT jingzhu integratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT zhenhangzhao integratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT bangzhengyin integratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT canpengwu integratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT chanyin integratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT rongchen integratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer
AT youdeding integratedapproachoffeatureselectionandmachinelearningforearlydetectionofbreastcancer