Cervical cancer prediction using machine learning models based on routine blood analysis
Abstract Cervical cancer (CC) is the fourth most common cancer among women globally. The key to preventing and treating CC is early detection, diagnosis, and treatment. This study aimed to develop an interpretable model for predicting CC risk using routine blood data. The primary endpoint variable i...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-08166-0 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849335259936587776 |
|---|---|
| author | Jie Su Hui Lu Ruihuan Zhang Na Cui Chao Chen Qin Si Biao Song |
| author_facet | Jie Su Hui Lu Ruihuan Zhang Na Cui Chao Chen Qin Si Biao Song |
| author_sort | Jie Su |
| collection | DOAJ |
| description | Abstract Cervical cancer (CC) is the fourth most common cancer among women globally. The key to preventing and treating CC is early detection, diagnosis, and treatment. This study aimed to develop an interpretable model for predicting CC risk using routine blood data. The primary endpoint variable is the occurrence of CC, as confirmed by histopathological diagnosis. We used the Shapley Additive Explanation (SHAP) method to provide interpretabiligy and identify key factors associated with CC. In this restrospective study, medical records of patients from 2013 to 2023 were collected. A total of 2,503 patients diagnosed with CC were included in the case group, while the control group was composed of 3,794 patients without apparent signs of the disease, which included women with other gynecological conditions as well as healthy individuals undergoing routine check-ups. Age, clinical diagnosis information and 22 blood cell analysis results were considered. Four different algorithms were applied to construct a model for estimating the likelihood of CC occurrence. Using least absolute shrinkage and selection operator (LASSO) and the random forest method (RF) method, 15 key routine blood features were ultimtely selected from an initial set of 23 features for model training. These features include age, red blood cell count (RBC), platelet distribution width (PDW), white blood cell count (WBC), Lymphocyte Percentage (LYMPH%), basophil count (BASO), Basophil Percentage (BASO%), Lymphocyte Absolute Value (LYMPH), Neutrophil Percentage (NEUT%), Hemoglobin (HGB), Mean Corpuscular Hemoglobin Concentration (MCHC), Red Cell Distribution Width (R-CV), Mean Platelet Volume (MPV), Plateletcrit (PCT), and Among the four models, the extreme gradient boosting (XGBoost) model achieved the highest predictive performance, with an area under the curve (AUC) of 0.964. In contrast, the RF model exhibited the poorest generalization ability, with an AUC of 0.907. The SHAP method revealed the top 6 predictors of CC according to the importance ranking, and the average platelet distribution width (PDW) was recognized as the most important predictor variable for CC occurrence (the primary endpoint variable). |
| format | Article |
| id | doaj-art-05ef6791f8c44b478db20431ebbe67be |
| institution | Kabale University |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-05ef6791f8c44b478db20431ebbe67be2025-08-20T03:45:20ZengNature PortfolioScientific Reports2045-23222025-07-0115111510.1038/s41598-025-08166-0Cervical cancer prediction using machine learning models based on routine blood analysisJie Su0Hui Lu1Ruihuan Zhang2Na Cui3Chao Chen4Qin Si5Biao Song6Medical neurobiology laboratory, Inner Mongolia Medical UniversityCollege of Computer Science, Inner Mongolia UniversityMedical Intelligent Diagnostics Big Data Research InstituteInner Mongolia Autonomous Region Cancer Center Gynecological oncology, Peking University Cancer Hospital (Inner Mongolia Campus)&Affiliated Cancer Hospital of Inner Mongolia Medical UniversityMedical Intelligent Diagnostics Big Data Research InstituteInner Mongolia Autonomous Region Cancer Center Gynecological oncology, Peking University Cancer Hospital (Inner Mongolia Campus)&Affiliated Cancer Hospital of Inner Mongolia Medical UniversityMedical Intelligent Diagnostics Big Data Research InstituteAbstract Cervical cancer (CC) is the fourth most common cancer among women globally. The key to preventing and treating CC is early detection, diagnosis, and treatment. This study aimed to develop an interpretable model for predicting CC risk using routine blood data. The primary endpoint variable is the occurrence of CC, as confirmed by histopathological diagnosis. We used the Shapley Additive Explanation (SHAP) method to provide interpretabiligy and identify key factors associated with CC. In this restrospective study, medical records of patients from 2013 to 2023 were collected. A total of 2,503 patients diagnosed with CC were included in the case group, while the control group was composed of 3,794 patients without apparent signs of the disease, which included women with other gynecological conditions as well as healthy individuals undergoing routine check-ups. Age, clinical diagnosis information and 22 blood cell analysis results were considered. Four different algorithms were applied to construct a model for estimating the likelihood of CC occurrence. Using least absolute shrinkage and selection operator (LASSO) and the random forest method (RF) method, 15 key routine blood features were ultimtely selected from an initial set of 23 features for model training. These features include age, red blood cell count (RBC), platelet distribution width (PDW), white blood cell count (WBC), Lymphocyte Percentage (LYMPH%), basophil count (BASO), Basophil Percentage (BASO%), Lymphocyte Absolute Value (LYMPH), Neutrophil Percentage (NEUT%), Hemoglobin (HGB), Mean Corpuscular Hemoglobin Concentration (MCHC), Red Cell Distribution Width (R-CV), Mean Platelet Volume (MPV), Plateletcrit (PCT), and Among the four models, the extreme gradient boosting (XGBoost) model achieved the highest predictive performance, with an area under the curve (AUC) of 0.964. In contrast, the RF model exhibited the poorest generalization ability, with an AUC of 0.907. The SHAP method revealed the top 6 predictors of CC according to the importance ranking, and the average platelet distribution width (PDW) was recognized as the most important predictor variable for CC occurrence (the primary endpoint variable).https://doi.org/10.1038/s41598-025-08166-0Blood routineCervical cancerMachine learningShapley additive interpretation |
| spellingShingle | Jie Su Hui Lu Ruihuan Zhang Na Cui Chao Chen Qin Si Biao Song Cervical cancer prediction using machine learning models based on routine blood analysis Scientific Reports Blood routine Cervical cancer Machine learning Shapley additive interpretation |
| title | Cervical cancer prediction using machine learning models based on routine blood analysis |
| title_full | Cervical cancer prediction using machine learning models based on routine blood analysis |
| title_fullStr | Cervical cancer prediction using machine learning models based on routine blood analysis |
| title_full_unstemmed | Cervical cancer prediction using machine learning models based on routine blood analysis |
| title_short | Cervical cancer prediction using machine learning models based on routine blood analysis |
| title_sort | cervical cancer prediction using machine learning models based on routine blood analysis |
| topic | Blood routine Cervical cancer Machine learning Shapley additive interpretation |
| url | https://doi.org/10.1038/s41598-025-08166-0 |
| work_keys_str_mv | AT jiesu cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis AT huilu cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis AT ruihuanzhang cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis AT nacui cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis AT chaochen cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis AT qinsi cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis AT biaosong cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis |