Cervical cancer prediction using machine learning models based on routine blood analysis

Abstract Cervical cancer (CC) is the fourth most common cancer among women globally. The key to preventing and treating CC is early detection, diagnosis, and treatment. This study aimed to develop an interpretable model for predicting CC risk using routine blood data. The primary endpoint variable i...

Full description

Saved in:
Bibliographic Details
Main Authors: Jie Su, Hui Lu, Ruihuan Zhang, Na Cui, Chao Chen, Qin Si, Biao Song
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-08166-0
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849335259936587776
author Jie Su
Hui Lu
Ruihuan Zhang
Na Cui
Chao Chen
Qin Si
Biao Song
author_facet Jie Su
Hui Lu
Ruihuan Zhang
Na Cui
Chao Chen
Qin Si
Biao Song
author_sort Jie Su
collection DOAJ
description Abstract Cervical cancer (CC) is the fourth most common cancer among women globally. The key to preventing and treating CC is early detection, diagnosis, and treatment. This study aimed to develop an interpretable model for predicting CC risk using routine blood data. The primary endpoint variable is the occurrence of CC, as confirmed by histopathological diagnosis. We used the Shapley Additive Explanation (SHAP) method to provide interpretabiligy and identify key factors associated with CC. In this restrospective study, medical records of patients from 2013 to 2023 were collected. A total of 2,503 patients diagnosed with CC were included in the case group, while the control group was composed of 3,794 patients without apparent signs of the disease, which included women with other gynecological conditions as well as healthy individuals undergoing routine check-ups. Age, clinical diagnosis information and 22 blood cell analysis results were considered. Four different algorithms were applied to construct a model for estimating the likelihood of CC occurrence. Using least absolute shrinkage and selection operator (LASSO) and the random forest method (RF) method, 15 key routine blood features were ultimtely selected from an initial set of 23 features for model training. These features include age, red blood cell count (RBC), platelet distribution width (PDW), white blood cell count (WBC), Lymphocyte Percentage (LYMPH%), basophil count (BASO), Basophil Percentage (BASO%), Lymphocyte Absolute Value (LYMPH), Neutrophil Percentage (NEUT%), Hemoglobin (HGB), Mean Corpuscular Hemoglobin Concentration (MCHC), Red Cell Distribution Width (R-CV), Mean Platelet Volume (MPV), Plateletcrit (PCT), and Among the four models, the extreme gradient boosting (XGBoost) model achieved the highest predictive performance, with an area under the curve (AUC) of 0.964. In contrast, the RF model exhibited the poorest generalization ability, with an AUC of 0.907. The SHAP method revealed the top 6 predictors of CC according to the importance ranking, and the average platelet distribution width (PDW) was recognized as the most important predictor variable for CC occurrence (the primary endpoint variable).
format Article
id doaj-art-05ef6791f8c44b478db20431ebbe67be
institution Kabale University
issn 2045-2322
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-05ef6791f8c44b478db20431ebbe67be2025-08-20T03:45:20ZengNature PortfolioScientific Reports2045-23222025-07-0115111510.1038/s41598-025-08166-0Cervical cancer prediction using machine learning models based on routine blood analysisJie Su0Hui Lu1Ruihuan Zhang2Na Cui3Chao Chen4Qin Si5Biao Song6Medical neurobiology laboratory, Inner Mongolia Medical UniversityCollege of Computer Science, Inner Mongolia UniversityMedical Intelligent Diagnostics Big Data Research InstituteInner Mongolia Autonomous Region Cancer Center Gynecological oncology, Peking University Cancer Hospital (Inner Mongolia Campus)&Affiliated Cancer Hospital of Inner Mongolia Medical UniversityMedical Intelligent Diagnostics Big Data Research InstituteInner Mongolia Autonomous Region Cancer Center Gynecological oncology, Peking University Cancer Hospital (Inner Mongolia Campus)&Affiliated Cancer Hospital of Inner Mongolia Medical UniversityMedical Intelligent Diagnostics Big Data Research InstituteAbstract Cervical cancer (CC) is the fourth most common cancer among women globally. The key to preventing and treating CC is early detection, diagnosis, and treatment. This study aimed to develop an interpretable model for predicting CC risk using routine blood data. The primary endpoint variable is the occurrence of CC, as confirmed by histopathological diagnosis. We used the Shapley Additive Explanation (SHAP) method to provide interpretabiligy and identify key factors associated with CC. In this restrospective study, medical records of patients from 2013 to 2023 were collected. A total of 2,503 patients diagnosed with CC were included in the case group, while the control group was composed of 3,794 patients without apparent signs of the disease, which included women with other gynecological conditions as well as healthy individuals undergoing routine check-ups. Age, clinical diagnosis information and 22 blood cell analysis results were considered. Four different algorithms were applied to construct a model for estimating the likelihood of CC occurrence. Using least absolute shrinkage and selection operator (LASSO) and the random forest method (RF) method, 15 key routine blood features were ultimtely selected from an initial set of 23 features for model training. These features include age, red blood cell count (RBC), platelet distribution width (PDW), white blood cell count (WBC), Lymphocyte Percentage (LYMPH%), basophil count (BASO), Basophil Percentage (BASO%), Lymphocyte Absolute Value (LYMPH), Neutrophil Percentage (NEUT%), Hemoglobin (HGB), Mean Corpuscular Hemoglobin Concentration (MCHC), Red Cell Distribution Width (R-CV), Mean Platelet Volume (MPV), Plateletcrit (PCT), and Among the four models, the extreme gradient boosting (XGBoost) model achieved the highest predictive performance, with an area under the curve (AUC) of 0.964. In contrast, the RF model exhibited the poorest generalization ability, with an AUC of 0.907. The SHAP method revealed the top 6 predictors of CC according to the importance ranking, and the average platelet distribution width (PDW) was recognized as the most important predictor variable for CC occurrence (the primary endpoint variable).https://doi.org/10.1038/s41598-025-08166-0Blood routineCervical cancerMachine learningShapley additive interpretation
spellingShingle Jie Su
Hui Lu
Ruihuan Zhang
Na Cui
Chao Chen
Qin Si
Biao Song
Cervical cancer prediction using machine learning models based on routine blood analysis
Scientific Reports
Blood routine
Cervical cancer
Machine learning
Shapley additive interpretation
title Cervical cancer prediction using machine learning models based on routine blood analysis
title_full Cervical cancer prediction using machine learning models based on routine blood analysis
title_fullStr Cervical cancer prediction using machine learning models based on routine blood analysis
title_full_unstemmed Cervical cancer prediction using machine learning models based on routine blood analysis
title_short Cervical cancer prediction using machine learning models based on routine blood analysis
title_sort cervical cancer prediction using machine learning models based on routine blood analysis
topic Blood routine
Cervical cancer
Machine learning
Shapley additive interpretation
url https://doi.org/10.1038/s41598-025-08166-0
work_keys_str_mv AT jiesu cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis
AT huilu cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis
AT ruihuanzhang cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis
AT nacui cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis
AT chaochen cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis
AT qinsi cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis
AT biaosong cervicalcancerpredictionusingmachinelearningmodelsbasedonroutinebloodanalysis