Development and validation of a biomarker-based prediction model for metastasis in patients with colorectal cancer: Application of machine learning algorithms

Objective: The purpose of the current study was to develop and validate a biomarker-based prediction model for metastasis in patients with colorectal cancer (CRC). Methods: Two datasets, GSE68468 and GSE41568, were retrieved from the Gene Expression Omnibus (GEO) database. In the GSE68468 dataset, k...

Full description

Saved in:
Bibliographic Details
Main Authors: Erfan Ayubi, Sajjad Farashi, Leili Tapak, Saeid Afshar
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Heliyon
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2405844024174749
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850073881743720448
author Erfan Ayubi
Sajjad Farashi
Leili Tapak
Saeid Afshar
author_facet Erfan Ayubi
Sajjad Farashi
Leili Tapak
Saeid Afshar
author_sort Erfan Ayubi
collection DOAJ
description Objective: The purpose of the current study was to develop and validate a biomarker-based prediction model for metastasis in patients with colorectal cancer (CRC). Methods: Two datasets, GSE68468 and GSE41568, were retrieved from the Gene Expression Omnibus (GEO) database. In the GSE68468 dataset, key biomarkers were identified through a screening process involving differential expression analysis, redundancy analysis, and recursive feature elimination technique. Subsequently, the prediction model was developed and internally validated using five machine learning (ML) algorithms including lasso and elastic-net regularized generalized linear model (glmnet), k-nearest neighbors (kNN), support vector machine (SVM) with Radial Basis Function Kernel, random forest (RF), and eXtreme Gradient Boosting (XGBoost). The predictive performance of the algorithm with the highest accuracy was then externally validated on the GSE41568 dataset. Results: Among 22,283 registered genes in the GSE68468 dataset, the screening process identified 16 key genes including MMP3, CCDC102B, CDH2, SCGB1A1, KRT7, CYP1B1, LAMC3, ALB, DIXDC1, VWF, MMP1, CYP4B1, NKX3-2, TMEM158, GADD45B, SERPINA1 and these genes were used to build the prediction model. On the internal validation dataset, the prediction performance of five ML algorithms was as follows; RF (accuracy = 0.97 and kappa = 0.91), XGBoost (0.93, 0.81), kNN (0.93, 0.81), glmnet (0.93, 0.82) and SVM (0.92, 0.80). Top five biomarkers were MMP3, CCDC102B, CDH2, VWF and MMP1. The RF model exhibited an accuracy of 0.97, a kappa value of 0.92, and an area under the curve (AUC) of 0.99 in the external validation dataset. Conclusion: The results of this study have identified biomarkers through ML algorithms which help to identify patients with CRC prone to metastasis.
format Article
id doaj-art-8a3582f01f3c442f9b3c56befb527589
institution DOAJ
issn 2405-8440
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Heliyon
spelling doaj-art-8a3582f01f3c442f9b3c56befb5275892025-08-20T02:46:43ZengElsevierHeliyon2405-84402025-01-01111e4144310.1016/j.heliyon.2024.e41443Development and validation of a biomarker-based prediction model for metastasis in patients with colorectal cancer: Application of machine learning algorithmsErfan Ayubi0Sajjad Farashi1Leili Tapak2Saeid Afshar3Cancer Research Center, Institute of Cancer, Avicenna Health Research Institute, Hamadan University of Medical Sciences, Hamadan, IranNeurophysiology Research Center, Institute of Neuroscience and Mental Health, Avicenna Health Research Institute, Hamadan University of Medical Sciences, Hamadan, IranModeling of Noncommunicable Diseases Research Center, Institute of Health Sciences andTechnologies, Avicenna Health Research Institute, Hamadan University of Medical Sciences, Hamadan, IranCancer Research Center, Institute of Cancer, Avicenna Health Research Institute, Hamadan University of Medical Sciences, Hamadan, Iran; Department of Medical Biotechnology, School of Advanced Medical Sciences and Technologies, Hamadan University of Medical Sciences, Hamadan, Iran; Corresponding author. Cancer Research Center, Institute of Cancer, Avicenna Health Research Institute, Hamadan University of Medical Sciences, Hamadan, Iran.Objective: The purpose of the current study was to develop and validate a biomarker-based prediction model for metastasis in patients with colorectal cancer (CRC). Methods: Two datasets, GSE68468 and GSE41568, were retrieved from the Gene Expression Omnibus (GEO) database. In the GSE68468 dataset, key biomarkers were identified through a screening process involving differential expression analysis, redundancy analysis, and recursive feature elimination technique. Subsequently, the prediction model was developed and internally validated using five machine learning (ML) algorithms including lasso and elastic-net regularized generalized linear model (glmnet), k-nearest neighbors (kNN), support vector machine (SVM) with Radial Basis Function Kernel, random forest (RF), and eXtreme Gradient Boosting (XGBoost). The predictive performance of the algorithm with the highest accuracy was then externally validated on the GSE41568 dataset. Results: Among 22,283 registered genes in the GSE68468 dataset, the screening process identified 16 key genes including MMP3, CCDC102B, CDH2, SCGB1A1, KRT7, CYP1B1, LAMC3, ALB, DIXDC1, VWF, MMP1, CYP4B1, NKX3-2, TMEM158, GADD45B, SERPINA1 and these genes were used to build the prediction model. On the internal validation dataset, the prediction performance of five ML algorithms was as follows; RF (accuracy = 0.97 and kappa = 0.91), XGBoost (0.93, 0.81), kNN (0.93, 0.81), glmnet (0.93, 0.82) and SVM (0.92, 0.80). Top five biomarkers were MMP3, CCDC102B, CDH2, VWF and MMP1. The RF model exhibited an accuracy of 0.97, a kappa value of 0.92, and an area under the curve (AUC) of 0.99 in the external validation dataset. Conclusion: The results of this study have identified biomarkers through ML algorithms which help to identify patients with CRC prone to metastasis.http://www.sciencedirect.com/science/article/pii/S2405844024174749Colorectal cancerMetastasisMachine learningBiomarker
spellingShingle Erfan Ayubi
Sajjad Farashi
Leili Tapak
Saeid Afshar
Development and validation of a biomarker-based prediction model for metastasis in patients with colorectal cancer: Application of machine learning algorithms
Heliyon
Colorectal cancer
Metastasis
Machine learning
Biomarker
title Development and validation of a biomarker-based prediction model for metastasis in patients with colorectal cancer: Application of machine learning algorithms
title_full Development and validation of a biomarker-based prediction model for metastasis in patients with colorectal cancer: Application of machine learning algorithms
title_fullStr Development and validation of a biomarker-based prediction model for metastasis in patients with colorectal cancer: Application of machine learning algorithms
title_full_unstemmed Development and validation of a biomarker-based prediction model for metastasis in patients with colorectal cancer: Application of machine learning algorithms
title_short Development and validation of a biomarker-based prediction model for metastasis in patients with colorectal cancer: Application of machine learning algorithms
title_sort development and validation of a biomarker based prediction model for metastasis in patients with colorectal cancer application of machine learning algorithms
topic Colorectal cancer
Metastasis
Machine learning
Biomarker
url http://www.sciencedirect.com/science/article/pii/S2405844024174749
work_keys_str_mv AT erfanayubi developmentandvalidationofabiomarkerbasedpredictionmodelformetastasisinpatientswithcolorectalcancerapplicationofmachinelearningalgorithms
AT sajjadfarashi developmentandvalidationofabiomarkerbasedpredictionmodelformetastasisinpatientswithcolorectalcancerapplicationofmachinelearningalgorithms
AT leilitapak developmentandvalidationofabiomarkerbasedpredictionmodelformetastasisinpatientswithcolorectalcancerapplicationofmachinelearningalgorithms
AT saeidafshar developmentandvalidationofabiomarkerbasedpredictionmodelformetastasisinpatientswithcolorectalcancerapplicationofmachinelearningalgorithms