An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer

Abstract Background and objective Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessment...

Full description

Saved in:

Bibliographic Details
Main Authors:	Subrata Das, Vatsal Patel, Shouvik Chakravarty, Arnab Ghosh, Anirban Mukhopadhyay, Nidhan K. Biswas
Format:	Article
Language:	English
Published:	BMC 2025-01-01
Series:	BioData Mining
Subjects:	Driver mutation Machine learning Pathogenecity prediction algorithm
Online Access:	https://doi.org/10.1186/s13040-024-00420-x
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832585987845259264
author	Subrata Das Vatsal Patel Shouvik Chakravarty Arnab Ghosh Anirban Mukhopadhyay Nidhan K. Biswas
author_facet	Subrata Das Vatsal Patel Shouvik Chakravarty Arnab Ghosh Anirban Mukhopadhyay Nidhan K. Biswas
author_sort	Subrata Das
collection	DOAJ
description	Abstract Background and objective Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessments. This inconsistency is evident even for well-established cancer driver mutations. This study aims to develop an ensemble machine learning approach to evaluate the performance (rank) of pathogenic and conservation scoring algorithms (PCSAs) based on their ability to distinguish pathogenic driver mutations from benign passenger (non-driver) mutations in head and neck squamous cell carcinoma (HNSC). Methods The study used a dataset from 502 HNSC patients, classifying mutations based on 299 known high-confidence cancer driver genes. Missense somatic mutations in driver genes were treated as driver mutations, while non-driver mutations were randomly selected from other genes. Each mutation was annotated with 41 PCSAs. Three machine learning algorithms—logistic regression, random forest, and support vector machine—along with recursive feature elimination, were used to rank these PCSAs. The final ranking of the PCSAs was determined using rank-average-sort and rank-sum-sort methods. Results The random forest algorithm emerged as the top performer among the three tested ML algorithms, with an AUC-ROC of 0.89, compared to 0.83 for the other two, in distinguishing pathogenic driver mutations from benign passenger mutations using all 41 PCSAs. The top 11 PCSAs were selected based on the first quintile cut-off from the final rank-sum distribution. Classifiers built using these top 11 PCSAs (DEOGEN2, Integrated_fitCons, MVP, etc.) demonstrated significantly higher performance (p-value < 2.22e-16) compared to those using the remaining 30 PCSAs across all three ML algorithms, in separating pathogenic driver from benign passenger mutations. The top PCSAs demonstrated strong performance on a validation cohort including independent HNSC and other cancer types: breast, lung, and colorectal - reflecting its consistency, robustness and generalizability. Conclusions The ensemble machine learning approach effectively evaluates the performance of PCSAs based on their ability to differentiate pathogenic drivers from benign passenger mutations in HNSC and other cancer types. Notably, some well-known PCSAs performed poorly, underscoring the importance of data-driven selection over relying solely on popularity.
format	Article
id	doaj-art-593c024a4a3c471ba043c61e4b1bbdaf
institution	Kabale University
issn	1756-0381
language	English
publishDate	2025-01-01
publisher	BMC
record_format	Article
series	BioData Mining
spelling	doaj-art-593c024a4a3c471ba043c61e4b1bbdaf2025-01-26T12:18:18ZengBMCBioData Mining1756-03812025-01-0118112510.1186/s13040-024-00420-xAn ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancerSubrata Das0Vatsal Patel1Shouvik Chakravarty2Arnab Ghosh3Anirban Mukhopadhyay4Nidhan K. Biswas5Biotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical GenomicsBiotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical GenomicsBiotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical GenomicsBiotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical GenomicsDepartment of Computer Science and Engineering, University of KalyaniBiotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical GenomicsAbstract Background and objective Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessments. This inconsistency is evident even for well-established cancer driver mutations. This study aims to develop an ensemble machine learning approach to evaluate the performance (rank) of pathogenic and conservation scoring algorithms (PCSAs) based on their ability to distinguish pathogenic driver mutations from benign passenger (non-driver) mutations in head and neck squamous cell carcinoma (HNSC). Methods The study used a dataset from 502 HNSC patients, classifying mutations based on 299 known high-confidence cancer driver genes. Missense somatic mutations in driver genes were treated as driver mutations, while non-driver mutations were randomly selected from other genes. Each mutation was annotated with 41 PCSAs. Three machine learning algorithms—logistic regression, random forest, and support vector machine—along with recursive feature elimination, were used to rank these PCSAs. The final ranking of the PCSAs was determined using rank-average-sort and rank-sum-sort methods. Results The random forest algorithm emerged as the top performer among the three tested ML algorithms, with an AUC-ROC of 0.89, compared to 0.83 for the other two, in distinguishing pathogenic driver mutations from benign passenger mutations using all 41 PCSAs. The top 11 PCSAs were selected based on the first quintile cut-off from the final rank-sum distribution. Classifiers built using these top 11 PCSAs (DEOGEN2, Integrated_fitCons, MVP, etc.) demonstrated significantly higher performance (p-value < 2.22e-16) compared to those using the remaining 30 PCSAs across all three ML algorithms, in separating pathogenic driver from benign passenger mutations. The top PCSAs demonstrated strong performance on a validation cohort including independent HNSC and other cancer types: breast, lung, and colorectal - reflecting its consistency, robustness and generalizability. Conclusions The ensemble machine learning approach effectively evaluates the performance of PCSAs based on their ability to differentiate pathogenic drivers from benign passenger mutations in HNSC and other cancer types. Notably, some well-known PCSAs performed poorly, underscoring the importance of data-driven selection over relying solely on popularity.https://doi.org/10.1186/s13040-024-00420-xDriver mutationMachine learningPathogenecity prediction algorithm
spellingShingle	Subrata Das Vatsal Patel Shouvik Chakravarty Arnab Ghosh Anirban Mukhopadhyay Nidhan K. Biswas An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer BioData Mining Driver mutation Machine learning Pathogenecity prediction algorithm
title	An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
title_full	An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
title_fullStr	An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
title_full_unstemmed	An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
title_short	An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
title_sort	ensemble machine learning based performance evaluation identifies top in silico pathogenicity prediction methods that best classify driver mutations in cancer
topic	Driver mutation Machine learning Pathogenecity prediction algorithm
url	https://doi.org/10.1186/s13040-024-00420-x
work_keys_str_mv	AT subratadas anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT vatsalpatel anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT shouvikchakravarty anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT arnabghosh anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT anirbanmukhopadhyay anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT nidhankbiswas anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT subratadas ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT vatsalpatel ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT shouvikchakravarty ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT arnabghosh ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT anirbanmukhopadhyay ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer AT nidhankbiswas ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer

An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer

Similar Items