An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer

Abstract Background and objective Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessment...

Full description

Saved in:
Bibliographic Details
Main Authors: Subrata Das, Vatsal Patel, Shouvik Chakravarty, Arnab Ghosh, Anirban Mukhopadhyay, Nidhan K. Biswas
Format: Article
Language:English
Published: BMC 2025-01-01
Series:BioData Mining
Subjects:
Online Access:https://doi.org/10.1186/s13040-024-00420-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832585987845259264
author Subrata Das
Vatsal Patel
Shouvik Chakravarty
Arnab Ghosh
Anirban Mukhopadhyay
Nidhan K. Biswas
author_facet Subrata Das
Vatsal Patel
Shouvik Chakravarty
Arnab Ghosh
Anirban Mukhopadhyay
Nidhan K. Biswas
author_sort Subrata Das
collection DOAJ
description Abstract Background and objective Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessments. This inconsistency is evident even for well-established cancer driver mutations. This study aims to develop an ensemble machine learning approach to evaluate the performance (rank) of pathogenic and conservation scoring algorithms (PCSAs) based on their ability to distinguish pathogenic driver mutations from benign passenger (non-driver) mutations in head and neck squamous cell carcinoma (HNSC). Methods The study used a dataset from 502 HNSC patients, classifying mutations based on 299 known high-confidence cancer driver genes. Missense somatic mutations in driver genes were treated as driver mutations, while non-driver mutations were randomly selected from other genes. Each mutation was annotated with 41 PCSAs. Three machine learning algorithms—logistic regression, random forest, and support vector machine—along with recursive feature elimination, were used to rank these PCSAs. The final ranking of the PCSAs was determined using rank-average-sort and rank-sum-sort methods. Results The random forest algorithm emerged as the top performer among the three tested ML algorithms, with an AUC-ROC of 0.89, compared to 0.83 for the other two, in distinguishing pathogenic driver mutations from benign passenger mutations using all 41 PCSAs. The top 11 PCSAs were selected based on the first quintile cut-off from the final rank-sum distribution. Classifiers built using these top 11 PCSAs (DEOGEN2, Integrated_fitCons, MVP, etc.) demonstrated significantly higher performance (p-value < 2.22e-16) compared to those using the remaining 30 PCSAs across all three ML algorithms, in separating pathogenic driver from benign passenger mutations. The top PCSAs demonstrated strong performance on a validation cohort including independent HNSC and other cancer types: breast, lung, and colorectal - reflecting its consistency, robustness and generalizability. Conclusions The ensemble machine learning approach effectively evaluates the performance of PCSAs based on their ability to differentiate pathogenic drivers from benign passenger mutations in HNSC and other cancer types. Notably, some well-known PCSAs performed poorly, underscoring the importance of data-driven selection over relying solely on popularity.
format Article
id doaj-art-593c024a4a3c471ba043c61e4b1bbdaf
institution Kabale University
issn 1756-0381
language English
publishDate 2025-01-01
publisher BMC
record_format Article
series BioData Mining
spelling doaj-art-593c024a4a3c471ba043c61e4b1bbdaf2025-01-26T12:18:18ZengBMCBioData Mining1756-03812025-01-0118112510.1186/s13040-024-00420-xAn ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancerSubrata Das0Vatsal Patel1Shouvik Chakravarty2Arnab Ghosh3Anirban Mukhopadhyay4Nidhan K. Biswas5Biotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical GenomicsBiotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical GenomicsBiotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical GenomicsBiotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical GenomicsDepartment of Computer Science and Engineering, University of KalyaniBiotechnology Research and Innovation Council-National Institute of Biomedical Genomics (BRIC-NIBMG), National Institute of Biomedical GenomicsAbstract Background and objective Accurate identification and prioritization of driver-mutations in cancer is critical for effective patient management. Despite the presence of numerous bioinformatic algorithms for estimating mutation pathogenicity, there is significant variation in their assessments. This inconsistency is evident even for well-established cancer driver mutations. This study aims to develop an ensemble machine learning approach to evaluate the performance (rank) of pathogenic and conservation scoring algorithms (PCSAs) based on their ability to distinguish pathogenic driver mutations from benign passenger (non-driver) mutations in head and neck squamous cell carcinoma (HNSC). Methods The study used a dataset from 502 HNSC patients, classifying mutations based on 299 known high-confidence cancer driver genes. Missense somatic mutations in driver genes were treated as driver mutations, while non-driver mutations were randomly selected from other genes. Each mutation was annotated with 41 PCSAs. Three machine learning algorithms—logistic regression, random forest, and support vector machine—along with recursive feature elimination, were used to rank these PCSAs. The final ranking of the PCSAs was determined using rank-average-sort and rank-sum-sort methods. Results The random forest algorithm emerged as the top performer among the three tested ML algorithms, with an AUC-ROC of 0.89, compared to 0.83 for the other two, in distinguishing pathogenic driver mutations from benign passenger mutations using all 41 PCSAs. The top 11 PCSAs were selected based on the first quintile cut-off from the final rank-sum distribution. Classifiers built using these top 11 PCSAs (DEOGEN2, Integrated_fitCons, MVP, etc.) demonstrated significantly higher performance (p-value < 2.22e-16) compared to those using the remaining 30 PCSAs across all three ML algorithms, in separating pathogenic driver from benign passenger mutations. The top PCSAs demonstrated strong performance on a validation cohort including independent HNSC and other cancer types: breast, lung, and colorectal - reflecting its consistency, robustness and generalizability. Conclusions The ensemble machine learning approach effectively evaluates the performance of PCSAs based on their ability to differentiate pathogenic drivers from benign passenger mutations in HNSC and other cancer types. Notably, some well-known PCSAs performed poorly, underscoring the importance of data-driven selection over relying solely on popularity.https://doi.org/10.1186/s13040-024-00420-xDriver mutationMachine learningPathogenecity prediction algorithm
spellingShingle Subrata Das
Vatsal Patel
Shouvik Chakravarty
Arnab Ghosh
Anirban Mukhopadhyay
Nidhan K. Biswas
An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
BioData Mining
Driver mutation
Machine learning
Pathogenecity prediction algorithm
title An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
title_full An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
title_fullStr An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
title_full_unstemmed An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
title_short An ensemble machine learning-based performance evaluation identifies top In-Silico pathogenicity prediction methods that best classify driver mutations in cancer
title_sort ensemble machine learning based performance evaluation identifies top in silico pathogenicity prediction methods that best classify driver mutations in cancer
topic Driver mutation
Machine learning
Pathogenecity prediction algorithm
url https://doi.org/10.1186/s13040-024-00420-x
work_keys_str_mv AT subratadas anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT vatsalpatel anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT shouvikchakravarty anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT arnabghosh anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT anirbanmukhopadhyay anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT nidhankbiswas anensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT subratadas ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT vatsalpatel ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT shouvikchakravarty ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT arnabghosh ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT anirbanmukhopadhyay ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer
AT nidhankbiswas ensemblemachinelearningbasedperformanceevaluationidentifiestopinsilicopathogenicitypredictionmethodsthatbestclassifydrivermutationsincancer