Predicting high confidence ctDNA somatic variants with ensemble machine learning models

Abstract Circulating tumour DNA (ctDNA) is a minimally invasive cancer biomarker that can be used to inform treatment of cancer patients. The utility of ctDNA as a cancer biomarker depends on the ability to accurately detect somatic variants associated with cancer. Accurate somatic variant detection...

Full description

Saved in:
Bibliographic Details
Main Authors: Rugare Maruzani, Liam Brierley, Andrea Jorgensen, Anna Fowler
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-025-01326-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850231322440630272
author Rugare Maruzani
Liam Brierley
Andrea Jorgensen
Anna Fowler
author_facet Rugare Maruzani
Liam Brierley
Andrea Jorgensen
Anna Fowler
author_sort Rugare Maruzani
collection DOAJ
description Abstract Circulating tumour DNA (ctDNA) is a minimally invasive cancer biomarker that can be used to inform treatment of cancer patients. The utility of ctDNA as a cancer biomarker depends on the ability to accurately detect somatic variants associated with cancer. Accurate somatic variant detection in circulating cell free DNA (cfDNA) NGS data requires filtering strategies to remove germline variants, and NGS artifacts. Rule-based variant filtering methods either remove a substantial number of true positive ctDNA variants along with false variant calls or retain an implausibly large number of total variants. Machine Learning (ML) enables identification of complex patterns which may improve ability to distinguish between real somatic ctDNA variants and false positive calls. We built two Random Forest (RF) models for predicting high confidence somatic ctDNA variants in low and high depth cfDNA NGS data. Low depth models were fitted and evaluated on whole exome sequencing (WES) cfDNA data at depths of approximately 10X while the high depth data was sequenced at approximately 500X. Both models utilise a set of 15 features from variants detected by bcftools, FreeBayes, LoFreq and Mutect2. High confidence ground truth sets were obtained from matched tissue biopsy samples. We benchmarked our models against rule-based filtering with a set of hard, medium, and soft thresholds. Precision-recall curves showed the high depth model outperformed rule-based filtering at all thresholds in Test Data (PR-AUC 0.71). Partial dependence plots showed membership in the COSMIC database, absence from the dbSNP common variants database, and increasing read depth increased mean probability of high confidence somatic variant prediction in both models. Our results demonstrate the utility of supervised ML models for filtering variants in cfDNA data.
format Article
id doaj-art-6ea40d04ef4d4f1bb92ed822b5bc3975
institution OA Journals
issn 2045-2322
language English
publishDate 2025-05-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-6ea40d04ef4d4f1bb92ed822b5bc39752025-08-20T02:03:35ZengNature PortfolioScientific Reports2045-23222025-05-0115111310.1038/s41598-025-01326-2Predicting high confidence ctDNA somatic variants with ensemble machine learning modelsRugare Maruzani0Liam Brierley1Andrea Jorgensen2Anna Fowler3Department of Health Data Science, Institute of Population Health, Great Britain and Northern Ireland, University of LiverpoolSchool Of Infection & Immunity, of Great Britain and Northern IrelandDepartment of Health Data Science, Institute of Population Health, Great Britain and Northern Ireland, University of LiverpoolDepartment of Health Data Science, Institute of Population Health, Great Britain and Northern Ireland, University of LiverpoolAbstract Circulating tumour DNA (ctDNA) is a minimally invasive cancer biomarker that can be used to inform treatment of cancer patients. The utility of ctDNA as a cancer biomarker depends on the ability to accurately detect somatic variants associated with cancer. Accurate somatic variant detection in circulating cell free DNA (cfDNA) NGS data requires filtering strategies to remove germline variants, and NGS artifacts. Rule-based variant filtering methods either remove a substantial number of true positive ctDNA variants along with false variant calls or retain an implausibly large number of total variants. Machine Learning (ML) enables identification of complex patterns which may improve ability to distinguish between real somatic ctDNA variants and false positive calls. We built two Random Forest (RF) models for predicting high confidence somatic ctDNA variants in low and high depth cfDNA NGS data. Low depth models were fitted and evaluated on whole exome sequencing (WES) cfDNA data at depths of approximately 10X while the high depth data was sequenced at approximately 500X. Both models utilise a set of 15 features from variants detected by bcftools, FreeBayes, LoFreq and Mutect2. High confidence ground truth sets were obtained from matched tissue biopsy samples. We benchmarked our models against rule-based filtering with a set of hard, medium, and soft thresholds. Precision-recall curves showed the high depth model outperformed rule-based filtering at all thresholds in Test Data (PR-AUC 0.71). Partial dependence plots showed membership in the COSMIC database, absence from the dbSNP common variants database, and increasing read depth increased mean probability of high confidence somatic variant prediction in both models. Our results demonstrate the utility of supervised ML models for filtering variants in cfDNA data.https://doi.org/10.1038/s41598-025-01326-2
spellingShingle Rugare Maruzani
Liam Brierley
Andrea Jorgensen
Anna Fowler
Predicting high confidence ctDNA somatic variants with ensemble machine learning models
Scientific Reports
title Predicting high confidence ctDNA somatic variants with ensemble machine learning models
title_full Predicting high confidence ctDNA somatic variants with ensemble machine learning models
title_fullStr Predicting high confidence ctDNA somatic variants with ensemble machine learning models
title_full_unstemmed Predicting high confidence ctDNA somatic variants with ensemble machine learning models
title_short Predicting high confidence ctDNA somatic variants with ensemble machine learning models
title_sort predicting high confidence ctdna somatic variants with ensemble machine learning models
url https://doi.org/10.1038/s41598-025-01326-2
work_keys_str_mv AT rugaremaruzani predictinghighconfidencectdnasomaticvariantswithensemblemachinelearningmodels
AT liambrierley predictinghighconfidencectdnasomaticvariantswithensemblemachinelearningmodels
AT andreajorgensen predictinghighconfidencectdnasomaticvariantswithensemblemachinelearningmodels
AT annafowler predictinghighconfidencectdnasomaticvariantswithensemblemachinelearningmodels