Determination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation

Abstract Background Orthogonal confirmation of variants identified by next-generation sequencing (NGS) is routinely performed in many clinical laboratories to improve assay specificity. However, confirmatory testing of all clinically significant variants increases both turnaround time and operating...

Full description

Saved in:
Bibliographic Details
Main Authors: Muqing Yan, Qiandong Zeng, Zhenxi Zhang, Patricia Okamoto, Stanley Letovsky, Angela Kenyon, Natalia Leach, Jennifer Reiner
Format: Article
Language:English
Published: BMC 2025-08-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-025-11889-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849344711266926592
author Muqing Yan
Qiandong Zeng
Zhenxi Zhang
Patricia Okamoto
Stanley Letovsky
Angela Kenyon
Natalia Leach
Jennifer Reiner
author_facet Muqing Yan
Qiandong Zeng
Zhenxi Zhang
Patricia Okamoto
Stanley Letovsky
Angela Kenyon
Natalia Leach
Jennifer Reiner
author_sort Muqing Yan
collection DOAJ
description Abstract Background Orthogonal confirmation of variants identified by next-generation sequencing (NGS) is routinely performed in many clinical laboratories to improve assay specificity. However, confirmatory testing of all clinically significant variants increases both turnaround time and operating costs for laboratories. Improvements to early NGS methods and bioinformatics algorithms have dramatically improved variant calling accuracy, particularly for single nucleotide variants (SNVs), thus calling into question the necessity of confirmatory testing for all variant types. The purpose of this study is to develop a new machine learning approach to capture false positive heterozygous variants (SNVs) from whole exome sequencing (WES) data. Results WES variant calls from Genome in a Bottle (GIAB) cell lines and their associated quality features were used to train five different machine learning models to predict whether a variant was a true positive or false positive based on quality metrics. Logistic regression and random forest models exhibited the highest false positive capture rates among the selected models, but GradientBoosting achieved the best balance between false positive capture rates and true positive flag rates. Further assessment using simulated false positive events as well as different combinations of quality features showed that model performance can be refined. Integration of the highest-performing models into a custom two-tiered confirmation bypass pipeline with additional guardrail metrics achieved 99.9% precision and 98% specificity in the identification of true positive heterozygous SNVs within the GIAB benchmark regions. Furthermore, testing on an independent set of heterozygous SNVs (n = 93) detected by exome sequencing of patient samples and cell lines demonstrated 100% accuracy. Conclusions Machine-learning models can be trained to classify SNVs into high or low-confidence categories with high precision, thus reducing the level of confirmatory testing required. Laboratories interested in deploying such models should consider incorporating additional quality criteria and thresholds to serve as guardrails in the assessment process.
format Article
id doaj-art-1869beb0c7ea471885bc0b4777fada67
institution Kabale University
issn 1471-2164
language English
publishDate 2025-08-01
publisher BMC
record_format Article
series BMC Genomics
spelling doaj-art-1869beb0c7ea471885bc0b4777fada672025-08-20T03:42:37ZengBMCBMC Genomics1471-21642025-08-0126111010.1186/s12864-025-11889-zDetermination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmationMuqing Yan0Qiandong Zeng1Zhenxi Zhang2Patricia Okamoto3Stanley Letovsky4Angela Kenyon5Natalia Leach6Jennifer Reiner7LabcorpLabcorpLabcorpLabcorpLabcorpLabcorpLabcorpLabcorpAbstract Background Orthogonal confirmation of variants identified by next-generation sequencing (NGS) is routinely performed in many clinical laboratories to improve assay specificity. However, confirmatory testing of all clinically significant variants increases both turnaround time and operating costs for laboratories. Improvements to early NGS methods and bioinformatics algorithms have dramatically improved variant calling accuracy, particularly for single nucleotide variants (SNVs), thus calling into question the necessity of confirmatory testing for all variant types. The purpose of this study is to develop a new machine learning approach to capture false positive heterozygous variants (SNVs) from whole exome sequencing (WES) data. Results WES variant calls from Genome in a Bottle (GIAB) cell lines and their associated quality features were used to train five different machine learning models to predict whether a variant was a true positive or false positive based on quality metrics. Logistic regression and random forest models exhibited the highest false positive capture rates among the selected models, but GradientBoosting achieved the best balance between false positive capture rates and true positive flag rates. Further assessment using simulated false positive events as well as different combinations of quality features showed that model performance can be refined. Integration of the highest-performing models into a custom two-tiered confirmation bypass pipeline with additional guardrail metrics achieved 99.9% precision and 98% specificity in the identification of true positive heterozygous SNVs within the GIAB benchmark regions. Furthermore, testing on an independent set of heterozygous SNVs (n = 93) detected by exome sequencing of patient samples and cell lines demonstrated 100% accuracy. Conclusions Machine-learning models can be trained to classify SNVs into high or low-confidence categories with high precision, thus reducing the level of confirmatory testing required. Laboratories interested in deploying such models should consider incorporating additional quality criteria and thresholds to serve as guardrails in the assessment process.https://doi.org/10.1186/s12864-025-11889-zNext generation sequencingSanger confirmationMachine learningClinical decision-support tool
spellingShingle Muqing Yan
Qiandong Zeng
Zhenxi Zhang
Patricia Okamoto
Stanley Letovsky
Angela Kenyon
Natalia Leach
Jennifer Reiner
Determination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation
BMC Genomics
Next generation sequencing
Sanger confirmation
Machine learning
Clinical decision-support tool
title Determination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation
title_full Determination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation
title_fullStr Determination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation
title_full_unstemmed Determination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation
title_short Determination of high-confidence germline genetic variants in next-generation sequencing through machine learning models: an approach to reduce the burden of orthogonal confirmation
title_sort determination of high confidence germline genetic variants in next generation sequencing through machine learning models an approach to reduce the burden of orthogonal confirmation
topic Next generation sequencing
Sanger confirmation
Machine learning
Clinical decision-support tool
url https://doi.org/10.1186/s12864-025-11889-z
work_keys_str_mv AT muqingyan determinationofhighconfidencegermlinegeneticvariantsinnextgenerationsequencingthroughmachinelearningmodelsanapproachtoreducetheburdenoforthogonalconfirmation
AT qiandongzeng determinationofhighconfidencegermlinegeneticvariantsinnextgenerationsequencingthroughmachinelearningmodelsanapproachtoreducetheburdenoforthogonalconfirmation
AT zhenxizhang determinationofhighconfidencegermlinegeneticvariantsinnextgenerationsequencingthroughmachinelearningmodelsanapproachtoreducetheburdenoforthogonalconfirmation
AT patriciaokamoto determinationofhighconfidencegermlinegeneticvariantsinnextgenerationsequencingthroughmachinelearningmodelsanapproachtoreducetheburdenoforthogonalconfirmation
AT stanleyletovsky determinationofhighconfidencegermlinegeneticvariantsinnextgenerationsequencingthroughmachinelearningmodelsanapproachtoreducetheburdenoforthogonalconfirmation
AT angelakenyon determinationofhighconfidencegermlinegeneticvariantsinnextgenerationsequencingthroughmachinelearningmodelsanapproachtoreducetheburdenoforthogonalconfirmation
AT natalialeach determinationofhighconfidencegermlinegeneticvariantsinnextgenerationsequencingthroughmachinelearningmodelsanapproachtoreducetheburdenoforthogonalconfirmation
AT jenniferreiner determinationofhighconfidencegermlinegeneticvariantsinnextgenerationsequencingthroughmachinelearningmodelsanapproachtoreducetheburdenoforthogonalconfirmation