SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants

Abstract Structural variations (SVs) are diverse forms of genetic alterations and drive a wide range of human diseases. Accurately genotyping SVs, particularly occurring at repetitive genomic regions, from short-read sequencing data remains challenging. Here, we introduce SVLearn, a machine-learning...

Full description

Saved in:
Bibliographic Details
Main Authors: Qimeng Yang, Jianfeng Sun, Xinyu Wang, Jiong Wang, Quanzhong Liu, Jinlong Ru, Xin Zhang, Sizhe Wang, Ran Hao, Peipei Bian, Xuelei Dai, Mian Gong, Zhuangbiao Zhang, Ao Wang, Fengting Bai, Ran Li, Yudong Cai, Yu Jiang
Format: Article
Language:English
Published: Nature Portfolio 2025-03-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-57756-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849774586313310208
author Qimeng Yang
Jianfeng Sun
Xinyu Wang
Jiong Wang
Quanzhong Liu
Jinlong Ru
Xin Zhang
Sizhe Wang
Ran Hao
Peipei Bian
Xuelei Dai
Mian Gong
Zhuangbiao Zhang
Ao Wang
Fengting Bai
Ran Li
Yudong Cai
Yu Jiang
author_facet Qimeng Yang
Jianfeng Sun
Xinyu Wang
Jiong Wang
Quanzhong Liu
Jinlong Ru
Xin Zhang
Sizhe Wang
Ran Hao
Peipei Bian
Xuelei Dai
Mian Gong
Zhuangbiao Zhang
Ao Wang
Fengting Bai
Ran Li
Yudong Cai
Yu Jiang
author_sort Qimeng Yang
collection DOAJ
description Abstract Structural variations (SVs) are diverse forms of genetic alterations and drive a wide range of human diseases. Accurately genotyping SVs, particularly occurring at repetitive genomic regions, from short-read sequencing data remains challenging. Here, we introduce SVLearn, a machine-learning approach for genotyping bi-allelic SVs. It exploits a dual-reference strategy to engineer a curated set of genomic, alignment, and genotyping features based on a reference genome in concert with an allele-based alternative genome. Using 38,613 human-derived SVs, we show that SVLearn significantly outperforms four state-of-the-art tools, with precision improvements of up to 15.61% for insertions and 13.75% for deletions in repetitive regions. On two additional sets of 121,435 cattle SVs and 113,042 sheep SVs, SVLearn demonstrates a strong generalizability to cross-species genotype SVs with a weighted genotype concordance score of up to 90%. Notably, SVLearn enables accurate genotyping of SVs at low sequencing coverage, which is comparable to the accuracy at 30× coverage. Our studies suggest that SVLearn can accelerate the understanding of associations between the genome-scale, high-quality genotyped SVs and diseases across multiple species.
format Article
id doaj-art-139943e436484dac92e71bf403bd00fd
institution DOAJ
issn 2041-1723
language English
publishDate 2025-03-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-139943e436484dac92e71bf403bd00fd2025-08-20T03:01:39ZengNature PortfolioNature Communications2041-17232025-03-0116111410.1038/s41467-025-57756-zSVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variantsQimeng Yang0Jianfeng Sun1Xinyu Wang2Jiong Wang3Quanzhong Liu4Jinlong Ru5Xin Zhang6Sizhe Wang7Ran Hao8Peipei Bian9Xuelei Dai10Mian Gong11Zhuangbiao Zhang12Ao Wang13Fengting Bai14Ran Li15Yudong Cai16Yu Jiang17Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityBotnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of OxfordKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityCollege of Information Engineering, Northwest A&F UniversityInstitute of Virology, Helmholtz Centre Munich - German Research Centre for Environmental HealthCollege of Information Engineering, Northwest A&F UniversityCollege of Information Engineering, Northwest A&F UniversityCollege of Information Engineering, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityAbstract Structural variations (SVs) are diverse forms of genetic alterations and drive a wide range of human diseases. Accurately genotyping SVs, particularly occurring at repetitive genomic regions, from short-read sequencing data remains challenging. Here, we introduce SVLearn, a machine-learning approach for genotyping bi-allelic SVs. It exploits a dual-reference strategy to engineer a curated set of genomic, alignment, and genotyping features based on a reference genome in concert with an allele-based alternative genome. Using 38,613 human-derived SVs, we show that SVLearn significantly outperforms four state-of-the-art tools, with precision improvements of up to 15.61% for insertions and 13.75% for deletions in repetitive regions. On two additional sets of 121,435 cattle SVs and 113,042 sheep SVs, SVLearn demonstrates a strong generalizability to cross-species genotype SVs with a weighted genotype concordance score of up to 90%. Notably, SVLearn enables accurate genotyping of SVs at low sequencing coverage, which is comparable to the accuracy at 30× coverage. Our studies suggest that SVLearn can accelerate the understanding of associations between the genome-scale, high-quality genotyped SVs and diseases across multiple species.https://doi.org/10.1038/s41467-025-57756-z
spellingShingle Qimeng Yang
Jianfeng Sun
Xinyu Wang
Jiong Wang
Quanzhong Liu
Jinlong Ru
Xin Zhang
Sizhe Wang
Ran Hao
Peipei Bian
Xuelei Dai
Mian Gong
Zhuangbiao Zhang
Ao Wang
Fengting Bai
Ran Li
Yudong Cai
Yu Jiang
SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants
Nature Communications
title SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants
title_full SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants
title_fullStr SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants
title_full_unstemmed SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants
title_short SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants
title_sort svlearn a dual reference machine learning approach enables accurate cross species genotyping of structural variants
url https://doi.org/10.1038/s41467-025-57756-z
work_keys_str_mv AT qimengyang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT jianfengsun svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT xinyuwang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT jiongwang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT quanzhongliu svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT jinlongru svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT xinzhang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT sizhewang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT ranhao svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT peipeibian svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT xueleidai svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT miangong svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT zhuangbiaozhang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT aowang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT fengtingbai svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT ranli svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT yudongcai svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants
AT yujiang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants