SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants
Abstract Structural variations (SVs) are diverse forms of genetic alterations and drive a wide range of human diseases. Accurately genotyping SVs, particularly occurring at repetitive genomic regions, from short-read sequencing data remains challenging. Here, we introduce SVLearn, a machine-learning...
Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-03-01
|
| Series: | Nature Communications |
| Online Access: | https://doi.org/10.1038/s41467-025-57756-z |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849774586313310208 |
|---|---|
| author | Qimeng Yang Jianfeng Sun Xinyu Wang Jiong Wang Quanzhong Liu Jinlong Ru Xin Zhang Sizhe Wang Ran Hao Peipei Bian Xuelei Dai Mian Gong Zhuangbiao Zhang Ao Wang Fengting Bai Ran Li Yudong Cai Yu Jiang |
| author_facet | Qimeng Yang Jianfeng Sun Xinyu Wang Jiong Wang Quanzhong Liu Jinlong Ru Xin Zhang Sizhe Wang Ran Hao Peipei Bian Xuelei Dai Mian Gong Zhuangbiao Zhang Ao Wang Fengting Bai Ran Li Yudong Cai Yu Jiang |
| author_sort | Qimeng Yang |
| collection | DOAJ |
| description | Abstract Structural variations (SVs) are diverse forms of genetic alterations and drive a wide range of human diseases. Accurately genotyping SVs, particularly occurring at repetitive genomic regions, from short-read sequencing data remains challenging. Here, we introduce SVLearn, a machine-learning approach for genotyping bi-allelic SVs. It exploits a dual-reference strategy to engineer a curated set of genomic, alignment, and genotyping features based on a reference genome in concert with an allele-based alternative genome. Using 38,613 human-derived SVs, we show that SVLearn significantly outperforms four state-of-the-art tools, with precision improvements of up to 15.61% for insertions and 13.75% for deletions in repetitive regions. On two additional sets of 121,435 cattle SVs and 113,042 sheep SVs, SVLearn demonstrates a strong generalizability to cross-species genotype SVs with a weighted genotype concordance score of up to 90%. Notably, SVLearn enables accurate genotyping of SVs at low sequencing coverage, which is comparable to the accuracy at 30× coverage. Our studies suggest that SVLearn can accelerate the understanding of associations between the genome-scale, high-quality genotyped SVs and diseases across multiple species. |
| format | Article |
| id | doaj-art-139943e436484dac92e71bf403bd00fd |
| institution | DOAJ |
| issn | 2041-1723 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Nature Communications |
| spelling | doaj-art-139943e436484dac92e71bf403bd00fd2025-08-20T03:01:39ZengNature PortfolioNature Communications2041-17232025-03-0116111410.1038/s41467-025-57756-zSVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variantsQimeng Yang0Jianfeng Sun1Xinyu Wang2Jiong Wang3Quanzhong Liu4Jinlong Ru5Xin Zhang6Sizhe Wang7Ran Hao8Peipei Bian9Xuelei Dai10Mian Gong11Zhuangbiao Zhang12Ao Wang13Fengting Bai14Ran Li15Yudong Cai16Yu Jiang17Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityBotnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of OxfordKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityCollege of Information Engineering, Northwest A&F UniversityInstitute of Virology, Helmholtz Centre Munich - German Research Centre for Environmental HealthCollege of Information Engineering, Northwest A&F UniversityCollege of Information Engineering, Northwest A&F UniversityCollege of Information Engineering, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityKey Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F UniversityAbstract Structural variations (SVs) are diverse forms of genetic alterations and drive a wide range of human diseases. Accurately genotyping SVs, particularly occurring at repetitive genomic regions, from short-read sequencing data remains challenging. Here, we introduce SVLearn, a machine-learning approach for genotyping bi-allelic SVs. It exploits a dual-reference strategy to engineer a curated set of genomic, alignment, and genotyping features based on a reference genome in concert with an allele-based alternative genome. Using 38,613 human-derived SVs, we show that SVLearn significantly outperforms four state-of-the-art tools, with precision improvements of up to 15.61% for insertions and 13.75% for deletions in repetitive regions. On two additional sets of 121,435 cattle SVs and 113,042 sheep SVs, SVLearn demonstrates a strong generalizability to cross-species genotype SVs with a weighted genotype concordance score of up to 90%. Notably, SVLearn enables accurate genotyping of SVs at low sequencing coverage, which is comparable to the accuracy at 30× coverage. Our studies suggest that SVLearn can accelerate the understanding of associations between the genome-scale, high-quality genotyped SVs and diseases across multiple species.https://doi.org/10.1038/s41467-025-57756-z |
| spellingShingle | Qimeng Yang Jianfeng Sun Xinyu Wang Jiong Wang Quanzhong Liu Jinlong Ru Xin Zhang Sizhe Wang Ran Hao Peipei Bian Xuelei Dai Mian Gong Zhuangbiao Zhang Ao Wang Fengting Bai Ran Li Yudong Cai Yu Jiang SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants Nature Communications |
| title | SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants |
| title_full | SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants |
| title_fullStr | SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants |
| title_full_unstemmed | SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants |
| title_short | SVLearn: a dual-reference machine learning approach enables accurate cross-species genotyping of structural variants |
| title_sort | svlearn a dual reference machine learning approach enables accurate cross species genotyping of structural variants |
| url | https://doi.org/10.1038/s41467-025-57756-z |
| work_keys_str_mv | AT qimengyang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT jianfengsun svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT xinyuwang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT jiongwang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT quanzhongliu svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT jinlongru svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT xinzhang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT sizhewang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT ranhao svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT peipeibian svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT xueleidai svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT miangong svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT zhuangbiaozhang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT aowang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT fengtingbai svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT ranli svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT yudongcai svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants AT yujiang svlearnadualreferencemachinelearningapproachenablesaccuratecrossspeciesgenotypingofstructuralvariants |