Predict the degree of secondary structures of the encoding sequences in DNA storage by deep learning model

Abstract DNA storage has been widely considered as a promising alternative for exponentially growing data. However, the inherent complex secondary structures severely compromise the processes of synthesis, PCR amplification, and sequencing, interfering with reliable information recovery. In large-sc...

Full description

Saved in:
Bibliographic Details
Main Authors: Wanmin Lin, Ling Chu, Xiangyu Yao, Zhihua Chen, Peng Xu, Wenbin Liu
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-05717-3
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849334990019493888
author Wanmin Lin
Ling Chu
Xiangyu Yao
Zhihua Chen
Peng Xu
Wenbin Liu
author_facet Wanmin Lin
Ling Chu
Xiangyu Yao
Zhihua Chen
Peng Xu
Wenbin Liu
author_sort Wanmin Lin
collection DOAJ
description Abstract DNA storage has been widely considered as a promising alternative for exponentially growing data. However, the inherent complex secondary structures severely compromise the processes of synthesis, PCR amplification, and sequencing, interfering with reliable information recovery. In large-scale storage applications, how to effectively circumvent the negative effects is a critical problem. As secondary structures are formed by contiguous bases with reversal complementary relations and accompanied by the released free energy, we construct a BiLSTM-Transformer model with k-mer embedding to predict the free energy of sequences and further screen out these sequences with high values. K-mer embedding can capture the characteristics of contiguous base pairings through overlapping short subsequences, further facilitating free-energy prediction. Compared with other deep learning models, our simulation results demonstrate that BiLSTM-Transformer model with k-mer embedding has a better prediction performance. Application on a real dataset demonstrates that the proposed model can screen out those top high-risk sequences which are prone to more read errors and fewer retrieved copy numbers in real DNA storage. The proposed screening method for top high-risk sequences can be a proactive step to prevent the occurrence of severe secondary structures, providing a solution for reliable information retrieval.
format Article
id doaj-art-3749f9be815a4707bae17a9f68b03947
institution Kabale University
issn 2045-2322
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-3749f9be815a4707bae17a9f68b039472025-08-20T03:45:26ZengNature PortfolioScientific Reports2045-23222025-07-011511910.1038/s41598-025-05717-3Predict the degree of secondary structures of the encoding sequences in DNA storage by deep learning modelWanmin Lin0Ling Chu1Xiangyu Yao2Zhihua Chen3Peng Xu4Wenbin Liu5Institute of Computing Science and Technology, Guangzhou UniversityInstitute of Computing Science and Technology, Guangzhou UniversityInstitute of Computing Science and Technology, Guangzhou UniversityInstitute of Computing Science and Technology, Guangzhou UniversityInstitute of Computing Science and Technology, Guangzhou UniversityInstitute of Computing Science and Technology, Guangzhou UniversityAbstract DNA storage has been widely considered as a promising alternative for exponentially growing data. However, the inherent complex secondary structures severely compromise the processes of synthesis, PCR amplification, and sequencing, interfering with reliable information recovery. In large-scale storage applications, how to effectively circumvent the negative effects is a critical problem. As secondary structures are formed by contiguous bases with reversal complementary relations and accompanied by the released free energy, we construct a BiLSTM-Transformer model with k-mer embedding to predict the free energy of sequences and further screen out these sequences with high values. K-mer embedding can capture the characteristics of contiguous base pairings through overlapping short subsequences, further facilitating free-energy prediction. Compared with other deep learning models, our simulation results demonstrate that BiLSTM-Transformer model with k-mer embedding has a better prediction performance. Application on a real dataset demonstrates that the proposed model can screen out those top high-risk sequences which are prone to more read errors and fewer retrieved copy numbers in real DNA storage. The proposed screening method for top high-risk sequences can be a proactive step to prevent the occurrence of severe secondary structures, providing a solution for reliable information retrieval.https://doi.org/10.1038/s41598-025-05717-3DNA storageSecondary structureSynthesisSequencingDeep learning model
spellingShingle Wanmin Lin
Ling Chu
Xiangyu Yao
Zhihua Chen
Peng Xu
Wenbin Liu
Predict the degree of secondary structures of the encoding sequences in DNA storage by deep learning model
Scientific Reports
DNA storage
Secondary structure
Synthesis
Sequencing
Deep learning model
title Predict the degree of secondary structures of the encoding sequences in DNA storage by deep learning model
title_full Predict the degree of secondary structures of the encoding sequences in DNA storage by deep learning model
title_fullStr Predict the degree of secondary structures of the encoding sequences in DNA storage by deep learning model
title_full_unstemmed Predict the degree of secondary structures of the encoding sequences in DNA storage by deep learning model
title_short Predict the degree of secondary structures of the encoding sequences in DNA storage by deep learning model
title_sort predict the degree of secondary structures of the encoding sequences in dna storage by deep learning model
topic DNA storage
Secondary structure
Synthesis
Sequencing
Deep learning model
url https://doi.org/10.1038/s41598-025-05717-3
work_keys_str_mv AT wanminlin predictthedegreeofsecondarystructuresoftheencodingsequencesindnastoragebydeeplearningmodel
AT lingchu predictthedegreeofsecondarystructuresoftheencodingsequencesindnastoragebydeeplearningmodel
AT xiangyuyao predictthedegreeofsecondarystructuresoftheencodingsequencesindnastoragebydeeplearningmodel
AT zhihuachen predictthedegreeofsecondarystructuresoftheencodingsequencesindnastoragebydeeplearningmodel
AT pengxu predictthedegreeofsecondarystructuresoftheencodingsequencesindnastoragebydeeplearningmodel
AT wenbinliu predictthedegreeofsecondarystructuresoftheencodingsequencesindnastoragebydeeplearningmodel