Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding
Data storage and retrieval using DNA sequences have been extensively studied in computer and information sciences because of the increasing demand for archiving large amounts of data over long periods of time. This study introduces an efficient approach to DNA data encoding that takes advantage of p...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11079985/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849714725695258624 |
|---|---|
| author | Kun Tu Dariusz Puchala |
| author_facet | Kun Tu Dariusz Puchala |
| author_sort | Kun Tu |
| collection | DOAJ |
| description | Data storage and retrieval using DNA sequences have been extensively studied in computer and information sciences because of the increasing demand for archiving large amounts of data over long periods of time. This study introduces an efficient approach to DNA data encoding that takes advantage of p-gram Huffman coding, a highly effective technique for lossless data compression. The proposed method combines data compression and encoding, while inherently guaranteeing no-homopolymer and GC-content constraints, both of which are crucial for biologically synthesizing DNA, to ensure the longevity and durability of the generated sequences. This property is achieved using Huffman trees. The proposed method achieves a high efficiency of 2.72 bits per nucleotide when encoding text data. However, it can be extended to any type of data characterized by compressibility. In addition, we perform an error analysis of the method, considering substitution and deletion errors, which result in two more robust variants. |
| format | Article |
| id | doaj-art-bd82c7c248e34dc9967fa0b07d5437ee |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-bd82c7c248e34dc9967fa0b07d5437ee2025-08-20T03:13:37ZengIEEEIEEE Access2169-35362025-01-011312316912318110.1109/ACCESS.2025.358907911079985Efficient Data Storage in DNA Sequences Aided by p-gram Huffman CodingKun Tu0https://orcid.org/0000-0002-6197-0372Dariusz Puchala1https://orcid.org/0000-0001-9070-8042School of Mathematical Sciences, Yangzhou University, Yangzhou, ChinaInstitute of Information Technology, Lodz University of Technology, Lodz, PolandData storage and retrieval using DNA sequences have been extensively studied in computer and information sciences because of the increasing demand for archiving large amounts of data over long periods of time. This study introduces an efficient approach to DNA data encoding that takes advantage of p-gram Huffman coding, a highly effective technique for lossless data compression. The proposed method combines data compression and encoding, while inherently guaranteeing no-homopolymer and GC-content constraints, both of which are crucial for biologically synthesizing DNA, to ensure the longevity and durability of the generated sequences. This property is achieved using Huffman trees. The proposed method achieves a high efficiency of 2.72 bits per nucleotide when encoding text data. However, it can be extended to any type of data characterized by compressibility. In addition, we perform an error analysis of the method, considering substitution and deletion errors, which result in two more robust variants.https://ieeexplore.ieee.org/document/11079985/Data encodingdata compressionDNA sequencesHuffman coding |
| spellingShingle | Kun Tu Dariusz Puchala Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding IEEE Access Data encoding data compression DNA sequences Huffman coding |
| title | Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding |
| title_full | Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding |
| title_fullStr | Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding |
| title_full_unstemmed | Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding |
| title_short | Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding |
| title_sort | efficient data storage in dna sequences aided by p gram huffman coding |
| topic | Data encoding data compression DNA sequences Huffman coding |
| url | https://ieeexplore.ieee.org/document/11079985/ |
| work_keys_str_mv | AT kuntu efficientdatastorageindnasequencesaidedbypgramhuffmancoding AT dariuszpuchala efficientdatastorageindnasequencesaidedbypgramhuffmancoding |