Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding

Data storage and retrieval using DNA sequences have been extensively studied in computer and information sciences because of the increasing demand for archiving large amounts of data over long periods of time. This study introduces an efficient approach to DNA data encoding that takes advantage of p...

Full description

Saved in:
Bibliographic Details
Main Authors: Kun Tu, Dariusz Puchala
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11079985/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849714725695258624
author Kun Tu
Dariusz Puchala
author_facet Kun Tu
Dariusz Puchala
author_sort Kun Tu
collection DOAJ
description Data storage and retrieval using DNA sequences have been extensively studied in computer and information sciences because of the increasing demand for archiving large amounts of data over long periods of time. This study introduces an efficient approach to DNA data encoding that takes advantage of p-gram Huffman coding, a highly effective technique for lossless data compression. The proposed method combines data compression and encoding, while inherently guaranteeing no-homopolymer and GC-content constraints, both of which are crucial for biologically synthesizing DNA, to ensure the longevity and durability of the generated sequences. This property is achieved using Huffman trees. The proposed method achieves a high efficiency of 2.72 bits per nucleotide when encoding text data. However, it can be extended to any type of data characterized by compressibility. In addition, we perform an error analysis of the method, considering substitution and deletion errors, which result in two more robust variants.
format Article
id doaj-art-bd82c7c248e34dc9967fa0b07d5437ee
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-bd82c7c248e34dc9967fa0b07d5437ee2025-08-20T03:13:37ZengIEEEIEEE Access2169-35362025-01-011312316912318110.1109/ACCESS.2025.358907911079985Efficient Data Storage in DNA Sequences Aided by p-gram Huffman CodingKun Tu0https://orcid.org/0000-0002-6197-0372Dariusz Puchala1https://orcid.org/0000-0001-9070-8042School of Mathematical Sciences, Yangzhou University, Yangzhou, ChinaInstitute of Information Technology, Lodz University of Technology, Lodz, PolandData storage and retrieval using DNA sequences have been extensively studied in computer and information sciences because of the increasing demand for archiving large amounts of data over long periods of time. This study introduces an efficient approach to DNA data encoding that takes advantage of p-gram Huffman coding, a highly effective technique for lossless data compression. The proposed method combines data compression and encoding, while inherently guaranteeing no-homopolymer and GC-content constraints, both of which are crucial for biologically synthesizing DNA, to ensure the longevity and durability of the generated sequences. This property is achieved using Huffman trees. The proposed method achieves a high efficiency of 2.72 bits per nucleotide when encoding text data. However, it can be extended to any type of data characterized by compressibility. In addition, we perform an error analysis of the method, considering substitution and deletion errors, which result in two more robust variants.https://ieeexplore.ieee.org/document/11079985/Data encodingdata compressionDNA sequencesHuffman coding
spellingShingle Kun Tu
Dariusz Puchala
Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding
IEEE Access
Data encoding
data compression
DNA sequences
Huffman coding
title Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding
title_full Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding
title_fullStr Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding
title_full_unstemmed Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding
title_short Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding
title_sort efficient data storage in dna sequences aided by p gram huffman coding
topic Data encoding
data compression
DNA sequences
Huffman coding
url https://ieeexplore.ieee.org/document/11079985/
work_keys_str_mv AT kuntu efficientdatastorageindnasequencesaidedbypgramhuffmancoding
AT dariuszpuchala efficientdatastorageindnasequencesaidedbypgramhuffmancoding