Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding

Data storage and retrieval using DNA sequences have been extensively studied in computer and information sciences because of the increasing demand for archiving large amounts of data over long periods of time. This study introduces an efficient approach to DNA data encoding that takes advantage of p...

Full description

Saved in:
Bibliographic Details
Main Authors: Kun Tu, Dariusz Puchala
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11079985/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data storage and retrieval using DNA sequences have been extensively studied in computer and information sciences because of the increasing demand for archiving large amounts of data over long periods of time. This study introduces an efficient approach to DNA data encoding that takes advantage of p-gram Huffman coding, a highly effective technique for lossless data compression. The proposed method combines data compression and encoding, while inherently guaranteeing no-homopolymer and GC-content constraints, both of which are crucial for biologically synthesizing DNA, to ensure the longevity and durability of the generated sequences. This property is achieved using Huffman trees. The proposed method achieves a high efficiency of 2.72 bits per nucleotide when encoding text data. However, it can be extended to any type of data characterized by compressibility. In addition, we perform an error analysis of the method, considering substitution and deletion errors, which result in two more robust variants.
ISSN:2169-3536