Efficient Data Storage in DNA Sequences Aided by p-gram Huffman Coding
Data storage and retrieval using DNA sequences have been extensively studied in computer and information sciences because of the increasing demand for archiving large amounts of data over long periods of time. This study introduces an efficient approach to DNA data encoding that takes advantage of p...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11079985/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Data storage and retrieval using DNA sequences have been extensively studied in computer and information sciences because of the increasing demand for archiving large amounts of data over long periods of time. This study introduces an efficient approach to DNA data encoding that takes advantage of p-gram Huffman coding, a highly effective technique for lossless data compression. The proposed method combines data compression and encoding, while inherently guaranteeing no-homopolymer and GC-content constraints, both of which are crucial for biologically synthesizing DNA, to ensure the longevity and durability of the generated sequences. This property is achieved using Huffman trees. The proposed method achieves a high efficiency of 2.72 bits per nucleotide when encoding text data. However, it can be extended to any type of data characterized by compressibility. In addition, we perform an error analysis of the method, considering substitution and deletion errors, which result in two more robust variants. |
|---|---|
| ISSN: | 2169-3536 |