Enhancing Genomic Data Representation Through BERT-LSTM Hybrid Architecture
This study proposes a novel approach for effective genetic sequence representation, focusing on the challenges of compressing and analyzing complex genomic data. We introduce a hybrid architecture that combines Bidirectional Encoder Representations from Transformers (BERT) with Long Short-Term Memor...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10964250/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This study proposes a novel approach for effective genetic sequence representation, focusing on the challenges of compressing and analyzing complex genomic data. We introduce a hybrid architecture that combines Bidirectional Encoder Representations from Transformers (BERT) with Long Short-Term Memory (LSTM) networks to generate comprehensive and compact gene embeddings. Our method processes genetic sequence data through k-mer tokenization and employs BERT to capture complex patterns, followed by LSTM to preserve essential sequential information while creating fixed-size representations. Using data from 623 participants from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, we analyzed genetic sequences across 10 genes to evaluate our approach. The effectiveness of our method is demonstrated through both visualization and quantitative evaluation. The t-distributed stochastic neighbor embedding (t-SNE) visualization revealed improved clustering of gene embeddings compared to traditional approaches, while our model achieved 82% accuracy in gene classification tasks. Our findings indicate that the combination of BERT and LSTM effectively captures both local and global genetic patterns while creating meaningful compressed representations, providing a promising framework for genetic sequence analysis. |
|---|---|
| ISSN: | 2169-3536 |