Enhancing Genomic Data Representation Through BERT-LSTM Hybrid Architecture

This study proposes a novel approach for effective genetic sequence representation, focusing on the challenges of compressing and analyzing complex genomic data. We introduce a hybrid architecture that combines Bidirectional Encoder Representations from Transformers (BERT) with Long Short-Term Memor...

Full description

Saved in:
Bibliographic Details
Main Authors: Kyeong Ho Kim, Minji Kim, Sohui Kim, Jong-Min Lee
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10964250/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This study proposes a novel approach for effective genetic sequence representation, focusing on the challenges of compressing and analyzing complex genomic data. We introduce a hybrid architecture that combines Bidirectional Encoder Representations from Transformers (BERT) with Long Short-Term Memory (LSTM) networks to generate comprehensive and compact gene embeddings. Our method processes genetic sequence data through k-mer tokenization and employs BERT to capture complex patterns, followed by LSTM to preserve essential sequential information while creating fixed-size representations. Using data from 623 participants from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, we analyzed genetic sequences across 10 genes to evaluate our approach. The effectiveness of our method is demonstrated through both visualization and quantitative evaluation. The t-distributed stochastic neighbor embedding (t-SNE) visualization revealed improved clustering of gene embeddings compared to traditional approaches, while our model achieved 82% accuracy in gene classification tasks. Our findings indicate that the combination of BERT and LSTM effectively captures both local and global genetic patterns while creating meaningful compressed representations, providing a promising framework for genetic sequence analysis.
ISSN:2169-3536