Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction

In this study, we propose a novel Gene Embedding Pipeline (CGEP) designed to address critical limitations in genomic deep learning by integrating both functional and regulatory genomic contexts. Unlike conventional approaches, our framework prioritizes disease-associated genes identified through con...

Full description

Saved in:
Bibliographic Details
Main Authors: Twaha Ahmed Minai, Zubair Ahmed Shaikh, Asim Imdad Wagan, M. Kamran Azim, Syed Muhammad Muaz, Muhammad Shoaib Siddiqui
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11048558/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849420171730485248
author Twaha Ahmed Minai
Zubair Ahmed Shaikh
Asim Imdad Wagan
M. Kamran Azim
Syed Muhammad Muaz
Muhammad Shoaib Siddiqui
author_facet Twaha Ahmed Minai
Zubair Ahmed Shaikh
Asim Imdad Wagan
M. Kamran Azim
Syed Muhammad Muaz
Muhammad Shoaib Siddiqui
author_sort Twaha Ahmed Minai
collection DOAJ
description In this study, we propose a novel Gene Embedding Pipeline (CGEP) designed to address critical limitations in genomic deep learning by integrating both functional and regulatory genomic contexts. Unlike conventional approaches, our framework prioritizes disease-associated genes identified through contribution analysis (validated via a breast cancer case study) while remaining extensible to all pathologies with patient-derived FASTA sequences. The pipeline uniquely processes each target gene alongside its upstream/downstream regulatory regions (±2.5 kbp), capturing promoter/enhancer dynamics critical for disease mechanisms. To enable flexible downstream analysis, CGEP generates four distinct embeddings per gene, which may be utilized independently via task-specific models or fused for enhanced predictive power. By leveraging publicly available reference genomes (GRCh37/GRCh38) as healthy baselines, our method minimizes data procurement barriers and supports decentralized training paradigms that align with institutional data governance requirements. Implementation relies on lightweight feed-forward architectures, ensuring computational accessibility without sacrificing performance. Benchmarking against state-of-the-art models demonstrates competitive accuracy as good as (99.0%), F1-score (0.99), and AUC-ROC (0.99), with superior GC (Generalization Capacity) and reduced OL (Overfitting Likelihood). To foster reproducibility, we provide open-source access to the entire pipeline, including modular scripts for data curation, embedding generation, and model training. This work bridges computational innovation with clinical pragmatism, enabling scalable and interpretable genomic analysis for precision medicine.
format Article
id doaj-art-a62efd63b31d48f7a9b03e6832c36fa5
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-a62efd63b31d48f7a9b03e6832c36fa52025-08-20T03:31:49ZengIEEEIEEE Access2169-35362025-01-011312466512468810.1109/ACCESS.2025.358251111048558Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic PredictionTwaha Ahmed Minai0https://orcid.org/0009-0003-4691-6575Zubair Ahmed Shaikh1Asim Imdad Wagan2https://orcid.org/0000-0001-9765-5385M. Kamran Azim3https://orcid.org/0000-0002-6725-3315Syed Muhammad Muaz4Muhammad Shoaib Siddiqui5https://orcid.org/0000-0002-5656-0416Department of Computer Science, Muhammad Ali Jinnah University, Karachi, Sindh, PakistanDepartment of Computer Science, Muhammad Ali Jinnah University, Karachi, Sindh, PakistanDepartment of Computer Science, Muhammad Ali Jinnah University, Karachi, Sindh, PakistanDepartment of Biosciences, Muhammad Ali Jinnah University, Karachi, Sindh, PakistanCodeX, Karachi, PakistanFaculty of Computer and Information Systems, Islamic University of Madinah, Madinah, Saudi ArabiaIn this study, we propose a novel Gene Embedding Pipeline (CGEP) designed to address critical limitations in genomic deep learning by integrating both functional and regulatory genomic contexts. Unlike conventional approaches, our framework prioritizes disease-associated genes identified through contribution analysis (validated via a breast cancer case study) while remaining extensible to all pathologies with patient-derived FASTA sequences. The pipeline uniquely processes each target gene alongside its upstream/downstream regulatory regions (±2.5 kbp), capturing promoter/enhancer dynamics critical for disease mechanisms. To enable flexible downstream analysis, CGEP generates four distinct embeddings per gene, which may be utilized independently via task-specific models or fused for enhanced predictive power. By leveraging publicly available reference genomes (GRCh37/GRCh38) as healthy baselines, our method minimizes data procurement barriers and supports decentralized training paradigms that align with institutional data governance requirements. Implementation relies on lightweight feed-forward architectures, ensuring computational accessibility without sacrificing performance. Benchmarking against state-of-the-art models demonstrates competitive accuracy as good as (99.0%), F1-score (0.99), and AUC-ROC (0.99), with superior GC (Generalization Capacity) and reduced OL (Overfitting Likelihood). To foster reproducibility, we provide open-source access to the entire pipeline, including modular scripts for data curation, embedding generation, and model training. This work bridges computational innovation with clinical pragmatism, enabling scalable and interpretable genomic analysis for precision medicine.https://ieeexplore.ieee.org/document/11048558/Context-aware gene embedding pipeline (CGEP)DNA embeddingupstreamdownstreamgene embeddingbreast cancer
spellingShingle Twaha Ahmed Minai
Zubair Ahmed Shaikh
Asim Imdad Wagan
M. Kamran Azim
Syed Muhammad Muaz
Muhammad Shoaib Siddiqui
Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction
IEEE Access
Context-aware gene embedding pipeline (CGEP)
DNA embedding
upstream
downstream
gene embedding
breast cancer
title Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction
title_full Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction
title_fullStr Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction
title_full_unstemmed Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction
title_short Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction
title_sort context aware gene embedding pipeline cgep an accessible regulatory region inclusive embedding framework for interpretable disease agnostic prediction
topic Context-aware gene embedding pipeline (CGEP)
DNA embedding
upstream
downstream
gene embedding
breast cancer
url https://ieeexplore.ieee.org/document/11048558/
work_keys_str_mv AT twahaahmedminai contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction
AT zubairahmedshaikh contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction
AT asimimdadwagan contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction
AT mkamranazim contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction
AT syedmuhammadmuaz contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction
AT muhammadshoaibsiddiqui contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction