Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction
In this study, we propose a novel Gene Embedding Pipeline (CGEP) designed to address critical limitations in genomic deep learning by integrating both functional and regulatory genomic contexts. Unlike conventional approaches, our framework prioritizes disease-associated genes identified through con...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11048558/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849420171730485248 |
|---|---|
| author | Twaha Ahmed Minai Zubair Ahmed Shaikh Asim Imdad Wagan M. Kamran Azim Syed Muhammad Muaz Muhammad Shoaib Siddiqui |
| author_facet | Twaha Ahmed Minai Zubair Ahmed Shaikh Asim Imdad Wagan M. Kamran Azim Syed Muhammad Muaz Muhammad Shoaib Siddiqui |
| author_sort | Twaha Ahmed Minai |
| collection | DOAJ |
| description | In this study, we propose a novel Gene Embedding Pipeline (CGEP) designed to address critical limitations in genomic deep learning by integrating both functional and regulatory genomic contexts. Unlike conventional approaches, our framework prioritizes disease-associated genes identified through contribution analysis (validated via a breast cancer case study) while remaining extensible to all pathologies with patient-derived FASTA sequences. The pipeline uniquely processes each target gene alongside its upstream/downstream regulatory regions (±2.5 kbp), capturing promoter/enhancer dynamics critical for disease mechanisms. To enable flexible downstream analysis, CGEP generates four distinct embeddings per gene, which may be utilized independently via task-specific models or fused for enhanced predictive power. By leveraging publicly available reference genomes (GRCh37/GRCh38) as healthy baselines, our method minimizes data procurement barriers and supports decentralized training paradigms that align with institutional data governance requirements. Implementation relies on lightweight feed-forward architectures, ensuring computational accessibility without sacrificing performance. Benchmarking against state-of-the-art models demonstrates competitive accuracy as good as (99.0%), F1-score (0.99), and AUC-ROC (0.99), with superior GC (Generalization Capacity) and reduced OL (Overfitting Likelihood). To foster reproducibility, we provide open-source access to the entire pipeline, including modular scripts for data curation, embedding generation, and model training. This work bridges computational innovation with clinical pragmatism, enabling scalable and interpretable genomic analysis for precision medicine. |
| format | Article |
| id | doaj-art-a62efd63b31d48f7a9b03e6832c36fa5 |
| institution | Kabale University |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-a62efd63b31d48f7a9b03e6832c36fa52025-08-20T03:31:49ZengIEEEIEEE Access2169-35362025-01-011312466512468810.1109/ACCESS.2025.358251111048558Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic PredictionTwaha Ahmed Minai0https://orcid.org/0009-0003-4691-6575Zubair Ahmed Shaikh1Asim Imdad Wagan2https://orcid.org/0000-0001-9765-5385M. Kamran Azim3https://orcid.org/0000-0002-6725-3315Syed Muhammad Muaz4Muhammad Shoaib Siddiqui5https://orcid.org/0000-0002-5656-0416Department of Computer Science, Muhammad Ali Jinnah University, Karachi, Sindh, PakistanDepartment of Computer Science, Muhammad Ali Jinnah University, Karachi, Sindh, PakistanDepartment of Computer Science, Muhammad Ali Jinnah University, Karachi, Sindh, PakistanDepartment of Biosciences, Muhammad Ali Jinnah University, Karachi, Sindh, PakistanCodeX, Karachi, PakistanFaculty of Computer and Information Systems, Islamic University of Madinah, Madinah, Saudi ArabiaIn this study, we propose a novel Gene Embedding Pipeline (CGEP) designed to address critical limitations in genomic deep learning by integrating both functional and regulatory genomic contexts. Unlike conventional approaches, our framework prioritizes disease-associated genes identified through contribution analysis (validated via a breast cancer case study) while remaining extensible to all pathologies with patient-derived FASTA sequences. The pipeline uniquely processes each target gene alongside its upstream/downstream regulatory regions (±2.5 kbp), capturing promoter/enhancer dynamics critical for disease mechanisms. To enable flexible downstream analysis, CGEP generates four distinct embeddings per gene, which may be utilized independently via task-specific models or fused for enhanced predictive power. By leveraging publicly available reference genomes (GRCh37/GRCh38) as healthy baselines, our method minimizes data procurement barriers and supports decentralized training paradigms that align with institutional data governance requirements. Implementation relies on lightweight feed-forward architectures, ensuring computational accessibility without sacrificing performance. Benchmarking against state-of-the-art models demonstrates competitive accuracy as good as (99.0%), F1-score (0.99), and AUC-ROC (0.99), with superior GC (Generalization Capacity) and reduced OL (Overfitting Likelihood). To foster reproducibility, we provide open-source access to the entire pipeline, including modular scripts for data curation, embedding generation, and model training. This work bridges computational innovation with clinical pragmatism, enabling scalable and interpretable genomic analysis for precision medicine.https://ieeexplore.ieee.org/document/11048558/Context-aware gene embedding pipeline (CGEP)DNA embeddingupstreamdownstreamgene embeddingbreast cancer |
| spellingShingle | Twaha Ahmed Minai Zubair Ahmed Shaikh Asim Imdad Wagan M. Kamran Azim Syed Muhammad Muaz Muhammad Shoaib Siddiqui Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction IEEE Access Context-aware gene embedding pipeline (CGEP) DNA embedding upstream downstream gene embedding breast cancer |
| title | Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction |
| title_full | Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction |
| title_fullStr | Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction |
| title_full_unstemmed | Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction |
| title_short | Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction |
| title_sort | context aware gene embedding pipeline cgep an accessible regulatory region inclusive embedding framework for interpretable disease agnostic prediction |
| topic | Context-aware gene embedding pipeline (CGEP) DNA embedding upstream downstream gene embedding breast cancer |
| url | https://ieeexplore.ieee.org/document/11048558/ |
| work_keys_str_mv | AT twahaahmedminai contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction AT zubairahmedshaikh contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction AT asimimdadwagan contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction AT mkamranazim contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction AT syedmuhammadmuaz contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction AT muhammadshoaibsiddiqui contextawaregeneembeddingpipelinecgepanaccessibleregulatoryregioninclusiveembeddingframeworkforinterpretablediseaseagnosticprediction |