Tokenization and deep learning architectures in genomics: A comprehensive review

The development of modern DNA sequencing technologies has resulted in the rapid growth of genomic data. Alongside the collection of this data, there is an increasing need for the development of modern computational tools leveraging this data for tasks including but not limited to antimicrobial resis...

Full description

Saved in:
Bibliographic Details
Main Authors: Conrad Testagrose, Christina Boucher
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037025003022
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849397967135440896
author Conrad Testagrose
Christina Boucher
author_facet Conrad Testagrose
Christina Boucher
author_sort Conrad Testagrose
collection DOAJ
description The development of modern DNA sequencing technologies has resulted in the rapid growth of genomic data. Alongside the collection of this data, there is an increasing need for the development of modern computational tools leveraging this data for tasks including but not limited to antimicrobial resistance and gene annotation. Current deep learning architectures and tokenization techniques have been explored for the extraction of meaningful underlying information contained within this sequencing data. We aim to survey current and foundational literature surrounding the area of deep learning architectures and tokenization techniques in the field of genomics. Our survey of the literature outlines that significant work remains in developing efficient tokenization techniques that can capture or model underlying motifs within DNA sequences. While deep learning models have become more efficient, many current tokenization methods either reduce scalability through naive sequence representation, incorrectly model motifs or are borrowed directly from NLP tasks for use with biological sequences. Current and future model architectures should seek to implement and support more advanced, and biologically relevant, tokenization techniques to more effectively model the underlying information in biological sequencing data.
format Article
id doaj-art-8d949462e2834af1a40dc57bee5bee5b
institution Kabale University
issn 2001-0370
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj-art-8d949462e2834af1a40dc57bee5bee5b2025-08-20T03:38:48ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01273547355510.1016/j.csbj.2025.07.038Tokenization and deep learning architectures in genomics: A comprehensive reviewConrad Testagrose0Christina Boucher1Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, United StatesCorresponding author.; Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, United StatesThe development of modern DNA sequencing technologies has resulted in the rapid growth of genomic data. Alongside the collection of this data, there is an increasing need for the development of modern computational tools leveraging this data for tasks including but not limited to antimicrobial resistance and gene annotation. Current deep learning architectures and tokenization techniques have been explored for the extraction of meaningful underlying information contained within this sequencing data. We aim to survey current and foundational literature surrounding the area of deep learning architectures and tokenization techniques in the field of genomics. Our survey of the literature outlines that significant work remains in developing efficient tokenization techniques that can capture or model underlying motifs within DNA sequences. While deep learning models have become more efficient, many current tokenization methods either reduce scalability through naive sequence representation, incorrectly model motifs or are borrowed directly from NLP tasks for use with biological sequences. Current and future model architectures should seek to implement and support more advanced, and biologically relevant, tokenization techniques to more effectively model the underlying information in biological sequencing data.http://www.sciencedirect.com/science/article/pii/S2001037025003022Deep learningLarge language modelsTokenizationGenomicsDNA sequencing
spellingShingle Conrad Testagrose
Christina Boucher
Tokenization and deep learning architectures in genomics: A comprehensive review
Computational and Structural Biotechnology Journal
Deep learning
Large language models
Tokenization
Genomics
DNA sequencing
title Tokenization and deep learning architectures in genomics: A comprehensive review
title_full Tokenization and deep learning architectures in genomics: A comprehensive review
title_fullStr Tokenization and deep learning architectures in genomics: A comprehensive review
title_full_unstemmed Tokenization and deep learning architectures in genomics: A comprehensive review
title_short Tokenization and deep learning architectures in genomics: A comprehensive review
title_sort tokenization and deep learning architectures in genomics a comprehensive review
topic Deep learning
Large language models
Tokenization
Genomics
DNA sequencing
url http://www.sciencedirect.com/science/article/pii/S2001037025003022
work_keys_str_mv AT conradtestagrose tokenizationanddeeplearningarchitecturesingenomicsacomprehensivereview
AT christinaboucher tokenizationanddeeplearningarchitecturesingenomicsacomprehensivereview