Spaced Seed Data Structures for De Novo Assembly

De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are...

Full description

Saved in:
Bibliographic Details
Main Authors: Inanç Birol, Justin Chu, Hamid Mohamadi, Shaun D. Jackman, Karthika Raghavan, Benjamin P. Vandervalk, Anthony Raymond, René L. Warren
Format: Article
Language:English
Published: Wiley 2015-01-01
Series:International Journal of Genomics
Online Access:http://dx.doi.org/10.1155/2015/196591
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832556240033546240
author Inanç Birol
Justin Chu
Hamid Mohamadi
Shaun D. Jackman
Karthika Raghavan
Benjamin P. Vandervalk
Anthony Raymond
René L. Warren
author_facet Inanç Birol
Justin Chu
Hamid Mohamadi
Shaun D. Jackman
Karthika Raghavan
Benjamin P. Vandervalk
Anthony Raymond
René L. Warren
author_sort Inanç Birol
collection DOAJ
description De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.
format Article
id doaj-art-074da3bfe45a457d9a147da215297915
institution Kabale University
issn 2314-436X
2314-4378
language English
publishDate 2015-01-01
publisher Wiley
record_format Article
series International Journal of Genomics
spelling doaj-art-074da3bfe45a457d9a147da2152979152025-02-03T05:45:56ZengWileyInternational Journal of Genomics2314-436X2314-43782015-01-01201510.1155/2015/196591196591Spaced Seed Data Structures for De Novo AssemblyInanç Birol0Justin Chu1Hamid Mohamadi2Shaun D. Jackman3Karthika Raghavan4Benjamin P. Vandervalk5Anthony Raymond6René L. Warren7Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaDe novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.http://dx.doi.org/10.1155/2015/196591
spellingShingle Inanç Birol
Justin Chu
Hamid Mohamadi
Shaun D. Jackman
Karthika Raghavan
Benjamin P. Vandervalk
Anthony Raymond
René L. Warren
Spaced Seed Data Structures for De Novo Assembly
International Journal of Genomics
title Spaced Seed Data Structures for De Novo Assembly
title_full Spaced Seed Data Structures for De Novo Assembly
title_fullStr Spaced Seed Data Structures for De Novo Assembly
title_full_unstemmed Spaced Seed Data Structures for De Novo Assembly
title_short Spaced Seed Data Structures for De Novo Assembly
title_sort spaced seed data structures for de novo assembly
url http://dx.doi.org/10.1155/2015/196591
work_keys_str_mv AT inancbirol spacedseeddatastructuresfordenovoassembly
AT justinchu spacedseeddatastructuresfordenovoassembly
AT hamidmohamadi spacedseeddatastructuresfordenovoassembly
AT shaundjackman spacedseeddatastructuresfordenovoassembly
AT karthikaraghavan spacedseeddatastructuresfordenovoassembly
AT benjaminpvandervalk spacedseeddatastructuresfordenovoassembly
AT anthonyraymond spacedseeddatastructuresfordenovoassembly
AT renelwarren spacedseeddatastructuresfordenovoassembly