Spaced Seed Data Structures for De Novo Assembly
De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are...
Saved in:
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Wiley
2015-01-01
|
Series: | International Journal of Genomics |
Online Access: | http://dx.doi.org/10.1155/2015/196591 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832556240033546240 |
---|---|
author | Inanç Birol Justin Chu Hamid Mohamadi Shaun D. Jackman Karthika Raghavan Benjamin P. Vandervalk Anthony Raymond René L. Warren |
author_facet | Inanç Birol Justin Chu Hamid Mohamadi Shaun D. Jackman Karthika Raghavan Benjamin P. Vandervalk Anthony Raymond René L. Warren |
author_sort | Inanç Birol |
collection | DOAJ |
description | De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction. |
format | Article |
id | doaj-art-074da3bfe45a457d9a147da215297915 |
institution | Kabale University |
issn | 2314-436X 2314-4378 |
language | English |
publishDate | 2015-01-01 |
publisher | Wiley |
record_format | Article |
series | International Journal of Genomics |
spelling | doaj-art-074da3bfe45a457d9a147da2152979152025-02-03T05:45:56ZengWileyInternational Journal of Genomics2314-436X2314-43782015-01-01201510.1155/2015/196591196591Spaced Seed Data Structures for De Novo AssemblyInanç Birol0Justin Chu1Hamid Mohamadi2Shaun D. Jackman3Karthika Raghavan4Benjamin P. Vandervalk5Anthony Raymond6René L. Warren7Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaCanada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, CanadaDe novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.http://dx.doi.org/10.1155/2015/196591 |
spellingShingle | Inanç Birol Justin Chu Hamid Mohamadi Shaun D. Jackman Karthika Raghavan Benjamin P. Vandervalk Anthony Raymond René L. Warren Spaced Seed Data Structures for De Novo Assembly International Journal of Genomics |
title | Spaced Seed Data Structures for De Novo Assembly |
title_full | Spaced Seed Data Structures for De Novo Assembly |
title_fullStr | Spaced Seed Data Structures for De Novo Assembly |
title_full_unstemmed | Spaced Seed Data Structures for De Novo Assembly |
title_short | Spaced Seed Data Structures for De Novo Assembly |
title_sort | spaced seed data structures for de novo assembly |
url | http://dx.doi.org/10.1155/2015/196591 |
work_keys_str_mv | AT inancbirol spacedseeddatastructuresfordenovoassembly AT justinchu spacedseeddatastructuresfordenovoassembly AT hamidmohamadi spacedseeddatastructuresfordenovoassembly AT shaundjackman spacedseeddatastructuresfordenovoassembly AT karthikaraghavan spacedseeddatastructuresfordenovoassembly AT benjaminpvandervalk spacedseeddatastructuresfordenovoassembly AT anthonyraymond spacedseeddatastructuresfordenovoassembly AT renelwarren spacedseeddatastructuresfordenovoassembly |