General encoding of canonical k-mers
To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet si...
Saved in:
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Peer Community In
2023-09-01
|
Series: | Peer Community Journal |
Subjects: | |
Online Access: | https://peercommunityjournal.org/articles/10.24072/pcjournal.323/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1825206423575855104 |
---|---|
author | Wittler, Roland |
author_facet | Wittler, Roland |
author_sort | Wittler, Roland |
collection | DOAJ |
description | To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, canonical k-mers are considered, i.e., the lexicographically smaller of a given k-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k-mers are not evenly distributed within the interval [0, σk −1]. We present a minimal encoding of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk/2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.
|
format | Article |
id | doaj-art-a463430ca95d4a17aa405c118a29ed58 |
institution | Kabale University |
issn | 2804-3871 |
language | English |
publishDate | 2023-09-01 |
publisher | Peer Community In |
record_format | Article |
series | Peer Community Journal |
spelling | doaj-art-a463430ca95d4a17aa405c118a29ed582025-02-07T10:16:49ZengPeer Community InPeer Community Journal2804-38712023-09-01310.24072/pcjournal.32310.24072/pcjournal.323General encoding of canonical k-mers Wittler, Roland0https://orcid.org/0000-0002-2249-9880Faculty of Technology, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), and Center for Biotechnology (CeBiTec), Bielefeld University, GermanyTo index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, canonical k-mers are considered, i.e., the lexicographically smaller of a given k-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k-mers are not evenly distributed within the interval [0, σk −1]. We present a minimal encoding of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk/2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet. https://peercommunityjournal.org/articles/10.24072/pcjournal.323/k-merscanonical k-mersencodingminimal perfect hash function |
spellingShingle | Wittler, Roland General encoding of canonical k-mers Peer Community Journal k-mers canonical k-mers encoding minimal perfect hash function |
title | General encoding of canonical k-mers
|
title_full | General encoding of canonical k-mers
|
title_fullStr | General encoding of canonical k-mers
|
title_full_unstemmed | General encoding of canonical k-mers
|
title_short | General encoding of canonical k-mers
|
title_sort | general encoding of canonical k mers |
topic | k-mers canonical k-mers encoding minimal perfect hash function |
url | https://peercommunityjournal.org/articles/10.24072/pcjournal.323/ |
work_keys_str_mv | AT wittlerroland generalencodingofcanonicalkmers |