General encoding of canonical k-mers

To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet si...

Full description

Saved in:
Bibliographic Details
Main Author: Wittler, Roland
Format: Article
Language:English
Published: Peer Community In 2023-09-01
Series:Peer Community Journal
Subjects:
Online Access:https://peercommunityjournal.org/articles/10.24072/pcjournal.323/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1825206423575855104
author Wittler, Roland
author_facet Wittler, Roland
author_sort Wittler, Roland
collection DOAJ
description To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, canonical k-mers are considered, i.e., the lexicographically smaller of a given k-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k-mers are not evenly distributed within the interval [0, σk −1]. We present a minimal encoding of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk/2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.
format Article
id doaj-art-a463430ca95d4a17aa405c118a29ed58
institution Kabale University
issn 2804-3871
language English
publishDate 2023-09-01
publisher Peer Community In
record_format Article
series Peer Community Journal
spelling doaj-art-a463430ca95d4a17aa405c118a29ed582025-02-07T10:16:49ZengPeer Community InPeer Community Journal2804-38712023-09-01310.24072/pcjournal.32310.24072/pcjournal.323General encoding of canonical k-mers Wittler, Roland0https://orcid.org/0000-0002-2249-9880Faculty of Technology, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), and Center for Biotechnology (CeBiTec), Bielefeld University, GermanyTo index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, canonical k-mers are considered, i.e., the lexicographically smaller of a given k-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k-mers are not evenly distributed within the interval [0, σk −1]. We present a minimal encoding of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk/2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet. https://peercommunityjournal.org/articles/10.24072/pcjournal.323/k-merscanonical k-mersencodingminimal perfect hash function
spellingShingle Wittler, Roland
General encoding of canonical k-mers
Peer Community Journal
k-mers
canonical k-mers
encoding
minimal perfect hash function
title General encoding of canonical k-mers
title_full General encoding of canonical k-mers
title_fullStr General encoding of canonical k-mers
title_full_unstemmed General encoding of canonical k-mers
title_short General encoding of canonical k-mers
title_sort general encoding of canonical k mers
topic k-mers
canonical k-mers
encoding
minimal perfect hash function
url https://peercommunityjournal.org/articles/10.24072/pcjournal.323/
work_keys_str_mv AT wittlerroland generalencodingofcanonicalkmers