General encoding of canonical k-mers

To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet si...

Full description

Saved in:

Bibliographic Details
Main Author:	Wittler, Roland
Format:	Article
Language:	English
Published:	Peer Community In 2023-09-01
Series:	Peer Community Journal
Subjects:	k-mers canonical k-mers encoding minimal perfect hash function
Online Access:	https://peercommunityjournal.org/articles/10.24072/pcjournal.323/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, canonical k-mers are considered, i.e., the lexicographically smaller of a given k-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k-mers are not evenly distributed within the interval [0, σk −1]. We present a minimal encoding of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk/2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.
ISSN:	2804-3871

General encoding of canonical k-mers

Similar Items