Section: Mathematical & Computational Biology
Topic: Genetics/Genomics, Applied mathematics

General encoding of canonical k-mers

10.24072/pcjournal.323 - Peer Community Journal, Volume 3 (2023), article no. e87.

Get full text PDF Peer reviewed and recommended by PCI

To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, canonical k-mers are considered, i.e., the lexicographically smaller of a given k-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k-mers are not evenly distributed within the interval [0, σk −1]. We present a minimal encoding of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk/2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.

Published online:
DOI: 10.24072/pcjournal.323
Type: Research article
Keywords: k-mers, canonical k-mers, encoding, minimal perfect hash function
Wittler, Roland 1

1 Faculty of Technology, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), and Center for Biotechnology (CeBiTec), Bielefeld University, Germany
License: CC-BY 4.0
Copyrights: The authors retain unrestricted copyrights and publishing rights
@article{10_24072_pcjournal_323,
     author = {Wittler, Roland},
     title = {General encoding of canonical \protect\emph{k}-mers},
     journal = {Peer Community Journal},
     eid = {e87},
     publisher = {Peer Community In},
     volume = {3},
     year = {2023},
     doi = {10.24072/pcjournal.323},
     language = {en},
     url = {https://peercommunityjournal.org/articles/10.24072/pcjournal.323/}
}
TY  - JOUR
AU  - Wittler, Roland
TI  - General encoding of canonical k-mers
JO  - Peer Community Journal
PY  - 2023
VL  - 3
PB  - Peer Community In
UR  - https://peercommunityjournal.org/articles/10.24072/pcjournal.323/
DO  - 10.24072/pcjournal.323
LA  - en
ID  - 10_24072_pcjournal_323
ER  - 
%0 Journal Article
%A Wittler, Roland
%T General encoding of canonical k-mers
%J Peer Community Journal
%D 2023
%V 3
%I Peer Community In
%U https://peercommunityjournal.org/articles/10.24072/pcjournal.323/
%R 10.24072/pcjournal.323
%G en
%F 10_24072_pcjournal_323
Wittler, Roland. General encoding of canonical k-mers. Peer Community Journal, Volume 3 (2023), article  no. e87. doi : 10.24072/pcjournal.323. https://peercommunityjournal.org/articles/10.24072/pcjournal.323/

PCI peer reviews and recommendation, and links to data, scripts, code and supplementary information: 10.24072/pci.mcb.100188

Conflict of interest of the recommender and peer reviewers:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

[1] aappleby Smhasher, Online; accessed 15-August-2022, 2022 (https://github.com/aappleby/smhasher)

[2] Crusoe, M. R.; Alameldin, H. F.; Awad, S.; Boucher, E.; Caldwell, A.; Cartwright, R.; Charbonneau, A.; Constantinides, B.; Edvenson, G.; Fay, S.; Fenton, J.; Fenzl, T.; Fish, J.; Garcia-Gutierrez, L.; Garland, P.; Gluck, J.; González, I.; Guermond, S.; Guo, J.; Gupta, A.; Herr, J. R.; Howe, A.; Hyer, A.; Härpfer, A.; Irber, L.; Kidd, R.; Lin, D.; Lippi, J.; Mansour, T.; McA'Nulty, P.; McDonald, E.; Mizzi, J.; Murray, K. D.; Nahum, J. R.; Nanlohy, K.; Nederbragt, A. J.; Ortiz-Zuazaga, H.; Ory, J.; Pell, J.; Pepe-Ranney, C.; Russ, Z. N.; Schwarz, E.; Scott, C.; Seaman, J.; Sievert, S.; Simpson, J.; Skennerton, C. T.; Spencer, J.; Srinivasan, R.; Standage, D.; Stapleton, J. A.; Steinman, S. R.; Stein, J.; Taylor, B.; Trimble, W.; Wiencko, H. L.; Wright, M.; Wyss, B.; Zhang, Q.; zyme, e.; Brown, C. T. The khmer software package: enabling efficient nucleotide sequence analysis, F1000Research, Volume 4 (2015) | DOI

[3] D., W. Compact mapping from an involuted set (Approach #1), 2017 ( https://cs.stackexchange.com/questions/82644/compact-mapping-from-an-involuted-set [Online; accessed 15-August-2022].)

[4] Limasset, A.; Rizk, G.; Chikhi, R.; Peterlongo, P. Fast and scalable minimal perfect hashing for massive key sets, arXiv (2017) | DOI

[5] Medvedev, P. Minimal encodings of canonical k-mers for general alphabets and even k-mer sizes, Peer Community in Mathematical and Computational Biology (2023) | DOI

[6] Marçais, G.; Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, Volume 27 (2011) no. 6, pp. 764-770 | DOI

[7] Patro, R.; Mount, S. M.; Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms, Nature Biotechnology, Volume 32 (2014) no. 5, pp. 462-464 | DOI

[8] Pibiri, G. E. Sparse and skew hashing of K-mers, Bioinformatics, Volume 38 (2022) no. Supplement_1 | DOI

[9] Pibiri, G.; Shibuya, Y.; Limasset, A. Locality-Preserving Minimal Perfect Hashing of k-mers, arXiv (2022) | DOI

[10] Pibiri, G. E.; Trani, R. Parallel and External-Memory Construction of Minimal Perfect Hash Functions with PTHash, IEEE Transactions on Knowledge and Data Engineering (2023), pp. 1-12 | DOI

[11] Cui, P.; Zhang, S.; Ding, F.; Ali, S.; Xiong, L. Dynamic regulation of genome-wide pre-mRNA splicing and stress tolerance by the Sm-like protein LSm5 in Arabidopsis, Genome Biology, Volume 15 (2014) no. 1 | DOI

[12] Wittler, R. MinEncCanKmer (Version 13fecaee), Zenodo, 2023 | DOI

[13] Zentgraf, J.; Rahmann, S. Fast gapped k-mer counting with subdivided multi-way bucketed Cuckoo hash table, In: 22nd International Workshop on Algorithms in Bioinformatics (WABI). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022 | DOI

Cited by Sources: