Accelerating <i>k</i>-mer-based sequence filtering

Martayan, Igor; Vandamme, Léa; Constantinides, Bede; Cazaux, Bastien; Paperman, Charles; Limasset, Antoine

doi:10.24072/pcjournal.735

Browse by volume Browse by section Browse by topic Browse by conference

Section: Mathematical & Computational Biology
Topic: Biophysics and computational biology, Computer sciences

Accelerating k-mer-based sequence filtering

Martayan, Igor ¹ ; Vandamme, Léa ¹ ; Constantinides, Bede ² ; Cazaux, Bastien ¹ ; Paperman, Charles ³ ; Limasset, Antoine ¹

¹Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
²University of Birmingham, Birmingham, United Kingdom
³Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL, France

Corresponding author(s): Martayan, Igor (igor.martayan@univ-lille.fr)

10.24072/pcjournal.735 - Peer Community Journal, Volume 6 (2026), article no. e55

Get full text PDF Peer reviewed and recommended by PCI

Abstract

Motivation. The exponential growth of global sequencing data repositories presents both analytical challenges and opportunities. While k-mer-based indexing has improved scalability over traditional alignment for identifying relevant documents, pinpointing the exact sequences matching numerous queries remains a hurdle. In particular, searching for numerous k-mers with a single large query or multiple distinct queries strains existing exact matching tools, whose performance scales poorly with an increasing number of patterns. At the same time, indexing entire vast datasets for infrequent or ad-hoc searches is often resource-prohibitive. Designing fast methods for matching a large number of k-mers without exhaustive pre-indexing is therefore critical. Contributions. We propose an efficient solution to the problem of k-mer-based sequence filtering: given a set of k-mers of interests and a threshold, quickly evaluate whether an arbitrary sequence has a number of k-mer matches above or below the threshold. Our approach demonstrates how minimizer-based based sketching, alongside SIMD acceleration, can enhance the performance of streaming searches, and is implemented as a Rust tool named K2Rmini. On a consumer laptop, K2Rmini is able to filter long reads at 2 Gbp/s. Availability. https://github.com/Malfoy/K2Rmini.

Published online: 2026-06-17

DOI: 10.24072/pcjournal.735
Type: Research article

Classification:
Keywords: High throughput sequencing · Multi-pattern matching · Minimizers · SIMD

Author's affiliations:

Martayan, Igor ¹ ; Vandamme, Léa ¹ ; Constantinides, Bede ² ; Cazaux, Bastien ¹ ; Paperman, Charles ³ ; Limasset, Antoine ¹

¹ Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
² University of Birmingham, Birmingham, United Kingdom
³ Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL, France

License:

CC-BY 4.0

Copyrights: The authors retain unrestricted copyrights and publishing rights

Martayan, I.; Vandamme, L.; Constantinides, B.; Cazaux, B.; Paperman, C.; Limasset, A. Accelerating k-mer-based sequence filtering. Peer Community Journal, Volume 6 (2026), article  no. e55. https://doi.org/10.24072/pcjournal.735

@article{10_24072_pcjournal_735,
     author = {Martayan, Igor and Vandamme, L\'ea and Constantinides, Bede and Cazaux, Bastien and Paperman, Charles and Limasset, Antoine},
     title = {Accelerating~\protect\emph{k}-mer-based sequence filtering
},
     journal = {Peer Community Journal},
     eid = {e55},
     year = {2026},
     publisher = {Peer Community In},
     volume = {6},
     doi = {10.24072/pcjournal.735},
     language = {en},
     url = {https://peercommunityjournal.org/articles/10.24072/pcjournal.735/}
}

TY  - JOUR
AU  - Martayan, Igor
AU  - Vandamme, Léa
AU  - Constantinides, Bede
AU  - Cazaux, Bastien
AU  - Paperman, Charles
AU  - Limasset, Antoine
TI  - Accelerating k-mer-based sequence filtering

JO  - Peer Community Journal
PY  - 2026
VL  - 6
PB  - Peer Community In
UR  - https://peercommunityjournal.org/articles/10.24072/pcjournal.735/
DO  - 10.24072/pcjournal.735
LA  - en
ID  - 10_24072_pcjournal_735
ER  -

%0 Journal Article
%A Martayan, Igor
%A Vandamme, Léa
%A Constantinides, Bede
%A Cazaux, Bastien
%A Paperman, Charles
%A Limasset, Antoine
%T Accelerating k-mer-based sequence filtering

%J Peer Community Journal
%] e55
%D 2026
%V 6
%I Peer Community In
%U https://peercommunityjournal.org/articles/10.24072/pcjournal.735/
%R 10.24072/pcjournal.735
%G en
%F 10_24072_pcjournal_735

PCI peer reviews and recommendation, and links to data, scripts, code and supplementary information: 10.24072/pci.mcb.100920

Conflict of interest of the recommender and peer reviewers:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

References
Cited by

[1] Agret, C.; Cazaux, B.; Limasset, A. Toward optimal fingerprint indexing for large scale genomics, bioRxiv (2021) | DOI

[2] Alanko, J. N.; Biagi, E.; Puglisi, S. J. Finimizers: Variable-Length Bounded-Frequency Minimizers for k-mer Sets, IEEE Transactions on Computational Biology and Bioinformatics, Volume 22 (2025) no. 2, pp. 899-910 | DOI

[3] Alanko, J. N.; Puglisi, S. J.; Vuohtoniemi, J. Small Searchable κ-Spectra via Subset Rank Queries on the Spectral Burrows-Wheeler Transform, SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23), Society for Industrial and Applied Mathematics, 2023, pp. 225-236 | DOI

[4] Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool, Journal of molecular biology, Volume 215 (1990) no. 3, pp. 403-410 | DOI

[5] Baire, A.; Marijon, P.; Andreace, F.; Peterlongo, P. Back to sequences: Find the origin of k-mers, Journal of Open Source Software, Volume 9 (2024) no. 101, p. 7066 | DOI

[6] Chikhi, R.; Raffestin, B.; Korobeynikov, A.; Edgar, R.; Babaian, A. Logan: planetary-scale genome assembly surveys life’s diversity, bioRxiv (2024) | DOI

[7] Constantinides, B.; Lees, J.; Crook, D. W. Deacon: fast sequence filtering and contaminant depletion, bioRxiv (2025) | DOI

[8] Crochemore, M.; Czumaj, A.; Gasieniec, L.; Lecroq, T.; Plandowski, W.; Rytter, W. Fast practical multi-pattern matching, Information Processing Letters, Volume 71 (1999) no. 3, pp. 107-113 | DOI

[9] Crosbie, N. D. grepq: A Rust application that quickly filters FASTQ files by matching sequences to a set of regular expressions, Journal of Open Source Software, Volume 10 (2025) no. 110 | DOI

[10] Darvish, M.; Seiler, E.; Mehringer, S.; Rahn, R.; Reinert, K. Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments, Bioinformatics, Volume 38 (2022) no. 17, pp. 4100-4108 | DOI

[11] Edgar, R. C.; Taylor, B.; Lin, V.; Altman, T.; Barbera, P.; Meleshko, D.; Lohr, D.; Novakovsky, G.; Buchfink, B.; Al-Shayeb, B.; others Petabase-scale sequence alignment catalyses viral discovery, Nature, Volume 602 (2022) no. 7895, pp. 142-147 | DOI

[12] Ekim, B.; Berger, B.; Chikhi, R. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer, Cell systems, Volume 12 (2021) no. 10, pp. 958-968 | DOI

[13] Faro, S.; Lecroq, T. The exact online string matching problem: A review of the most recent results, ACM Computing Surveys (CSUR), Volume 45 (2013) no. 2, pp. 1-42 | DOI

[14] Gallant, A. ripgrep, GitHub repository, version 15.1.0, released 22 October 2025, accessed 15 June 2026, 2025 (https://github.com/BurntSushi/ripgrep/releases/tag/15.1.0)

[15] Golan, S.; Tziony, I.; Kraus, M.; Orenstein, Y.; Shur, A. GreedyMini: generating low-density DNA minimizers, Bioinformatics, Volume 41 (2025) no. Supplement_1, p. i275-i284 | DOI

[16] Groot Koerkamp, R.; Liu, D.; Pibiri, G. E. The open-closed mod-minimizer algorithm, Algorithms for Molecular Biology, Volume 20 (2025) no. 1, p. 4 | DOI

[17] Groot Koerkamp, R.; Martayan, I. SimdMinimizers: Computing Random Minimizers, fast, 23rd International Symposium on Experimental Algorithms (SEA 2025) (Leibniz International Proceedings in Informatics (LIPIcs)), Volume 338, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2025, p. 20 | DOI

[18] Groot Koerkamp, R.; Pibiri, G. E. The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024) (Leibniz International Proceedings in Informatics (LIPIcs)), Volume 312, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2024, p. 11 | DOI

[19] Homer, N.; Stadick, S.; Lambert, S.; Stone, M.; Fennell, T. fulcrumgenomics/fqgrep: v1.1.1, Zenodo, 2025 (https://github.com/fulcrumgenomics/fqgrep) | DOI

[20] Karasikov, M.; Mustafa, H.; Rätsch, G.; Kahles, A. Lossless indexing with counting de Bruijn graphs, Genome Research, Volume 32 (2022) no. 9, pp. 1754-1764 | DOI

[21] Khan, J.; Patro, R.; Pandey, P. kache-hash: A dynamic, concurrent, and cache-efficient hash table for streaming k-mer operations, bioRxiv (2026) | DOI

[22] Lemane, T.; Lezzoche, N.; Lecubin, J.; Pelletier, E.; Lescot, M.; Chikhi, R.; Peterlongo, P. Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA, Nature Computational Science, Volume 4 (2024) no. 2, pp. 104-109 | DOI

[23] Li, H. Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34 (2018) no. 18, pp. 3094-3100 | DOI

[24] Ma, B.; Lu, C.; Wang, Y.; Yu, J.; Zhao, K.; Xue, R.; Ren, H.; Lv, X.; Pan, R.; Zhang, J.; others A genomic catalogue of soil microbiomes boosts mining of biodiversity and genetic resources, Nature communications, Volume 14 (2023) no. 1 | DOI

[25] Martayan, I.; Limasset, A. Malfoy/K2Rmini: Peer Community Journal version, Zenodo, 2026 | DOI

[26] Martayan, I.; Limasset, A. Malfoy/K2Rmini_experiments: Peer Community Journal version ., Zenodo, 2026 | DOI

[27] Marchet, C.; Boucher, C.; Puglisi, S. J.; Medvedev, P.; Salson, M.; Chikhi, R. Data structures based on k-mers for querying large collections of sequencing data sets, Genome research, Volume 31 (2021) no. 1, pp. 1-12 | DOI

[28] Marchet, C.; Limasset, A. Scalable sequence database search using partitioned aggregated Bloom comb trees, Bioinformatics, Volume 39 (2023) no. Supplement_1, p. i252-i259 | DOI

[29] Marçais, G.; Pellow, D.; Bork, D.; Orenstein, Y.; Shamir, R.; Kingsford, C. Improving the performance of minimizers and winnowing schemes, Bioinformatics, Volume 33 (2017) no. 14, p. i110-i117 | DOI

[30] Mohamadi, H.; Chu, J.; Vandervalk, B. P.; Birol, I. ntHash: recursive nucleotide hashing, Bioinformatics, Volume 32 (2016) no. 22, pp. 3492-3494 | DOI

[31] Mäklin, T.; Alanko, J. N.; Biagi, E.; Puglisi, S. J. Sequence alignment with k-bounded matching statistics, bioRxiv (2025) | DOI

[32] Nayfach, S.; Shi, Z. J.; Seshadri, R.; Pollard, K. S.; Kyrpides, N. C. New insights from uncultivated genomes of the global human gut microbiome, Nature, Volume 568 (2019) no. 7753, pp. 505-510 | DOI

[33] Orenstein, Y.; Pellow, D.; Marçais, G.; Shamir, R.; Kingsford, C. Compact universal k-mer hitting sets, Algorithms in Bioinformatics: 16th International Workshop, WABI 2016, Aarhus, Denmark, August 22-24, 2016. Proceedings 16, Springer, 2016, pp. 257-268 | DOI

[34] Parks, D. H.; Rinke, C.; Chuvochina, M.; Chaumeil, P.-A.; Woodcroft, B. J.; Evans, P. N.; Hugenholtz, P.; Tyson, G. W. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life, Nature microbiology, Volume 2 (2017) no. 11, pp. 1533-1542 | DOI

[35] Patro, R.; Bharti, S.; Singhania, P.; Dhakal, R.; Dahlstrom, T. J.; Groot Koerkamp, R. mim: A lightweight auxiliary index to enable fast, parallel, gzipped FASTQ parsing, bioRxiv (2025) | DOI

[36] Pellow, D.; Pu, L.; Ekim, B.; Kotlar, L.; Berger, B.; Shamir, R.; Orenstein, Y. Efficient minimizer orders for large values of k using minimum decycling sets, Genome Research, Volume 33 (2023) no. 7, pp. 1154-1161 | DOI

[37] Pibiri, G. E. A lossless shortcut for k-mer-based sequence filtering, Peer Community in Mathematical and Computational Biology (2026) | DOI

[38] Rahman Hera, M.; Koslicki, D. Estimating similarity and distance using FracMinHash, Algorithms for Molecular Biology, Volume 20 (2025) no. 1, pp. 1-13 | DOI

[39] Roberts, M.; Hayes, W.; Hunt, B. R.; Mount, S. M.; Yorke, J. A. Reducing storage requirements for biological sequence comparison, Bioinformatics, Volume 20 (2004) no. 18, pp. 3363-3369 | DOI

[40] Sayers, E. W.; Cavanaugh, M.; Clark, K.; Pruitt, K. D.; Sherry, S. T.; Yankie, L.; Karsch-Mizrachi, I. GenBank 2024 update, Nucleic Acids Research, Volume 52 (2024) no. D1, p. D134-D137 | DOI

[41] Schleimer, S.; Wilkerson, D. S.; Aiken, A. Winnowing: local algorithms for document fingerprinting, Proceedings of the 2003 ACM SIGMOD international conference on Management of data (SIGMOD '03), Association for Computing Machinery, New York, NY, USA, 2003, pp. 76-85 | DOI

[42] Shen, W.; Le, S.; Li, Y.; Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation, PloS one, Volume 11 (2016) no. 10 | DOI

[43] Shen, W.; Sipos, B.; Zhao, L. SeqKit2: A Swiss army knife for sequence and alignment processing, Imeta, Volume 3 (2024) no. 3 | DOI

[44] Solomon, B.; Kingsford, C. Fast search of thousands of short-read sequencing experiments, Nature biotechnology, Volume 34 (2016) no. 3, pp. 300-302 | DOI

[45] Teyssier, N.; Dobin, A. BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences, bioRxiv (2025) | DOI

[46] Vandamme, L.; Cazaux, B.; Limasset, A. K2R: Tinted de Bruijn Graphs implementation for efficient read extraction from sequencing datasets, Bioinformatics Advances, Volume 5 (2025) no. 1 | DOI

[47] Wang, X.; Hong, Y.; Chang, H.; Park, K.; Langdale, G.; Hu, J.; Zhu, H. Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs, 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), USENIX Association, 2019, pp. 631-648

[48] Zentgraf, J.; Schmitz, J. E.; Rahmann, S. Cleanifier: contamination removal from microbial sequences using spaced seeds of a human pangenome index, Bioinformatics, Volume 42 (2025) no. 1 | DOI

[49] Zheng, H.; Kingsford, C.; Marçais, G. Improved design and analysis of practical minimizers, Bioinformatics, Volume 36 (2020) no. Supplement_1, p. i119-i127 | DOI

[50] Zielezinski, A.; Vinga, S.; Almeida, J.; Karlowski, W. M. Alignment-free sequence comparison: benefits, applications, and tools, Genome biology, Volume 18 (2017), pp. 1-17 | DOI

Cited by Sources:

Section: Mathematical & Computational Biology Topic: Biophysics and computational biology, Computer sciences

Accelerating k-mer-based sequence filtering

Section: Mathematical & Computational Biology
Topic: Biophysics and computational biology, Computer sciences