Section: Mathematical & Computational Biology
Topic: Genetics/genomics, Computer sciences

Revisiting pangenome openness with k-mers

10.24072/pcjournal.415 - Peer Community Journal, Volume 4 (2024), article no. e47.

Get full text PDF Peer reviewed and recommended by PCI
article image

Pangenomics is the study of related genomes collectively, usually from the same species or closely related taxa. Originally, pangenomes were defined for bacterial species. After the concept was extended to eukaryotic genomes, two definitions of pangenome evolved in parallel: the gene-based approach, which defines the pangenome as the union of all genes, and the sequence-based approach, which defines the pangenome as the set of all nonredundant genomic sequences. Estimating the total size of the pangenome for a given species has been subject of study since the very first mention of pangenomes. Traditionally, this is performed by predicting the ratio at which new genes are discovered, referred to as the openness of the species. Here, we abstract each genome as a set of items, which is entirely agnostic of the two approaches (gene-based, sequence-based). Genes are a viable option for items, but also other possibilities are feasible, e.g., genome sequence substrings of fixed length k (k-mers). In the present study, we investigate the use of k-mers to estimate the openness as an alternative to genes, and compare the results. An efficient implementation is also provided.

Published online:
DOI: 10.24072/pcjournal.415
Type: Research article

Parmigiani, Luca 1, 2, 3; Wittler, Roland 1, 2; Stoye, Jens 1, 2

1 Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University – Bielefeld, Germany
2 Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University – Bielefeld, Germany
3 Graduate School “Digital Infrastructure for the Life Sciences” (DILS), Bielefeld University – Bielefeld, Germany
License: CC-BY 4.0
Copyrights: The authors retain unrestricted copyrights and publishing rights
     author = {Parmigiani, Luca and Wittler, Roland and Stoye, Jens},
     title = {Revisiting pangenome openness with \protect\emph{k}-mers},
     journal = {Peer Community Journal},
     eid = {e47},
     publisher = {Peer Community In},
     volume = {4},
     year = {2024},
     doi = {10.24072/pcjournal.415},
     language = {en},
     url = {}
AU  - Parmigiani, Luca
AU  - Wittler, Roland
AU  - Stoye, Jens
TI  - Revisiting pangenome openness with k-mers
JO  - Peer Community Journal
PY  - 2024
VL  - 4
PB  - Peer Community In
UR  -
DO  - 10.24072/pcjournal.415
LA  - en
ID  - 10_24072_pcjournal_415
ER  - 
%0 Journal Article
%A Parmigiani, Luca
%A Wittler, Roland
%A Stoye, Jens
%T Revisiting pangenome openness with k-mers
%J Peer Community Journal
%D 2024
%V 4
%I Peer Community In
%R 10.24072/pcjournal.415
%G en
%F 10_24072_pcjournal_415
Parmigiani, Luca; Wittler, Roland; Stoye, Jens. Revisiting pangenome openness with k-mers. Peer Community Journal, Volume 4 (2024), article  no. e47. doi : 10.24072/pcjournal.415.

PCI peer reviews and recommendation, and links to data, scripts, code and supplementary information: 10.24072/pci.mcb.100185

Conflict of interest of the recommender and peer reviewers:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

[1] 1000 Genomes Project Consortium A global reference for human genetic variation, Nature, Volume 526 (2015) no. 7571, pp. 68-74 | DOI

[2] Abudahab, K.; Prada, J. M.; Yang, Z.; Bentley, S. D.; Croucher, N. J.; Corander, J.; Aanensen, D. M. PANINI: Pangenome Neighbour Identification for Bacterial Populations, Microbial Genomics, Volume 5 (2019) no. 4 | DOI

[3] Argemi, X.; Hansmann, Y.; Prola, K.; Prévost, G. Coagulase-Negative Staphylococci Pathogenomics, International Journal of Molecular Sciences, Volume 20 (2019) no. 5 | DOI

[4] Argemi, X.; Matelska, D.; Ginalski, K.; Riegel, P.; Hansmann, Y.; Bloom, J.; Pestel-Caron, M.; Dahyot, S.; Lebeurre, J.; Prévost, G. Comparative genomic analysis of Staphylococcus lugdunensis shows a closed pan-genome and multiple barriers to horizontal gene transfer, BMC Genomics, Volume 19 (2018) no. 1 | DOI

[5] Bosi, E.; Monk, J. M.; Aziz, R. K.; Fondi, M.; Nizet, V.; Palsson, B. Ø. Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicity, Proceedings of the National Academy of Sciences, Volume 113 (2016) no. 26 | DOI

[6] Brittnacher, M. J.; Fong, C.; Hayden, H. S.; Jacobs, M. A.; Radey, M.; Rohmer, L. PGAT: a multistrain analysis resource for microbial genomes, Bioinformatics, Volume 27 (2011) no. 17, pp. 2429-2430 | DOI

[7] Chacoma, A.; Zanette, D. H. Heaps’ Law and Heaps functions in tagged texts: evidences of their linguistic relevance, Royal Society Open Science, Volume 7 (2020) no. 3 | DOI

[8] Chaudhari, N. M.; Gupta, V. K.; Dutta, C. BPGA- an ultra-fast pan-genome analysis pipeline, Scientific Reports, Volume 6 (2016) no. 1 | DOI

[9] Cheng, H.; Concepcion, G. T.; Feng, X.; Zhang, H.; Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm, Nature Methods, Volume 18 (2021) no. 2, pp. 170-175 | DOI

[10] Clarke, T. H.; Brinkac, L. M.; Inman, J. M.; Sutton, G.; Fouts, D. E. PanACEA: a bioinformatics tool for the exploration and visualization of bacterial pan-chromosomes, BMC Bioinformatics, Volume 19 (2018) no. 1 | DOI

[11] Clauset, A.; Shalizi, C. R.; Newman, M. E. J. Power-Law Distributions in Empirical Data, SIAM Review, Volume 51 (2009) no. 4, pp. 661-703 | DOI

[12] Compeau, P. E. C.; Pevzner, P. A.; Tesler, G. How to apply de Bruijn graphs to genome assembly, Nature Biotechnology, Volume 29 (2011) no. 11, pp. 987-991 | DOI

[13] Cui, Y.; Song, Y. Genome and Evolution of Yersinia pestis, Advances in Experimental Medicine and Biology, Springer Netherlands, Dordrecht, 2016, pp. 171-192 | DOI

[14] Cummins, E. A.; Hall, R. J.; McInerney, J. O.; McNally, A. Prokaryote pangenomes are dynamic entities, Current Opinion in Microbiology, Volume 66 (2022), pp. 73-78 | DOI

[15] Ding, W.; Baumdicker, F.; Neher, R. A. panX: pan-genome analysis and exploration, Nucleic Acids Research, Volume 46 (2017) no. 1 | DOI

[16] Edgar, R. C. Search and clustering orders of magnitude faster than BLAST, Bioinformatics, Volume 26 (2010) no. 19, pp. 2460-2461 | DOI

[17] Enright, A. J. An efficient algorithm for large-scale detection of protein families, Nucleic Acids Research, Volume 30 (2002) no. 7, pp. 1575-1584 | DOI

[18] Fouts, D. E.; Brinkac, L.; Beck, E.; Inman, J.; Sutton, G. PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species, Nucleic Acids Research, Volume 40 (2012) no. 22 | DOI

[19] Fu, L.; Niu, B.; Zhu, Z.; Wu, S.; Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data, Bioinformatics, Volume 28 (2012) no. 23, pp. 3150-3152 | DOI

[20] Gautreau, G.; Bazin, A.; Gachet, M.; Planel, R.; Burlot, L.; Dubois, M.; Perrin, A.; Médigue, C.; Calteau, A.; Cruveiller, S.; Matias, C.; Ambroise, C.; Rocha, E. P. C.; Vallenet, D. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph, PLOS Computational Biology, Volume 16 (2020) no. 3 | DOI

[21] Heaps, H. Information retrieval: Computational and theoretical aspects, Academic Press, Inc., 1978

[22] Heck, K. L.; van Belle, G.; Simberloff, D. Explicit Calculation of the Rarefaction Diversity Measurement and the Determination of Sufficient Sample Size, Ecology, Volume 56 (1975) no. 6, pp. 1459-1461 | DOI

[23] van Iersel, L. Faster method for estimating the openness of species, Peer Community in Mathematical and Computational Biology (2024) | DOI

[24] Kokot, M.; Długosz, M.; Deorowicz, S. KMC 3: counting and manipulating k-mer statistics, Bioinformatics, Volume 33 (2017) no. 17, pp. 2759-2761 | DOI

[25] Li, L.; Stoeckert, C. J.; Roos, D. S. OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes, Genome Research, Volume 13 (2003) no. 9, pp. 2178-2189 | DOI

[26] Lobb, B.; Shapter, A.; Doxey, A. C.; Nissimov, J. I. Functional Profiling and Evolutionary Analysis of a Marine Microalgal Virus Pangenome, Viruses, Volume 15 (2023) no. 5 | DOI

[27] Marçais, G.; Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, Volume 27 (2011) no. 6, pp. 764-770 | DOI

[28] Page, A. J.; Cummins, C. A.; Hunt, M.; Wong, V. K.; Reuter, S.; Holden, M. T.; Fookes, M.; Falush, D.; Keane, J. A.; Parkhill, J. Roary: rapid large-scale prokaryote pan genome analysis, Bioinformatics, Volume 31 (2015) no. 22, pp. 3691-3693 | DOI

[29] Parmigiani, L. lucaparmigiani/Pangenome-Openness: Pangenome- Openness. Version v1.0.0., Zenodo, 2023 | DOI

[30] Parmigiani, L.; Wittler, R.; Stoye, J. Supplementary Material – Revisiting pangenome openness with k-mers, Zenodo, 2023 | DOI

[31] Sahl, J. W.; Caporaso, J. G.; Rasko, D. A.; Keim, P. The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes, PeerJ, Volume 2 (2014) | DOI

[32] Seemann, T. Prokka: rapid prokaryotic genome annotation, Bioinformatics, Volume 30 (2014) no. 14, pp. 2068-2069 | DOI

[33] Sheikhizadeh, S.; Schranz, M. E.; Akdel, M.; de Ridder, D.; Smit, S. PanTools: representation, storage and exploration of pan-genomic data, Bioinformatics, Volume 32 (2016) no. 17 | DOI

[34] Sheikhizadeh Anari, S.; de Ridder, D.; Schranz, M. E.; Smit, S. Efficient inference of homologs in large eukaryotic pan-proteomes, BMC Bioinformatics, Volume 19 (2018) no. 1 | DOI

[35] Snipen, L.; Liland, K. H. micropan: an R-package for microbial pan-genomics, BMC Bioinformatics, Volume 16 (2015) no. 1 | DOI

[36] Tettelin, H.; Masignani, V.; Cieslewicz, M. J.; Donati, C.; Medini, D.; Ward, N. L.; Angiuoli, S. V.; Crabtree, J.; Jones, A. L.; Durkin, A. S.; DeBoy, R. T.; Davidsen, T. M.; Mora, M.; Scarselli, M.; Margarit y Ros, I.; Peterson, J. D.; Hauser, C. R.; Sundaram, J. P.; Nelson, W. C.; Madupu, R.; Brinkac, L. M.; Dodson, R. J.; Rosovitz, M. J.; Sullivan, S. A.; Daugherty, S. C.; Haft, D. H.; Selengut, J.; Gwinn, M. L.; Zhou, L.; Zafar, N.; Khouri, H.; Radune, D.; Dimitrov, G.; Watkins, K.; O'Connor, K. J. B.; Smith, S.; Utterback, T. R.; White, O.; Rubens, C. E.; Grandi, G.; Madoff, L. C.; Kasper, D. L.; Telford, J. L.; Wessels, M. R.; Rappuoli, R.; Fraser, C. M. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial “pan-genome”, Proceedings of the National Academy of Sciences, Volume 102 (2005) no. 39, pp. 13950-13955 | DOI

[37] Tettelin, H.; Riley, D.; Cattuto, C.; Medini, D. Comparative genomics: the bacterial pan-genome, Current Opinion in Microbiology, Volume 11 (2008) no. 5, pp. 472-477 | DOI

[38] The Computational Pan-Genomics Consortium Computational pan-genomics: status, promises and challenges, Briefings in Bioinformatics (2016) | DOI

[39] Tonkin-Hill, G.; Gladstone, R. A.; Pöntinen, A. K.; Arredondo-Alonso, S.; Bentley, S. D.; Corander, J. Robust analysis of prokaryotic pangenome gene gain and loss rates with Panstripe, Genome Research, Volume 33 (2023) no. 1, pp. 129-140 | DOI

[40] Vernikos, G. S. A Review of Pangenome Tools and Recent Studies, The Pangenome, Springer International Publishing, Cham, 2020, pp. 89-112 | DOI

[41] Wood, D. E.; Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments, Genome Biology, Volume 15 (2014) no. 3 | DOI

[42] Xin, H.; Lee, D.; Hormozdiari, F.; Yedkar, S.; Mutlu, O.; Alkan, C. Accelerating read mapping with FastHASH, BMC Genomics, Volume 14 (2013) no. S1 | DOI

[43] Zhao, Y.; Jia, X.; Yang, J.; Ling, Y.; Zhang, Z.; Yu, J.; Wu, J.; Xiao, J. PanGP: A tool for quickly analyzing bacterial pan-genome profile, Bioinformatics, Volume 30 (2014) no. 9, pp. 1297-1299 | DOI

[44] Zhao, Y.; Wu, J.; Yang, J.; Sun, S.; Xiao, J.; Yu, J. PGAP: pan-genomes analysis pipeline, Bioinformatics, Volume 28 (2011) no. 3, pp. 416-418 | DOI

Cited by Sources: