Section: Mathematical & Computational Biology
Topic:
Genetics/genomics,
Computer sciences
Revisiting pangenome openness with k-mers
Corresponding author(s): Parmigiani, Luca (luca.parmigiani@uni-bielefeld.de)
10.24072/pcjournal.415 - Peer Community Journal, Volume 4 (2024), article no. e47.
Get full text PDF Peer reviewed and recommended by PCIPangenomics is the study of related genomes collectively, usually from the same species or closely related taxa. Originally, pangenomes were defined for bacterial species. After the concept was extended to eukaryotic genomes, two definitions of pangenome evolved in parallel: the gene-based approach, which defines the pangenome as the union of all genes, and the sequence-based approach, which defines the pangenome as the set of all nonredundant genomic sequences. Estimating the total size of the pangenome for a given species has been subject of study since the very first mention of pangenomes. Traditionally, this is performed by predicting the ratio at which new genes are discovered, referred to as the openness of the species. Here, we abstract each genome as a set of items, which is entirely agnostic of the two approaches (gene-based, sequence-based). Genes are a viable option for items, but also other possibilities are feasible, e.g., genome sequence substrings of fixed length k (k-mers). In the present study, we investigate the use of k-mers to estimate the openness as an alternative to genes, and compare the results. An efficient implementation is also provided.
Type: Article de recherche
Parmigiani, Luca 1, 2, 3; Wittler, Roland 1, 2; Stoye, Jens 1, 2
@article{10_24072_pcjournal_415, author = {Parmigiani, Luca and Wittler, Roland and Stoye, Jens}, title = {Revisiting pangenome openness with \protect\emph{k}-mers}, journal = {Peer Community Journal}, eid = {e47}, publisher = {Peer Community In}, volume = {4}, year = {2024}, doi = {10.24072/pcjournal.415}, language = {en}, url = {https://peercommunityjournal.org/articles/10.24072/pcjournal.415/} }
TY - JOUR AU - Parmigiani, Luca AU - Wittler, Roland AU - Stoye, Jens TI - Revisiting pangenome openness with k-mers JO - Peer Community Journal PY - 2024 VL - 4 PB - Peer Community In UR - https://peercommunityjournal.org/articles/10.24072/pcjournal.415/ DO - 10.24072/pcjournal.415 LA - en ID - 10_24072_pcjournal_415 ER -
%0 Journal Article %A Parmigiani, Luca %A Wittler, Roland %A Stoye, Jens %T Revisiting pangenome openness with k-mers %J Peer Community Journal %D 2024 %V 4 %I Peer Community In %U https://peercommunityjournal.org/articles/10.24072/pcjournal.415/ %R 10.24072/pcjournal.415 %G en %F 10_24072_pcjournal_415
Parmigiani, Luca; Wittler, Roland; Stoye, Jens. Revisiting pangenome openness with k-mers. Peer Community Journal, Volume 4 (2024), article no. e47. doi : 10.24072/pcjournal.415. https://peercommunityjournal.org/articles/10.24072/pcjournal.415/
PCI peer reviews and recommendation, and links to data, scripts, code and supplementary information: 10.24072/pci.mcb.100185
Conflict of interest of the recommender and peer reviewers:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
[1] A global reference for human genetic variation, Nature, Volume 526 (2015) no. 7571, pp. 68-74 | DOI
[2] PANINI: Pangenome Neighbour Identification for Bacterial Populations, Microbial Genomics, Volume 5 (2019) no. 4 | DOI
[3] Coagulase-Negative Staphylococci Pathogenomics, International Journal of Molecular Sciences, Volume 20 (2019) no. 5 | DOI
[4] Comparative genomic analysis of Staphylococcus lugdunensis shows a closed pan-genome and multiple barriers to horizontal gene transfer, BMC Genomics, Volume 19 (2018) no. 1 | DOI
[5] Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicity, Proceedings of the National Academy of Sciences, Volume 113 (2016) no. 26 | DOI
[6] PGAT: a multistrain analysis resource for microbial genomes, Bioinformatics, Volume 27 (2011) no. 17, pp. 2429-2430 | DOI
[7] Heaps’ Law and Heaps functions in tagged texts: evidences of their linguistic relevance, Royal Society Open Science, Volume 7 (2020) no. 3 | DOI
[8] BPGA- an ultra-fast pan-genome analysis pipeline, Scientific Reports, Volume 6 (2016) no. 1 | DOI
[9] Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm, Nature Methods, Volume 18 (2021) no. 2, pp. 170-175 | DOI
[10] PanACEA: a bioinformatics tool for the exploration and visualization of bacterial pan-chromosomes, BMC Bioinformatics, Volume 19 (2018) no. 1 | DOI
[11] Power-Law Distributions in Empirical Data, SIAM Review, Volume 51 (2009) no. 4, pp. 661-703 | DOI
[12] How to apply de Bruijn graphs to genome assembly, Nature Biotechnology, Volume 29 (2011) no. 11, pp. 987-991 | DOI
[13] Genome and Evolution of Yersinia pestis, Advances in Experimental Medicine and Biology, Springer Netherlands, Dordrecht, 2016, pp. 171-192 | DOI
[14] Prokaryote pangenomes are dynamic entities, Current Opinion in Microbiology, Volume 66 (2022), pp. 73-78 | DOI
[15] panX: pan-genome analysis and exploration, Nucleic Acids Research, Volume 46 (2017) no. 1 | DOI
[16] Search and clustering orders of magnitude faster than BLAST, Bioinformatics, Volume 26 (2010) no. 19, pp. 2460-2461 | DOI
[17] An efficient algorithm for large-scale detection of protein families, Nucleic Acids Research, Volume 30 (2002) no. 7, pp. 1575-1584 | DOI
[18] PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species, Nucleic Acids Research, Volume 40 (2012) no. 22 | DOI
[19] CD-HIT: accelerated for clustering the next-generation sequencing data, Bioinformatics, Volume 28 (2012) no. 23, pp. 3150-3152 | DOI
[20] PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph, PLOS Computational Biology, Volume 16 (2020) no. 3 | DOI
[21] Information retrieval: Computational and theoretical aspects, Academic Press, Inc., 1978
[22] Explicit Calculation of the Rarefaction Diversity Measurement and the Determination of Sufficient Sample Size, Ecology, Volume 56 (1975) no. 6, pp. 1459-1461 | DOI
[23] Faster method for estimating the openness of species, Peer Community in Mathematical and Computational Biology (2024) | DOI
[24] KMC 3: counting and manipulating k-mer statistics, Bioinformatics, Volume 33 (2017) no. 17, pp. 2759-2761 | DOI
[25] OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes, Genome Research, Volume 13 (2003) no. 9, pp. 2178-2189 | DOI
[26] Functional Profiling and Evolutionary Analysis of a Marine Microalgal Virus Pangenome, Viruses, Volume 15 (2023) no. 5 | DOI
[27] A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, Volume 27 (2011) no. 6, pp. 764-770 | DOI
[28] Roary: rapid large-scale prokaryote pan genome analysis, Bioinformatics, Volume 31 (2015) no. 22, pp. 3691-3693 | DOI
[29] lucaparmigiani/Pangenome-Openness: Pangenome- Openness. Version v1.0.0., Zenodo, 2023 | DOI
[30] Supplementary Material – Revisiting pangenome openness with k-mers, Zenodo, 2023 | DOI
[31] The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes, PeerJ, Volume 2 (2014) | DOI
[32] Prokka: rapid prokaryotic genome annotation, Bioinformatics, Volume 30 (2014) no. 14, pp. 2068-2069 | DOI
[33] PanTools: representation, storage and exploration of pan-genomic data, Bioinformatics, Volume 32 (2016) no. 17 | DOI
[34] Efficient inference of homologs in large eukaryotic pan-proteomes, BMC Bioinformatics, Volume 19 (2018) no. 1 | DOI
[35] micropan: an R-package for microbial pan-genomics, BMC Bioinformatics, Volume 16 (2015) no. 1 | DOI
[36] Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial “pan-genome”, Proceedings of the National Academy of Sciences, Volume 102 (2005) no. 39, pp. 13950-13955 | DOI
[37] Comparative genomics: the bacterial pan-genome, Current Opinion in Microbiology, Volume 11 (2008) no. 5, pp. 472-477 | DOI
[38] Computational pan-genomics: status, promises and challenges, Briefings in Bioinformatics (2016) | DOI
[39] Robust analysis of prokaryotic pangenome gene gain and loss rates with Panstripe, Genome Research, Volume 33 (2023) no. 1, pp. 129-140 | DOI
[40] A Review of Pangenome Tools and Recent Studies, The Pangenome, Springer International Publishing, Cham, 2020, pp. 89-112 | DOI
[41] Kraken: ultrafast metagenomic sequence classification using exact alignments, Genome Biology, Volume 15 (2014) no. 3 | DOI
[42] Accelerating read mapping with FastHASH, BMC Genomics, Volume 14 (2013) no. S1 | DOI
[43] PanGP: A tool for quickly analyzing bacterial pan-genome profile, Bioinformatics, Volume 30 (2014) no. 9, pp. 1297-1299 | DOI
[44] PGAP: pan-genomes analysis pipeline, Bioinformatics, Volume 28 (2011) no. 3, pp. 416-418 | DOI
Cited by Sources: