Section: Microbiology
Topic:
Microbiology,
Computer sciences
EnVhogDB: an extended view of the viral protein families on Earth through a vast collection of HMM profiles
Corresponding author(s): Enault, François (francois.enault@uca.fr); Galiez, Clovis (clovis.galiez@univ-grenoble-alpes.fr)
10.24072/pcjournal.627 - Peer Community Journal, Volume 5 (2025), article no. e100
Get full text PDF Peer reviewed and recommended by PCIOver the last twenty years, hundreds of metagenomic studies have generated millions of viral genomic sequences from a wide variety of ecosystems. Despite this, the overall genetic diversity of viruses remains elusive, both in terms of the number of protein families they encode and the diversity of these families. Indeed, even if it is recognized that the organization of the viral protein sequence space requires sensitive homology detection methods, such methods have never been applied at a large scale. To produce a more realistic and comprehensive view of the protein diversity in the viral world, we have (i) collected thousands of viromes and identified viral contigs and proteins within them, (ii) retrieved viral proteins available in different public databases, and (iii) applied sensitive similarity searches to cluster all these proteins into families and (iv) annotated the protein clusters produced. More than 46 million deduplicated proteins were clustered into less than 2.3 million protein families. After further removing genomic sequences likely of cellular origin using an iterative procedure, the remaining 2,203,457 clusters were coined enVhogs (for environmental Viral homologous groups). Their multiple sequence alignments have been transformed into HMMs to constitute the EnVhog database. Even if only a small proportion of enVhogs were annotated (15.9 %), they encompass almost half of the protein dataset (44.8 %). Applied to the annotation of four recently published viromes from diverse environments (sulfuric soil, grassland, surface seawater and human gut), enVhog HMMs doubled the number of viral sequences characterized, and increased by 54%-74% the number of proteins functionally annotated. EnVhogDB, the largest comprehensive compilation of viral protein information to date, is a resource that will thus further help to determine the functions of proteins encoded in newly sequenced viral genomes, and help to improve the accuracy of viral sequence detection tools. EnVhog database is available at http://envhog.u-ga.fr/envhog.
Type: Research article
Pérez-Bucio, Rubén  1 , 2 ; Enault, François  1 ; Galiez, Clovis  2
CC-BY 4.0
@article{10_24072_pcjournal_627,
author = {P\'erez-Bucio, Rub\'en and Enault, Fran\c{c}ois and Galiez, Clovis},
title = {EnVhogDB: an extended view of the viral protein families on {Earth} through a vast collection of {HMM} profiles
},
journal = {Peer Community Journal},
eid = {e100},
year = {2025},
publisher = {Peer Community In},
volume = {5},
doi = {10.24072/pcjournal.627},
language = {en},
url = {https://peercommunityjournal.org/articles/10.24072/pcjournal.627/}
}
TY - JOUR AU - Pérez-Bucio, Rubén AU - Enault, François AU - Galiez, Clovis TI - EnVhogDB: an extended view of the viral protein families on Earth through a vast collection of HMM profiles JO - Peer Community Journal PY - 2025 VL - 5 PB - Peer Community In UR - https://peercommunityjournal.org/articles/10.24072/pcjournal.627/ DO - 10.24072/pcjournal.627 LA - en ID - 10_24072_pcjournal_627 ER -
%0 Journal Article %A Pérez-Bucio, Rubén %A Enault, François %A Galiez, Clovis %T EnVhogDB: an extended view of the viral protein families on Earth through a vast collection of HMM profiles %J Peer Community Journal %] e100 %D 2025 %V 5 %I Peer Community In %U https://peercommunityjournal.org/articles/10.24072/pcjournal.627/ %R 10.24072/pcjournal.627 %G en %F 10_24072_pcjournal_627
Pérez-Bucio, R.; Enault, F.; Galiez, C. EnVhogDB: an extended view of the viral protein families on Earth through a vast collection of HMM profiles. Peer Community Journal, Volume 5 (2025), article no. e100. https://doi.org/10.24072/pcjournal.627
PCI peer reviews and recommendation, and links to data, scripts, code and supplementary information: 10.24072/pci.microbiol.100152
Conflict of interest of the recommender and peer reviewers:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.
[1] Total metagenomes outperform viromes in recovering viral diversity from sulfuric soils, ISME Communications, Volume 4 (2024) no. 1, p. ycae017 | DOI
[2] IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata, Nucleic Acids Research, Volume 51 (2023) no. D1, p. D733-D743 | DOI
[3] Viruses as Winners in the Game of Life, Annual Review of Virology, Volume 3 (2016) no. 1, pp. 197-214 | DOI
[4] Charting the diversity of uncultured viruses of Archaea and Bacteria, BMC Biology, Volume 17 (2019) no. 1, p. 109 | DOI
[5] EnVhogDB profiles, https://zenodo.org/doi/10.5281/zenodo.17086059, 2025 | DOI
[6] EnVhogDB supplementary materials, https://zenodo.org/doi/10.5281/zenodo.17086808 (2025) | DOI
[7] Transmission and dynamics of mother-infant gut viruses during pregnancy and early life, Nature Communications, Volume 15 (2024) no. 1, p. 1945 | DOI
[8] PhaMMseqs: a new pipeline for constructing phage gene phamilies using MMseqs2, G3 GenesGenomesGenetics, Volume 12 (2022) no. 11, p. jkac233 | DOI
[9] Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation, Nucleic Acids Research, Volume 45 (2017) no. D1, p. D491-D498 | DOI
[10] VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses, Microbiome, Volume 9 (2021) no. 1, p. 37 | DOI
[11] A new tool for the Identification of Viral Elements in Microbial Dark Matter, Peer Community in Microbiology (2025) | DOI
[12] Prodigal: prokaryotic gene recognition and translation initiation site identification, BMC Bioinformatics, Volume 11 (2010) no. 1, p. 119 | DOI
[13] VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences, Microbiome, Volume 8 (2020) no. 1, p. 90 | DOI
[14] The global virome: How much diversity and how many independent origins?, Environmental Microbiology, Volume 25 (2022) no. 1, pp. 40-44 | DOI
[15] Orthologous Gene Clusters and Taxon Signature Genes for Viruses of Prokaryotes, Journal of Bacteriology, Volume 195 (2013) no. 5, pp. 941-950 | DOI
[16] Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases, Nature, Volume 569 (2019) no. 7758, pp. 655-662 | DOI
[17] Metagenome-derived virus-microbe ratios across ecosystems, The ISME Journal, Volume 17 (2023) no. 10, pp. 1552-1563 | DOI
[18] Context similarity scoring improves protein sequence alignments in the midnight zone, Bioinformatics, Volume 31 (2015) no. 5, pp. 674-681 | DOI
[19] MMseqs2 desktop and local web server app for fast, interactive sequence searches, Bioinformatics, Volume 35 (2019) no. 16, pp. 2856-2858 | DOI
[20] Uniclust databases of clustered and deeply annotated protein sequences and alignments, Nucleic Acids Research, Volume 45 (2017) no. D1, p. D170-D176 | DOI
[21] Fast and sensitive taxonomic assignment to metagenomic contigs, Bioinformatics, Volume 37 (2021) no. 18, pp. 3029-3031 | DOI
[22] Pfam: The protein families database in 2021, Nucleic Acids Research, Volume 49 (2021) no. D1, p. D412-D419 | DOI
[23] Numerous cultivated and uncultivated viruses encode ribosomal proteins, Nature Communications, Volume 10 (2019) no. 1, p. 752 | DOI
[24] Are There 10$^{\textrm{31}}$ Virus Particles on Earth, or More, or Fewer?, Journal of Bacteriology, Volume 202 (2020) no. 9 | DOI
[25] Forty-nine metagenomic-assembled genomes from an aquatic virome expand Caudoviricetes by 45 potential new families and the newly uncovered Gossevirus of Bamfordvirae, Journal of General Virology, Volume 105 (2024) no. 3 | DOI
[26] Transposable Prophages in Leptospira: An Ancient, Now Diverse, Group Predominant in Causative Agents of Weil’s Disease, International Journal of Molecular Sciences, Volume 22 (2021) no. 24, p. 13434 | DOI
[27] Reekeekee- and roodoodooviruses, two different Microviridae clades constituted by the smallest DNA phages, Virus Evolution, Volume 9 (2022) no. 1, p. veac123 | DOI
[28] Porcine circoviruses: current status, knowledge gaps and challenges, Virus Research, Volume 286 (2020), p. 198044 | DOI
[29] Uncovering Earth’s virome, Nature, Volume 536 (2016) no. 7617, pp. 425-430 | DOI
[30] EnVhogDB code for building the database, https://zenodo.org/doi/10.5281/zenodo.17098288, 2025 | DOI
[31] EnVhogDB raw data, https://zenodo.org/doi/10.5281/zenodo.17086476, 2025 | DOI
[32] Pandoraviruses: Amoeba Viruses with Genomes Up to 2.5 Mb Reaching That of Parasitic Eukaryotes, Science, Volume 341 (2013) no. 6143, pp. 281-286 | DOI
[33] Identifying viruses from metagenomic data using deep learning, Quantitative Biology, Volume 8 (2020) no. 1, pp. 64-77 | DOI
[34] Exploring the viral world through metagenomics, Current Opinion in Virology, Volume 1 (2011) no. 4, pp. 289-297 | DOI
[35] Assessment of viral community functional potential from viral metagenomes may be hampered by contamination with cellular sequences, Open Biology, Volume 3 (2013) no. 12, p. 130160 | DOI
[36] VirSorter: mining viral signal from microbial genomic data, PeerJ, Volume 3 (2015), p. e985 | DOI
[37] Viral but not bacterial community successional patterns reflect extreme turnover shortly after rewetting dry soils, Nature Ecology & Evolution, Volume 7 (2023) no. 11, pp. 1809-1822 | DOI
[38] Giant virus diversity and host interactions through global metagenomics, Nature, Volume 578 (2020) no. 7795, pp. 432-436 | DOI
[39] Protein homology detection by HMM–HMM comparison, Bioinformatics, Volume 21 (2004) no. 7, pp. 951-960 | DOI
[40] MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets, Nature Biotechnology, Volume 35 (2017) no. 11, pp. 1026-1028 | DOI
[41] Clustering huge protein sequence sets in linear time, Nature Communications, Volume 9 (2018) no. 1, p. 2542 | DOI
[42] PHROG: families of prokaryotic virus proteins clustered using remote homology, NAR Genomics and Bioinformatics, Volume 3 (2021) no. 3, p. lqab067 | DOI
[43] Viral infections likely mediate microbial controls on ecosystem responses to global warming, FEMS Microbiology Ecology, Volume 99 (2023) no. 3, p. fiad016 | DOI
[44] efam: an e xpanded, metaproteome-supported HMM profile database of viral protein fam ilies, Bioinformatics, Volume 37 (2021) no. 22, pp. 4202-4208 | DOI
[45] Discovering and exploring the hidden diversity of human gut viruses using highly enriched virome samples, http://biorxiv.org/lookup/doi/10.1101/2024.02.19.580813, 2024 | DOI
Cited by Sources: