<i>mbctools</i>: A User-Friendly Metabarcoding and Cross-Platform Pipeline for Analyzing Multiple Amplicon Sequencing Data across a Large Diversity of Organisms

Barnabé, Christian; Sempéré, Guilhem; Manzanilla, Vincent; Millan, Joel Moo; Amblard-Rambert, Antoine; Waleckx, Etienne

doi:10.24072/pcjournal.501

Section: Genomics
Topic: Biology of interactions, Ecology, Genetics/genomics

mbctools: A User-Friendly Metabarcoding and Cross-Platform Pipeline for Analyzing Multiple Amplicon Sequencing Data across a Large Diversity of Organisms

Barnabé, Christian ¹ ; Sempéré, Guilhem ^2,³ ; Manzanilla, Vincent ¹ ; Millan, Joel Moo ⁴ ; Amblard-Rambert, Antoine ^1,⁴; Waleckx, Etienne ^1,⁴

¹ Institut de Recherche pour le Développement, UMR INTERTRYP IRD, CIRAD, Université de Montpellier, Montpellier, France
² Centre de Coopération Internationale en Recherche Agronomique pour le Développement, UMR INTERTRYP IRD, CIRAD, Université de Montpellier, Montpellier, France
³ South Green Bioinformatics Platform, Biodiversity, Montpellier, France
⁴ Laboratorio de Parasitología, Centro de Investigaciones Regionales “Dr Hideyo Noguchi”, Universidad Autónoma de Yucatán, Mérida, Yucatán, México

Corresponding author(s): Sempéré, Guilhem (guilhem.sempere@cirad.fr); Manzanilla, Vincent (vincent.manzanilla@ird.fr); Waleckx, Etienne (etienne.waleckx@ird.fr)

10.24072/pcjournal.501 - Peer Community Journal, Volume 4 (2024), article no. e114

Get full text PDF Peer reviewed and recommended by PCI

Abstract

We developed a python package called mbctools, designed to offer a cross-platform tool for processing amplicon data from various organisms in the context of metabarcoding studies. It can handle the most common tasks in metabarcoding pipelines such as paired-end merging, primer trimming, quality filtering, sequence denoising, zero-radius operational taxonomic unit (ZOTU) filtering, and has the capability to process multiple genetic markers simultaneously. mbctools is a menu-driven program that eliminates the need for expertise in command-line skills and ensures documentation of each analysis for reproducibility purposes. The software, designed to run in a console, offers an interactive experience, guided by keyboard inputs, assisting users along the way through data processing and hiding the complexity of command lines by letting them concentrate on selecting parameters to apply in each step of the process. In our workflow, VSEARCH is utilized for processing fastq files derived from amplicon-based Next-Generation Sequencing data. This software is a versatile open-source tool for processing amplicon sequences, offering advantages such as high speed, efficient memory usage, and the ability to handle large datasets. It provides functions for various tasks such as dereplication, clustering, chimera detection, and taxonomic assignment. VSEARCH is thus very efficient in retrieving the overall diversity of a sample. To adapt to the diversity of projects in metabarcoding, we facilitate the reprocessing of datasets with the possibility to adjust parameters. mbctools can also be launched in a headless mode, making it suited for integration into pipelines running on High-Performance Computing environments. mbctools is available at https://github.com/GuilhemSempere/mbctools, https://pypi.org/project/mbctools/.

Metadata

Published online: 2024-12-11

DOI: 10.24072/pcjournal.501
Type: Software tool

Keywords: metabarcoding, amplicon, reproducibility, VSEARCH, pipeline

Author's affiliations:

Barnabé, Christian ¹; Sempéré, Guilhem ^{2, 3}; Manzanilla, Vincent ¹; Millan, Joel Moo ⁴; Amblard-Rambert, Antoine ^{1, 4}; Waleckx, Etienne ^{1, 4}

¹ Institut de Recherche pour le Développement, UMR INTERTRYP IRD, CIRAD, Université de Montpellier, Montpellier, France
² Centre de Coopération Internationale en Recherche Agronomique pour le Développement, UMR INTERTRYP IRD, CIRAD, Université de Montpellier, Montpellier, France
³ South Green Bioinformatics Platform, Biodiversity, Montpellier, France
⁴ Laboratorio de Parasitología, Centro de Investigaciones Regionales “Dr Hideyo Noguchi”, Universidad Autónoma de Yucatán, Mérida, Yucatán, México

License:

CC-BY 4.0

Copyrights: The authors retain unrestricted copyrights and publishing rights

@article{10_24072_pcjournal_501,
     author = {Barnab\'e, Christian and Semp\'er\'e, Guilhem and Manzanilla, Vincent and Millan, Joel Moo and Amblard-Rambert, Antoine and Waleckx, Etienne},
     title = {\protect\emph{mbctools}: {A} {User-Friendly} {Metabarcoding} and {Cross-Platform} {Pipeline} for {Analyzing} {Multiple} {Amplicon} {Sequencing} {Data} across a {Large} {Diversity} of {Organisms}},
     journal = {Peer Community Journal},
     eid = {e114},
     year = {2024},
     publisher = {Peer Community In},
     volume = {4},
     doi = {10.24072/pcjournal.501},
     language = {en},
     url = {https://peercommunityjournal.org/articles/10.24072/pcjournal.501/}
}

TY  - JOUR
AU  - Barnabé, Christian
AU  - Sempéré, Guilhem
AU  - Manzanilla, Vincent
AU  - Millan, Joel Moo
AU  - Amblard-Rambert, Antoine
AU  - Waleckx, Etienne
TI  - mbctools: A User-Friendly Metabarcoding and Cross-Platform Pipeline for Analyzing Multiple Amplicon Sequencing Data across a Large Diversity of Organisms
JO  - Peer Community Journal
PY  - 2024
VL  - 4
PB  - Peer Community In
UR  - https://peercommunityjournal.org/articles/10.24072/pcjournal.501/
DO  - 10.24072/pcjournal.501
LA  - en
ID  - 10_24072_pcjournal_501
ER  -

%0 Journal Article
%A Barnabé, Christian
%A Sempéré, Guilhem
%A Manzanilla, Vincent
%A Millan, Joel Moo
%A Amblard-Rambert, Antoine
%A Waleckx, Etienne
%T mbctools: A User-Friendly Metabarcoding and Cross-Platform Pipeline for Analyzing Multiple Amplicon Sequencing Data across a Large Diversity of Organisms
%J Peer Community Journal
%D 2024
%V 4
%I Peer Community In
%U https://peercommunityjournal.org/articles/10.24072/pcjournal.501/
%R 10.24072/pcjournal.501
%G en
%F 10_24072_pcjournal_501

Barnabé, C.; Sempéré, G.; Manzanilla, V.; Millan, J. M.; Amblard-Rambert, A.; Waleckx, E. mbctools: A User-Friendly Metabarcoding and Cross-Platform Pipeline for Analyzing Multiple Amplicon Sequencing Data across a Large Diversity of Organisms. Peer Community Journal, Volume 4 (2024), article  no. e114. https://doi.org/10.24072/pcjournal.501

PCI peer reviews and recommendation, and links to data, scripts, code and supplementary information: 10.24072/pci.genomics.100370

Conflict of interest of the recommender and peer reviewers:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

Full text

The full text below may contain a few conversion errors compared to the version of record of the published article.

Introduction

Metabarcoding consists in identifying a targeted subset of genomes within bulk samples by massive sequencing of amplicons of taxonomically informative genetic markers, also known as barcodes (Figure 1) (Taberlet et al., 2012). Over the past decade, this approach has quickly gained popularity since Hebert et al. (2003) first advocated the use of short variable DNA sequences, amplified using universal primers, for species identification, and discovery of new taxa. Its simplicity of implementation enables biologists, local governments, and NGOs to increase their understanding of spatial and temporal ecological networks (Holdaway et al., 2017). Using universal primers specific of kingdoms or subgroups of organisms of interest (e.g. "arthropods", "vertebrates", "reptiles", "amphibians", "mammals", "birds", ...), metabarcoding is highly effective for species-level identification and assessment of the whole diversity of a sample within the targeted kingdoms or subgroups of organisms of interest. For example, universal primers for vertebrates, targeting the 12S rRNA gene, have been proposed for metabarcoding studies focused on vertebrates (Riaz et al., 2011). In studies targeting animals, primers allowing the amplification of a portion of the mitochondrial marker Cytochrome oxidase 1 (COI) may be used (Hebert et al., 2016). In plants, standard DNA barcoding generally involves one to four plastid DNA regions (rbcL, matK, trnH-psbA, trnL), sometimes in combination with the internal transcribed spacers of nuclear ribosomal DNA (nrDNA, ITS) (CBOL Plant Working Group et al., 2009). For fungi, the internal transcribed spacer (ITS) regions ITS1 and ITS2 of the nuclear ribosomal DNA are the most commonly used barcodes (Schoch et al., 2012). For bacterial and archaeal communities, the 16S ribosomal RNA (16S rRNA) is the most widely used barcode and provides taxonomic resolution to the genus level (Caporaso et al., 2011). Beyond these markers, specific markers are used in various fields to genotype species. For instance, in the study of Trypanosoma cruzi, the causative agent of Chagas disease, the Glucose-6-Phosphate Isomerase (GPI) and the Cytochrome oxidase 2 (COII) genes are frequently used to distinguish between different strains (Lauthier et al., 2012; López-Domínguez et al., 2022; Barnabé et al., 2023). Although single markers are very informative, in many cases, a combination of markers is necessary to fully understand the spatial and temporal dynamics of ecological networks. A huge advantage of metabarcoding is that it is based on massive sequencing. As a consequence, all the amplicons amplified from different targeted markers can be pooled and sequenced simultaneously (Hernández-Andrade et al., 2019). This powerful approach has proven to be useful for diet profile analysis, air and water quality monitoring, biodiversity monitoring, and food quality control, e.g., beverage or ancient ecosystem composition (Leray & Knowlton, 2015; Thomsen & Willerslev, 2015; Raclariu et al., 2017). As another application example, our group has recently proposed to use this approach to untangle the transmission cycles of vector-borne pathogens and their dynamics, using the gut contents of vector insects as bulk samples, and simultaneously sequencing markers allowing for the molecular identification of vector species and/or genetic diversity, blood meal sources (using universal primers for vertebrates which may also serve as pathogen hosts), gut microbiome composition (which may modulate vectorial capacity), and pathogen diversity. Indeed, all the latter components interact together to shape transmission cycles (Dumonteil et al., 2018; Hernández-Andrade et al., 2019).

The increasing demand for efficient and versatile pipelines to analyze amplicon data has led to the development of over 30 programs and pipelines in recent years (Hakimzadeh et al., 2023). However, these applications often present some challenges in terms of reproducibility and user-friendliness for non-expert users in the field of bioinformatics. Indeed, the obtaining of pseudo quantitative taxonomic data from fastq files involves a series of complex processing steps that can be achieved with specific software implying the understanding of a large panel of parameters. To address this problem, mbctools was designed with a simplified interface, allowing scientists without advanced scripting skills to easily install, navigate, utilize the pipeline while focusing on key parameters. This accessibility ensures that non-expert users can effectively and autonomously analyze their data without experiencing difficulties in reproducibility or requiring extensive bioinformatics knowledge.

Among pipelines and algorithms used in the field of metabarcoding data processing, substantial variations in sensitivity and specificity are observed (Bailet et al., 2020; Mathon et al., 2021). From these benchmarking studies, VSEARCH appears appropriate for our applications. mbctools' main purpose is to facilitate the use of VSEARCH, which has demonstrated strong performance in handling large-scale next-generation sequencing (NGS) data (Rognes et al., 2016), and it thus could be considered a "wrapper", hiding the complexity of command-line to the users. It thus allows them to easily process sequencing reads and obtain meaningful outputs, i.e, zero-radius operational taxonomic units (ZOTUs (Edgar, 2016), almost equivalent to ASVs (amplicon sequence variants)) along with abundance information.

mbctools processes demultiplexed metabarcoding data obtained from one or multiple genetic markers to distribute cleaned and filtered sequences into ZOTUs sorted per marker, which can be used for taxonomic assignment or further analyzes such as phylogeny (Figures 1, 2). While the original goal in developing mbctools was providing assistance to novice users through its interface, it can also prove useful to bioinformaticians willing to integrate it as part of a wider data processing pipeline, since its series of mandatory operations (initial analysis described below) can be launched in a headless mode by feeding the program with a configuration file. Additionally, addressing the challenge of processing large numbers of markers simultaneously across different kingdoms was one of the main motivations for designing mbctools, which allows for the simultaneous processing of multiple genetic markers and has a simplified graphical interface that guides the user through key parameters of the analysis.

Prerequisites and dependencies

Input data types - sequencing tech & lab

mbctools enables the analysis of amplicon data from multiple markers, obtained from short reads (Illumina) or long reads (Oxford Nanopore Technology, PacBio). It can handle both single-end reads and paired-end reads, provided that the latter can be merged. In the eventuality where the merging of paired-end reads is prevented by either large amplicon lengths (leading to insufficient overlap zone) or the presence of low complexity sequences, mbctools is able to utilize only forward reads for analysis. As it relies on the underlying VSEARCH software which is highly optimized in terms of efficiency and memory utilization, it is able to accommodate large datasets deriving from numerous samples and amplicons.

Running environment

mbctools is a versatile cross-platform python package running seamlessly on Windows, Unix, or OSX systems. The installation of Python3.7 (or higher) and VSEARCH (version 2.19.0 or higher, available at https://github.com/torognes/vsearch) is required to run the pipeline, and those binaries must be added to the PATH environment variable. Additionally, for Windows users, Powershell script execution must be enabled using the Set-ExecutionPolicy command, as the default Windows command prompt does not provide enough flexibility to run the software. The mbctools package, available from the Python Package Index (PyPI) (https://pypi.org/project/mbctools/), can be installed and upgraded using the command line tool pip (pip3 install mbctools).Before starting the analyses, the package requires mandatory directories and files to be pre-arranged as illustrated in Figure 3:

A default directory, named 'fastq', should contain all the fastq files corresponding to the R1 (and R2, when working with paired-ends sequencing) reads. These files are expected to be demultiplexed, meaning that each fastq file should contain reads from a single sample.

Figure 1 - Metabarcoding workflow example and area of operation of mbctools.

Figure 2 - Overview of the mbctools workflow, where each station icon represents a distinct step in the process. In the initial analysis (menu 1), data processing begins with the merging for paired-end reads (step 1) and the conversion of reads from fastq to fasta (step 2). This is followed by dereplication (step 3), denoising (step 4, which can be re-run with various options via menu 1a, 1b, 1c, 1d), chimera removal (step 5), sequence affiliation to markers (step 6), and finally, sequence orientation harmonization (step 7). Following this, primer removal (step 8) and denoised sequence abundance filtering (step 9) are applied. The last step in the analysis involves obtaining per-marker fasta files containing retained ZOTUs (step 10). The menu labeled as "4" comes into play after taxonomic assignment, tuning outputs for visualization in metaXplor and ensuring long-term data accessibility.

A default directory named 'refs' that contains one reference file per genetic marker in the multi-FASTA format. Each of them should contain one or more unaligned reference sequences, from the same strand and without gaps. Providing multiple reference sequences for a given marker improves diversity support and thus leads to a better assignment accuracy. In addition, the 'sequence orientation fixing' step runs faster when using several reference sequences. Filenames must end with the extension '.fas' and must be called by the name of the corresponding marker, referred to as locus in mbctools (Figure 3).

'lociPE.txt' provides the list of markers (one per line) corresponding to paired-end R1/R2 reads.

'lociSE.txt' provides the list of markers (one per line) corresponding to single-end reads and for which only R1 reads will be analyzed.

'samples.txt' includes a list of sample names, where each name appears on a separate line.

'fastqR1.txt' contains the names of the fastq R1 files corresponding to the samples.

'fastqR2.txt' contains the names of the fastq R2 files corresponding to the samples.

The contents of the last three files must be sorted in the same order, i.e., line (i) of 'samples.txt' shall correspond to line (i) of both 'fastqR1.txt' and 'fastqR2.txt'.

mbctools shall be launched from within the working directory in a shell console, typically by typing 'mbctools' if installed as a package.

Figure 3 - Pre-required files and directories

Software features and interface overview

mbctools features a series of menus, each containing two to four options as illustrated in Figures 4 to 8. On startup, the program displays the main menu shown in Figure 4. This menu presents descriptive keywords that enable users to easily navigate through the available menus and sections. From each menu, the user is required to select one single option at a time.

Figure 4 - Contents of the main menu, explaining navigation conventions and providing access to the four main sections.

To analyze a new dataset, it is necessary to perform a complete initial analysis by selecting the first option from both the first and subsequent menu. The initial analysis processes the data through a succession of steps such as merging, sample-level dereplication, sequence denoising, chimera detection and removal, affiliation of sequences to markers (e.g., loci), and sequence re-orientation (menu 1, Figure 5). Once those are completed, primers are removed, and the denoising process may be refined by tuning the cluster size threshold (menu 2, Figure 6). Finally, centroid sequences associated with each marker can be exported to fasta files (menu 3, Figure 7). The above represents the complete cycle of data processing. Additionally (menu 4, Figure 8), mbctools software offers the possibility to export results in a format that is compatible with the metaXplor platform (Sempéré et al., 2021). The entire process is described as supplementary material.

If, at any point, users wish to exit the program and conduct further analyses of their choice, mbctools allows it since it generates intermediate output files at each step.

Figure 5 - Contents of section 1 menu, in which only option 1, INITIAL ANALYSIS, is mandatory as it executes all steps required to proceed with menu 2. Options 1a to 1d may be used after option 1 in case the user wants to refine some parameters. Analysis 1e is optional and provides information about the reads' quality.

When option 1 (initial analysis) is selected from menu 1, the menu shown on Figure 5 allows to run the initial analysis (section 1), namely read merging, sample-level dereplication, sequence clustering (denoising), chimera detection, affiliation of sequences to markers, and sequence re-orientation. In the same menu, four other sections (1a, 1b, 1c, 1d) provide means to re-process some particular markers and/or samples with adjusted parameters. In most cases the default values for those parameters will work, but there might be situations where they need to be adapted, either globally (e.g., depending on the sequencing depth, the level of similarity between targeted genetic markers), at the marker (i.e., locus) level (e.g., according to its size), or at the sample level (e.g., to account for heterogeneous sample quality). Option 1e may be used to assess the read quality for each sample, and write results to files with suffix "*.quality.txt" in the outputs directory.

When the initial analysis is completed, menu 2 should be used to remove primer sequences, and to exclude denoised sequences with an abundance considered too low for being relevant (Figure 6). This operation may be launched either in batch mode, or sample by sample in case different thresholds need to be applied.

Figure 6 - Menu 2 combines two procedures that need to be launched successively for each marker: primer removal and application of an abundance threshold for retaining centroid sequences.

In menu 3, the program groups sample files for each marker, creating a new file per locus with the suffix "*_allseq_select.fasta" (Figure 7). During this process, mbctools can, as an option, dereplicate sequences at the project level, ensuring that each entry in the fasta file is unique. In such cases, an additional tabulated file is generated to maintain a record of sequence abundance across the samples. Optionally generated files are then named with the following suffixes: "*_allseq_select_derep.fasta", "*_allseq_select_derep.tsv".

Figure 7 - Contents of section 3 menu, allowing to obtain per-marker (per-locus) fasta files. Menu 3 provides means to generate fasta files containing the retained sequences, either for all markers (loci) at once, or for a given marker.

Finally, menu 4 converts mbctools outputs into a format suitable for importing into metaXplor (Figure 8), a web-oriented platform for storing, sharing, exploring and manipulating metagenomic pipeline outputs. metaXplor significantly contributes to enhancing the FAIR (Findable, Accessible, Interoperable, and Reusable) (Wilkinson et al., 2016) characteristics of this specific data type by serving as a robust solution for centralized data management, intuitive visualization, and long-term accessibility.

Figure 8 - Contents of section 4 menu, guiding users in obtaining necessary files to create a metaXplor import archive (i.e., sample metadata, sequence fasta and abundance, assignment to reference taxa) in order to summarize all project results.

Test data

The test data were generated in the scope of a study conducted on Chagas disease, a zoonotic parasitic illness caused by Trypanosoma cruzi. This disease is primarily transmitted to mammals through the feces and urine of hematophageous insects known as triatomines, commonly referred to as kissing bugs (Hemiptera: Reduviidae). Triatoma dimidiata is the main vector in the Yucatan Peninsula in Mexico. In 2019, the owner of an isolated rural dwelling collected thirty-three samples of T. dimidiata individuals from a conserved sylvatic site within the Balam Kú State Reserve, located in Campeche, Mexico (Figure 9). We used a metabarcoding approach to simultaneously characterize the genetic diversity of the vector and identify its blood meals, characterize its hindgut and midgut microbiome composition, and assess the genetic diversity of T. cruzi infecting T. dimidiata in this sylvatic ecotope. We extracted the total DNA from each T. dimidiata gut, and amplified different selected markers of interest for our question: (i) for T. dimidiata genotyping, a fragment of the vector's Internal Transcribed Spacer 2 (ITS-2) nuclear marker was amplified with primers ITS2_200F (5′-TCG YAT CTA GGC ATT GTC TG-3′) and ITS2_200R (5-′CTC GCA GCT ACT AAG GGA ATC C-3′) previously described in (Richards et al., 2013); (ii) for the identification of vertebrate blood meals, a fragment of the vertebrate 12S rRNA gene was amplified with primers L1085 (5′- CCC AAA CTG GGA TTA GAT ACC C-3′) and H1259 (5′- GTT TGC TGA AGA TGG CGG TA-3′) (Kitano et al., 2007; Moo-Millan et al., 2019); (iii) for the identification of hindgut and midgut bacterial microbiome composition, a fragment of the bacterial 16S rRNA gene was amplified with primers E786F (5′-GAT TAG ATA CCC TGG TAG-3′) and U926R (5'-CCG TCA ATT CCT TTR AGT TT-3') (Baker et al. 2003); and (iv) to accurately genotype T. cruzi, we targeted markers COII, GPI, ND1 and the intergenic region of the mini-exon gene, using the primers described in (10). Then we pooled the different amplicons obtained per T. dimidiata specimen and sequenced them on an Illumina MiSeq. On average, we obtained 222,244 reads per sample. The reads were processed with the mbctools package, whose outputs were imported into metaXplor. In mbctools, we retained the reads with a minimum length of 100bp, clusters with a minimum size of 8 and used an alpha value of 2. The taxonomic identification of the ZOTUs was performed using BLASTn (Johnson et al., 2008). The exported CSV file with the metadata served as input for metaXplor. The analysis results are directly available on the web application under the dataset 'mbctools_review' (“metaXplor: mbctools data set,” 2024) (Figure 9).

Figure 9 - Graphical output from the metaXplor web interface. On the left side the geographic origin of the thirty three samples. On the right side the taxonomic proportion of each identified organism

We identified an average of 33 ZOTUs per T. dimidiata sample, and only 9.61% of the reads were discarded during the analysis (Table 1). After analysis, we found thirteen blood-meal sources (the most frequently identified being Homo sapiens, Sus scrofa and Mus musculus), six bacterial Orders (Bacillales, Oceanospirillales and Micrococcales being the most abundant), along with T. cruzi and T. dimidiata (Figure 9). The test data input and output files are freely accessible on the DataSuds repository (Manzanilla et al., 2024).

Table 1 - Statistics concerning the processing of the test data set with the number of reads per sample and the attribution to each marker.

Conclusion

mbctools facilitates the processing of amplicon datasets for metabarcoding studies by offering menus that guide users through the various steps of the data analysis. It is, to our knowledge, the only tool eliminating the need for the computer literacy normally required to utilize command-line based software as complex as the underlying VSEARCH, while allowing to process multiple genetic markers simultaneously. mbctools offers the possibility to reprocess specific markers and samples with customized parameters, providing means to adapt to the diversity of metabarcoding datasets. In the end, it generates fasta files composed of the filtered ZOTUs ready for the taxonomic assignment step. Although BLASTn is a commonly used and straightforward solution for this task, given the diversity of tools and approaches (k-mer based Kraken2 (Wood et al., 2019), Kaiju (Menzel et al. 2016), phylogeny-based PhyloSift (Darling et al., 2014)), we leave to the user to proceed with taxonomic assignment, based on his own practice or the literature. To promote data management, visualization and long-term accessibility, mbctools offers a file conversion functionality compatible with the metaXplor web application which aims at centralizing online meta-omic data, while offering user-friendly means to interact with it. The metaXplor instance hosted at CIRAD (https://metaxplor.cirad.fr/) is indeed used by several teams to keep track of previous project data (be it private or public) and proves useful in helping scientists quickly recover precise information. As an example, metaXplor was utilized in the scope of 24 shotgun metagenomics projects based on VANA (Virion-Associated Nucleic Acids), conducted to uncover viral diversity (Moubset et al., 2022). It allowed for the incremental building of a data repository that efficiently manages and provides means to analyze the extensive sequence datasets thus generated. This platform facilitated similarity-based searches and phylogenetic analyses, significantly enhancing the retrieval and re-analysis of data, thereby promoting viral discovery and classification. Although we initially developed the mbctools pipeline and software for reproducibility purposes across our teams working on infectious diseases, the pipeline's strength resides in its capacity to handle diverse data and be tailored to different research needs. This tool plays a central role in training sessions provided by our teams in developing countries, and its functionalities and user-friendliness are recurrently extended according to the feedback they generate. Depending on the success of this first release, future versions may add support for new sequencing technologies, embedded taxonomic assignment solutions or phylogenetic tree building.

Authors' contributions

CB and GS designed and conceived the software based on EW's specifications and JMM's feedbacks. VM did the Python packaging. CB, GS, VM, JMM, AAR, and EW wrote the article. AAR provided the data to test mbctools. All authors read and approved the final manuscript.

Funding

This work received financial support from CONACYT (National Council of Science and Technology, Mexico) Basic Science (Project ID: CB2015-258752) and National Problems (Project ID: PN2015-893) grants awarded to EW, as well as from IRD (French National Research Institute for Sustainable Development).

Data, scripts, code, and supplementary information availability

mbctools is openly accessible on Github at https://github.com/GuilhemSempere/mbctools and on PyPI at https://pypi.org/project/mbctools/. The main pipeline description is available as supplementary material, as are higher-definition versions of Figures 1 to 3.

Acknowledgements

The authors would like to thank Frédéric Mahé for taking the time to introduce VSEARCH functionalities, as well as Dorian Grasset for testing the OSX version and creating a GitHub action to automatize PyPI package generation. Preprint version 2 of this article has been peer-reviewed and recommended by Peer Community In Genomics (https://doi.org/10.24072/pci.genomics.100370; Pollet, 2024).

Conflict of interest disclosure

The authors declare that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

References

[1] Bailet, B.; Apothéloz-Perret-Gentil, L.; Baričević, A.; Chonova, T.; Franc, A.; Frigerio, J.-M.; Kelly, M.; Mora, D.; Pfannkuchen, M.; Proft, S.; Ramon, M.; Vasselon, V.; Zimmermann, J.; Kahlert, M. Diatom DNA metabarcoding for ecological assessment: Comparison among bioinformatics pipelines used in six European countries reveals the need for standardization, Science of The Total Environment, Volume 745 (2020) | DOI

[2] Baker, G.; Smith, J.; Cowan, D. Review and re-analysis of domain-specific 16S primers, Journal of Microbiological Methods, Volume 55 (2003) no. 3, pp. 541-555 | DOI

[3] Barnabé, C.; Brenière, S. F.; Santillán-Guayasamín, S.; Douzery, E. J.; Waleckx, E. Revisiting gene typing and phylogeny of Trypanosoma cruzi reference strains: Comparison of the relevance of mitochondrial DNA, single-copy nuclear DNA, and the intergenic region of mini-exon gene, Infection, Genetics and Evolution, Volume 115 (2023) | DOI

[4] CBOL Plant Working Group,; Hollingsworth, P. M.; Forrest, L. L.; Spouge, J. L.; Hajibabaei, M.; Ratnasingham, S.; van der Bank, M.; Chase, M. W.; Cowan, R. S.; Erickson, D. L.; Fazekas, A. J.; Graham, S. W.; James, K. E.; Kim, K.-J.; Kress, W. J.; Schneider, H.; van AlphenStahl, J.; Barrett, S. C.; van den Berg, C.; Bogarin, D.; Burgess, K. S.; Cameron, K. M.; Carine, M.; Chacón, J.; Clark, A.; Clarkson, J. J.; Conrad, F.; Devey, D. S.; Ford, C. S.; Hedderson, T. A.; Hollingsworth, M. L.; Husband, B. C.; Kelly, L. J.; Kesanakurti, P. R.; Kim, J. S.; Kim, Y.-D.; Lahaye, R.; Lee, H.-L.; Long, D. G.; Madriñán, S.; Maurin, O.; Meusnier, I.; Newmaster, S. G.; Park, C.-W.; Percy, D. M.; Petersen, G.; Richardson, J. E.; Salazar, G. A.; Savolainen, V.; Seberg, O.; Wilkinson, M. J.; Yi, D.-K.; Little, D. P. A DNA barcode for land plants, Proceedings of the National Academy of Sciences, Volume 106 (2009) no. 31, pp. 12794-12797 | DOI

[5] Caporaso, J. G.; Lauber, C. L.; Walters, W. A.; Berg-Lyons, D.; Lozupone, C. A.; Turnbaugh, P. J.; Fierer, N.; Knight, R. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample, Proceedings of the National Academy of Sciences, Volume 108 (2010) no. supplement_1, pp. 4516-4522 | DOI

[6] Darling, A. E.; Jospin, G.; Lowe, E.; Matsen, F. A.; Bik, H. M.; Eisen, J. A. PhyloSift: phylogenetic analysis of genomes and metagenomes, PeerJ, Volume 2 (2014) | DOI

[7] Dumonteil, E.; Ramirez-Sierra, M.-J.; Pérez-Carrillo, S.; Teh-Poot, C.; Herrera, C.; Gourbière, S.; Waleckx, E. Detailed ecological associations of triatomines revealed by metabarcoding and next-generation sequencing: implications for triatomine behavior and Trypanosoma cruzi transmission cycles, Scientific Reports, Volume 8 (2018) no. 1 | DOI

[8] Edgar, R. UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing, bioRxiv (2016) | DOI

[9] Hakimzadeh, A.; Abdala Asbun, A.; Albanese, D.; Bernard, M.; Buchner, D.; Callahan, B.; Caporaso, J. G.; Curd, E.; Djemiel, C.; Brandström Durling, M.; Elbrecht, V.; Gold, Z.; Gweon, H. S.; Hajibabaei, M.; Hildebrand, F.; Mikryukov, V.; Normandeau, E.; Özkurt, E.; M. Palmer, J.; Pascal, G.; Porter, T. M.; Straub, D.; Vasar, M.; Větrovský, T.; Zafeiropoulos, H.; Anslan, S. A pile of pipelines: An overview of the bioinformatics software for metabarcoding data analyses, Molecular Ecology Resources, Volume 24 (2023) no. 5 | DOI

[10] Hebert, P. D. N.; Cywinska, A.; Ball, S. L.; deWaard, J. R. Biological identifications through DNA barcodes, Proceedings of the Royal Society of London. Series B: Biological Sciences, Volume 270 (2003) no. 1512, pp. 313-321 | DOI

[11] Hebert, P. D. N.; Ratnasingham, S.; Zakharov, E. V.; Telfer, A. C.; Levesque-Beaudin, V.; Milton, M. A.; Pedersen, S.; Jannetta, P.; deWaard, J. R. Counting animal species with DNA barcodes: Canadian insects, Philosophical Transactions of the Royal Society B: Biological Sciences, Volume 371 (2016) no. 1702 | DOI

[12] Hernández-Andrade, A.; Moo-Millan, J.; Cigarroa-Toledo, N.; Ramos-Ligonio, A.; Herrera, C.; Bucheton, B.; Bart, J.-M.; Jamonneau, V.; Bañuls, A.-L.; Paupy, C.; Roiz, D.; Sereno, D.; N. Ibarra-Cerdeña, C.; Machaín-Williams, C.; García-Rejón, J.; Gourbière, S.; Barnabé, C.; Telleria, J.; Oury, B.; Brenière, F.; Simard, F.; Rosado, M.; Solano, P.; Dumonteil, E.; Waleckx, E. Metabarcoding: A Powerful Yet Still Underestimated Approach for the Comprehensive Study of Vector-Borne Pathogen Transmission Cycles and Their Dynamics, Vector-Borne Diseases - Recent Developments in Epidemiology and Control, IntechOpen, 2020 | DOI

[13] Holdaway, R.; Wood, J.; Dickie, I.; Orwin, K.; Bellingham, P.; Richardson, S.; Lyver, P.; Timoti, P.; Buckley, T. Using DNA meta barcoding to assess New Zealand's terrestrial biodiversity, New Zealand Journal of Ecology, Volume 41 (2017) no. 2 | DOI

[14] Johnson, M.; Zaretskaya, I.; Raytselis, Y.; Merezhuk, Y.; McGinnis, S.; Madden, T. L. NCBI BLAST: a better web interface, Nucleic Acids Research, Volume 36 (2008) | DOI

[15] Kitano, T.; Umetsu, K.; Tian, W.; Osawa, M. Two universal primer sets for species identification among vertebrates, International Journal of Legal Medicine, Volume 121 (2006) no. 5, pp. 423-427 | DOI

[16] Lauthier, J. J.; Tomasini, N.; Barnabé, C.; Rumi, M. M. M.; D’Amato, A. M. A.; Ragone, P. G.; Yeo, M.; Lewis, M. D.; Llewellyn, M. S.; Basombrío, M. A.; Miles, M. A.; Tibayrenc, M.; Diosque, P. Candidate targets for Multilocus Sequence Typing of Trypanosoma cruzi: Validation using parasite stocks from the Chaco Region and a set of reference strains, Infection, Genetics and Evolution, Volume 12 (2012) no. 2, pp. 350-358 | DOI

[17] Leray, M.; Knowlton, N. DNA barcoding and metabarcoding of standardized samples reveal patterns of marine benthic diversity, Proceedings of the National Academy of Sciences, Volume 112 (2015) no. 7, pp. 2076-2081 | DOI

[18] López-Domínguez, J.; López-Monteon, A.; Ochoa-Martínez, P.; Dumonteil, E.; Barnabé, C.; Waleckx, E.; Hernández-Giles, R. G.; Ramos-Ligonio, A. Molecular Characterization of Four Mexican Isolates of Trypanosoma cruzi and Their Profile Susceptibility to Nifurtimox, Acta Parasitologica, Volume 67 (2022) no. 4, pp. 1584-1593 | DOI

[19] Manzanilla, V.; Barnabé, C.; Guilhem, S.; Millan, J.; Amblard-Rambert, A. W. E. Test data for mbctools, 2024 | DOI

[20] Mathon, L.; Valentini, A.; Guérin, P.; Normandeau, E.; Noel, C.; Lionnet, C.; Boulanger, E.; Thuiller, W.; Bernatchez, L.; Mouillot, D.; Dejean, T.; Manel, S. Benchmarking bioinformatic tools for fast and accurate eDNA metabarcoding species identification, Molecular Ecology Resources, Volume 21 (2021) no. 7, pp. 2565-2579 | DOI

[21] Menzel, P.; Ng, K. L.; Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju, Nature Communications, Volume 7 (2016) no. 1 | DOI

[22] Moo-Millan, J. I.; Arnal, A.; Pérez-Carrillo, S.; Hernandez-Andrade, A.; Ramírez-Sierra, M.-J.; Rosado-Vallado, M.; Dumonteil, E.; Waleckx, E. Disentangling Trypanosoma cruzi transmission cycle dynamics through the identification of blood meal sources of natural populations of Triatoma dimidiata in Yucatán, Mexico, Parasites & Vectors, Volume 12 (2019) no. 1 | DOI

[23] Moubset, O.; François, S.; Maclot, F.; Palanga, E.; Julian, C.; Claude, L.; Fernandez, E.; Rott, P.; Daugrois, J.-H.; Antoine-Lorquin, A.; Bernardo, P.; Blouin, A. G.; Temple, C.; Kraberger, S.; Fontenele, R. S.; Harkins, G. W.; Ma, Y.; Marais, A.; Candresse, T.; Chéhida, S. B.; Lefeuvre, P.; Lett, J.-M.; Varsani, A.; Massart, S.; Ogliastro, M.; Martin, D. P.; Filloux, D.; Roumagnac, P. Virion-Associated Nucleic Acid-Based Metagenomics: A Decade of Advances in Molecular Characterization of Plant Viruses, Phytopathology, Volume 112 (2022) no. 11, pp. 2253-2272 | DOI

[24] Pollet, N. One tool to metabarcode them all, Peer Community in Genomics (2024) | DOI

[25] Raclariu, A. C.; Paltinean, R.; Vlase, L.; Labarre, A.; Manzanilla, V.; Ichim, M. C.; Crisan, G.; Brysting, A. K.; de Boer, H. Comparative authentication of Hypericum perforatum herbal products using DNA metabarcoding, TLC and HPLC-MS, Scientific Reports, Volume 7 (2017) no. 1 | DOI

[26] Riaz, T.; Shehzad, W.; Viari, A.; Pompanon, F.; Taberlet, P.; Coissac, E. ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis, Nucleic Acids Research, Volume 39 (2011) no. 21 | DOI

[27] Richards, B.; de la Rua, N. M.; Monroy, C.; Stevens, L.; Dorn, P. L. Novel polymerase chain reaction-restriction fragment length polymorphism assay to determine internal transcribed spacer-2 group in the Chagas disease vector, Triatoma dimidiata (Latreille, 1811), Memórias do Instituto Oswaldo Cruz, Volume 108 (2013) no. 4, pp. 395-398 | DOI

[28] Rognes, T.; Flouri, T.; Nichols, B.; Quince, C.; Mahé, F. VSEARCH: a versatile open source tool for metagenomics, PeerJ, Volume 4 (2016) | DOI

[29] Schoch, C. L.; Seifert, K. A.; Huhndorf, S.; Robert, V.; Spouge, J. L.; Levesque, C. A.; Chen, W.; Bolchacova, E.; Voigt, K.; Crous, P. W.; Miller, A. N.; Wingfield, M. J.; Aime, M. C.; An, K.-D.; Bai, F.-Y.; Barreto, R. W.; Begerow, D.; Bergeron, M.-J.; Blackwell, M.; Boekhout, T.; Bogale, M.; Boonyuen, N.; Burgaz, A. R.; Buyck, B.; Cai, L.; Cai, Q.; Cardinali, G.; Chaverri, P.; Coppins, B. J.; Crespo, A.; Cubas, P.; Cummings, C.; Damm, U.; de Beer, Z. W.; de Hoog, G. S.; Del-Prado, R.; Dentinger, B.; Diéguez-Uribeondo, J.; Divakar, P. K.; Douglas, B.; Dueñas, M.; Duong, T. A.; Eberhardt, U.; Edwards, J. E.; Elshahed, M. S.; Fliegerova, K.; Furtado, M.; García, M. A.; Ge, Z.-W.; Griffith, G. W.; Griffiths, K.; Groenewald, J. Z.; Groenewald, M.; Grube, M.; Gryzenhout, M.; Guo, L.-D.; Hagen, F.; Hambleton, S.; Hamelin, R. C.; Hansen, K.; Harrold, P.; Heller, G.; Herrera, C.; Hirayama, K.; Hirooka, Y.; Ho, H.-M.; Hoffmann, K.; Hofstetter, V.; Högnabba, F.; Hollingsworth, P. M.; Hong, S.-B.; Hosaka, K.; Houbraken, J.; Hughes, K.; Huhtinen, S.; Hyde, K. D.; James, T.; Johnson, E. M.; Johnson, J. E.; Johnston, P. R.; Jones, E. G.; Kelly, L. J.; Kirk, P. M.; Knapp, D. G.; Kõljalg, U.; Kovács, G. M.; Kurtzman, C. P.; Landvik, S.; Leavitt, S. D.; Liggenstoffer, A. S.; Liimatainen, K.; Lombard, L.; Luangsa-ard, J. J.; Lumbsch, H. T.; Maganti, H.; Maharachchikumbura, S. S. N.; Martin, M. P.; May, T. W.; McTaggart, A. R.; Methven, A. S.; Meyer, W.; Moncalvo, J.-M.; Mongkolsamrit, S.; Nagy, L. G.; Nilsson, R. H.; Niskanen, T.; Nyilasi, I.; Okada, G.; Okane, I.; Olariaga, I.; Otte, J.; Papp, T.; Park, D.; Petkovits, T.; Pino-Bodas, R.; Quaedvlieg, W.; Raja, H. A.; Redecker, D.; Rintoul, T. L.; Ruibal, C.; Sarmiento-Ramírez, J. M.; Schmitt, I.; Schüßler, A.; Shearer, C.; Sotome, K.; Stefani, F. O.; Stenroos, S.; Stielow, B.; Stockinger, H.; Suetrong, S.; Suh, S.-O.; Sung, G.-H.; Suzuki, M.; Tanaka, K.; Tedersoo, L.; Telleria, M. T.; Tretter, E.; Untereiner, W. A.; Urbina, H.; Vágvölgyi, C.; Vialle, A.; Vu, T. D.; Walther, G.; Wang, Q.-M.; Wang, Y.; Weir, B. S.; Weiß, M.; White, M. M.; Xu, J.; Yahr, R.; Yang, Z. L.; Yurkov, A.; Zamora, J.-C.; Zhang, N.; Zhuang, W.-Y.; Schindel, D. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi, Proceedings of the National Academy of Sciences, Volume 109 (2012) no. 16, pp. 6241-6246 | DOI

[30] Sempéré, G.; Pétel, A.; Abbé, M.; Lefeuvre, P.; Roumagnac, P.; Mahé, F.; Baurens, G.; Filloux, D. metaXplor: an interactive viral and microbial metagenomic data manager, GigaScience, Volume 10 (2021) no. 2 | DOI

[31] Taberlet, P.; Coissac, E.; Pompanon, F.; Brochmann, C.; Willerslev, E. Towards next‐generation biodiversity assessment using DNA metabarcoding, Molecular Ecology, Volume 21 (2012) no. 8, pp. 2045-2050 | DOI

[32] Thomsen, P. F.; Willerslev, E. Environmental DNA – An emerging tool in conservation for monitoring past and present biodiversity, Biological Conservation, Volume 183 (2015), pp. 4-18 | DOI

[33] Wilkinson, M. D.; Dumontier, M.; Aalbersberg, I. J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; da Silva Santos, L. B.; Bourne, P. E.; Bouwman, J.; Brookes, A. J.; Clark, T.; Crosas, M.; Dillo, I.; Dumon, O.; Edmunds, S.; Evelo, C. T.; Finkers, R.; Gonzalez-Beltran, A.; Gray, A. J.; Groth, P.; Goble, C.; Grethe, J. S.; Heringa, J.; ’t Hoen, P. A.; Hooft, R.; Kuhn, T.; Kok, R.; Kok, J.; Lusher, S. J.; Martone, M. E.; Mons, A.; Packer, A. L.; Persson, B.; Rocca-Serra, P.; Roos, M.; van Schaik, R.; Sansone, S.-A.; Schultes, E.; Sengstag, T.; Slater, T.; Strawn, G.; Swertz, M. A.; Thompson, M.; van der Lei, J.; van Mulligen, E.; Velterop, J.; Waagmeester, A.; Wittenburg, P.; Wolstencroft, K.; Zhao, J.; Mons, B. The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data, Volume 3 (2016) no. 1 | DOI

[34] Wood, D. E.; Lu, J.; Langmead, B. Improved metagenomic analysis with Kraken 2, Genome Biology, Volume 20 (2019) no. 1 | DOI

[35] metaXplor: mbctools data set, 2024 (https://metaxplor.cirad.fr/metaXplor/main.jsp?module=mbctools_review)

Section: Genomics Topic: Biology of interactions, Ecology, Genetics/genomics

mbctools: A User-Friendly Metabarcoding and Cross-Platform Pipeline for Analyzing Multiple Amplicon Sequencing Data across a Large Diversity of Organisms

Section: Genomics
Topic: Biology of interactions, Ecology, Genetics/genomics