MATEdb, a data repository of high-quality metazoan transcriptome assemblies to accelerate phylogenomic studies

With the advent of high throughput sequencing, the amount of genomic data available for animals (Metazoa) species has bloomed over the last decade, especially from transcriptomes due to lower sequencing costs and easier assembling process compared to genomes. Transcriptomic data sets have proven useful for phylogenomic studies, such as inference of phylogenetic interrelationships (eg, species tree reconstruction) and comparative genomics analyses (e.g., gene repertoire evolutionary dynamics). However, these data sets are often analyzed following different analytical pipelines, particularly including different software versions, leading to potential methodological biases when analyzed jointly in a comparative framework. Moreover, these analyses are computationally expensive and not affordable for a large part of the scientific community. More importantly, assembled transcriptomes are usually not deposited in public databases. Furthermore, the quality of these data sets is hardly ever taken into consideration, potentially impacting subsequent analyses such as orthology and phylogenetic or gene repertoire evolution inference. To alleviate these issues, we present Metazoan Assemblies from Transcriptomic Ensembles (MATEdb), a curated database of 335 high-quality transcriptome assemblies from different animal phyla analyzed following the same pipeline. The repository is composed, for each species, of (1) a de novo transcriptome assembly, (2) its candidate coding regions within transcripts (both at the level of nucleotide and amino acid sequences), (3) the coding regions filtered using their contamination profile (i.e., only metazoan content), (4) the longest isoform of the amino acid candidate coding regions, (5) the gene content completeness score as assessed against the BUSCO database, and (6) an orthology-based gene annotation. We complement the repository with gene annotations from high-quality genomes, which are often not straightforward to obtain from individual sequencing projects, totalling 423 high-quality genomic and transcriptomic data sets. We invite the community to provide suggestions for new data sets and new annotation features to be included in subsequent versions, that will be analyzed following the same pipeline and be permanently stored in public repositories. We believe that MATEdb will accelerate research on animal phylogenomics while saving thousands of hours of computational work in a plea for open and collaborative science.


Phylotranscriptomics and the quest for resolving the Animal Tree of Life
With the advent of high throughput sequencing techniques, genomic approaches to phylogenetics, and more specifically analyses based on transcriptomic data (commonly referred to as phylotranscriptomics), are quickly replacing single or few-gene approaches, enabling an in-depth phylogenetic interrogation of multiple lineages at an unprecedented level in a reliable manner (Cheon, Zhang, and Park 2020). Since the cost of sequencing and assembling transcriptomes is still much lower compared to genomes, this source of genomic data has been extensively exploited with the goal of accessing thousands of molecular markers that can be leveraged to explore phylogenetic interrelationships. For example, the inference of phylogenetic relationships based on a handful of genes has historically failed to provide enough resolution to illuminate animal interrelationships at both deep and shallow levels in many metazoan lineages.

Leveraging transcriptomes to interrogate gene repertoire evolution across animal lineages
The plethora of transcriptomic datasets available in public databases can potentially be harnessed not only to understand phylogenetic interrelationships but also to investigate the dynamics of gene repertoire evolution. However, its immediate use would require effective data mining and integration. More specifically, since most genes are expressed basally in most tissues (Ramsköld et al. 2009;Gu et al. 2021), transcriptomic data from a species can be used as a proxy of its proteome. The combination and joint analysis of data from several species allows the interrogation of the dynamics of gene family evolution at macroevolutionary scales. This approach has been followed to investigate evolutionary dynamics in complex gene families in clades such as land plants (Geng et al. 2021), insects (Thoma et al. 2019)  Unique among related databases, we present here Metazoan Assemblies from Transcriptomic Ensembles (MATEdb v1), a continuously updated and curated database of hundreds of high-quality transcriptome assemblies from different animal phyla. The database included both retrieved raw transcriptomic data from public databases and datasets generated by us. We performed a de novo transcriptome assembly and after a quality filtering process, we evaluated the integrity score of the gene content against the scores of the BUSCO database.
We complement the repository with high-quality genome gene annotations, which are often not readily available from sequencing projects. We provide orthology-based gene annotations for all datasets as well. We believe that MATEdb is a valuable resource that will promote and accelerate research on animal evolutionary studies while saving thousands of hours of computational work in a call for open and collaborative science.

Taxonomic coverage
Taxon sampling is represented in Figure 1 and Table S1 (these files will be updated for the subsequent versions of the database). Taxon sampling was based on dataset availability in public repositories while maximizing taxonomic representation at the order and family level over closely related species.

Figure 1.
Taxonomic representation of taxa included in MATEdb v1. Taxonomic rank (e.g., phylum, class, etc.), data type (i.e., genome or transcriptome), BUSCO score for the longest isoform datasets and proteome size is shown in the outer rings.

Analytical pipeline
The analytical pipeline followed is depicted in Figure 2. In brief, the main steps were as follows: (1) raw data were downloaded from public repositories (most of them from the Sequence Read Archive (SRA) in NCBI); (2) adapters and low-quality reads were trimmed; (3) filtered raw reads were de novo assembled; (4) assembly completeness was explored based on the percentage of metazoan single-copy genes recovered; (5) Open Reading Frames (ORFs) were inferred from the assemblies; (6) ORFs were decontaminated (eg, filtering out non-metazoan ORFs); (7) the longest isoform per gene (sensu Trinity) was parsed, which can be directly used as the input for phylogenomic studies. Further details about the software versions and parameters implemented throughout the pipeline are shown in Figure 2 (see also

Compilation of transcriptomic data ensembles
RNA-seq raw data for the selected samples (accession numbers are available as   Supplementary Material, Table S1) were retrieved from SRA (Leinonen et al. 2011) by using the fastq-dump program of the SRA Toolkit v.2.10.7 (http://ncbi.github.io/sra-tools/). Paired-end sequences were preferably chosen over single reads when possible. Moreover, raw data published in MolluscDB (Liu et al. 2021), and molluscDB (Caurcel et al. 2021) repositories were manually downloaded.
For the projects in which transcriptome data sequencing was not extracted from the whole specimen but on specific tissues, a pool was created to combine all raw data referred to the same organism and the same project. Each accession was independently downloaded, although they were considered a unique data set for the rest of the analyses.

Compilation of genomic data
Coding DNA Sequences (CDS) and proteome files including all predicted proteins within genome assembly were downloaded from the different data repositories referenced in Supplementary Material, Table S1.  Table S2). For the cases in which transcriptomes consisted of a pool of libraries belonging to a single specimen, each of the datasets was trimmed independently and then merged together during the assembly process with Trinity.

Transcriptome assembly processing
We assessed the quality of assembled transcriptomes using completeness scores of BUSCO v4.1.4 (Manni et al. 2021). BUSCO software employs sets of benchmarking universal single-copy orthologs from OrthoDB (www.orthodb.org; accessed on 10 February 2021) to provide quantitative measures of the completeness of genomic assemblies in terms of expected gene content.
Transcriptome assemblies with a gene completeness percentage (considering the sum of both complete and fragmented genes) higher than 85% were retained. As an exception to this, there are some transcriptome assemblies which have been included despite their low BUSCO score due to their taxonomic relevance (e.g., they were the only representatives of their lineage, such as in the case of Remipedia). In any case, the lowest threshold for BUSCO scores allowed in MATEdb was ca. 70%.
Candidate coding regions were predicted on assemblies, using TransDecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder) in two different steps aiming to improve the reliability of the prediction. First, all ORFs with a minimum length of 100 amino acids were extracted with the TransDecoder.LongOrfs module. Then, 25% of the total number of previously predicted ORF was computed and used as input to train the Markov Model behind the TransDecoder.Predict module, which resulted in a more accurate output than the one done with the default settings. For further details see Supplementary Material, Table S2.
Contaminant sequences present in the transcriptome assemblies were filtered out based on BlobTools2 analysis (Challis et al. 2020). First, assemblies were subjected to a Diamond (Buchfink, Xie, and Huson 2015) BLASTP search (--sensitive --max-target-seqs 1 --evalue 1e-10) against the NCBI non-redundant (NCBI NR) protein sequence database (http://www.ncbi.nlm.nih.gov/). Contigs that were classified as bacterial, plant, or fungal sequences were removed from the assembly. All commands needed during the decontamination process with BlobTools were applied through the execution of several custom scripts ("blobtools.sh" and "extract_phyla_for_blobtools.py"), included in the Github repository.
Finally, the longest ORFs of each transcript were retained as final candidate coding regions for further analyses. This step was performed by applying a modified version of the python script "choose_longest_iso.py" from (Fernández et al. 2014) to adapt to the new Trinity headers, also included in the Github repository ("fetch_longest_iso.py").

Genome data processing
We downloaded the predicted proteome for published genomes mostly from the NCBI genome library (https://www.ncbi.nlm.nih.gov/genome/). Then, gene completeness was assessed using BUSCO in protein mode. As it was established for transcriptome assemblies, only those proteome assemblies with BUSCO completeness score higher than 85% (complete plus fragmented percentages). In addition, we carried out a filtering step to keep the longest isoform of each gene using a custom script similar to that used with transcriptomes filtering ("remove_isoforms_proteome.sh"). This script takes the annotation file as a reference to create the new Trinity-like headers and finally returns the filtered proteome file. For further details on scripts, please check the Github repository.

Functional annotation of gene repertoire
The longest isoform gene list for each dataset was annotated with eggNOG-mapper v2 (Cantalapiedra et al. 2021). This permits a higher precision than traditional homology searches (i.e. BLAST searches), as it avoids transferring annotations from close paralogs (duplicate genes with a higher chance of being involved in functional divergence).

Scripts and commands
The scripts and commands used for every step and the supplementary Tables S1 and   S2 are publicly available in the following repository: https://github.com/MetazoaPhylogenomicsLab/MATEdb

Files deposited in the repository
The data repository is composed of (1) a de novo transcriptome assembly, (2) its candidate coding regions within transcripts (both at the level of nucleotide and amino acid sequences), (3) the coding regions filtered using their contamination profile (ie, only metazoan content), (4) the longest isoform of the amino acid candidate coding regions, (5) the gene content completeness score as assessed against the BUSCO database, and (6) an orthology-based gene annotation. We complement the repository with genome annotations from high-quality genomes that are often not straightforward to obtain from the sequencing projects (in the case of genomes, only files (4), (5) and (6) are provided in MATEdb).

Software availability
We provide a Docker container for easy deployment of the tools used to generate the files in the database with the appropriate software versions along with their dependencies