EukProt: A database of genome-scale predicted proteins across the diversity of 1 eukaryotes 2

family evolution. 31 Each species is placed within the UniEuk taxonomic framework in order to facilitate downstream 32 analyses, and each data set is associated with a unique, persistent identifier to facilitate comparison 33 and replication among analyses. The database is regularly updated, and all versions will be 34 permanently stored and made available via FigShare. The current version has a number of updates, 35 notably ‘The Comparative Set’ (TCS), a reduced taxonomic set with high estimated completeness 36 while maintaining a substantial phylogenetic breadth, which comprises 196 predicted proteomes. A 37 BLAST web server and graphical displays of data set completeness are available at 38 http://evocellbio.com/eukprot/. We invite the community to provide suggestions for new data sets and 39 new annotation features to be included in subsequent versions, with the goal of building a 40 collaborative resource that will promote research to understand eukaryotic diversity and 41 diversification.


47
Over the past 15 years, the discovery of diverse novel microbial eukaryotes, coupled with methods 48 to reconstruct phylogenies based on hundreds of protein-coding genes (frequently referred to as 49 phylogenomics, although the term was originally proposed as the more general integration of genome 50 analysis and evolutionary studies [Eisen, 2003]) have led to a remarkable reshaping in our

76
To address this gap, we assembled a database of protein sequences -EukProt -either from 77 newly performed protein prediction or by retrieving available sequences from a comprehensive set of 78 species representing known eukaryotic diversity. We note that an existing database of genome-scale 79 protein data sets, PhyloDB (https://github.com/allenlab/PhyloDB), contains 550 eukaryotic species 80 with at least 500 proteins, as of version 1.076. PhyloDB was most recently updated in 2015; EukProt 81 includes data made available since then, allowing it to provide more species with generally higher 82 completeness and representing a greater phylogenetic breadth. In addition, we placed each species 83 within a universal eukaryotic taxonomic framework, UniEuk (Berney et al., 2017), in order to ensure 84 that the evolutionary relationships among data sets are accurately and consistently described. We 85 also provide estimated completeness statistics for each predicted proteome based on BUSCO  We determined species and strain identities by reading the publications that described the data 97 sets, consulting the literature for naming revisions, and comparing 18S ribosomal DNA sequences for 98 each data set to reference sequence databases (see below for a description of how we retrieved 18S 99 sequences). For species that were previously known by other names, we recorded these previous 100 names in the metadata for the data set, except in cases where a species was originally assigned to a 101 genus but not identified to the species level (e.g., Goniomonas sp., now identified as Goniomonas 102 avonlea, is not listed as a previous name). This exception was not followed when the GenBank record 103 description for the 18S sequence contained a different name than the one used in EukProt; in this 104 case this previous name is always listed in the previous names field, in order to avoid confusion when 105 retrieving 18S sequences from GenBank.

106
When no strain name was available for an MMETSP data set, we used the MMETSP ID as the 107 strain name (for example, EP00362 has the strain name MMETSP1317). For strains with alternative 108 names, we included these in a dedicated field for alternative strain names; this list is not necessarily 109 exhaustive (i.e., it is not guaranteed to contain all alternative names for a given strain).

113
The full taxonomic pathways provided for all species (which follow the framework developed in the 114 UniEuk project [Berney et al., 2017]) are not based on a fixed number of ranks, but on a free, 115 unlimited number of taxonomic levels, in order to match phylogenetic evidence as closely as possible.

116
This provides end-users more information and flexibility, but could also make it more difficult to 117 summarize results of downstream analyses. Therefore, we provide three additional fields 118 ("supergroup", "taxogroup1" and "taxogroup2") to help end-users whenever it is useful to distribute 119 eukaryotic diversity into a fixed number of taxonomic categories of roughly equivalent phylogenetic 120 depth or ecological relevance. The groupings called "supergroups'' in the context of UniEuk resources 121 (36 recognized lineages so far, of which 34 are included in EukProt; Microheliella and Meteora are not 122 represented) consist of strictly monophyletic, deep-branching eukaryotic lineages of a phylogenetic 123 depth equivalent to some of the first proposed supergroups such as Opisthokonta, Alveolata, 124 Rhizaria, and Stramenopiles. UniEuk "supergroups" correspond roughly to the deepest phylogenetic 125 resolution of eukaryotic relationships achievable with ribosomal RNA genes, and not necessarily to 126 the currently recognized highest-level groupings of eukaryotes based on phylogenomic evidence 127 (these groupings are, of course, present in the complete taxonomic pathways). UniEuk "supergroups" 128 are therefore highly variable in relative diversity, ranging from lineages consisting of a single, orphan 129 genus (e.g., Ancoracysta, Mantamonas, Palpitomonas), to Opisthokonta as a whole. The 130 "taxogroup1" (of which there are 72 in EukProt) and "taxogroup2" (of which there are 198 in EukProt) 131 levels allow further subdivision of large supergroups into lineages of relatively equivalent evolutionary 132 or ecological relevance, based on current knowledge. These levels are more arbitrarily defined but 133 are intended to represent strictly monophyletic groupings that match one of the levels in the complete 134 taxonomic pathways. As an illustrative example, diatoms are in the Diatomeae taxogroup2, which is 135 within the Ochrophyta taxogroup1, which is within the Stramenopiles supergroup. Small, ecologically 136 and morphologically homogeneous supergroups are not subdivided further; in such cases the 137 "taxogroup1" and "taxogroup2" levels are the same as the "supergroup" level. The same approach will 138 be used in EukRibo, a manually-curated database of reference ribosomal RNA gene sequences 139 developed in parallel (Berney 2022) to help users link analyses of different types of genetic data.

141
Merging strains from the same species

143
In general, we only included data from a single strain/isolate per species. However, when only a 144 single transcriptome data set was available for a given strain of a species, and there were additional 145 published transcriptome data sets for other strains of the same species, we combined them using CD-146 HIT (Li & Godzik, 2006) run with default parameter values, in order to guard against the possibility 147 that a single transcriptome might lack genes expressed only in one condition or experiment. When 4 multiple strains were merged to produce a species' data set (there were 41 such cases), this 149 information is indicated in the metadata for the data set. When a data set is indicated as merged, but 150 more than one strain name is not listed in the strain column, this indicates that the data set was 151 distributed as merged and we were unable to determine the exact strains used.

153
Processing steps applied to publicly available data 154 155 All sequences within each FASTA file are assigned a unique, standardized identifier based on the 156 data set's EukProt ID and on the type of data (protein or transcriptome); this identifier is prepended to 157 the existing FASTA header, separated by a space. Illegal characters are removed from sequences.

158
The following characters are permitted, as defined by NCBI BLAST   which is also the case for the data sets in EukProt. We used the parameter --min_contig_in_predict 207 200 (as it matched the default minimum contig length in Trinity). By default, we selected the proteins 208 at Tier 2 (predictions supported by at least 2 sources). If Tier 2 produced fewer than 15,000 predicted 209 proteins, we instead selected Tier 1. All other parameter values were left at their defaults. We did not 210 perform gene prediction on unannotated genomes for which a transcriptome was already available for 211 the same species (under the assumption that the gene predictions of the transcriptome would be of 212 higher quality, due to potential errors in the gene annotation process).

216
We searched for 18S sequences of at least 1,500 base pairs in length for each data set in the 217 following order. If no sequence, or only a partial sequence was retrieved in a given step, we moved on 218 to the next step. Details specific to each data set are included in the EukProt metadata.        In the course of constructing successive versions of EukProt, we have observed several areas 421 that we believe to be current limitations of the database: 422 1. As can be observed in Figure 1, representation of species is highly uneven among taxonomic 423 groups. This is largely due to the difficulty in identifying, cultivating and sequencing species 424 from underrepresented groups; however, some representatives of key taxa have been 425 published but are not publicly accessible (a list of these species can be found in the "not 426 included" metadata

434
4. Proteins smaller than 50 amino acids are not predicted with the settings we used for protein 435 prediction (but may be included in data sets for which we did not perform protein prediction).

436
We used the default settings for protein prediction in TransDecoder (see Methods), in order to 437 compromise between minimum protein length and total number of predicted proteins (as 438 proteomes predicted with protein sizes smaller than 50 amino acids are generally much 439 larger, which may significantly slow many downstream analyses). EukProt users interested in 440 proteins shorter than 50 amino acids (for the species on which we performed protein 441 predictions) would instead have to repeat protein predictions using their desired settings.

443
Growing the EukProt database with community involvement

445
The core functionality of the database is the distribution of genome-scale protein sequences

456
As new genome-scale eukaryotic protein data sets become available, we plan to add them to the 457 database. As yet, we do not have a formal mechanism to accomplish this, and will instead depend on 458 monitoring the literature and assistance from the community. As an example, for version 3, we relied

464
In the longer term, we hope the standardization of our database provides a path towards including 465 all data sets in a major sequence repository such as NCBI/EBI/DDBJ, so that they can be more 466 broadly accessible and integrated into the suites of tools available at these repositories.