Efficient k-mer based curation of raw sequence data: application in Drosophila suzukii

10.24072/pcjournal.309 - Peer Community Journal, Volume 3 (2023), article no. e79.

Get full text PDF Peer reviewed and recommended by PCI

Several studies have highlighted the presence of contaminated entries in public sequence repositories, calling for special attention to the associated metadata. Here, we propose and evaluate a fast and efficient k–mer-based approach to assess the degree of mislabeling or contamination. We applied it to high-throughput whole-genome raw sequence data for 236 Ind-Seq and 22 Pool-Seq samples of the invasive species Drosophila suzukii. We first used Clark software to build a dictionary of species-discriminating k–mers from the curated assemblies of 29 target drosophilid species (including D. melanogaster, D. simulans, D. subpulchrella, or D. biarmipes) and 12 common drosophila pathogens and commensals (including Wolbachia). Counting the number of k–mers composing each query sample sequence that matched a discriminating k–mer from the dictionary provided a simple criterion for assignment to target species and evaluation of the entire sample. Analyses of a wide range of samples, representative of both target and other drosophilid species, demonstrated very good performance of the proposed approach, both in terms of run time and accuracy of sequence assignment. Of the 236 D. suzukii individuals, five were reassigned to D. simulans and eleven to D. subpulchrella. Another four showed moderate to substantial microbial contamination. Similarly, among the 22 Pool-Seq samples analyzed, two from the native range were found to be contaminated with 1 and 7 D. subpulchrella individuals, respectively (out of 50), and one from Europe was found to be contaminated with 5 to 6 D. immigrans individuals (out of 100). Overall, the present analysis allowed the definition of a large curated dataset consisting of > 60 population samples representative of the worldwide genetic diversity, which may be valuable for further population genetics studies on D. suzukii. More generally, while we advocate careful sample identification and verification prior to sequencing, the proposed framework is simple and computationally efficient enough to be included as a routine post-hoc quality check prior to any data analysis and prior to data submission to public repositories.

Published online:
DOI: 10.24072/pcjournal.309
Gautier, Mathieu 1

1 CBGP, INRAE, CIRAD, IRD, Montpellier SupAgro, Université de Montpellier, Montpellier, France
License: CC-BY 4.0
Copyrights: The authors retain unrestricted copyrights and publishing rights
     author = {Gautier, Mathieu},
     title = {Efficient \protect\emph{k-mer} based curation of raw sequence data: application in {\protect\emph{Drosophila} suzukii}},
     journal = {Peer Community Journal},
     eid = {e79},
     publisher = {Peer Community In},
     volume = {3},
     year = {2023},
     doi = {10.24072/pcjournal.309},
     language = {en},
     url = {}
AU  - Gautier, Mathieu
TI  - Efficient k-mer based curation of raw sequence data: application in Drosophila suzukii
JO  - Peer Community Journal
PY  - 2023
VL  - 3
PB  - Peer Community In
UR  -
DO  - 10.24072/pcjournal.309
LA  - en
ID  - 10_24072_pcjournal_309
ER  - 
%0 Journal Article
%A Gautier, Mathieu
%T Efficient k-mer based curation of raw sequence data: application in Drosophila suzukii
%J Peer Community Journal
%D 2023
%V 3
%I Peer Community In
%R 10.24072/pcjournal.309
%G en
%F 10_24072_pcjournal_309
Gautier, Mathieu. Efficient k-mer based curation of raw sequence data: application in Drosophila suzukii. Peer Community Journal, Volume 3 (2023), article  no. e79. doi : 10.24072/pcjournal.309.

Peer reviewed and recommended by PCI : 10.24072/pci.genomics.100244

Conflict of interest of the recommender and peer reviewers:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

[1] Asplen, M. K.; Anfora, G.; Biondi, A.; Choi, D.-S.; Chu, D.; Daane, K. M.; Gibert, P.; Gutierrez, A. P.; Hoelmer, K. A.; Hutchison, W. D.; Isaacs, R.; Jiang, Z.-L.; Kárpáti, Z.; Kimura, M. T.; Pascual, M.; Philips, C. R.; Plantamp, C.; Ponti, L.; Vétek, G.; Vogt, H.; Walton, V. M.; Yu, Y.; Zappalà, L.; Desneux, N. Invasion biology of spotted wing Drosophila (Drosophila suzukii): a global perspective and future priorities, Journal of Pest Science, Volume 88 (2015) no. 3, pp. 469-494 | DOI

[2] Atallah, J.; Teixeira, L.; Salazar, R.; Zaragoza, G.; Kopp, A. The making of a pest: the evolution of a fruit-penetrating ovipositor in Drosophila suzukii and related species, Proceedings of the Royal Society B: Biological Sciences, Volume 281 (2014) no. 1781, p. 20132840 | DOI

[3] Chang, C.-H.; Gregory, L. E.; Gordon, K. E.; Meiklejohn, C. D.; Larracuente, A. M. Unique structure and positive selection promote the rapid divergence of Drosophila Y chromosomes, eLife, Volume 11 (2022), p. e75795 | DOI

[4] Chen, S.; Zhou, Y.; Chen, Y.; Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor., Bioinformatics, Volume 34 (2018) no. 17, p. i884-i890 | DOI

[5] Chiu, J. C.; Jiang, X.; Zhao, L.; Hamm, C. A.; Cridland, J. M.; Saelao, P.; Hamby, K. A.; Lee, E. K.; Kwok, R. S.; Zhang, G.; Zalom, F. G.; Walton, V. M.; Begun, D. J. Genome of Drosophila suzukii, the spotted wing drosophila., G3, Volume 3 (2013) no. 12, p. 2257-71 | DOI

[6] Cini, A.; Ioriatti, C.; Anfora, G. A review of the invasion of Drosophila suzukii in Europe and a draft research agenda for integrated pest management, Bulletin of Insectology, Volume 65 (2012), pp. 149-160

[7] Conner, W. R.; Blaxter, M. L.; Anfora, G.; Ometto, L.; Rota-Stabelli, O.; Turelli, M. Genome comparisons indicate recent transfer of wRi-like Wolbachia between sister species Drosophila suzukii and D. subpulchrella, Ecology and Evolution, Volume 7 (2017) no. 22, p. 9391 | DOI

[8] Cornet, L.; Baurain, D. Contamination detection in genomic data: more is not enough, Genome Biology, Volume 23 (2022) no. 1 | DOI

[9] Durkin, S. M.; Chakraborty, M.; Abrieux, A.; Lewald, K. M.; Gadau, A.; Svetec, N.; Peng, J.; Kopyto, M.; Langer, C. B.; Chiu, J. C.; Emerson, J. J.; Zhao, L. Behavioral and Genomic Sensory Adaptations Underlying the Pest Activity of Drosophila suzukii, Molecular Biology and Evolution, Volume 38 (2021) no. 6, pp. 2532-2546 | DOI

[10] Finet, C.; Kassner, V. A.; Carvalho, A. B.; Chung, H.; Day, J. P.; Day, S.; Delaney, E. K.; De Ré, F. C.; Dufour, H. D.; Dupim, E.; Izumitani, H. F.; Gautério, T. B.; Justen, J.; Katoh, T.; Kopp, A.; Koshikawa, S.; Longdon, B.; Loreto, E. L.; Nunes, M. D. S.; Raja, K. K. B.; Rebeiz, M.; Ritchie, M. G.; Saakyan, G.; Sneddon, T.; Teramoto, M.; Tyukmaeva, V.; Vanderlinde, T.; Wey, E. E.; Werner, T.; Williams, T. M.; Robe, L. J.; Toda, M. J.; Marlétaz, F. DrosoPhyla: Resources for Drosophilid Phylogeny and Systematics, Genome Biology and Evolution, Volume 13 (2021) no. 8 (evab179) | DOI

[11] Francois, C. M.; Durand, F.; Figuet, E. Prevalence and Implications of Contamination in Public Genomic, G3, Volume 10 (2020) no. 2 | DOI

[12] Gautier, M. kmer dictionaries and associated scripts for kmer contaminant detection in Drosophila suzukii sequencing data using Clark program, 2023 ("Data INRAe, Recherche Data Gouv") | DOI

[13] Jezovit, J. A.; Levine, J. D.; Schneider, J. Phylogeny, environment and sexual communication across the Drosophila genus., J Exp Biol, Volume 220 (2017) no. Pt 1, pp. 42-52 | DOI

[14] Kapun, M.; Nunez, J. C. B.; Bogaerts-Márquez, M.; Murga-Moreno, J.; Paris, M.; Outten, J.; Coronado-Zamora, M.; Tern, C.; Rota-Stabelli, O.; Guerreiro, M. P. G.; Casillas, S.; Orengo, D. J.; Puerma, E.; Kankare, M.; Ometto, L.; Loeschcke, V.; Onder, B. S.; Abbott, J. K.; Schaeffer, S. W.; Rajpurohit, S.; Behrman, E. L.; Schou, M. F.; Merritt, T. J. S.; Lazzaro, B. P.; Glaser-Schmitt, A.; Argyridou, E.; Staubach, F.; Wang, Y.; Tauber, E.; Serga, S. V.; Fabian, D. K.; Dyer, K. A.; Wheat, C. W.; Parsch, J.; Grath, S.; Veselinovic, M. S.; Stamenkovic-Radak, M.; Jelic, M.; Buendía-Ruíz, A. J.; Gómez-Julián, M. J.; Espinosa-Jimenez, M. L.; Gallardo-Jiménez, F. D.; Patenkovic, A.; Eric, K.; Tanaskovic, M.; Ullastres, A.; Guio, L.; Merenciano, M.; Guirao-Rico, S.; Horváth, V.; Obbard, D. J.; Pasyukova, E.; Alatortsev, V. E.; Vieira, C. P.; Vieira, J.; Torres, J. R.; Kozeretska, I.; Maistrenko, O. M.; Montchamp-Moreau, C.; Mukha, D. V.; Machado, H. E.; Lamb, K.; Paulo, T.; Yusuf, L.; Barbadilla, A.; Petrov, D.; Schmidt, P.; Gonzalez, J.; Flatt, T.; Bergland, A. O. Drosophila Evolution over Space and Time (DEST): A New Population Genomics Resource., Mol Biol Evol, Volume 38 (2021) no. 12, pp. 5782-5805 | DOI

[15] Kim, B. Y.; Wang, J. R.; Miller, D. E.; Barmina, O.; Delaney, E.; Thompson, A.; Comeault, A. A.; Peede, D.; D'Agostino, E. R. R.; Pelaez, J.; Aguilar, J. M.; Haji, D.; Matsunaga, T.; Armstrong, E. E.; Zych, M.; Ogawa, Y.; Stamenković-Radak, M.; Jelić, M.; Veselinović, M. S.; Tanasković, M.; Erić, P.; Gao, J.-J.; Katoh, T. K.; Toda, M. J.; Watabe, H.; Watada, M.; Davis, J. S.; Moyle, L. C.; Manoli, G.; Bertolini, E.; Koštál, V.; Hawley, R. S.; Takahashi, A.; Jones, C. D.; Price, D. K.; Whiteman, N.; Kopp, A.; Matute, D. R.; Petrov, D. A. Highly contiguous assemblies of 101 drosophilid genomes., eLife, Volume 10 (2021) | DOI

[16] Klasson, L.; Kumar, N.; Bromley, R.; Sieber, K.; Flowers, M.; Ott, S. H.; Tallon, L. J.; Andersson, S. G. E.; Dunning Hotopp, J. C. Extensive duplication of the Wolbachia DNA in chromosome four of Drosophila ananassae, BMC Genomics, Volume 15 (2014) no. 1 | DOI

[17] Lalyer, C. R.; Sigsgaard, L.; Giese, B. Ecological vulnerability analysis for suppression of Drosophila suzukii by gene drives, Global Ecology and Conservation, Volume 32 (2021) | DOI

[18] Lewald, K. M.; Abrieux, A.; Wilson, D. A.; Lee, Y.; Conner, W. R.; Andreazza, F.; Beers, E. H.; Burrack, H. J.; Daane, K. M.; Diepenbrock, L.; Drummond, F. A.; Fanning, P. D.; Gaffney, M. T.; Hesler, S. P.; Ioriatti, C.; Isaacs, R.; Little, B. A.; Loeb, G. M.; Miller, B.; Nava, D. E.; Rendon, D.; Sial, A. A.; da Silva, C. S. B.; Stockton, D. G.; Timmeren, S. V.; Wallingford, A.; Walton, V. M.; Wang, X.; Zhao, B.; Zalom, F. G.; Chiu, J. C. Population genomics of Drosophila suzukii reveal longitudinal population structure and signals of migrations in and out of the continental United States., G3, Volume 11 (2021) no. 12 | DOI

[19] Machado, H. E.; Bergland, A. O.; Taylor, R.; Tilk, S.; Behrman, E.; Dyer, K.; Fabian, D. K.; Flatt, T.; González, J.; Karasov, T. L.; Kim, B.; Kozeretska, I.; Lazzaro, B. P.; Merritt, T. J.; Pool, J. E.; O'Brien, K.; Rajpurohit, S.; Roy, P. R.; Schaeffer, S. W.; Serga, S.; Schmidt, P.; Petrov, D. A. Broad geographic sampling reveals the shared basis and environmental correlates of seasonal adaptation in Drosophila, eLife, Volume 10 (2021), p. e67577 | DOI

[20] Manni, M.; Berkeley, M. R.; Seppey, M.; Simão, F. A.; Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes, Molecular Biology and Evolution, Volume 38 (2021) no. 10, pp. 4647-4654 | DOI

[21] Newton, I. L. G.; Sheehan, K. B. Passage of Wolbachia pipientis through Mutant Drosophila melanogaster Induces Phenotypic and Genomic Changes, Applied and Environmental Microbiology, Volume 81 (2015) no. 3, pp. 1032-1037 | DOI

[22] Olazcuaga, L.; Loiseau, A.; Parrinello, H.; Paris, M.; Fraimout, A.; Guedot, C.; Diepenbrock, L. M.; Kenis, M.; Zhang, J.; Chen, X.; Borowiec, N.; Facon, B.; Vogt, H.; Price, D. K.; Vogel, H.; Prud'homme, B.; Estoup, A.; Gautier, M. A Whole-Genome Scan for Association with Invasion Success in the Fruit Fly Drosophila suzukii Using Contrasts of Allele Frequencies Corrected for Population Structure., Molecular biology and evolution, Volume 37 (2020) no. 8, pp. 2369-2385 | DOI

[23] Ometto, L.; Cestaro, A.; Ramasamy, S.; Grassi, A.; Revadi, S.; Siozios, S.; Moretto, M.; Fontana, P.; Varotto, C.; Pisani, D.; Dekker, T.; Wrobel, N.; Viola, R.; Pertot, I.; Cavalieri, D.; Blaxter, M.; Anfora, G.; Rota-Stabelli, O. Linking Genomics and Ecology to Investigate the Complex Evolution of an Invasive Drosophila Pest, Genome Biology and Evolution, Volume 5 (2013) no. 4, pp. 745-757 | DOI

[24] Ounit, R.; Lonardi, S. Higher classification sensitivity of short metagenomic reads with CLARK-S, Bioinformatics, Volume 32 (2016) no. 24, pp. 3823-3825 | DOI

[25] Ounit, R.; Wanamaker, S.; Close, T. J.; Lonardi, S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers., BMC Genomics, Volume 16 (2015), p. 236 | DOI

[26] Palmieri, N.; Nolte, V.; Chen, J.; Schlötterer, C. Genome assembly and annotation of aDrosophila simulansstrain from Madagascar, Molecular Ecology Resources, Volume 15 (2014) no. 2, pp. 372-381 | DOI

[27] Paris, M.; Boyer, R.; Jaenichen, R.; Wolf, J.; Karageorgi, M.; Green, J.; Cagnon, M.; Parinello, H.; Estoup, A.; Gautier, M.; Gompel, N.; Prud'homme, B. Near-chromosome level genome assembly of the fruit pest Drosophila suzukii using long-read sequencing., Scientific reports, Volume 10 (2020) no. 1, p. 11227 | DOI

[28] Piper, A. M.; Cunningham, J. P.; Cogan, N. O. I.; Blacket, M. J. DNA Metabarcoding Enables High-Throughput Detection of Spotted Wing Drosophila (Drosophila suzukii) Within Unsorted Trap Catches, Frontiers in Ecology and Evolution, Volume 10 (2022) | DOI

[29] R Core Team R: A Language and Environment for Statistical Computing,, 2017

[30] Schlötterer, C.; Tobler, R.; Kofler, R.; Nolte, V. Sequencing pools of individuals - mining genome-wide polymorphism data without big funding., Nature Reviews Genetics, Volume 15 (2014) no. 11, pp. 749-763 | DOI

[31] Steinegger, M.; Salzberg, S. L. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank, Genome Biology, Volume 21 (2020) no. 1 | DOI

[32] Takamori, H.; Watabe, H.-a.; Fuyama, Y.; Zhang, Y.-p.; Aotsuka, T. Drosophila subpulchrella, a new species of the Drosophila suzukii species subgroup from Japan and China (Diptera: Drosophilidae), Entomological Science, Volume 9 (2006) no. 1, pp. 121-128 | DOI

[33] Wood, D. E.; Lu, J.; Langmead, B. Improved metagenomic analysis with Kraken 2, Genome Biology, Volume 20 (2019) no. 1 | DOI

[34] Wood, D. E.; Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments, Genome Biology, Volume 15 (2014) no. 3 | DOI

[35] Zhu, Y. A. B. Empirical Validation of Pooled Whole Genome Population Re-Sequencing in Drosophila melanogaster, PLOS ONE, Volume 7 (2012) no. 7, pp. 1-7 | DOI

Cited by Sources: