Section: Mathematical & Computational Biology
Topic: Biophysics and computational biology

HairSplitter: haplotype assembly from long, noisy reads

Corresponding author(s): Faure, Roland (roland.faure@irisa.fr)

10.24072/pcjournal.481 - Peer Community Journal, Volume 4 (2024), article no. e96.

Get full text PDF Peer reviewed and recommended by PCI
article image

Motivation: Long-read assemblers face challenges in discerning closely related viral or bacterial strains, often collapsing similar strains into a single sequence. This limitation has been hampering metagenome analysis, as diverse strains may harbor crucial functional distinctions. Results: We introduce a novel software, HairSplitter, designed to retrieve strains from a partially or totally collapsed assembly and long reads. The method uses a custom variant-calling process to operate with erroneous long reads and introduces a new read binning algorithm to recover an a priori unknown number of strains. On noisy long reads, HairSplitter recovers more strains while being faster than state-of-the-art tools, both in the cases of viruses and bacteria. Availability: HairSplitter is freely available on GitHub at https://github.com/RolandFaure/Hairsplitter (https://doi.org/10.5281/zenodo.13753481).

Published online:
DOI: 10.24072/pcjournal.481
Type: Article de recherche
Mots clés : Metagenomes, Metaviromes, Haplotyping, Genome assembly, Strain separation

Faure, Roland 1, 2; Lavenier, Dominique 1; Flot, Jean-François 2, 3

1 Univ. Rennes, INRIA RBA, CNRS UMR 6074, Rennes, France
2 Service Evolution Biologique et Ecologie, Université libre de Bruxelles (ULB), Brussels, Belgium
3 Interuniversity Institute of Bioinformatics in Brussels -- (IB)2, Brussels, Belgium
License: CC-BY 4.0
Copyrights: The authors retain unrestricted copyrights and publishing rights
@article{10_24072_pcjournal_481,
     author = {Faure, Roland and Lavenier, Dominique and Flot, Jean-Fran\c{c}ois},
     title = {HairSplitter: haplotype assembly from long, noisy reads},
     journal = {Peer Community Journal},
     eid = {e96},
     publisher = {Peer Community In},
     volume = {4},
     year = {2024},
     doi = {10.24072/pcjournal.481},
     language = {en},
     url = {https://peercommunityjournal.org/articles/10.24072/pcjournal.481/}
}
TY  - JOUR
AU  - Faure, Roland
AU  - Lavenier, Dominique
AU  - Flot, Jean-François
TI  - HairSplitter: haplotype assembly from long, noisy reads
JO  - Peer Community Journal
PY  - 2024
VL  - 4
PB  - Peer Community In
UR  - https://peercommunityjournal.org/articles/10.24072/pcjournal.481/
DO  - 10.24072/pcjournal.481
LA  - en
ID  - 10_24072_pcjournal_481
ER  - 
%0 Journal Article
%A Faure, Roland
%A Lavenier, Dominique
%A Flot, Jean-François
%T HairSplitter: haplotype assembly from long, noisy reads
%J Peer Community Journal
%D 2024
%V 4
%I Peer Community In
%U https://peercommunityjournal.org/articles/10.24072/pcjournal.481/
%R 10.24072/pcjournal.481
%G en
%F 10_24072_pcjournal_481
Faure, Roland; Lavenier, Dominique; Flot, Jean-François. HairSplitter: haplotype assembly from long, noisy reads. Peer Community Journal, Volume 4 (2024), article  no. e96. doi : 10.24072/pcjournal.481. https://peercommunityjournal.org/articles/10.24072/pcjournal.481/

PCI peer reviews and recommendation, and links to data, scripts, code and supplementary information: 10.24072/pci.mcb.100307

Conflict of interest of the recommender and peer reviewers:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

[1] Benoit, G.; Raguideau, S.; James, R.; Phillippy, A.; Chikhi, R.; Quince, C. High-quality metagenome assembly from long accurate reads with metaMDBG, Nature Biotechnology (2024), pp. 1-6 | DOI

[2] Bertrand, D.; Shaw, J.; Kalathiyappan, M.; Ng, A. H. Q.; Kumar, M. S.; Li, C.; Dvornicic, M.; Soldo, J. P.; Koh, J. Y.; Tong, C.; Ng, O. T.; Barkham, T.; Young, B.; Marimuthu, K.; Chng, K. R.; Sikic, M.; Nagarajan, N. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes, Nature Biotechnology, Volume 37 (2019) no. 8, pp. 937-944 | DOI

[3] Biemann, C. Chinese whispers: An efficient graph clustering algorithm and its application to natural language processing problems, Proceedings of TextGraphs (2006), pp. 73-80 | DOI

[4] Cai, D.; Shang, J.; Sun, Y. HaploDMF: viral Haplotype reconstruction from long reads via Deep Matrix Factorization, Bioinformatics, Volume 38 (2022) | DOI

[5] Ceppellini, R.; Curtoni, E.; Mattiuz, P.; Miggiano, V.; Scudeller, G.; Serra, A. Genetics of leukocyte antigens: a family study of segregation and linkage., Histocompatibility Testing 1967, 1967

[6] de Cesare, M.; Mwenda, M.; Jeffreys, A. E.; Chirwa, J.; Drakeley, C.; Schneider, K.; Mambwe, B.; Glanz, K.; Ntalla, C.; Carrasquilla, M.; Portugal, S.; Verity, R. J.; Bailey, J. A.; Ghinai, I.; Busby, G. B.; Hamainza, B.; Hawela, M.; Bridges, D. J.; Hendry, J. A. Flexible and cost-effective genomic surveillance of P. falciparum malaria with targeted nanopore sequencing, Nature Communications, Volume 15 (2024) no. 1 | DOI

[7] Cheng, H.; Concepcion, G.; Feng, X.; Zhang, H.; Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm, Nature Methods, Volume 18 (2021), pp. 170-175 | DOI

[8] Coban, O.; Deyn, G.; Ploeg, M. Soil microbiota as game-changers in restoration of degraded lands, Science, Volume 375 (2022), p. abe0725 | DOI

[9] Conlon, M.; Bird, A. The impact of diet and lifestyle on gut microbiota and human health, Nutrients, Volume 7 (2014), pp. 17-44 | DOI

[10] DeGroot, M.; Schervish, M. Probability and Statistics, 2002

[11] Faure, R. Replication data for: HairSplitter: separating haplotypes with long reads [Data set], Zenodo (2024) | DOI

[12] Faure, R.; Flot, J. F.; Lavenier, D. Hairsplitter: v1.9.17, Zenodo, 2024 | DOI

[13] Faure, R.; Guiglielmoni, N.; Flot, J. F. GraphUnzip: unzipping assembly graphs with long reads and Hi-C, bioRxiv (2021) | DOI

[14] Feng, X.; Cheng, H.; Portik, D.; Li, H. Metagenome assembly of high-fidelity long reads with hifiasm-meta, Nature Methods, Volume 19 (2022), pp. 671-674 | DOI

[15] Feng, Z.; Clemente, J.; Wong, B.; Schadt, E. Detecting and phasing minor single-nucleotide variants from long-read sequencing data, Nature Communications, Volume 12 (2021), p. 3032 | DOI

[16] Fix, E.; Hodges, J. L. Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties, International Statistical Review / Revue Internationale de Statistique, Volume 57 (1989) no. 3 | DOI

[17] Flint, A.; Reaume, S.; Harlow, J.; Hoover, E.; Weedmark, K.; Nasheri, N. Genomic analysis of human Noroviruses using combined Illumina-Nanopore data, Virus Evolution, Volume 7 (2021) | DOI

[18] Frank, C.; Werber, D.; Cramer, J. P.; Askar, M.; Faber, M.; an der Heiden, M.; Bernard, H.; Fruth, A.; Prager, R.; Spode, A.; Wadl, M.; Zoufaly, A.; Jordan, S.; Kemper, M. J.; Follin, P.; Müller, L.; King, L. A.; Rosner, B.; Buchholz, U.; Stark, K.; Krause, G. Epidemic profile of shiga-toxin–producing Escherichia coli O104:H4 outbreak in Germany, New England Journal of Medicine, Volume 365 (2011) no. 19, pp. 1771-1780 | DOI

[19] Ghurye, J.; Cepeda-Espinoza, V.; Pop, M. Metagenomic assembly: overview, challenges and applications, The Yale Journal of Biology and Medicine, Volume 89 (2016), pp. 353-362

[20] Kang, X.; Luo, X.; Schönhuth, A. StrainXpress: strain aware metagenome assembly from short reads, Nucleic Acids Research, Volume 50 (2022) no. 17, p. e101-e101 | DOI

[21] Kazantseva, E.; Donmez, A.; Pop, M.; Kolmogorov, M. stRainy: assembly-based metagenomic strain phasing using long reads (2023) | DOI

[22] Kolmogorov, M.; Bickhart, D. M.; Behsaz, B.; Gurevich, A.; Rayko, M.; Shin, S. B.; Kuhn, K.; Yuan, J.; Polevikov, E.; Smith, T. P. L.; Pevzner, P. A. metaFlye: scalable long-read metagenome assembly using repeat graphs, Nature Methods, Volume 17 (2020) no. 11, pp. 1103-1110 | DOI

[23] Kong, W.; Wang, Y.; Zhang, S.; Yu, J.; Zhang, X. Recent Advances in Assembly of Complex Plant Genomes, Genomics, Proteomics & Bioinformatics, Volume 21 (2023) no. 3, pp. 427-439 | DOI

[24] Konstantinidis, K.; Tiedje, J. Genomic insights that advance the species definition for prokaryotes, Proceedings of the National Academy of Sciences of the United States of America, Volume 102 (2005), p. 2567-72 | DOI

[25] Koren, S.; Walenz, B. P.; Berlin, K.; Miller, J. R.; Bergman, N. H.; Phillippy, A. M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation, Genome Research, Volume 27 (2017) no. 5, pp. 722-736 | DOI

[26] Li, H. Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34 (2018) no. 18, pp. 3094-3100 | DOI

[27] Li, H.; Feng, X.; Chu, C. The design and construction of reference pangenome graphs with minigraph, Genome Biology, Volume 21 (2020), p. 265 | DOI

[28] Luo, C.; Knight, R.; Siljander, H.; Knip, M.; Xavier, R.; Gevers, D. ConStrains identifies microbial strains in metagenomic datasets, Nature Biotechnology, Volume 33 (2015), pp. 1045-1052 | DOI

[29] Luo, X.; Kang, X.; Schönhuth, A. Strainline: full-length de novo viral haplotype reconstruction from noisy long reads, Genome Biology, Volume 23 (2022), p. 29 | DOI

[30] Magazine, N.; Zhang, T.; Wu, Y.; McGee, M.; Veggiani, G.; Huang, W. Mutations and evolution of the SARS-CoV-2 spike protein, Viruses, Volume 14 (2022), p. 640 | DOI

[31] McNaughton, A.; Roberts, H.; Bonsall, D.; de Cesare, M.; Mokaya, J.; Lumley, S.; Golubchik, T.; Piazza, P.; Martin, J.; Lara, C.; Brown, A.; Ansari, M.; Bowden, R.; Barnes, E.; Matthews, P. Illumina and Nanopore methods for whole genome sequencing of hepatitis B virus (HBV), Scientific Reports, Volume 9 (2019), p. 7081 | DOI

[32] Mikheenko, A.; Saveliev, V.; Gurevich, A. MetaQUAST: Evaluation of metagenome assemblies, Bioinformatics, Volume 32 (2016), pp. 1088-1090 | DOI

[33] Milne, I.; Stephen, G.; Bayer, M.; Cock, P. J. A.; Pritchard, L.; Cardle, L.; Shaw, P. D.; Marshall, D. Using Tablet for visual exploration of second-generation sequencing data, Briefings in Bioinformatics, Volume 14 (2013) no. 2, pp. 193-202 | DOI

[34] Pibiri, G. E. Accurate Haplotype Reconstruction from Long, Error-Prone, Reads with *HairSplitter*, Peer Community in Mathematical and Computational Biology, Volume 1 (2024), p. 100307 | DOI

[35] Quince, C.; Delmont, T. O.; Raguideau, S.; Alneberg, J.; Darling, A. E.; Collins, G.; Eren, A. M. DESMAN: a new tool for de novo extraction of strains from metagenomes, Genome Biology, Volume 18 (2017) no. 1, p. 181 | DOI

[36] Rodriguez Jimenez, A.; Guiglielmoni, N.; Goetghebuer, L.; Dechamps, E.; George, I.; Flot, J.-F. Comparative genome analysis of Vagococcus fluvialis reveals abundance of mobile genetic elements in sponge-isolated strains, BMC Genomics, Volume 23 (2022) | DOI

[37] Runtuwene, L. R.; Tuda, J. S. B.; Mongan, A. E.; Suzuki, Y. On-Site MinION Sequencing, Advances in Experimental Medicine and Biology, Springer Singapore, Singapore, 2019, pp. 143-150 | DOI

[38] Vaser, R.; Sović, I.; Nagarajan, N.; Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads, Genome Research, Volume 27 (2017), pp. 737-746 | DOI

[39] Vicedomini, R.; Quince, C.; Darling, A. E.; Chikhi, R. Strainberry: automated strain separation in low-complexity metagenomes using long reads, Nature Communications, Volume 12 (2021) no. 1, p. 4485 | DOI

[40] Ward, N. New directions and interactions in metagenomics research, FEMS Microbiology Ecology, Volume 55 (2006), pp. 331-338 | DOI

[41] Wick, R. Badread: simulation of error-prone long reads, Journal of Open Source Software, Volume 4 (2019) no. 36, p. 1316 | DOI

[42] Wick, R. R.; Schultz, M. B.; Zobel, J.; Holt, K. E. Bandage: interactive visualization of de novo genome assemblies, Bioinformatics, Volume 31 (2015) no. 20, pp. 3350-3352 | DOI

Cited by Sources:

block.super