Section: Evolutionary Biology
Topic: Biophysics and computational biology, Evolution, Genetics/genomics

Faster model-based estimation of ancestry proportions

Corresponding author(s): Meisner, Jonas (jonas.meisner@sund.ku.dk)

10.24072/pcjournal.503 - Peer Community Journal, Volume 4 (2024), article no. e115.

Get full text PDF Peer reviewed and recommended by PCI
article image

Ancestry estimation from genotype data in unrelated individuals has become an essential tool in population and medical genetics to understand demographic population histories and to model or correct for population structure. The ADMIXTURE software is a widely used model-based approach to account for population stratification, however, it struggles with convergence issues and does not scale to modern human datasets or the large number of variants in whole-genome sequencing data. Likelihood-free approaches optimize a least square objective and have gained popularity in recent years due to their scalability. However, this comes at the cost of accuracy in the ancestry estimates in more complex admixture scenarios. We present a new model-based approach, fastmixture, which adopts aspects from likelihood-free approaches for parameter initialization, followed by a mini-batch expectation-maximization procedure to model the standard likelihood. In a simulation study, we demonstrate that the model-based approaches of fastmixture and ADMIXTURE are significantly more accurate than recent and likelihood-free approaches. We further show that fastmixture runs approximately 30× faster than ADMIXTURE on both simulated and empirical data from the 1000 Genomes Project such that our model-based approach scales to much larger sample sizes than previously possible.

Published online:
DOI: 10.24072/pcjournal.503
Type: Research article
Mots-clés : Ancestry estimation, population structure, population genetics, evolutionary genetics, bioinformatics

Santander, Cindy G. 1; Refoyo Martinez, Alba 2; Meisner, Jonas 3, 4

1 Department of Biology, University of Copenhagen, Denmark
2 Center for Health Data Science, University of Copenhagen, Denmark
3 Mental Health Centre Copenhagen, Copenhagen University Hospital, Denmark
4 Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Denmark
License: CC-BY 4.0
Copyrights: The authors retain unrestricted copyrights and publishing rights
@article{10_24072_pcjournal_503,
     author = {Santander, Cindy G. and Refoyo Martinez, Alba and Meisner, Jonas},
     title = {Faster model-based estimation of ancestry proportions},
     journal = {Peer Community Journal},
     eid = {e115},
     publisher = {Peer Community In},
     volume = {4},
     year = {2024},
     doi = {10.24072/pcjournal.503},
     language = {en},
     url = {https://peercommunityjournal.org/articles/10.24072/pcjournal.503/}
}
TY  - JOUR
AU  - Santander, Cindy G.
AU  - Refoyo Martinez, Alba
AU  - Meisner, Jonas
TI  - Faster model-based estimation of ancestry proportions
JO  - Peer Community Journal
PY  - 2024
VL  - 4
PB  - Peer Community In
UR  - https://peercommunityjournal.org/articles/10.24072/pcjournal.503/
DO  - 10.24072/pcjournal.503
LA  - en
ID  - 10_24072_pcjournal_503
ER  - 
%0 Journal Article
%A Santander, Cindy G.
%A Refoyo Martinez, Alba
%A Meisner, Jonas
%T Faster model-based estimation of ancestry proportions
%J Peer Community Journal
%D 2024
%V 4
%I Peer Community In
%U https://peercommunityjournal.org/articles/10.24072/pcjournal.503/
%R 10.24072/pcjournal.503
%G en
%F 10_24072_pcjournal_503
Santander, Cindy G.; Refoyo Martinez, Alba; Meisner, Jonas. Faster model-based estimation of ancestry proportions. Peer Community Journal, Volume 4 (2024), article  no. e115. doi : 10.24072/pcjournal.503. https://peercommunityjournal.org/articles/10.24072/pcjournal.503/

PCI peer reviews and recommendation, and links to data, scripts, code and supplementary information: 10.24072/pci.evolbiol.100838

Conflict of interest of the recommender and peer reviewers:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article.

[1] Alexander, D. H.; Novembre, J.; Lange, K. Fast model-based estimation of ancestry in unrelated individuals, Genome research, Volume 19 (2009) no. 9, pp. 1655-1664 | DOI

[2] Baumdicker, F.; Bisschop, G.; Goldstein, D.; Gower, G.; Ragsdale, A. P.; Tsambos, G.; Zhu, S.; Eldon, B.; Ellerman, E. C.; Galloway, J. G.; Gladstein, A. L.; Gorjanc, G.; Guo, B.; Jeffery, B.; Kretzschumar, W. W.; Lohse, K.; Matschiner, M.; Nelson, D.; Pope, N. S.; Quinto-Cortés, C. D.; Rodrigues, M. F.; Saunack, K.; Sellinger, T.; Thornton, K.; van Kemenade, H.; Wohns, A. W.; Wong, Y.; Gravel, S.; Kern, A. D.; Koskela, J.; Ralph, P. L.; Kelleher, J. Efficient ancestry and mutation simulation with msprime 1.0, Genetics, Volume 220 (2021) no. 3 | DOI

[3] Browning, S. R.; Browning, B. L.; Daviglus, M. L.; Durazo-Arvizu, R. A.; Schneiderman, N.; Kaplan, R. C.; Laurie, C. C. Ancestry-specific recent effective population size in the Americas, PLoS genetics, Volume 14 (2018) no. 5, p. e1007385 | DOI

[4] Cabreros, I.; Storey, J. D. A likelihood-free estimator of population structure bridging admixture models and principal components analysis, Genetics, Volume 212 (2019) no. 4, pp. 1009-1029 | DOI

[5] Chang, C. C.; Chow, C. C.; Tellier, L. C.; Vattikuti, S.; Purcell, S. M.; Lee, J. J. Second-generation PLINK: rising to the challenge of larger and richer datasets, Gigascience, Volume 4 (2015) no. 1, pp. s13742-015 | DOI

[6] Chiu, A. M.; Molloy, E. K.; Tan, Z.; Talwalkar, A.; Sankararaman, S. Inferring population structure in biobank-scale genomic data, The American Journal of Human Genetics, Volume 109 (2022) no. 4, pp. 727-737 | DOI

[7] The 1000 Genomes Project Consortium A global reference for human genetic variation, Nature, Volume 526 (2015) no. 7571, p. 68 | DOI

[8] The International HapMap Consortium A second generation human haplotype map of over 3.1 million SNPs, Nature, Volume 449 (2007) no. 7164, p. 851 | DOI

[9] Dominguez Mantes, A.; Mas Montserrat, D.; Bustamante, C. D.; Gir'o-i-Nieto, X.; Ioannidis, A. G. Neural ADMIXTURE for rapid genomic clustering, Nature Computational Science (2023), pp. 1-9 | DOI

[10] Engelhardt, B. E.; Stephens, M. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis, PLoS genetics, Volume 6 (2010) no. 9, p. e1001117 | DOI

[11] Frichot, E.; Mathieu, F.; Trouillon, T.; Bouchard, G.; François, O. Fast and efficient estimation of individual ancestry coefficients, Genetics, Volume 196 (2014) no. 4, pp. 973-983 | DOI

[12] Fumagalli, M. *fastmixture* generates fast and accurate estimates of global ancestry proportions and ancestral allele frequencies, Peer Community in Evolutionary Biology, Volume 1 (2024), p. 100838 | DOI

[13] Gopalan, P.; Hao, W.; Blei, D. M.; Storey, J. D. Scaling probabilistic models of genetic variation to millions of humans, Nature genetics, Volume 48 (2016) no. 12, pp. 1587-1590 | DOI

[14] Gravel, S.; Henn, B. M.; Gutenkunst, R. N.; Indap, A. R.; Marth, G. T.; Clark, A. G.; Yu, F.; Gibbs, R. A.; The 1000 Genomes Project; Bustamante, C. D. Demographic history and rare allele sharing among human populations, Proceedings of the National Academy of Sciences, Volume 108 (2011) no. 29, pp. 11983-11988 | DOI

[15] Harris, C. R.; Millman, K. J.; van der Walt, S. J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N. J.; Kern, R.; Picus, M.; Hoyer, S.; van Kerkwijk, M. H.; Brett, M.; Haldane, A.; del Río, J. F.; Wiebe, M.; Peterson, P.; Gérard-Marchant, P.; Sheppard, K.; Reddy, T.; Weckesser, W.; Abbasi, H.; Gohlke, C.; Oliphant, T. E. Array programming with NumPy, Nature, Volume 585 (2020) no. 7825, pp. 357-362 | DOI

[16] Li, Z.; Meisner, J.; Albrechtsen, A. Fast and accurate out-of-core PCA framework for large scale biobank data, Genome Research, Volume 33 (2023) no. 9, pp. 1599-1608 | DOI

[17] Marchini, J.; Cardon, L. R.; Phillips, M. S.; Donnelly, P. The effects of human population structure on large genetic association studies, Nature genetics, Volume 36 (2004) no. 5, pp. 512-517 | DOI

[18] Martin, A. R.; Kanai, M.; Kamatani, Y.; Okada, Y.; Neale, B. M.; Daly, M. J. Clinical use of current polygenic risk scores may exacerbate health disparities, Nature genetics, Volume 51 (2019) no. 4, pp. 584-591 | DOI

[19] Meisner, J.; Albrechtsen, A. Inferring population structure and admixture proportions in low-depth NGS data, Genetics, Volume 210 (2018) no. 2, pp. 719-731 | DOI

[20] Meisner, J.; Santander, C.; Refoyo-Martinez, A. Supplemental data for reproducing "Faster model-based estimation of ancestry proportions", Zenodo, 2024 | DOI

[21] Meisner, J.; Santander, C.; Refoyo-Martinez, A. Supplemental Information: Faster model-based estimation of ancestry proportions, Zenodo, 2024 | DOI

[22] Novembre, J.; Johnson, T.; Bryc, K.; Kutalik, Z.; Boyko, A. R.; Auton, A.; Indap, A.; King, K. S.; Bergmann, S.; Nelson, M. R.; Stephens, M.; Bustamante, C. D. Genes mirror geography within Europe, Nature, Volume 456 (2008) no. 7218, pp. 98-101 | DOI

[23] Patterson, N.; Price, A. L.; Reich, D. Population structure and eigenanalysis, PLoS genetics, Volume 2 (2006) no. 12, p. e190 | DOI

[24] Pritchard, J. K.; Stephens, M.; Donnelly, P. Inference of population structure using multilocus genotype data, Genetics, Volume 155 (2000) no. 2, pp. 945-959 | DOI

[25] Ruder, S. An overview of gradient descent optimization algorithms, arXiv:1609.04747 (2016) | DOI

[26] Skotte, L.; Korneliussen, T. S.; Albrechtsen, A. Estimating individual admixture proportions from next generation sequencing data, Genetics, Volume 195 (2013) no. 3, pp. 693-702 | DOI

[27] Tang, H.; Peng, J.; Wang, P.; Risch, N. J. Estimation of individual admixture: analytical and study design considerations, Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society, Volume 28 (2005) no. 4, pp. 289-301 | DOI

[28] Tsambos, G.; Kelleher, J.; Ralph, P.; Leslie, S.; Vukcevic, D. link-ancestors: fast simulation of local ancestry with tree sequence software, Bioinformatics Advances, Volume 3 (2023) no. 1, p. vbad163 | DOI

[29] Varadhan, R.; Roland, C. Simple and globally convergent methods for accelerating the convergence of any EM algorithm, Scandinavian Journal of Statistics, Volume 35 (2008) no. 2, pp. 335-353 | DOI

[30] Wang, Y.; Tsuo, K.; Kanai, M.; Neale, B. M.; Martin, A. R. Challenges and opportunities for developing more generalizable polygenic risk scores, Annual review of biomedical data science, Volume 5 (2022) no. 1, pp. 293-320 | DOI

[31] Zhou, H.; Alexander, D.; Lange, K. A quasi-Newton acceleration for high-dimensional optimization algorithms, Statistics and computing, Volume 21 (2011), pp. 261-273 | DOI

Cited by Sources:

block.super