Populations in statistical genetics What are they, and how can we infer them from whole genome data? Daniel Lawson Heilbronn Institute, University of Bristol www.paintmychromosomes.com Work with: January 2014, Cambridge 1 / 35 Overview I Some views of ‘population’ I A statistical definition I Generative approaches I Inference from sparse weakly linked data (STRUCTURE) I Inference from dense linked data (fineSTRUCTURE) I Selected results 2 / 35 What is a population in general? Like species, populations are elusive objects. I Definitions are a function of knowledge and application I Some definitions - Need not be biologically meaningful I Example: Individuals found on same island I Example: Sample location due to clustered sampling procedure I Example: Disease status for case/control studies I In statistics: we generalize from samples to a population I Share: individuals are equivalent under some measure But do populations really exist? Does it matter if they are useful? 3 / 35 Diversion: some fuzzy definitions of species Are definitions of species analogous to definitions of populations? I Typological species: Organisms that share the same set of phenotypes I Ecological species: Organisms that compete for the same environment I Phylogenetic species: Organisms with a single common ancestor I Biological species: Organisms that can share DNA via sexual reproduction (Western/Eastern) Meadow Lark: http://evolution.berkeley.edu 4 / 35 What is a population? Motivation from genetics We are interested in non-subjective definitions of populations. I Individuals have equivalent genetic ancestry I This depends on the data available ... I ... and the model used I Knowledge driven definition I Requires defining equivalent! 5 / 35 The birds and the bees (vs the plants and bacteria) I Sex dramatically affects genetic transmission I Different population concept required Sexual species I I I I I Generalise the Biological Species concept Random mating within a population Try to detect deviations from random mating to identify populations Asexual species I I I I Generalise the Ecological Species concept Neutral competition within a population Try to detect deviations from neutrality to identify populations Not examined further in this talk 6 / 35 Top down approaches Start from a theoretical generative model of reproduction. For example: I Individuals move between populations via migration I Discrete generations (for simplicity) I Each generation randomly chooses parent(s) within a population Population structure: individuals migrate between populations but mate randomly within each 7 / 35 Ancestry Process Time Present 8 / 35 Populations as Demes Demes Migration Ammerman and Cavalli-Sforza, 1984 The Neolithic Transition and the Genetics of Populations in Europe. 9 / 35 Diversion: Principal Components Analysis I PCA is widely used in genetics, and helpful for visualisation I It does not make any explicit modelling assumptions... BUT to interpret the output, you do! I I I I I I I ‘Just’ rotating the data Similar individuals tend to be close Differences shared by many individuals tend to appear first Each component describes a different direction of variation If we are lucky, these correspond to real shared drift events There is no unique interpretation of PCA (see McVean 2009) See Lawson & Falush 2012 for details. 10 / 35 Example A) Historical Scenario B) Spatial Scenario Populations Time in the past d1 Migration d2 Present Pop A Pop B1 Pop B2 Space C) PCA of A D) PCA of B PC2 PC2 0.2 0.2 ● ● ●●● ● ● ●● B2 ● ● ● ● f(d2) ● ● ● ● −0.2 ●● ● ● ● ● ● ● 0 −0.1 ● ● 0.2 0.1 ● A f(d1) PC1 0.1 ● 0.2 0 PC1 −0.1 ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● −0.2 ●● ● ● ●●● ●● −0.2 B1 −0.2 11 / 35 Bottom up approaches Start from a definition of population as ‘equivalent’ individuals I Within a population, individuals are randomly mating I Small samples of large populations: individuals are approximately independent I Smaller populations: relationships must be accounted for I What does random mating mean for the data? I How are populations related? 12 / 35 Top down vs Bottom up C) Full Statistical Model Time 0.6 0.8 0.0 0.6 0.9 1.0 Genetic Postition 0.0 0.6 0.9 1.0 Genetic Postition 0.0 0.2 0.4 Density 0.6 0.4 0.2 0.0 Time 0.8 1.0 B) Statistical Model 1.0 A) Generative Model 0.0 0.6 0.9 1.0 Genetic Postition I A) Generative approaches are best in theory, if we can make the model match reality I But, hard to use in practice - how to do inference? I B) Bottom up approaches are approximate - might lose power I C) But can be refined until they are close to the generating process 13 / 35 Ancestry Process - Ancestral Recombination Graph Time • Take limit N → ∞ with N/T constant • Recombination rate ρ (for tract length) • Ignore unbranching ancestors Present 14 / 35 Structure model Pritchard, Stephens & Donnelly 2000. I Populations are large and well mixed I SNPs Dil are unlinked* I loci have some ancestral frequency p0l ∼ P(·) I Population k has frequency pkl drifted from ancestral p0l ** pkl ∼ Dirichlet(p0l ) I Individual i is in population k if Qik = 1, assigned by Y P(Dil |pQik ,l ) l I Individuals in the same population are exchangeable with respect to the SNP frequencies * Solutions to this have been explored (computationally inconvenient) ** Valid approximation, originally derived by Wright 15 / 35 Single SNP with populations Ancestral Population Mutation Some SNP distribution Present populations 'Independent drift' of SNP distribution 16 / 35 Scaling STRUCTURE STRUCTURE approach has a parameter for every SNP. But: I Assuming that drift is weak, p0l = E (D·l ) and: p(pkl ) ∼ N (p0l , p0l (1 − p0l )) I Probability SNP l is shared not by chance: Xijl = Dil Djl /p0l + (1 − Dil )(1 − Djl )/(1 − p0l ) I Invoke the Central limit theorem: Xij = L X Xijl ∼ N(µij , σij2 ) l=1 This is the Coancestry Matrix It is a sufficient statistic for p(D|Q) I µ and σ known and same for all individuals in a population I Exchangability again! Now lets think about linkage... I I 17 / 35 Ancestral Recombination Graph Time towards present Hein, Schierup and Wiuf 'Gene Genealogies, Variation and Evolution', OUP 2005 18 / 35 ChromoPainter 19 / 35 ChromoPainter Hidden Markov Model Reference Panel x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Reconstruction x x x x x x x x x Data x x 20 / 35 FineSTRUCTURE model I Each population has a characteristic rate Pab of sharing ‘chunks’ with each other population I Individuals are again exchangeable within populations I Each recombination event has probability P̂ab = Pab /n̂b when coming from an individual in population b into population a I Dirichlet Process prior on the parameters Pa· I We integrate out P, leaving no population level parameters I We can put a meaningful prior on the variation of P between populations, and the number of populations I In practice these details don’t matter much 21 / 35 0.84 0.90 0.84 0.93 0.99 0.86 0.86 0.99 0.86 0.86 Example: Africa HGDP 1000 BiakaPygmy 900 800 MbutiPygmy 700 San BantuSouthAfrica BantuSouthAfrica 600 BantuKenya 500 400 Yoruba 300 200 Mandenka BiakaPygmy MbutiPygmy San BantuSouthAfrica BantuKenya BantuSouthAfrica Yoruba Mandenka 100 22 / 35 Ancient Genomes: fennoscandia.blogspot.co.uk Ajv70 Gotland hunter gatherer fennoscandia.blogspot.co.uk/2013/11/ajv70-and-modern-european-variation-ii.html 444k SNPs 23 / 35 History of populations Time I We have assumed that we observe individuals from real populations I Populations differ by genetic drift I This works, even if there is historical migration, provided that the mixture fractions are equal (exchangeability) I Real individuals are related by a combination of drift and admixture I Bottleneck Split Split Migration Migration ‘Ancestral population graph’ Populations 24 / 35 Admixture B) Continuous Admixture Time in the past Time in the past A) Discrete Admixture Pop AB1 Present Pop AB1 Present Pop A Pop B1 Pop B2 C) PCA of A ● ● ● ● ● ● ● ● ● ● ●● −0.2 ● PC2 B2 0.1 ● −0.1 0.1 x −0.1 0.2 A Pop B2 0.2 PC1 B1 −0.2 ● ● ● ● ● ●● PC2 B2 0.1 ● ● ● ●● ● ● ● ● ● ● ●●● AB1 Pop B1 C) PCA of B 0.2 A ● ● Pop A −0.1 0.1 ● ● AB1 −0.2 −0.1 0.2 PC1 B1 −0.2 Admixture describes mixture without drift. 25 / 35 Admixture B) Mixing parameter 0.1 0.15 0.10 Density −10 −5 0 5 10 −10 −5 0 5 Genetic Location Genetic Location C) Mixing parameter 1 D) Mixing parameter 5 10 0.08 0.04 0.00 Density 0.12 0.00 0.02 0.04 0.06 0.08 Density 0.05 0.00 0.10 0.00 Density 0.20 A) Non−admixed populations −10 −5 0 5 10 Genetic Location −10 −5 0 5 10 Genetic Location Continuous admixture is a problem. 26 / 35 Admixture models I STRUCTURE can infer ‘pure’ populations from admixed populations I Population assignment Qik sums to 1 without requiring a single element I Interpretation: Observed individuals are mixtures of pure populations, without drift How can we tell apart: I I I Large drift, mixed by admixture? small drift without admixture? I Solution: SNPs have fixed in some populations pik = 0 (or 1) I FineSTRUCTURE cannot use this description, as we’ve integrated out these details 27 / 35 Drift model I See each drift event as independent I Population assignment Qik now takes any value I ‘Amount’ of each drift event retained by individual i I Reconstructs the coancestry matrix I Requires a strong prior to obtain a unique solution Pop AB1 Drift Components 28 / 35 29 / 35 Projects I Siberian cold adaptation (Alexia & Toomas, in review) I GlobeTrotter (Hellenthal et al: very accurate admixture dating using ChromoPainter, to appear in Science) I Peopling of the British Isles (in review) I UK10K, ALSPAC (4K whole genomes, use in genome-wide association studies) I Highly recombining asexuals (fungus, bacteria) Model improvements: I I I I I Relatedness Admixture/history of populations Complete recoding for usability Efficient computation (fastFineSTRUCTURE) 30 / 35 See www.paintmychromosomes.com I Lawson, Hellenthal, Myers & Falush, ‘Inference of population structure using dense haplotype data’, 2012. PLoS Genetics. I Lawson & Falush ‘Similarity matrices and clustering algorithms for population identifcation using genetic data’, 2012. ARHG. I Lawson 2014 ‘Populations in statistical genetics modelling and inference’, in ‘Populations in the Human Sciences’, Eds. Kreager, Capelli, Ulijaszek & Winney. 31 / 35 Relatedness What if individuals within a population are related? 32 / 35 Relatedness Grandparents Parents Cousins I I I I Relatedness IBD Samples are cousins This is very easy to tell from their tract length distribution Excluding these tracts, we sample from the population distribution of chunks Multiple ordering model in progress 33 / 35 Choice of measure IBS Measure COV Measure range = ( 0.832 , 0.843 ) C2 5.54 ESU Measure range = ( −5246 , 3766 ) 6.52 C2 3.31 4.57 C1 3.6 1.65 C1 1.83 −0.29 B2 −2.24 A B1 B2 C1 C2 0.36 B1 −1.11 −2.58 A B1 B2 C1 C2 3.6 C1 2.01 B1 −0.97 −1.69 A −3.32 −2.4 A B2 C2 3.84 0.43 −1.16 C1 2.49 −2.75 C2 −3.54 C1 C2 range = ( 1774 , 2131 ) B2 1.14 B1 −0.22 3.78 3.04 C1 2.3 A −1.57 1.56 B2 0.82 B1 −0.65 0.46 0.08 −0.9 A B1 B2 C1 −3.12 4.52 C2 1.81 −1.95 A B2 3.17 −0.37 B1 B1 CPL Measure 1.22 C1 0.46 −0.26 4.52 2.81 B2 1.18 B2 IBD Measure range = ( 26290 , 164376 ) 4.39 C2 B1 1.89 −1.84 A −3.21 FHS Measure range = ( 366346 , 434550 ) A 2.61 C1 −0.37 −1.27 A 3.33 1.1 0.68 B1 4.04 C2 2.57 2.62 B2 range = ( −0.0339 , 0.0201 ) 4.04 C2 −2.25 −1.39 A −2.13 A B1 B2 C1 C2 −2.87 34 / 35 Between / Within Population Distance Choice of measure 3.5 3.0 ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● IBD ● ● ● ● ● 2.0 1.5 CPL ● ● 2.5 ● ● ● ● ● ● ESU FHS COV IBS 0.5 0 20 40 60 80 100 Number of regions 35 / 35
© Copyright 2026 Paperzz