Populations in statistical genetics

Populations in statistical genetics
What are they, and how can we infer them from whole genome
data?
Daniel Lawson
Heilbronn Institute, University of Bristol
www.paintmychromosomes.com
Work with:
January 2014, Cambridge
1 / 35
Overview
I
Some views of ‘population’
I
A statistical definition
I
Generative approaches
I
Inference from sparse weakly linked data (STRUCTURE)
I
Inference from dense linked data (fineSTRUCTURE)
I
Selected results
2 / 35
What is a population in general?
Like species, populations are elusive objects.
I
Definitions are a function of knowledge and application
I
Some definitions - Need not be biologically meaningful
I
Example: Individuals found on same island
I
Example: Sample location due to clustered sampling
procedure
I
Example: Disease status for case/control studies
I
In statistics: we generalize from samples to a population
I
Share: individuals are equivalent under some measure
But do populations really exist? Does it matter if they are useful?
3 / 35
Diversion: some fuzzy definitions of species
Are definitions of species analogous to definitions of populations?
I
Typological species: Organisms that share the same set of
phenotypes
I
Ecological species: Organisms that compete for the same
environment
I
Phylogenetic species: Organisms with a single common
ancestor
I
Biological species: Organisms that can share DNA via
sexual reproduction
(Western/Eastern) Meadow Lark: http://evolution.berkeley.edu
4 / 35
What is a population? Motivation from genetics
We are interested in non-subjective definitions of populations.
I
Individuals have equivalent genetic ancestry
I
This depends on the data available ...
I
... and the model used
I
Knowledge driven definition
I
Requires defining equivalent!
5 / 35
The birds and the bees (vs the plants and bacteria)
I
Sex dramatically affects genetic transmission
I
Different population concept required
Sexual species
I
I
I
I
I
Generalise the Biological Species concept
Random mating within a population
Try to detect deviations from random mating to identify
populations
Asexual species
I
I
I
I
Generalise the Ecological Species concept
Neutral competition within a population
Try to detect deviations from neutrality to identify populations
Not examined further in this talk
6 / 35
Top down approaches
Start from a theoretical generative model of reproduction.
For example:
I
Individuals move between populations via migration
I
Discrete generations (for simplicity)
I
Each generation randomly chooses parent(s) within a
population
Population structure: individuals migrate between populations but
mate randomly within each
7 / 35
Ancestry Process
Time
Present
8 / 35
Populations as Demes
Demes
Migration
Ammerman and Cavalli-Sforza, 1984 The Neolithic Transition and the Genetics of Populations in Europe.
9 / 35
Diversion: Principal Components Analysis
I
PCA is widely used in genetics, and helpful for visualisation
I
It does not make any explicit modelling assumptions...
BUT to interpret the output, you do!
I
I
I
I
I
I
I
‘Just’ rotating the data
Similar individuals tend to be close
Differences shared by many individuals tend to appear first
Each component describes a different direction of variation
If we are lucky, these correspond to real shared drift events
There is no unique interpretation of PCA (see McVean 2009)
See Lawson & Falush 2012 for details.
10 / 35
Example
A) Historical Scenario
B) Spatial Scenario
Populations
Time in the past
d1
Migration
d2
Present
Pop A
Pop B1
Pop B2
Space
C) PCA of A
D) PCA of B
PC2
PC2
0.2
0.2
●
●
●●●
●
● ●●
B2
●
●
●
●
f(d2)
●
●
● ●
−0.2
●●
●
●
●
●
●
●
0
−0.1
●
●
0.2
0.1
●
A
f(d1)
PC1
0.1
●
0.2
0
PC1
−0.1
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
−0.2
●● ● ●
●●●
●●
−0.2
B1
−0.2
11 / 35
Bottom up approaches
Start from a definition of population as ‘equivalent’ individuals
I
Within a population, individuals are randomly mating
I
Small samples of large populations: individuals are
approximately independent
I
Smaller populations: relationships must be accounted for
I
What does random mating mean for the data?
I
How are populations related?
12 / 35
Top down vs Bottom up
C) Full Statistical Model
Time
0.6
0.8
0.0
0.6
0.9 1.0
Genetic Postition
0.0
0.6
0.9 1.0
Genetic Postition
0.0
0.2
0.4
Density
0.6
0.4
0.2
0.0
Time
0.8
1.0
B) Statistical Model
1.0
A) Generative Model
0.0
0.6
0.9 1.0
Genetic Postition
I
A) Generative approaches are best in theory, if we can make
the model match reality
I
But, hard to use in practice - how to do inference?
I
B) Bottom up approaches are approximate - might lose power
I
C) But can be refined until they are close to the generating
process
13 / 35
Ancestry Process - Ancestral Recombination Graph
Time
• Take limit N → ∞ with
N/T constant
• Recombination rate ρ
(for tract length)
• Ignore unbranching
ancestors
Present
14 / 35
Structure model
Pritchard, Stephens & Donnelly 2000.
I Populations are large and well mixed
I SNPs Dil are unlinked*
I loci have some ancestral frequency
p0l ∼ P(·)
I
Population k has frequency pkl drifted from ancestral p0l **
pkl ∼ Dirichlet(p0l )
I
Individual i is in population k if Qik = 1, assigned by
Y
P(Dil |pQik ,l )
l
I
Individuals in the same population are exchangeable with
respect to the SNP frequencies
* Solutions to this have been explored (computationally inconvenient)
** Valid approximation, originally derived by Wright
15 / 35
Single SNP with populations
Ancestral
Population
Mutation
Some SNP distribution
Present populations
'Independent drift' of SNP distribution
16 / 35
Scaling STRUCTURE
STRUCTURE approach has a parameter for every SNP. But:
I Assuming that drift is weak, p0l = E (D·l ) and:
p(pkl ) ∼ N (p0l , p0l (1 − p0l ))
I
Probability SNP l is shared not by chance:
Xijl = Dil Djl /p0l + (1 − Dil )(1 − Djl )/(1 − p0l )
I
Invoke the Central limit theorem:
Xij =
L
X
Xijl ∼ N(µij , σij2 )
l=1
This is the Coancestry Matrix
It is a sufficient statistic for p(D|Q)
I µ and σ known and same for all individuals in a population
I Exchangability again!
Now lets think about linkage...
I
I
17 / 35
Ancestral Recombination Graph
Time
towards
present
Hein, Schierup and Wiuf 'Gene Genealogies,
Variation and Evolution', OUP 2005
18 / 35
ChromoPainter
19 / 35
ChromoPainter Hidden Markov Model
Reference Panel
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Reconstruction
x
x
x
x
x
x
x
x
x
Data
x
x
20 / 35
FineSTRUCTURE model
I
Each population has a characteristic rate Pab of sharing
‘chunks’ with each other population
I
Individuals are again exchangeable within populations
I
Each recombination event has probability P̂ab = Pab /n̂b when
coming from an individual in population b into population a
I
Dirichlet Process prior on the parameters Pa·
I
We integrate out P, leaving no population level parameters
I
We can put a meaningful prior on the variation of P between
populations, and the number of populations
I
In practice these details don’t matter much
21 / 35
0.84
0.90
0.84
0.93
0.99
0.86
0.86
0.99
0.86
0.86
Example: Africa HGDP
1000
BiakaPygmy
900
800
MbutiPygmy
700
San
BantuSouthAfrica
BantuSouthAfrica
600
BantuKenya
500
400
Yoruba
300
200
Mandenka
BiakaPygmy
MbutiPygmy
San
BantuSouthAfrica
BantuKenya
BantuSouthAfrica
Yoruba
Mandenka
100
22 / 35
Ancient Genomes: fennoscandia.blogspot.co.uk
Ajv70 Gotland hunter gatherer
fennoscandia.blogspot.co.uk/2013/11/ajv70-and-modern-european-variation-ii.html
444k SNPs
23 / 35
History of populations
Time
I
We have assumed that we observe
individuals from real populations
I
Populations differ by genetic drift
I
This works, even if there is historical
migration, provided that the mixture
fractions are equal (exchangeability)
I
Real individuals are related by a
combination of drift and admixture
I
Bottleneck
Split
Split
Migration
Migration
‘Ancestral population graph’
Populations
24 / 35
Admixture
B) Continuous Admixture
Time in the past
Time in the past
A) Discrete Admixture
Pop AB1
Present
Pop AB1
Present
Pop A
Pop B1
Pop B2
C) PCA of A
●
●
●
●
● ● ●
● ●
●
●●
−0.2
●
PC2
B2
0.1
●
−0.1
0.1
x
−0.1
0.2
A
Pop B2
0.2
PC1
B1
−0.2
●
●
●
●
●
●●
PC2
B2
0.1
● ●
●
●● ●
●
● ●
●
●
●●●
AB1
Pop B1
C) PCA of B
0.2
A
●
●
Pop A
−0.1
0.1
●
●
AB1
−0.2
−0.1
0.2
PC1
B1
−0.2
Admixture describes mixture without drift.
25 / 35
Admixture
B) Mixing parameter 0.1
0.15
0.10
Density
−10
−5
0
5
10
−10
−5
0
5
Genetic Location
Genetic Location
C) Mixing parameter 1
D) Mixing parameter 5
10
0.08
0.04
0.00
Density
0.12
0.00 0.02 0.04 0.06 0.08
Density
0.05
0.00
0.10
0.00
Density
0.20
A) Non−admixed populations
−10
−5
0
5
10
Genetic Location
−10
−5
0
5
10
Genetic Location
Continuous admixture is a problem.
26 / 35
Admixture models
I
STRUCTURE can infer ‘pure’ populations from admixed
populations
I
Population assignment Qik sums to 1 without requiring a
single element
I
Interpretation: Observed individuals are mixtures of pure
populations, without drift
How can we tell apart:
I
I
I
Large drift, mixed by admixture?
small drift without admixture?
I
Solution: SNPs have fixed in some populations pik = 0 (or 1)
I
FineSTRUCTURE cannot use this description, as we’ve
integrated out these details
27 / 35
Drift model
I
See each drift event as independent
I
Population assignment Qik now takes any
value
I
‘Amount’ of each drift event retained by
individual i
I
Reconstructs the coancestry matrix
I
Requires a strong prior to obtain a unique
solution
Pop AB1
Drift Components
28 / 35
29 / 35
Projects
I
Siberian cold adaptation (Alexia & Toomas, in review)
I
GlobeTrotter (Hellenthal et al: very accurate admixture dating
using ChromoPainter, to appear in Science)
I
Peopling of the British Isles (in review)
I
UK10K, ALSPAC (4K whole genomes, use in genome-wide
association studies)
I
Highly recombining asexuals (fungus, bacteria)
Model improvements:
I
I
I
I
I
Relatedness
Admixture/history of populations
Complete recoding for usability
Efficient computation (fastFineSTRUCTURE)
30 / 35
See www.paintmychromosomes.com
I
Lawson, Hellenthal, Myers & Falush, ‘Inference of population
structure using dense haplotype data’, 2012. PLoS Genetics.
I
Lawson & Falush ‘Similarity matrices and clustering algorithms for
population identifcation using genetic data’, 2012. ARHG.
I
Lawson 2014 ‘Populations in statistical genetics modelling and
inference’, in ‘Populations in the Human Sciences’, Eds. Kreager,
Capelli, Ulijaszek & Winney.
31 / 35
Relatedness
What if individuals within a population are related?
32 / 35
Relatedness
Grandparents
Parents
Cousins
I
I
I
I
Relatedness
IBD
Samples are cousins
This is very easy to tell from their tract length distribution
Excluding these tracts, we sample from the population
distribution of chunks
Multiple ordering model in progress
33 / 35
Choice of measure
IBS Measure
COV Measure
range = ( 0.832 , 0.843 )
C2
5.54
ESU Measure
range = ( −5246 , 3766 )
6.52
C2
3.31
4.57
C1
3.6
1.65
C1
1.83
−0.29
B2
−2.24
A
B1
B2
C1
C2
0.36
B1
−1.11
−2.58
A
B1
B2
C1
C2
3.6
C1
2.01
B1
−0.97
−1.69
A
−3.32
−2.4
A
B2
C2
3.84
0.43
−1.16
C1
2.49
−2.75
C2
−3.54
C1
C2
range = ( 1774 , 2131 )
B2
1.14
B1
−0.22
3.78
3.04
C1
2.3
A
−1.57
1.56
B2
0.82
B1
−0.65
0.46
0.08
−0.9
A
B1
B2
C1
−3.12
4.52
C2
1.81
−1.95
A
B2
3.17
−0.37
B1
B1
CPL Measure
1.22
C1
0.46
−0.26
4.52
2.81
B2
1.18
B2
IBD Measure
range = ( 26290 , 164376 )
4.39
C2
B1
1.89
−1.84
A
−3.21
FHS Measure
range = ( 366346 , 434550 )
A
2.61
C1
−0.37
−1.27
A
3.33
1.1
0.68
B1
4.04
C2
2.57
2.62
B2
range = ( −0.0339 , 0.0201 )
4.04
C2
−2.25
−1.39
A
−2.13
A
B1
B2
C1
C2
−2.87
34 / 35
Between / Within Population Distance
Choice of measure
3.5
3.0
●
●
●
●
●
● ●
● ●
●
1.0
●
●
●
●
●
●
●
IBD
●
●
●
●
●
2.0
1.5
CPL
●
●
2.5
●
●
●
●
●
●
ESU
FHS
COV
IBS
0.5
0
20
40
60
80
100
Number of regions
35 / 35