Detecting past changes in population size using haplotype

Motivations
1/21
Coralie Merle
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Detecting past changes in population size using haplotype
homozygosity
Rencontres Jeunes Statisticiens, Porquerolles, 4 Avril 2017
Coralie Merle, Jean-Michel Marin, François Rousset and Raphaël Leblois
Université de Montpellier, IMAG
& INRA, CBGP
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Infer population sizes using genetic data
⇧
: the demographic parameters of interest,
⇧ x: the observed genetic data,
⇧ h 2 H: a unobserved gene tree (latent variable).
Exponentially contracting population size
2/21
Coralie Merle
L(x| ) =
Z
f (x, h| )dh
h2H
Piecewise constant population size
Pop. size
f ⇤ Ne0
Ne0
|
t
Generations
before present
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Limited number of well-chosen loci, assumed independent
Independent loci
L(x| ) =
=
Z
f (x, h| )dh
h2H
Y
i2{indep. loci}
Z
hi 2H
f (xi , hi | )dhi
Figure: hi : a possible history of the sampled
gene copies at given locus.
Main barrier: intensive computation
3/21
Coralie Merle
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Huge number of loci, model the recombination
L(x| ) =
Z
f (x, h| )dh
h2H
Figure: h: an Ancestral Recombination Graph
(Griffiths and Tavaré [1994]).
Main barrier: very intensive computation
develop a new inference method, not based on a likelihood.
4/21
Coralie Merle
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
1 Motivations
2 Demographic inference using haplotype homozygosity
”
Empirical H
H and theoretical HHth
Parameter inference
3 Penalized model choice
Model choice penalty with sensitivity weights
Sensitivity Analysis
4 Numerical results
5/21
Coralie Merle
Simulated data sets
Holstein data set
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Next Generation Sequencing techniques: more and more data
Classical inference methods: suitable for polymorphism data sets consisting in some
(non recombining) loci.
Take advantage of the information carried by the genetic recombination
Consider the dependency of genealogies of adjacent positions in the genome: the
Linkage Disequilibrium (LD).
Haplotype structure based methods:
Demographic history inference based on identical
segment lengths between two haplotypes.
Pairwise haplotypes
alignment
Haplotype Homozygosity: HH(n) is the probability for n adjacent positions drawn at
random in the whole genome sequence to be homozygote between the two haplotypes.
5/21
Coralie Merle
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Previous methods: inference of the historical changes in the effective population
size
overfitting risk.
Aim of this work: a model choice procedure between demographic models of different
complexity.
”
• Model choice criterion based on the comparison of observed H
H and theoretical
HHth haplotype homozygosity.
• Nested demographic models
penalized model choice criterion.
• Penalization relying on the computation of Sobol’s sensitivity indices, related to
the complexity of each model.
6/21
Coralie Merle
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
M0 : Constant population size
✓0 = (Ne0 )
Population size
Penalized model choice
Numerical results
References
M1 : Contracting population size
✓1 = (Ne0 , t, f )
Population size
f ⇤ Ne0
Ne0
7/21
Coralie Merle
Ne0
Generations
before present
|
t
Generations
before present
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Outline
1 Motivations
2 Demographic inference using haplotype homozygosity
”
Empirical H
H and theoretical HHth
Parameter inference
3 Penalized model choice
Model choice penalty with sensitivity weights
Sensitivity Analysis
4 Numerical results
7/21
Coralie Merle
Simulated data sets
Holstein data set
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
“
Empirical HH and theoretical HHth
d computed from the observed data
Empirical HH,
”
H
H(n) the segment proportion of at least n identical adjacent positions over the
haplotypes.
⇧ Choose a number W (n) of windows of n adjacent positions to explore.
⇧ Draw uniformly at random in the genome the windows of n adjacent markers.
⇧ Check if each window contains a polymorphism between the two haplotypes or
not.
”
Finally, H
H(n) is the observed proportion of homozygotes segments among all
randomly visited segments:
W (n)
”
H
H(n) =
X
w =1
1H(w ) /W (n).
Theoretical HHth of MacLeod et al. [2009]
For a given value of ✓, n 7! HHth (✓, n) is a time consumming blackbox function.
8/21
Coralie Merle
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
“
Empirical HH and theoretical HHth
Assumption of MacLeod et al. [2009]
Ignoring the possibility of more than one recombination per segment within one
generation.
⇧ Short segments, that is small values of n: negligible consequences.
⇧ Longer segments, that is increasing the number n of adjacent positions
considered: farther and farther from the reality.
On a long segment, more than one recombination may occur in practice.
⇧ Impact even greater when the population size is large since the probability of
coalescence event is lower compared to the probability of recombination events.
9/21
Coralie Merle
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
“
Empirical HH and theoretical HHth
1.00
Ne0 = 500
HH
0.70
0.75
0.80
0.85
0.90
0.95
100 HH(500)
Mean of 100 HH(500)
HHth
0
5000
10000
15000
20000
Segment length (n)
Conclusion: Small population size, short segments: negligible consequences.
10/21
Coralie Merle
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
“
Empirical HH and theoretical HHth
1.0
Ne0 = 5000
0.6
0.2
0.4
HH
0.8
100 HH(5000)
Mean of 100 HH(5000)
HHth
0
5000
10000
15000
20000
Segment length (n)
Conclusion: Larger population size, longer segments: small bias.
11/21
Coralie Merle
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Parameter inference
Outline
1 Motivations
2 Demographic inference using haplotype homozygosity
”
Empirical H
H and theoretical HHth
Parameter inference
3 Penalized model choice
Model choice penalty with sensitivity weights
Sensitivity Analysis
4 Numerical results
11/21
Coralie Merle
Simulated data sets
Holstein data set
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Parameter inference
We propose to estimate ✓0 and ✓1 by:
✓b0 2 arg min
✓0
✓b1 2 arg min
✓1
npos Å
X
HHth (n, ✓0 )
n=1
ã2
”
H
H(n)
npos Å
X
HHth (n, ✓1 )
n=1
”
H
H(n)
”
H
H(n)
ã2
”
H
H(n)
Evaluation of n 7! HHth (n, ✓j ) is time consuming.
12/21
Coralie Merle
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Outline
1 Motivations
2 Demographic inference using haplotype homozygosity
”
Empirical H
H and theoretical HHth
Parameter inference
3 Penalized model choice
Model choice penalty with sensitivity weights
Sensitivity Analysis
4 Numerical results
12/21
Coralie Merle
Simulated data sets
Holstein data set
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Model choice penalty with sensitivity weights
Algorithm 1: Model Choice procedure
1
”
Computation of H
H on the data;
2
Estimation of the parameters under M0 and M1 numerically solving:
✓b0 = arg min Rel_MSE(M0 ) = arg min
✓0
✓b1 = arg min Rel_MSE(M1 ) = arg min
✓1
3
✓0
Model choice according to:
✓1
X Å HHth (✓0 , n)
✓j
”
H
H(n)
n2I
X Å HHth (✓1 , n)
j2{0,1}
X
Ç
wj (n)
n2I
where wj (n), j 2 {0, 1} are the sensitivity weights.
13/21
Coralie Merle
”
H
H(n)
ã2
,
ã2
”
H
H(n)
n2I
arg min Pena_Rel_MSE(Mj ) = arg min
”
H
H(n)
HHth (✓bj , n)
;
”
H
H(n)
”
H
H(n)
å2
,
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Sensitivity Analysis
Outline
1 Motivations
2 Demographic inference using haplotype homozygosity
”
Empirical H
H and theoretical HHth
Parameter inference
3 Penalized model choice
Model choice penalty with sensitivity weights
Sensitivity Analysis
4 Numerical results
13/21
Coralie Merle
Simulated data sets
Holstein data set
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Sensitivity Analysis
Sensitivity weights
Variance-based sensitivity analysis decomposes the output variance of a function
into fractions which can be associated to each input parameter through Sobol’s
sensitivity indices.
⇧ Consider the input parameters ✓ = (✓ (1) , ✓ (2) , . . .) as random variables.
⇧ ✓ 7 ! HHth (✓, n): the function to be evaluated.
⇧ Apply the sensitivity analysis to HHth (n, .) for any n 2 {1, . . . , npos }.
Sobol’s sensitivity index of order one of the parameter ✓ (i) :
14/21
Coralie Merle
S
(1)
(n)
✓ (i)
=
Var✓(i) (E✓(⇠i) (HHth (✓, n)|✓ (i) ))
Var(HHth (✓, n))
.
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Sensitivity Analysis
Sobol’s sensitivity indices
b (i) (n): estimate of Sobol’s sensitivity index of order one of the parameter ✓(i)
S
✓
computed for n adjacent markers under the more complex model.
15/21
Coralie Merle
0.8
^
SNe0
^
St
^
Sf
^
^ ^
SNe0 + St + Sf
0.2
0.4
0.6
−
−
−
−
−
−
−
0.0
Sobol Index Estimates (first order)
1.0
(1)
0
5000
10000
15000
20000
Segment length (n)
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Sensitivity Analysis
Sensitivity weights
16/21
Coralie Merle
Å
ã2
bNe0 (n)
S
bNe0 (n) + Sbt (n) + Sbf (n)
S
,
w1 (n) = 1.
Weights
0.6
0.8
1.0
w0 (n) =
0.4
w0(n)
0.0
0.2
w1(n)
0
5000
10000
15000
20000
Segment length (n)
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Outline
1 Motivations
2 Demographic inference using haplotype homozygosity
”
Empirical H
H and theoretical HHth
Parameter inference
3 Penalized model choice
Model choice penalty with sensitivity weights
Sensitivity Analysis
4 Numerical results
16/21
Coralie Merle
Simulated data sets
Holstein data set
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Simulated data sets
Outline
1 Motivations
2 Demographic inference using haplotype homozygosity
”
Empirical H
H and theoretical HHth
Parameter inference
3 Penalized model choice
Model choice penalty with sensitivity weights
Sensitivity Analysis
4 Numerical results
16/21
Coralie Merle
Simulated data sets
Holstein data set
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Simulated data sets
17/21
Coralie Merle
1.00
Nested models: detecting a contraction
0.70
Rel_MSE(M0) = 0.04491
Rel_MSE(M1) = 0.003796
0.65
0.75
0.80
HH
0.85
0.90
0.95
Observed HH, (500,1500,10)
HHth(θ0 = 874.3)
HHth(θ1 = 525.8, 1619, 9.5)
Pena_rel_MSE(M0) = 0.03678
Pena_rel_MSE(M1) = 0.003796
0
5000
10000
15000
Segment length (n)
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Simulated data sets
18/21
Coralie Merle
1.00
Nested models: avoiding overfitting
0.85
HH
0.90
0.95
Observed HH, (500)
HHth(θ0 = 496)
HHth(θ1 = 496.1, 5001, 0.86)
0.80
Rel_MSE(M0) = 0.002121
Rel_MSE(M1) = 0.002119
Pena_rel_MSE(M0) = 0.001878
Pena_rel_MSE(M1) = 0.002119
0
5000
10000
15000
Segment length (n)
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Simulated data sets
Table: Model choice procedure applied on 100 simulated data sets. True Positive Rate (TPR)
obtained with the penalized relative MSE criterion.
19/21
Coralie Merle
Demographic Model
✓0 = 500
✓0 = 5000
✓1 = (500, 1500, 10)
TPR Pen. rel. MSE
0.98
0.82
0.96
”
N
e0
CI0.95 (Ne0 )
499
502]
[495
4894
4923]
[4866
b
t
-
-
CI0.95 (t)
-
-
fb
-
-
CI0.95 (f )
-
-
542
[526
557]
1886
[1767
2005]
12.0
[11.6
12.3]
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Holstein data set
Outline
1 Motivations
2 Demographic inference using haplotype homozygosity
”
Empirical H
H and theoretical HHth
Parameter inference
3 Penalized model choice
Model choice penalty with sensitivity weights
Sensitivity Analysis
4 Numerical results
19/21
Coralie Merle
Simulated data sets
Holstein data set
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Holstein data set
Table: Model choice procedure applied on
the Holstein data set. Contraction
Detection Rate (CDR).
CDR
Pen. rel. MSE
20/21
Coralie Merle
”
N
e0
CI0.95 (Ne0 )
b
t
CI0.95 (t)
fb
CI0.95 (f )
0.89
6983
[6877
7089]
7652
[6877 8428]
4.25
[3.83 4.67]
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Holstein data set
Conclusions and perspectives
• Conclusions
⇧ Accurate estimation of the population demographic history simulated
under M0 and M1 using haplotype homozygosity.
⇧ Penalized model choice criterion with sensitivity weights: reliable detection
of constant and contracting population size.
• Perspectives
⇧ Possibility of more than one recombination event per segment within one
generation in the theoretical HHth calculation.
21/21
Coralie Merle
⇧ Extend the model choice procedure to more complex models: piecewise
constant population size with more changes, exponential growing or
declining population sizes.
Thank you for your attention.
Université de Montpellier
Motivations
Demographic inference using haplotype homozygosity
Penalized model choice
Numerical results
References
Holstein data set
Robert C Griffiths and Simon Tavaré. Ancestral inference in population genetics.
Statistical Science, 9:307–319, 1994.
I. MacLeod, T. Meuwissen, B. Hayes, and M. Goddard. A novel predictor of
multilocus haplotype homozygosity: comparison with existing predictors. Genetics
research, 91(6):413–426, 2009.
21/21
Coralie Merle
Université de Montpellier