An Algorithm for Computing the Gene Tree Probability under the

An Algorithm for Computing the Gene Tree
Probability under the Multispecies Coalescent
and its Application in the Inference of
Population Tree
Yufeng Wu
Dept. of Computer Science & Engineering
University of Connecticut, USA
ISMB 2016, July 11, 2016
1
Population gene genealogy
Allele
Coalescent Theory: Introduction
Sampled allele
Coalescence
Generation
Wright-Fisher
Model:
• Non-overlapping
generations
• Constant
population size
• Random mating
Time
Coalescent: trace
backward in time
from sampled
lineages.
Coalescence: two sample lineages find common ancestor.
Gene genealogy: determined by coalescent process
Stochastic: when coalescence occurs is a stochastic process
Coalescent in Multiple Populations
“Outline" tree is the species (population)
tree: evolutionary history of populations
“Embedded" tree is the gene tree:
evolutionary history of individual alleles
T
Coalescent: tracing lineages backward in
time to common ancestor in a population
Multispecies coalescent: each
extant/ancestral population runs a separate
Coalescence
coalescent: determines gene genealogy with
multiple populations
Gene tree/species tree discordance: gene trees for the a species tree
with different topology due to stochastic coalescence.
Stochasticity: inherent in coalescent process (a gene tree has certain
probability of being observed under the coalescent process).
• Larger T: B and C more likely to coalesce within T
• Small T: B and C may not coalesce with T (and may lead to different gene tree)
Multispecies coalescent: allow computation of probability of gene genealogies from
multiple species (populations)
Gene Tree Probability
For a species tree, any gene tree topology can arise, but with probability.
For species tree Ts (with branch length) and a gene tree topology Tg:
Gene tree probability P(Tg|Ts): probability of observing a binary gene
tree topology Tg for species tree Ts under coalescent theory.
• The larger P(Tg|Ts) is, the more likely Tg will be observed.
• Has multiple applications in population and evolutionary genomics.
• e.g. Infer Ts from multiple gene genealogies Tg by maximum likelihood
For small gene tree and species tree, calculation by hand may be
feasible (e.g. Hudson, 1983, Takahata and Nei, 1985, Rosenberg
2002): usually listing all possible gene trees and species trees with
say five species.
For larger trees, an algorithm is needed. (Nowadays, gene trees and
species trees can have hundreds of taxa/alleles)
4
Key: efficient computation of the gene tree probability.
An algorithm for Gene Tree Probability (Degnan and Salter, 2005)
History 2.
Same gene
tree
History 1
T
Coalescent history: each
coalescent event occur at
which species tree branch:
T
a
b
c
a
b
c
• a and b coalesce within T
• Then coalescence with c above T.
Note: a and b can coalesce above T:
give different history
Degnan and Salter: P(Tg|Ts)=H P(Tg, H|Ts), H: coalescent histories of Tg
Why enumerate H? H specifies for each species tree branch, there are
v lineages
some u gene lineages coalescing into v lineages.
Classic result in coalescent theory
(Tavare
1984; Watterson 1984; Takahata and Nei 1985)
puv(T): probability of u (unlabeled) lineages
coalesce to v lineages within time T is:
T
u lineages
P(Tg, H|Ts): product of puv(T) over all species tree branches (independence assumed)
An algorithm for Gene Tree Probability (Degnan and Salter, 2005)
Recall: puv(T): the probability of u (not labeled) lineages coalesce to v
lineages within time T
P(Tg, H|Ts) = p21(T1)
For a fixed coalescent history H:
* p22(T2)

* p31() * C (combinatorial
factor)
Main challenge: need to consider
all possible coalescent histories.
T2known polynomial time
No
algorithms for gene tree probability
computation.
STELLS algorithm (Wu, Evolution,
T2012):
1
another algorithm for gene tree
probability. Faster than Degnan and
Salter’s but still slow for large trees
(exponential time in general)
Population genetics: large number of gene
alleles (small number of populations)
Phylogenetics study: Hundreds of species; 1 or several alleles per species
Population genetics study: 1000 Genomes Project: 26 populations; haplotypes
from 2504 individuals
Question: can gene tree probability be computed efficiently for
many gene alleles when the number of populations is small?
Application: inference of demographic history
This talk: Polynomial-time algorithm
for computing gene tree probability
computation when the number of
populations is fixed to a constant
Basic idea: merge multiple coalescent
history into a compact coalescent
history
Upper lineage count (ULC): number (not specific Compact Coalescent
lineages as in coalescent history) of gene lineages at
History (CCH)
the top of population tree branch.
ULC3=1
ULC3=1 h
h
Coalescent
Coalescent
g
d
g
history
2
history 1
c
ULC2=1
ULC2=1
ULC1=3
ULC1=3
f
f
A
c
a1 a2 a3 a4
e
B
b1 b2 b3
A
d
a1 a2 a3 a4
e
B
b1 b2 b3
CCH: (3, 1, 1)
Compact coalescent history (CCH): ordered list of ULC for each
population tree branch
Two different coalescent histories Can lead to same CCH
Number of CCH is smaller than number of coalescent history
Gene tree probability: can be computed from CCH (details omitted)
Number of Compact Coalescent History is Polynomial
Bounded for Constant Number of Populations
c5 = 1
n: number of gene lineages
m: number of population
c4
tree branches When the
c1 c2, c3, c4 ≤ n
c2|CCH| = m (constant) number of populations is
constant  m is constant
CCH: (c1,c2,c3,c4,1)
c3
c1
CCH: length m vector of integers. Each position, value range from 1 to n.
The number of CCH ≤ nm. Polynomial in n when m is constant.
Time (seconds)
100000
10000
1000
New
100
STELLS
10
1
1
2
5
10
15
20
Run-time (seconds) for
computing the gene tree
probability for 500 gene
trees using new algorithm
and original STELLS: two
populations.
Allele number per population
Application: Inference of Population Trees Using
Pairwise Distance (also see Wu, Bioinformatics, 2015)
Population
haplotypes
Assume: infinite sites model and
no intra-locus recombination
Haplotypes of two
populations A and B
AAAA
1 234
A
B
a1 AAAA
a2 CAGA
b1 CTGA
b2 AAAC
1
3
a1
Estimate population
divergence distance of
two populations
Gene
genealogies
2
a2
Maximum likelihood estimate
(gene tree prob algorithm)
D
4
1.1
A
a1 a 2
B
0.9
A
B
C
A B C
- 2.0 1.0
- 2.0
-
b1 b2
Compare w/ TreeMix (Pickrell and Pritchard, 2012)
• doesn’t consider linkage disequilibrium or LD).
Our approach is more accurate (but slower) than TreeMix
•
•
Neighbor
joining
Pairwise population distance
Mutation
b1 b2
Population
tree
1.0
20 alleles for 8 populations each. Population tree height of 0.1 1.0
C
B
A
Average Robinson-Foulds error for our method: 0.11
• TreeMix: 0.18
Partly supported by U.S. National Science Foundation grants IIS-0953563 and IIS-1447711.