Making the most of DArT data for phylogenetic inference

Making the most of DArT data for
phylogenetic inference
Barbara Holland
&
Michael Woodhams
(Maths & Physics)
Vincent Moulton
(Computational Biology)
Dorothy Steane
(Plant Science)
Generating DArTs
Diversity
Array
Technologies
1: Collect DNA from reference individuals
2. Digest with one 6bp rare cutter (CTGCAG) and one 4bp frequent cutter (TCGA)
3. Only fragments with two rare ends are amplified and retained
4. Create a microarray with these fragments (~2-3% of the genome)
5: Analyse phylogenetic samples by digesting them with the same cutters and running
them against the microarray (DNA-DNA hybridisation).
Each fragment is scored 1 (present) or 0 (absent) *
*This is in math fantasy land – in real life you also get ?s
Properties of DArT data

Data is binary (fragments are present or absent, 1/0)

A random set of fragments from across the genome.


Fragments are much more likely to be lost in parallel than
gained in parallel
Data exhibit an ascertainment bias: We can observe only the
fragments on the chip. These fragments were derived from a
small set of reference taxa.
The model

Fragment evolution can be modeled as a stochastic Dollo
process, i.e. gained once but lost potentially many times

Parallel gains are forbidden

Fragments are lost at a constant rate r (memoryless)

Chance of loss over time t is 1-exp(-rt)
Hamming Horror
Hamming distance D = (n10+n01)/(n11+n00+n10+n01)
Ref
1111
1111
1111
1111
1111
1111
0000
0000
B
1111
0000
0000
0000
D(Ref,B)=12/16=(12+0)/(4+0+12+0)
1100
0000
0000
0000
C
1000
1000
0000
0000
D
1100
1100
0000
0000
D(C,D)=2/16=(1+1)/(1+13+1+1)
Hamming simulation
Underlying tree used in simulation
Tree based on Hamming distances
using A as the reference taxon
A distance correction is required
•Let n00 be the number of fragments
absent at both A and B
R
•Let n01 be the number of fragments
absent at A and present at B
•Let n10 be the number of fragments
present at A and absent at B
A
B
•Let n11 be the number of fragments
present at both A and B
A distance correction is required
R
Michael Woodham's key observation was that,
due to the Dollo nature of the process,
any fragment that is present at the reference
taxon R and at taxon A,
must also be present at the internal node X.
X
A
B
A distance correction is required
Recall, chance of survival over time t is exp(-rt)
R
d(X,B) = -log[probability fragment survives from X
Anything present at A is known to be present at X
X
=> d(X,B) = -log[n11/(n11+n10)]
A
B
d(A,B) = d(A,X) + d(X,B)
= -log[n11/(n11+n01)] - log[n11/(n11+n10)]
= log[(1+n01/n11)(1+n10/n11)]
A zoo of distances

Hamming: dH=(n01+n10)/(n11+n10+n01+n00)

Log Det: dLD=log[det[D]]-0.5Σk(log(Ck)+log(Rk))

Jaccard: dJ=(n01+n10)/(n11+n10+n01)

Log Jaccard: dLJ=-log(1-dJ)=-log[n11/(n11+n10+n01)]

HS: dHS=-log[2n11/(2n11+n10+n01)]

Nei Li: F=2n11/(2n11+n10+n01);F=Q^2/(2-Q)
dNL=-log(Q)
Simulations
Random (yule) topology,
Edge lengths chosen
from uniform distribution
0.05<l<0.40
Yule tree,
subject to minimum edge length 0.01
Simulation details
• Choose an arbitrary node to start the process at. At this node,
the number of DArT fragments is taken from a Poisson
distribution with mean M. (We use the result from HS 2004 that
a stochastic Dollo process is independent of the root).
• Propagate outward from the start point along tree edges, so that
each new node acquires some new DArT fragments and inherits
some of those from its parent.
• If the edge length is l, then the probability of a given fragment
present in the parent still being present at the end of the edge is
exp(-l).
• The number of new fragments in the child but not the parent is
Poisson distributed with mean (1-exp(-l))M.
Simulations
Selection of Reference Taxa
R
R
R
One ref, included
V
R
One ref, excluded
S
U
R
S
T
All taxa are refs
S
Two refs, included
Two refs, excluded
Simulations
All taxa are references, 9 taxa.
Simulations
Single reference, excluded, 9 taxa.
Single reference, included, 9 taxa.
Simulations
(distance matrix -> tree by FastME)
Simulations
Simulations
Multiple References
11111 11100 00000
11111 11100 00000
11111 11100 00000
11111 11100 00000
R
00001 11110 00000
00001 11110 00000
00001 11110 00000
00001 11110 00000
00001 11110 00000
00001 11110 00000
00000 00000 00000
00000 00000 00000
A
S
r
s
00000 00111 11111
00000 00111 11111
00000 00111 11111
00000 00111 11111
00000 01111 10000
00000 01111 10000
00000 01111 10000
00000 01111 10000
B
00000 00000 00000
00000 01111 10000
00000 01111 10000
00000 00000 00000
If R were the only reference, we'd only see the coloured sites.
n10=6, n01=2, n11=2, d(A,B)= -log(2/4) - log(2/8)=3
Multiple References
11111 11100 00000
11111 11100 00000
11111 11100 00000
11111 11100 00000
R
00001 11110 00000
00001 11110 00000
00001 11110 00000
00001 11110 00000
00001 11110 00000
00001 11110 00000
00000 00000 00000
00000 00000 00000
A
S
r
s
00000 00111 11111
00000 00111 11111
00000 00111 11111
00000 00111 11111
00000 01111 10000
00000 01111 10000
00000 01111 10000
00000 01111 10000
B
00000 00000 00000
00000 01111 10000
00000 01111 10000
00000 00000 00000
With R and S as references
n10=7, n01=7, n11=3, d(A,B)= -2log(3/10)=3.474
Generalizing the DArT Distance
•
The DArT distance does less well when there is more than
one reference taxon.
•
Define
dRDa(A,B;R)=DArT distance between A and B calculated only from
sites that are 1 at R.
•
Then dGD(A,B) is a weighted average:
dGD(A,B)=(ΣRdRDa(A,B;R)√nR)/(ΣR√nR)
Partitioned DarT distances
(under construction)
•
When the reference taxa are known (typically
the case)
•
And it's also known which fragments come
from which reference taxon (not always the
case)
•
You can define a partitioned DArT distance
that takes a weighted average of the DArT
distance for each partition.
Simulations
All taxa are references, 9 taxa.
Simulations
DArT, Generalized DArT
and HS tree (FastME)
94 Eucalcypt taxa
8 reference taxa
Norwich
•
Why does the Generalised DarT distance
perform so well when the reference taxa are
included and so poorly when they are not?
Single reference
R
Pattern proabilities can be computed by
rooting the tree at the reference taxon and
then only considering loss of fragments.
pr
pbcd
pa
A
pb
B
pcd
pc
pd
C
D
E.g. the probability of seeing
R 1
A 0
B 0
C 1
D 1
is
(1-pr)pa(1-pbcd)pb(1-pcd)(1-pc)(1-pd)
Reference unknown
Set all edge probabilities to 0.01
R
R
pr
pbcd
pa
A
pb
B
pcd
pc
pd
C
D
B
D
n01/n11
0.010
0.010
0.010
n10/n11
0.031
0.020
0.010
D(A,C)
0.040
0.030
0.020
D(A,C) = log[(1+n01/n11)(1+n10/n11)]
Multiple reference taxa
In the multiple reference setting you also
have to consider gain of fragments down any
edge that is above a reference taxon.
R
E.g. the probability of seeing
R 0
A 0
B 0
C 1
S 1
pr
pbcs
pa
A
pb
B
pcs
pc
ps
C
S
* need to renormalise probabilities
Has 4 terms
prpa(1-pbcs)pb(1-pcs)(1-pc)(1-ps) +
pbcspb(1-pcs)(1-pc)(1-ps) +
pcs(1-pc)(1-ps) +
ps
Set all edge probabilities to 0.01
R
R
pr
S
R or S
n01/n11
0.010
0.020
0.020
n10/n11
0.020
0.010
0.020
D(A,B)
0.030
0.030
0.040
D(A,B) = log[(1+n01/n11)(1+n10/n11)]
pbs
pa
R
S
R or S
A
pb
B
ps
S
n00
0.0102
0.0102
0.0203
n01
0.0097
0.0195
0.0196
n10
0.0195
0.0097
0.0196
n11
0.9606
0.9606
0.9702
1.0000
1.0000
1.0297
* need to renormalise probabilities
Future ideas
•
The small examples we worked through in
Norwich suggest two new ideas to be tested
by simulation
•
In the case of unknown references, compute
D(X,Y|R) for each R and take the max.
•
In the case of known references, a
modification to the Generalised DArT that only
averages over the references
Future work - hybridisation
Links to other peoples work
• Gene content evolution with HGT, aka
controlling ancestral genome obesiety (Tal
Dagan, Bill Martin)
• Language evolution with borrowing (Geoff
Nicholls, Russell Gray)
BIG Thanks to Torsten and Shiju
http://www.maths.utas.edu.au/phylomania/phylomania2011.htm