Lecture 3. Phylogeny methods: Distance methods

Lecture 3. Phylogeny methods: Distance methods
Joe Felsenstein
Department of Genome Sciences and Department of Biology
Lecture 3. Phylogeny methods: Distance methods – p.1/25
Distance methods
These have been attractive, particular to mathematical scientists who
love geometry. This has its good and bad effects.
1. Take the sequences in all pairs.
2. For each pair compute a distance. (As we will see, this is best
thought of as the length of the 2-species tree for those species).
3. Try to find that tree which best fits the table of distances.
Lecture 3. Phylogeny methods: Distance methods – p.2/25
A phylogeny with branch lengths
A
B
A
0.08
A
0.10
0.05
E
0.06
0.03
0.05
0.07
0
B
C
D
E
0.23 0.16 0.20 0.17
B
0.23 0
C
0.16 0.23 0
D
0.20 0.17 0.15 0
E
0.17 0.24 0.11 0.21 0
0.23 0.17 0.24
0.15 0.11
0.21
D
C
and the pairwise distances it predicts
Lecture 3. Phylogeny methods: Distance methods – p.3/25
A phylogeny with branch lengths
B
A
v
v
E
v
1
5
v
3
v
v6
2
7
v4
D
C
Lecture 3. Phylogeny methods: Distance methods – p.4/25
Least squares trees
Least squares methods minimize
Q =
n X
X
wij (Dij − dij )2
i=1 j6=i
over all trees, using the distances dij that they predict.
Cavalli-Sforza and Edwards suggested wij = 1, Fitch and
Margoliash suggested wij = 1/D2ij .
Lecture 3. Phylogeny methods: Distance methods – p.5/25
Statistical assumptions of least squares trees
Implicit assumption is that distances are (independently?) Normally
distributed with expectation dij and variance proportional to 1/wij2 :
Dij ∼ N (dij , K/wij )
Thus the different weightings correspond to different assumptions about
the error in the distances. Also, there is assumed to be no covariance of
distances.
In fact, the distances will covary, since a change in an interior branch of
the tree increases (or decreases) all distances whose paths go through
that branch.
Lecture 3. Phylogeny methods: Distance methods – p.6/25
Matrix approach to fitting branch lengths
If we stack the distances up into a column vector D, we can
solve the least squares equation (obtained by taking
derivatives of the quadratic form Q):
DT = (D12 , D13 , D14 , D15 , D23 , D24 , D25 , D34 , D35 , D45 )
T
T
X D = X X v.
X=
1
0
0
0
1
1
1
0
0
0
0
1
0
0
1
0
0
1
1
0
0
0
1
0
0
1
0
1
0
1
0
0
0
1
0
0
1
0
1
1
0
1
0
1
1
0
1
1
0
1
1
0
1
0
1
0
1
1
0
1
1
1
1
1
0
0
0
0
0
0
where the “design matrix" X is has 1’s whenever a given
branch lies on the path for the given distance.
Lecture 3. Phylogeny methods: Distance methods – p.7/25
The Jukes-Cantor model for DNA
u/3
A
G
u/3
u/3
C
u/3
u/3
u/3
T
Lecture 3. Phylogeny methods: Distance methods – p.8/25
The distance for the Jukes-Cantor model
differences
per site
1
0.75
0.49
0
0
0.7945
branch length
Lecture 3. Phylogeny methods: Distance methods – p.9/25
If you don’t correct for “multiple hits"
B
0.20
A
Left: the true tree.
B
0.00
0.0206
0.20
0.155
C
0.155
A
C
Right: a tree fitting the uncorrected distances
Lecture 3. Phylogeny methods: Distance methods – p.10/25
Approximate variances for distances
under the Jukes-Cantor model
Distance as a function of fraction of nucleotide differences is
4
3
t̂ = − ln 1 − D
4
3
The “delta method" approximates the variance of one as a
function of the variance of the other:
2
∂ t̂
Var(D)
Var(t̂) '
∂D
Lecture 3. Phylogeny methods: Distance methods – p.11/25
Approximate variances, continued
The variance of fraction of nucleotide difference with n sites is the
binomial variance
Var(D) = D (1 − D)/n
and since
1
∂ t̂
=
∂D
1 − 43 D
we get
Var(t̂) '
D(1 − D)/n
2
4
1 − 3D
Lecture 3. Phylogeny methods: Distance methods – p.12/25
Standard deviation of distance
as it increases with distance (given the JC model)
Standard deviation
5
4
What Fitch−Margoliash assumes
3
2
1
JC model predicts
0
0
1
2
3
4
Jukes−Cantor distance
Lecture 3. Phylogeny methods: Distance methods – p.13/25
The UPGMA algorithm
1. Choose the smallest of the Dij
2. make a new “tip" (ij)
3. Have i and j connected to this new tip, by a node whose “time" ago
in branch length units is Dij /2.
4. Have the weight of the new tip be w(ij) = wi + wj
5. For each other tip, aside from i and j, compute
D(ij),k = Dk,(ij)
wi Dik + wj Djk
=
wi + w j
6. Delete the rows and columns of the D matrix for i and j.
7. If only one row left, stop, else return to step 1.
This can be done in O(n2 ) time if you save minimum elements of each
row.
Lecture 3. Phylogeny methods: Distance methods – p.14/25
Sarich’s (1969) immunological distances
dog
bear
raccoon
weasel
seal
sea lion
cat
monkey
dog
0
32
48
51
50
48
98
148
bear
32
0
26
34
29
33
84
136
raccoon
48
26
0
42
44
44
92
152
weasel
51
34
42
0
44
38
86
142
seal
50
29
44
44
0
24
89
142
sea lion
48
33
44
38
24
0
90
142
cat
98
84
92
86
89
90
0
148
monkey
148
136
152
142
142
142
148
0
Lecture 3. Phylogeny methods: Distance methods – p.15/25
Sarich’s (1969) immunological distances
with columns and rows corresponding to the smallest distance
highlighted and box for smallest.
dog
bear
raccoon
weasel
seal
sea lion
cat
monkey
dog
0
32
48
51
50
48
98
148
bear
32
0
26
34
29
33
84
136
raccoon
48
26
0
42
44
44
92
152
weasel
51
34
42
0
44
38
86
142
seal
50
29
44
44
0
24
89
142
sea lion
48
33
44
38
24
0
90
142
cat
98
84
92
86
89
90
0
148
monkey
148
136
152
142
142
142
148
0
Lecture 3. Phylogeny methods: Distance methods – p.16/25
13
13
12
monkey
cat
weasel
sea lion
seal
raccoon
bear
dog
UPGMA tree for Sarich (1969) data
12
22.9
6.75
5.75
1
19.75
3.15
44.9166
22.0166
27.22619
72.1428
Lecture 3. Phylogeny methods: Distance methods – p.17/25
UPGMA misleads on a nonclocklike tree
True tree
Distance matrix
A
A
13
D
B C
10
4 4
2
2
A
B
C
D
0 17 21 27
B 17
0 12 18
C 21 12
0 14
D 27 18 14
UPGMA tree
B
6
C
6
D
8
A
10.833
2
2.833
0
An unclocklike tree (left), the distances from it (center) and the UPGMA
tree from those distances (right)
The distortion of the tree is due to “short-branch attraction" in which B
and C, close to each other in the true tree, cluster first.
Lecture 3. Phylogeny methods: Distance methods – p.18/25
Neighbor-joining algorithm
1. For each tip, compute ui =
Pn
j6=i
Dij /(n − 2)
2. Choose the i and j for which Dij − ui − uj is smallest.
3. Join items i and j. Compute the branch length from i to the new
node (vi ) and from j to the new node (vj ) as
vi
vj
=
=
1
2 Dij
1
2 Dij
+ 12 (ui − uj )
+ 12 (uj − ui )
4. compute the distance between the new node (ij) and each other
tip as
.
D(ij),k = (Dik + Djk − Dij ) 2
5. delete tips i and j from the tables and replace them by the new
node, (ij), which is now treated as a tip.
6. If more than two nodes remain, go back to step 1. Otherwise
connect the two remaining nodes by a branch of length Dij .
Lecture 3. Phylogeny methods: Distance methods – p.19/25
Star decomposition search
i
v
i
j
v
j
k
(ij)
“Star decomposition" tree search method used in Neighbor-Joining
method
Lecture 3. Phylogeny methods: Distance methods – p.20/25
25.25
6.875
3.4375
monkey
cat
weasel
sea lion
seal
bear
dog
raccoon
NJ tree for Sarich’s (1969) data
12.35 11.65
19.125
1.75
7.8125
19.5625
1.5625
47.0833
20.4375
100.9166
Neighbor-joining tree for the Sarich (1969) imunological distance data
Lecture 3. Phylogeny methods: Distance methods – p.21/25
References, page 1
Bryant, D., and P. Waddell. 1998. Rapid evaluation of least-squares and
minimum-evolution criteria on phylogenetic trees. Molecular Biology
and Evolution 15: 1346-1359. [quicker least squares distance trees]
Bruno, W. J., N. D. Socci, and A. L. Halpern. 2000. Weighted neighbor joining: a
likelihood-based approach to distance-based phylogeny reconstruction.
Molecular Biology and Evolution 17: 189-197. [A weighted version of NJ
which de-weights large distances appropriately]
Cavalli-Sforza, L. L., Edwards, A. W. F. 1967. Phylogenetic analysis: models and
estimation procedures. Evolution 32: 550-570 (also published in
American Journal of Human Genetics 19: 233-257, 1967) [One of the first
least squares distance methods]
Farris, J. S. 1981. Distance data in phylogenetic analysis. pp. 3-23 in Advances
in Cladistics. Proceedings of the first meeting of the Willi Hennig Society.,
ed. V. A. Funk and D. R. Brooks. New York Botanical Garden, Bronx.
[Criticism of distance methods]
Farris, J. S. 1985. Distance data revisited. Cladistics 1: 67 -85. [Reply to my 1984
paper]
Farris, J. S. 1986. Distances and statistics. Cladistics 2: 1 44-157. [debate was
cut off after this]
Lecture 3. Phylogeny methods: Distance methods – p.22/25
References, page 2
Felsenstein, J. 1984. Distance methods for inferring phylogenies: a
justification. Evolution 38: 16-24. [Argument for statistical
interpretation of distance methods]
Felsenstein, J. 1986. Distance methods: reply to Farris. Cladistics 2: 130-143.
[reply to Farris 1985]
Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland,
Massachusetts. [See chapter 11]
Fitch, W. M. and E. Margoliash. 1967. Construction of phylogenetic trees.
Science 155: 279-284. [One of the first least squares distance methods]
Rohlf, F. J. 1962. A numerical taxonomic study of the genus Aedes (Diptera:
Culicidae) with emphasis on the congruence of larval and adult
classifications. Ph.D. thesis, Department of Entomology, University of
Kansas. [UPGMA – one of two introductions of it]
Lecture 3. Phylogeny methods: Distance methods – p.23/25
References, page 3
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for
reconstructing phylogenetic trees. Molecular Biology and Evolution 4:
406-425. [Neighbor-joining]
Sneath, P. H. A. 1962. The construction of taxonomic groups, pp. 289-332 in
Microbial Classification, eds. G. C. Ainsworth and P. H. A. Sneath.
Cambridge University Press, Cambridge. [UPGMA – one of two
introductions of it]
Lecture 3. Phylogeny methods: Distance methods – p.24/25
How it was done
This projection produced
using the prosper style in LaTeX,
using Latex to make a .dvi file,
using dvips to turn this into a Postscript file,
using ps2pdf to make it into a PDF file, and
displaying the slides in Adobe Acrobat Reader.
Result: nice slides using freeware.
Lecture 3. Phylogeny methods: Distance methods – p.25/25