Castillo-Morales, A.; (1973).Drawing an optimal tree from a distance matrix."

DRAWING AN OPl'IMAL TREE
FROM A DISTANCE MATRIX:
by
ALBERTO CASTILLO·M>RALES
Institute of Sta.tistics
Mimeogra.ph Series No. 872
June
197;
•
TABLE OF CONTENTS (cont'd.)
iv
Page
8.
LIST OF.REFERENCES • .
9.
APPENDICES • .
86
~
88
9.1
A summary of Graph Theory Definitions
9.2
Minimization Procedure of the
Sequential Approach . .
. ..
The Possible Outcomes of a Distance
Matrix of Order Five. . . . . • .
Summary of Computer Storage and Time
Requirements of Programs Developed...
Used.
9.·3
9.4
.
.
.
.
.
•
.
.•
•
•
.
88
90
94
98
•
1.
INTRODUCTION
The problem of finding the minimum lenght tree,
corresponding to a given distance matrix, has been studied
in biology by Camin and Sokal (1965), Cavalli-Sforza and
Edwards (1965, 1967), Edwards and Cavalli-Sforza (1963,
1964), Farris (1970, 1972), Horne (1967), Kluge and Farris
(1969), and Wagner (1967).
Very seldom is a real data
distance matrix tree realizable, i.e., there is no tree
exactly fitting its entries.
Hakimi and Yau (1965) showed that if a distance matrix
1S tree realizable, then it has a unique tree realization
and such a tree is its minimum lenght realization.
Thus,
the m1n1mum length tree corresponding to a given distance
matrix D is given by the tree realizable distance matrix H,
which estimates D.
In this work, least squares (Cavalli-
Sforza and Edwards, 1967) is used as estimation criterion.
The four points solution determines, without enumeration, the least squares (optimal) tree.
tion of unweighted trees 1S necessary.
If
n~5,
enumera-
Farris' (1972)
method of adding a point at a time, and least squares, leads
to a sequential method for solving the general case.
The fact that given an unweighted tree
(i.~.,
its form)
it 1S possible to find its least squares weighted version,
leads to the consideration of a new approach for estimating
the unweightedtree.
•
Simoes-Pereira (1966) gives the
relation between a tree and its set of (~) four point
subtrees.
"To avoid enumeration, a criterion for estimating
•
•
2
•
3
2.
2.1
REVIEW OF LITERATURE
Motivation of the p"robTem
Edwards and Cavalli-Sforza (1963) suggested the
estimation of an evolutionary tree by using "... the tree
which invokes the minimum net amount of evolution •.."
Assuming a random process in time, Edwards and CavalliSforza (1964) used the maximum likelihood method to estimate
evolutionary trees.
Their method implies the comparison of
the maximum likelihood of each tree form.
Later (Cavalli-
Sforza and Edwards, 1965), they found that their method
gives a solution if and only if the likelihood surface lS
regular.
Camin and Sokal (1965) used the concept of
"minimum number of evolutionary steps."
They worked under
the assumption of discrete characters with known direction
of evolutionary trends.
Cavalli-Sforza and Edwards (1967) compared the maXlmum
likelihood method with the minimum evolution method and a
new method; the adqitive tree method.
The additive tree
method " ... assumes that distances along the tree are
additive, thus implying independence of evolution in all
branches."
The solution of this method is a restricted
least squares solution.
Independently, Horne (1967)
suggested the least squares estimation method for an
additive and minimum evolutionary
•
discrete characters.
tree~
She worked with
In both cases, either total
enumeration or a preliminary estimate of the unweighted
·'
4
tree
(i.~.,
its form) is needed.
Wagner (1967) used the "ground plan/divergence" method
to measure the amount of evolutionary change using a graph
giving the minimum amount of evolutionary change.
His work
was generalized and stated in a mathematical framework by
Farris (1970), using the concept of minimum length tree.
Farris gave a method to compute minimum length trees based
on the use of Manhattan distances and the original
"coordinate values" of each individual.
His method
lS
similar to the "elementary reducing cycle" (Hakimiand Yau,
1965).
Farris (1972) also worked out the case where the
coordinate (character) values of each individual are
unknown.
In both cases it is necessary to have an estimate
of the unweighted tree, and in the later case an iterative
optimization procedure is necessary.
2.2
Review of graph theory literature
A summary of definitions used in this chapter can be
found in appendix 9.1.
Hakimi and Yau (1965) proved each of the following:
Theorem 2.2.1
Let D be a real symmetric matrix of order n.
lS
a (graph) distance matrix if and only if the following
three conditions hold
•
Then D
1)
d .. > 0
lJ
for i, j
= 1, 2 , • . • ,n
.'
5
=
2)
d ..
3)
d .. < dik+d
kj
J.]
~~
a
for
J.
= i,2, .•• ,n
for i,j,k
=
i,2, •.• ,n,
where d .. J.s the J. jth element of D.
J.]
Theorem 2.2.2 (The elementary reducing cycle)
Let D be a distance matrix of order n, let 8 be any
realization of D (8 may have more than n points) and let 8
have three lines v i v 2 ' v i v 3 ' and v 2 v 3 with weights Wi2 'W i3 '
and W23 ' respectively, such that
Then 8'" , a realization of D having smaller length than 8,
can be found.
To find 8'" delete v v ' v i v 3' and v v ' and
i 2
2 3
add a new point va and the lines vivO' v 2 v ' and v 3v '
O
O
defining va by
W
Oi =
1
+ W
W ),
'2 (W i2
i3 23
1 (W
+ W
W
23
02 = 2
i2
-
W ),
i3
and
W
03 =
1
'2 (W i3 + W23 - Wi2 ) .
Theorem 2.2.3
••
If D is realizable as a tree, T, then T is the only
cycle-less realization of D.
•
6
Remark:
Uniqueness does not consider the addition of internal
points lying on two lines, i.e., given p., p. and p.p. in
-
-
l
]
l
the tree, add PO' p·PO'
and p·PO'
deleting
l
}
. . p.p
l
].•
]
It would
be trivially done without changing the tree.
Theorem 2.2.4
If D has a tree realization T, then T
lS
the optimum
realization of D.
Simoes-Pereira (1966, 1969) stated and proved the
following two theorems.
Theorem 2.2.5
Let D be a distance matrix of order 4.
Then D has a
tree realization if and only if for some permutation P1'
P2' P3' P4 of the external points v 1 , v 2 , v 3 , v 4 , the
following system of six linear equations in the five
unknowns Y1' Y2' Y3' Y4' and Y5 and the distances, d p . p .
l
]
between the external points
•
Y1 + Y2
=
d
Y3 + Y4
=
d
(2.1 )
P1 P 2
C2.
P3 P 4
Y5 + Y1 + Y3
=
d
Y5 + Y1 + Y4
=
d
P1 P 3
P1 P 4
2)
(2. 3 )
(2.4)
.'
7
. (2.5)
(2.6)
has a solution with y. > 0; i= 1,2,3,4,5, where at most
l
-
two Yi are zero, but (to avoid triviality) neither Y1 = 0,
Y2 = 0, nor Y3 = Y4= 0 are allowed.
Remarks:
1.
The restrictions on the y's, are necessary to
avoid degeneration to a three points case.
2.
The possible trees are shown in Figure 2.1.
P3
Pi
Y1
Pi
P2
Pi
Y3
Y5
Y5
Y4
P3
Y4
Y2
P4
P2
P4
P3
P2
Y4
Y1
Y5
•
Pi
•
Figure 2.1.
P2
Y4
I
P3
Possible four point trees, where
P4
P4
•
8
Theorem 2.2.6
Let D be a distance matrix of order n.
Then D has a
tree realization if and only if each of its 4x4 diagonal
submatrices has a tree realization.
Remark:
Let A
=
{i1~ i2~ l3~ i 4 }C: {1~ 2~ •.. ~n}~
then the
submatrix of D formed by the 4x4 elements whose subscripts
are the orderer pairs of members of A with subscripts
contained in A is a 4x4 diagonal submatrix of D.
i
•
•
9
3.
3. 1
THE FOUR POINTS CASE
Results for D Bein"gTr"eeRea.Tizable
Theorem 2.2.5 glves a difficult way for checking if a
distance matrix of order 4 is tree realizable.
An easy
algorithm is given by the following corollary.
Corollary 3.1. 1
Let D = «d l..] » be a distance matrix of order 4.
Consider
and
Then D has a tree realization if and only if either exactly
one, or all three, of a, b, and c hold.
Proof:
Assume that D has a tree realization,
2.2.5 holds.
i.~.,
Theorem
It follows that the 6 equations of theorem
2.2.5 have a solution for y. > 0; i= 1, 2, 3, 4, 5.
l
-
Adding
(2.1) and (2.2), (2.3) and (2.6), and (2.4) and (2.5), we
obtain
•
10
and
Therefore
To complete this part of the proof, note that there
are 24
permutations of 4 points, but they reduce to the
three possibilities of Corollary 3.1.1.
Also, note that
ifyS > 0 one of a, b, or c holds, and if yS = 0 all
three a, b, and c hold.
Conversely, suppose that either a, b, or c of
Corollary 3.1.1 holds.
and suppose d
P1 P 2
+ d
Let {Pi' ·PZ' P3,P4}
< d
P3 P 4 -
P1 P 3
+ d
P2 P 4
=d
= {1,2,3,4}
P1 P 4
+d
P2 P 3
Consider
YP1
YP2
YP3
YP4
•
and
=
~(d
=
~(d
=
~ (d .. +d
=
~(d
P1 P 2
P1 P 2
+d
+d
P1 P 3
P1 P 4
+d
P1 P 3
P2 P 3
P3 P 4
P3 P 4
-d
-d
-d
-d
P2 P 3
P1 P 3
P1 P 4
P1 P 3
)=
~(d
)=
~(d
)=
~(d
)=
~(d
P1 P 2
P1 P 2
P2 P 3
P2 P 4
+d
+d
+d
+d
P1 P 4
P2 P 4
P3 P 4
P3 P 4
-d
-d
-d
-'od
P2 P 4
P1P 4
)
(3.1)
)
(3.2)
)
(3. :3)
)
(3.4)
P 2P 4
P2 P 3
11
e'
(3. 5)
Since D is a distance matrix, Yi > 0; i=1,2,3, and .4.
~
By hypothesis Y5
o.
Note that equations (2.1), (2.2), ... ,(2.6) of
Theorem 2.2.5 can be obtained by adding the corresponding
y 's defined by (3. 1), (3.2), ... , (3.5) .
Checking
conditions of Theorem 2.2.5, note that if three of more
y's are zero, or y
= y
= 0, or YP3= y = 0, a
Pi
P2
P4
degenerate three (or less) points case is obtained;
otherwise Theorem 2.2.5 holds.
To finish, note that if Y5 > 0 only one of a, b, or c
holds, but if Y5 = 0, all three of a, b, and c hold.
Thus, if D
1S
a distance matrix of order 4,
Corollary 3.1.1 can be used to check if D is tree
realizable.
If D is tree realizable, then the elementary
reducing cycle can be used to obtain its tree realization.
Alternatively, Corollary 3.1.1 gives the unweighted tree
(the form) of D, and its proof gives computing equations
to obtain its weights.
This is summarized in the
following result.
Resul t3.1. 2
•
Let D be a tree realizable distance matrix of order 4
•
12
with
for some {Pi' P2' P3' P4} = {1, 2, 3, 4}.
Then the tree realization of D is
P3
Pi
Y1
)
YS
Y2
y<
, where
Y4
P2
P4
the y' s are given by (3. 1), ( 3 . 2) , •.. , ( 3 • S) .
Identify the above tree by saying that Pi and P ,
2
and also P3 and P 4 ' are together in the tree, and denote
it by P1:P2 and P3: P 4.
Remark:
denotes a four
To simplify notation
points unweighted tree, with the understanding that some
of the lines may have weight zero and must be deleted.
yS= 0, choose, arbitrarily, any of
1>-<2
1>-<2
2>--< 4'
1
3
, or
3
•
to denote the unweighted tree.
4
4
3
If
•
13
3.2
Estimation Criterion
If it is known that the distance matrix D has a tree
realization, then that tree is the minimum length
realization of D and it is unique (Theorems 2.2.3 and
2.2.4).
Most likely, a practical problem will start with
a distance matrix which is not tree realizable.
In this
case, it is necessary to have a criterion to choose a tree
realizable matrix which best estimates the given one.
Cavalli-Sforza and Edwards (1967), and Horne (1967)
suggested the use of the least squares estimation method
to find the " •.. tree giving the smaller distortion •.. "
Camin and Sokal (1965) and Farris (1970) suggested the
minimum length tree.
Knowing that the minimum length tree is the un1que
tree realization of a tree realizable matrix, it remains
to solve the problem of choosing the tree realizable
matrix to estimate a given distance matrix D.
Such a
problem will be soived by using the least squares criterion,
and the resulting tree will be called an optimal tree
realization for D.
3.3
Estimation Method
Consider a distance matrix D
1S not tree realizable.
•
4
.
4
r
i=1 j=1
..
= (h 1J
.. )
(d .. -h .. )2 1S a m1n1mum .
1J
1J
of order 4 which
The problem 1S to find a tree
realizable distance matrix H
r
= (d 1J
.. )
of order 4 such that
Note that this 1S
•
15
either one or all three of a" b, and c hold
a)
b)
c)
-1- - -2-
k'h < k'h
= -3.....
k'h
-2- --1-
k'h < k'h
= -3k'h
k'h < k'h
= -2k'h
-3- - -1and
2.
(h-d)' (h-d)
- -
--
lS
a minimum.
To find h such that k h = k h and (h-d)'(h-d) is a
2
3
minimum the usual least squares procedure can be followed.
Let
and
~~ =
-2d + 2h - 2A(k 3 - k 2 ) =
o.
It follows that
h = d -
•
(k
k )1-(k
k) (k 3 - k 2 )(k 3 - k 2 )'d.
-3 - -2
-3 - -2
h
=d
l(c'
4 3
(3.6)
16
and
(h-d)"(h-d)
. (k
-3
- k. )"E
-2
--
and
=Q+
E"E +
2{_1(c" - C")(k
=Q+
E"E >
Q
1+
3
2-3
- k )"}E
2
To finish, it is necessary to prove the following:
B.
h
C.
h gives the minimum sum of squares over the
1S
a vector defining a distance matrix, H, and
possible choices 1. a, 1. b, and 1. c.
Proof of A:
= kid = Ci,
Note that ki h
k"h
-21S
= C"2
+
l(c"
2 3
then
C.. ) > k"d > k"d
2 - -2- - -1-
= k"h
-1-
clear from-(k 3 - k ), having +1 in the places
2
•
18
4
= lj:"
3
= lj:"
1
= "2
>
d 14
-
d 14
d 23
d 13
d 24
-4- - -4- + -4+ -4-
d 14
+
-1l.
4
d
d
24
+ -4-
d
23
-4-
1
d 14 + Tj."(d 24 + d 14 + d 13
1
1. d
d )
14 + '4(d 12 + d 13 - 23
- 2
To show that h ij
each case separately.
b)
Using d 13
1.
•
~
~
h ik + h kj , it
d
~
23
)
1
"2 d 14 ->
lS
o.
necessary to work
d 14 + d 23 - d. 24 , it follows that
19
2.
e)
Using d 24 <
d 14 + d 23 - ,d 13 , it follows that
1.
Proof of C:
To show that h, the solution for 1.a, gives the
rnlnlrnurn surnof squares ~(Ci - C;)2 over the possible
•
choices 1.a, 1.b and i.e, first solve for 1.b following
the above procedure to obtain h*, where:
•
20
and a
,
=
1
4(d 14 + d 23
C')2
equal to !(C'
4 3
1
= d 12
h!3
= d 13
h!4
= d 14
- a
h~3
= d 23
- a
h~4
= d 24
h~4
= d 34
,
,
,
,
+ a ,
-
d 12 - d 34 ) .
'
which is larger 'than !(C'
4 3
It has a sum of squares
C') 2
2
= d 13
+ d 24 · If d 12 + d 34 = d 13 + d 24 ,
both solutions 1.a and 1.b have the same sum of squares.
unless d
d
12 + 34
+ a
h*12
2
1
The proof of k h* ~ k h* works as the proof of A for h.
Similarly, the proof that h* defines a distance matrix·
works as the proof of B for h.
To show that a solution h** for 1. c has a sum of
squares larger than the sum of squares obtained for 1.a,
note that the constraints given by 1.c,
i.~.,
are at least as strong as the constraints in
•
•
21
and these constraints are stronger than the constraints
given by either 1.a or 1.b.
The above can be summarized in the following result.
Result 3.3.1:
Let D be a distance matrix of order 4 which is not
tree realizable.
Then the tree realizable distance
matrix H, defined by the elements of (3.6), gives a least
squares estimate for D.
The tree realization of H is
called an optimal tree for D.
Remark on Uniqueness:
a)
there
1S
a unique solution.
b)
there are two different least squares solutions:
and
c)
•
Both solutions d* and d** of b have equal length .
•
22
Proof:
Add equations 2.1, 2.2, ••• ,2.6
fo~
d* to obtain
Doing the same for d** yields
But (Equation 3.5 for d*)
and similarly, Equation 3.5 for
ii*.
+
Y5'" = l(d
2 2 1,J..4
•
d*i~
2
.
2
2 3
d~H~
d**.2
l1 3
yields
. ) = Y5'
2 4
d~~*
2
2
•
23
At this point, it is convenient to introduce a
concept which will be used later.
Definition 3.3.2
Let D be a distance matrix of order 4, let
{ii'
2
2, i 3 , i 4 }
= {1,
2, 3, 4}, and let
(3.7)
then (3.7) defines the optimal tree having i 1 and i and
2
also i 3 and i 4 together; it is summarized by saying that D
has the optimal tree i 1:
•
2
2 - i 3: 'i 4 •
•
24
4.
THE GENERAL CASE
Theorem 2.2.6 glves a set of necessary and sufficient
conditions for a distance matrix to be tree realizable.
This and Result 3.3.1 suggest that a distance matrix can
be estimated by simultaneously solving the set of its (~)
four point subproblems.
Unfortunately, the resulting tree
realizable distance matrix is not the desired least squares
solut ion.
This is due to the fact that, most likely, the
resulting tree has very few internal points, giving one
(or several) radial arrangement(s) of lines.
A better solution can be obtained by uSlng the same
procedure, but considering, for some four point
subproblems, a solution different from the optimal, ~.e.,
changing the pairs of points which are together.
The
choice of the sUbproblems to be changed to a non-optimal
solution is not easy.
To try to avoid this problem, a
sequential approach has been developed.
4.1
A Sequential Approach
Let D be a distance matrix of order n, let A ,A , ... ,
1 2
A
(n)
be the (~) four point sets out of {1, 2, ... ,n}, and
"'t
4
let D , D , ..• ,D
be the corresponding order 4 diagonal
2
1
(n)
4
submatrices of D.
•
Compute the sum of squares for fitting
the optimal tree t.o each one of D , D , .•. ,D
, and
2
1
(n)
4
•
25
choose the one having the smallest sum of squares, say D. ,
l
with its estimate Gi . If there are two or more D.'s
J
having the same smallest sum of squares, consider each one
as a case and work each case separately.
If some D. has
J
two or three optimal trees, work with each one of them as
a separate case.
With this in mind, proceed as having
only one case.
Clearly, the process starts with the least squares
four points tree, say Ti = 84 with points {v1,v2,v3,v4J=Ai
and distance matrix G..
l
Given 8 4 , for each uk e: {1, 2, ... ,nJ - Ai find
88
8 4 uk' the sum of squares for adding uk to 8 4 .. Choose
v 5 such that
84
= min
UkE: {1, 2, ... ,nJ-A.l
•
new point v n •
88
uk '
Attach to T l' a line of weight W to the
n-
•
26
Point vn , to obtain an n points tree T. The line to v n
can be attached to any point p of any line B of T _ .
n 1
In accord with the least squares criterion, find
hv;v
1
T
n-1
i= 1, ... ,n-1; such that
n
SSv
n-1
min
L (g
-d v . v )2,
. 1 v.v
1 n
1 n
p£B~Tn_l, 1=
W> 0
n-
where g v.v
1 n
1S
the distance between the points v.1 and v n
in the tree T given by p, B, and W, and T 1 SS v 1S the
n
nn
sum of squares for adding v n to Tn- .1' The minimization
procedure is simple but has difficult algebraic expression
and is given in Appendix 9.2.
This is a fast method to obtain a weighted tree.
If
the process leads to more than one case, say k cases, the
results are k weighted trees.
The k th step of the above procedure yields a k points
weighted tree Sk with distance matrix Gk . Let Dk be the
corresponding submatrix of D. Consider the unweighted
tree Y defined by Sk' then it is possible to find the
weighted tree
Sk
defined by Y and having least squares fit
with respect to Dk . The above sequential least squares
method, as well as any method based in adding a point at a
time, can be improved by using the latter idea.
•
the subject of the next chapter .
This
1S
•
27
4.2
The Weights of the Tree Problem
This problem was discussed by Cavalli-Sforza and
Edwards (1967) and Horne (1967).
Given a distance matrix
D of order n and an unweighted n points tree S, the object
is to find the weighted tree T with form (unweighted tree)
S having least squares distance matrix G (with elements
.. )witl1respect.to D.•
.. g lJ
Let the weights of T be glven by W , W , ... ,W m.
2
1
Note
that S known allows us to write
g" =
L Wk
lJ
kEP ..
lJ
for {i, j} C {1, 2 , ••• , n} ,
where P .. is defined by inspection of S.
lJ
Using this, find W k= 1,2, ... ,m; such that
k,
n-1
t
n
L
i=1 j=i+1
.
W' ) 2 = min
lJ kE:P .. k
Wk>O
( d. . -
L
lJ
k= 1,2, ... ,m
Note that Q =
n-1
n
L
L
(d..-
i=1 j=i+1
n-1
n
L
L
lJ
(d .. -
i=1 j=i+1
•
1
<
"2
n-1
{L
n
L
i=1 j=i+1
(d .. -
L
L
W)2 is a convex
kE.P .. k
lJ
lJ
W) 2 +
lJ k(P .. k
lJ
n-1
n
L
L
i=1 j=i+1
(d .. -
L
lJ k€P ..
lJ
Uk)
2}
•
28
for any Wi' W2 ,··· ,Wm and U1 , U2 ,··· ,Urn·
Indeed, for each
pair i< j C {1, 2, ••. ,n}
+ 14{(d ..
lJ
rearrange the left hand side and reduce terms in the right
hand side to obtain
{ d .. -
lJ
E
Wk + Uk
---::::-2--}2
k€'P ..
lJ
Minimizing a convex function (Beale, 1955) subject to
linear constraints is accomplished when finding a local
minimum, and such minimum can be found by a modification
of the simplex method (Beale, 1955).
Actually, computations were done by iteratively
solving
~.
= 0; i= 1, 2, ... ,m; and substituting either
l
the resulting value for W.
l
•
(if W. > 0 )or zero.
l
•
29
4.3
The Unweighted Tree Problem
The least squares tree realization of a distance
matrix can be found, theoretically, by enumerating the
unweighted trees, finding the least squares version of
each one, and choosing that with the smallest sum of
squares.
Unfortunately, the number of unweighted trees
lS
so large (for n > 10) that the exact solution is
impossible in practice.
In this section, the basic four points solution will
be used to find a criterion to estimate the unweighted
tree.
First, some preliminary definition and results are
glven.
Notation 4.3.1
Let D be a tree realizable distance matrix of order n
with unweighted tree T.
Let AC {1, 2, ... ,n} and let T
be the subtree of T for A, then call T
1
the subtree of T
for D containing A, or more generally, when D and Tare
understood, a subtree containing A.
Definition 4.3.2
Let Ai' A C {1, 2, •.• ,n}, let T 1 and T 2 be two
2
subtrees containing Ai and A
2
1,
respectively, and let T
and T 2 be the subtrees of T1 and T2 containing Ai
Then T
1
and T
otherwise, T
•
1
2
are said to be compatible if T
and T
2
1
n A2 ·
1 = T 2,
are said to b~ incompatible .
•
30
Remark:
T
1
and T
2
are compatible if and only if there exists
T 3 containing some A 3 with the property that A1 U A C A
2
3
which is compatible with both T
1
and T .
2
Proof:
First assume that T
there exists T; U T 1
U
compatible with bothT
Now assume that T
A3::>AiU A
T ·
2
2
and T
1
T 2 = Ti U T1
1
3
T
1
and T
2
T 2 = T 3 which is
is a subtree of T containing
n
A1 and A 2 = A3
3
equal, as are the subtrees 8
n A2 C
n A2 .
By
of T 3 a,nd 8 1 of T for A are
1
1
4
A and A
1
1
of T
n
of T 2 containing A1
6
U
and T .
2
hypothesis, the subtrees 8
of T1 and 8
are compatible, then
such that T 3 is compatible with both T 1 and
Note that A1 = A3
8ince A
1
2
3
and 8
2
of T
2
for A .
2
A C A , the subtrees 8
2
2
5
n
A2 are equal, therefore
are compatible.
Definition 4.3.3
Let A.C
l
{1, 2, ... ,nl; i= 1, ... ,k; and let T. be a
subtree containing Ai; i= 1, ... ,k.
l
Then T1 , T 2 , ... ,T are
k
said to be compatible if there exists a subtree TO
k
containing
•
U A. which is compatible with each one of
. 1 l
l=
incompatible .
•
31
Definition 4.3.4
Let A. C {i, 2, ... ,n}; i= 0, 1, 2, ... ,k; let T. be a
l
kl
subtree containing A.; i= 0, 1, 2, ... ,k; and let U A.CA '
O
l
. 1 l
l=
Then the number of incompatibilities of T , T , ... ,T with
i
k
2
respect to TO is the number of Ti's being incompatible
with TO'
Remarks:
1.
If D is a tree realizable distance matrix of order
n, then its set of (~) four point optimal subtrees
are compatible
2.
Suppose the set of (~) four point subtrees of a
distance matrix D of order n are compatible, then
4.4
a)
They determine an unweighted tree for D.
b)
It does not follow that D is tree realizable.
The Number of Incompatibilities Criterion
Given D, a distance matrix of order n, the problem is
to find a tree T with distance matrix G whose elements are
least squares estimates for those of D.
A searching
method is necessary to avoid the practically impossible
work of total enumeration.
The relation between a tree
and its set of four point subtrees, together with the
solution given to the four points case, suggest the Use of
the (~) four point solution to solve an n points case.
•
If D is tree realizable, its (~)'four point trees are
•
32
compatible, and each four points set containing i and j
gives the same solution dt.(=d .. ) for the tree distance
J.]
J.]
between i and j. If D is not tree realizable, there is at
least one pair J., j which gives different solutions for
different four point sets.
Moreover, the set of (~) four
point trees may be incompatible (definition 4.3.3).
To find an unweighted tree T whose least squares
versJ.on is close to the optimal realization of D,the
number of incompatibilities among the (~) four points
trees can be used as a criterion of closeness of fit for
the form of the tree.
If T1 , T 2 , ... ,T
are the
(n)
4
unweighted four point optimal trees of D, then the number
of incompatibilities of T1 , T 2 , .•. ,T(n) with respect to T
4
J.s the number of unweighted four point optimal trees of D
that must be changed to obtain the compatible set defined
by T.
In accord with the four points solution, the
unweighted tree giving the minimum number of
n
incompatibilities with the data set of (4) four point
optimal trees can be chosen.
The definition of number of incompatibilities does
not give an easy way for finding the tree with minimum
number of incompatibilities; in fact, it suggests total
enumeration.
•
The set 8 1 = {T. , T. , ... , T. l, with
J. 1
J."2
J. k
maximum number, k, of compatible four point optimal
•
33
trees~
defines a tree with
In
n~
very small with respect to
with respect to (~).
< n points~ where m may be
and/or k may be very small
Finding 8
very difficult for large n (n
1
~
is easy for small n~ but
10), where the number of
possible combinations of Ti's is very large.
A summarization of the information contained In the
set
T1~
T2~
..•
~T
(n)
of four point optimal trees is given
by the number of ¥imes that each pair {i,j} C {1,2~ ... ,n}
appears together in T 1 ,
T2~
..•
~T.
(n)
This concept will be
4
defined -and developed to give a method for constructing
unweighted trees.
Definition 4.4.1
Let D be a distanoe matrix of order n >4,- let
A1~ A2,· .. ~A(n) be the four point sets of{1~2~ ... ~n}~ and
4
Ie t T
1
~
be the optimal tree for A ,A , ... ,A
,
T , •.. , T
1 2
2
(n)
(n)
4
respectively.
4
For each pair{i,j}C{1~2,~•. ,n} define its
frequency, denoted by F ..
IJ
= F JI
.. ,
as the number of four
point optimal trees in which i and j are together.
Note: .
If there are two or more optimal trees for one or
more A's, define the frequencies for each possibility and
work'with each separately.
•
With this in mind, the
f61lowing discussion refers to one 6ase •
•
34
Remark:
If D 1S of order 2 or 3 define" its frequencies as
zero.
Fact 4.4.2
Let n
~
4, and let D be a tree" realizable distance
matrix of order n, then there are at least two disjoint
pairs with frequency (n;2).
Proof:
Let n=4, then by definition we have two disjoint
pairs with frequency 1 = (n;2).
Suppose that all tree realizable distance matrices of
order n-1 have at least two disjoint pairs 1 1 , i 2 and 1 ,
3
n-1
-2).
1 with frequency (
2
4
Let D be a tree realizable distance matrix of order
n, and let G be the n-1 order diagonal submatrix of D
.
"
obta1ned
by de Iet1ng
1tS n th rowan d co I umn.
G is a tree realizable distance matrix of order n-1.
By h1pothesis, there are two pairs i 1 , i 2 and i 3 , i 4
. .1mples that hav1ng
.
having frequency ( n-2-1)
2
. Th1S
chosen
either the pair {i , i } or{i , i }, then any other pair
1
2
3
4
{i ,
k
i~}
together with the chosen pair will form a tree of
the type i 1 :i 2 - ik:i~ (or i 3 :i 4 - ik:i~), since there are
i~}, i.s=.,., {is,i S } C {1,2,
,
n-1}
{i ,i } implies i :i -i :i ' and·{iS,i } C{1,2,
1 2 S S
1 2
S
,
n-l}
{i ,i } implies i :i -i :i .
3 4 S s
3 4
exactly (n;3) such pairs {i
•
k
The following part of
35
the proof for i
1
,i , also applies to i ,i .
3 4
2
Let iC {1,2, ... ,n-1} - {i ,i } and consider the
1 Z
optimal
tre~
for i
1
,i , i and n.
2
(4.1)
then jE: {1,Z, ••• ,n-1} - {i ,i ,i} implies that
1 2
i
1
:i
2
- j:n.
Indeed, from Equation 4.1 obtain
d . . , and use j;in and
l1 l
.
!..~.
,
(4.2)
to obtain d . . =
l1 J
Therefore
Since l1' i Z' j, and n form a tree, then
i.~., i
1
:i
2
- j:n for all j C {1,2, ••. ,n-1} -
{i ,i }· ,
1
Z
Note that there are ( n ; 3 )+n-3= (n-2)
. 2
four
•
point trees containing i 1 andi Z ' therefore
•
36
b)
If i
and i 2 are not together in the optimal
tree for ii' 2 2 , i and n suppose i 1 :i-i 2 :n (if
i :n-i :i, the proof is the same permuting i and
1
2
n), then i:j-i 2 :n for all {i,j} C {1,2, ... ,n-1} 1
{i }. Indeed, it is known that i 1 :i-i 2 :n and i 1
2
and 2 2 are not together 2n the optimal tree for
2 1 , i 2 , i and n,
~.~.,
(4.3)
i :i -i:j and Equation 4.2 obtain
1 2
therefore,
Since 2 1 , 2 2 , j and n form a tree, it follows
that
•
(4.4)
Ie
37
i . ~., i 1 : j - i 2 : n for all j E: {1, 2 , ... , n -1} {i 1 ,i 2 }, with n-3 possibilities.
To finish, it
is necessary to show that j1:j2-i2: n for all
{j1,j2} C {1,2, .•. ,n-1} - {i 2 }. It i·s known that
i 1 :i 2- j 1: j 2 for all {j1,j2} C {1,2, ... ,n-1} {ii' i 2 }, i.e.,
for all {j1' j2} C {1,2, ... ,n-1} - {ii' i 2 }, and
Equation 4.4 for j1
(4.6)
From equations 4.5 and 4.6 obtain
(4.7)
and
(4.8)
respectively.
Adding Equations 4.7 and 4.8 yields
•
'.
38
ji' j2' l2 and n form a tree, it follows that
~.
e. , ji: j 2- i 2: n for all {ji,j2}C {1,2, ... ,n-i}-
As above, note that there are ( n-2
2 ) four
point trees containing i
and n, therefore
2
F.
= (n- 2).
2
l2 n
Definition 4.4.3
Let D be a tree realizable distance matrix with
unweighted tree T and let p be an interior point of a
line of T.
Construct a new tree
The subtrees obtained by breaking
T~
by adding p to T.
T~
at p will be called
limbs of T (or D) with end point p.
For each one of the
obtained limbs, the line in which p is lying will be
called the end line.
Remarks:
1.
If L lS a limb of D containing i i ' i 2 ,·.· ,ik , and
D is understood, then L will be named as the limb
containing i i '
point p
2.
¢
Note that the end
{ii'
Let D be a tree realizable distance matrix of
order n with frequencies F .. ;· {i ,j }C {i, 2, .•. ,n} ,
lJ
and let L be a limb of D containing the points
•
i
i
, .•• ,i , then L has the frequencies
k
39
Let L be a limb containing the points i
3.
1
,i , ••• ,
2
lk' then for each n such that·{i ,i , •.. ,i }C
1 2
k
{1, 2, .•• ,n}, L defines a set of n points trees.
Example:
134
2>-l-<5
From
with end point a.
1
3
4
1
3
obtain the limb L =
a
2>-1-
For n=6, L defines the set
5
1 3 5 4
, T3
)>--~J;4,....1-1-«
)>---,1"-:!'...J11.--(
a
1
2
6
2
2
=
6
1
3
6
4
}.
)>---'-.:1~I-«
a
3
2
5
Note that L can be obtained from T. by breaking at a.;
l
=
i
l
1, 2, 3.
Result 4.4.4
Let D be a tree realizable distance matrix of order
n.
Let Ai with k
1
elements and A
2
with k 2 elements be such
that Ai' A 2 C {1, 2, ... ,n} and A1n A 2 = 0, and let L1 and
L
2
be two limbs of D with end points b
frequencies F
ij
; {i,j}C Ai; and F
ij
1
and b
and
2
; {i,j}C A ;
2
respectively.
For each iE: A. compute the frequency gib. with
J
respect to the tree T. for A.
.
J
J
•
a
1
( 2) = ( 2) =
a
and let
J
U
b.; J.= 1,2.
J
Define
Ie
40
n k
2
k
C - 1- 2) + gOb
F~"
ij
~
+ gOb if either i
] 2
1
e:
A ° and je:: A
~
2
=
. Fij if either· {i,j} C Ai or· {i ,j} C A ;
2
then L 3 , obtained by joining the end points of L and L
2
1
and attaching the end line e
3
to the joining point, is a
limb with frequencies given by F*
ij'° {00}
~,J C A1
U A 2'
Proof:
First, construct an arbitrary limb L by letting
c = {hi' . . .
,h k }
=
{1, 2,..., n }
T=
and considering the limb L for C obtained by breaking T at
c, where A1U A2 = {ii' i 2 , .. ·, i k +k}.
Note that both T
1 2
and L are arbitrary and have no relation with D.
Join L
tree T
3
3
with the above constructed limb L to obtain a
for 1, 2, ... , n, then L
containing the set Ai
U A2 •
used only to show that L
satisfy D.
•
3
3
is a limb of T
3
As above, note that T
3
is
is a limb, but T 3 does not
To show that L
3
has frequencies F* note that
if either {i ,j} C Ai or· {i ,j} C A , its frequency is known
2
4-1
and it is defined by the limb containing {i,j}, then
= F
p'~
lJ
0
o.
lJ
If iE: Ai and jE: A , they are together for each one
2
n-k -k
of the (
~
2) possible pairs In L, and they are also
together for each pair for which i and b
are together in
n-k -k
T and J and b are together in T , then F*o = (
1·2)
1
2
2
lJ
2
+ gOb
l
1
1
+ gOb·
J 2
Result 4-.4-.5
Let L be a limb of a tree realizable distance matrix
D of order n, let T be the tree realization of D, and let
L contain the set {ii' i 2 , ..• ,i }, with end point a.
k
1.
L determines uniquely its frequencies and,
2.
The frequencies of L determine uniquely L.
Then,
Proof:
jh-1
_--L.I
end point b!
with
Join L and L to obtain a tree and
1
count the frequencies of L.
•
jh
1_ _ b
They are unlque and
do not depend on the choice of L1 .
2.
Given Fi j ' . {i ,j} C
{ii' i 2 ,· .• , i k }, obtain L by
42
uS1ng the following procedure.
A.
Define the k one point limbs L.; j=1, ... ,k;
J
where L. = i.; j = 1, ... ,k.
J
B.
J
For each pair Lp ' Lq of defined (known) limbs
do the following.
a)
Join Lp and L to obtain L and compute
q
r
the frequencies F* of L in accord with
r
Result 4.4.4.
b)
If the computed frequencies F* of L
r
are
equal to the given frequencies F, then
modify the set of defined limbs by
c)
deleting Land L , and adding L .
P
q
r
·If the computed frequencies F{c ofL
are
r
not equal to the given frequencies F, do
not modify the set of defined limbs.
c.
Repeat step B until obtaining a k points limb.
It is necessary to prove that a k points limb L'" will
be obtained, and L"'= L.
First, note that by construction, if a limb of D has
two points, then its frequency is (n;2).
contrary,
Suppose the
1.~.,
suppose we have a two points i, j limb L*
of D with F .. < ( n-2
2 ) and let T be the tree realization of
1J
D.
Then there would have to be a four points
ii, j,
k, h}
subtree T1 of T with (without loss of generality)
•
i:k - j:h,
~.~.,
T1
=
i>-<j
k
h
, otherwise F .. would
1J
•
43
n-2
be ( 2).
This subtree cannot be broken to leave i and j
alone in a limb.
J
Thus, a two points limb containing i and
cannot be obtained.
Also, note that if S, a limb of an n order tree
realizable distance matrix H, has more than two points,
then there is at least a pair of points i, j in Shaving
n-2
frequency ( 2).
.
Indeed, cons1der the tree T
2
formed by S
(with more than two points) and its end point a.
There
are at least two disjoint pairs of points which are always
together, thus
the~e
is at least one pa1r not containing
the end point a with frequency (n;2) with respect to H.
To show that a k points limb
L~=
L~
will be obtained, and
L, note that "breaking" L at the point of attachment
of the end line, two limbs L* and L** with k* and k**;
k* + k** = k points are obtained.
Use finite induction as
follows:
A one point or two points limb is obtained uniquely
by following the procedure of part 2 of the proof.
Suppose
the problem can be solved for k-1 (and so for k* and k**)
points.
The procedure will lead to a step where L* and
L** are defined.
Join L* and L** to obtain
limb with frequencies F*.
and
L)'o~, L~ =
L, and F.. =
1J
L~
a k points
Clearly, by definition of L*
P~.
1J
for all' {i, j} C L.
Result 4.4.6
Let D be a tree realizable distance matrix of order
••••
•
44
n, then its frequencies F .. ; {i,j}C {i, 2, ... , n}
l]
determine uniquely its unweighted tree.
Proof:
Same as In the second part of the preceding result,
but finish when obtaining a limb for {i, 2, ... , n} and
deleting its end line.
4.5 The Sum of Absolute Frequencies of Deviations
Criterion
Part 2 of the proof of 4.4.5 and Result 4.4.6 glve a
method to construct unweighted trees.
The sum of absolute
deviations of frequencies is used to relate this method to
the number of incompatibilities criterion.
Let F .. ; .{i,j} C {i, 2, ... ,n} ;'be the data frequencies,
.l]
and let
P~.;
l]
{i,j}C {i,2, ... ,n}be the frequencies of ann
points tree T.
Let Ai' A2 ,.·. ,A(n) be the (~) four point sets of
4
{i,2, ... ,n}, let t
i
, t
2
, ... , t
(n)
be the data trees for
4
Ai' A2 ,···,A
, and let T i , T 2 , .•. , T
be the subtrees
(n)
(n)
4
4
of T containing Ai' A 2 ,·.·,A
(n)
4
Suppose t
h
is different from T , then the set
h
{t i , t 2 ,· .. , t h _ i , i Th , t h + i , ... ,ten)} has the frequencies
4
•
g .. ; {i, j} C {i, 2 , , .. ,n}; gi ven by
l]
•
45
F ..
if {i,j}
lJ
g ..
lJ
¢
A
h
=
F .. + hZ"
lJ
lJ
if {i,j}C A ,
h
where hZ" is such that
lJ
-l
if i:j In t
o
h
and ifj In T ,
h
if either i:j or iij In both t
+1
if iij In t
h
h
and T , and
h
and i:j in T ,
h
where iij denotes "i:j is not true".
The relation between F .. and
lJ
F~'~
ij
= F lJ
..
F~.
lJ
lS glven by
+
If for all {i,j} C {1,2, ... ,n}
I
A
h
E
:::> {i, j }
h z ··
l J
1
E
1 Z • ·1 ,
= A :::>
{i , j } h l J
h
then the number of incompatibilities of the data tree
forms with T would be
•
1
"4
SADF T -
~
E l F .. -F* .
{i,j}C {1,2, ... ,n} lJ lJ
I,
•
46
where SADF
T
means sum of absolute deviations of
frequencies with respect to T.
(There are four palrs of
points involved in each incompatibility.)'
Result 4.5.1
n
If SADF T = 0, then the set of (4) data four point
optimal trees are compatible with T.
Proof:
If n=4, the result lS trivially true.
and SADF
T
SADF = 0,
Suppose n > 5
= 0, then for each four points subtree
~.~.,
n
n
t h = Th ; h= 1,2, .•. '(4); the set of (4)
data four point optimal trees is compatible with T.
End
of proof.
Most likely
for some palr {i,j}C {1,2, ... ,nL
In this case, the
number of incompatibilities lS larger than
, 1
4
SADF .
T
Let T be the tree with minimum SADF, and suppose
there are k incompatibilities with respect to T.
Let T*
have SADF*, and suppose there are k* incompatibilities
with respect to
If 4k < SADF*, then k
~
k*, so that searching for the
tree with minimum number of incompatibilities lS
•
restricted to those trees having
S~DF* <
4k.
The problem
•
47
seems to be within computational possibilities, but Slnce
it would result in greatly increased cost only the minimum
SADF tree is considered.
4.5.2
Searching Method for the Minimum SADF Tree
Given F .. : {i,j}C {1,2, ... ,n},
1J
1. Define the set of limbs S= {L = 1, L 2 = 2, ... ,
1
Ln = n} with partial SADF:: PSADF = O.
2.
For each disjoint pa1r L , L C
p
q
following:
a)
Join them to obtain L
r
frequencies
b)
with computed
F~':
Compute PSADF
L
=
r
3.
S do the
L
{i,j}C L
r
IF .. 1J
F~~·I.
1J
Amongst the set of limbs obtained in 2 choose
the one (or ones) with minimum PSADF and include
it (them) in the set S of defined limbs.
4.
Repeat 2 and 3 until obtaining an n points limb.
Delete its end line to obtain the tree with
minimum SADF.
This searching method 1S restricted by computer
storage capacity.
A bound for the number n. of j point
J
limbs, j= 1,2, ... ,n, to be considered was necessary.
A fast approximation can be obtained if the step
three of 4.5.2 is modified to keep only a set of
disjointly defined limbs,
i.~.,
for each limb included 1n
•
48
the set of defined limbs, delete the two limbs used In its
construction.
Once the unweighted tree
computed using 4.2 .
•
lS
found, the weights can be
•
49
5•
5.1
EXAMPLES
Computational Details of the Four Points Case
This example is used to clarify computational
details of the four points case.
o
12
13
17
12
o
14
20
"13
14
o
15
17
20
15
o
D =
Note that d .. lS the distance between P.l and P. ;
lJ
l
,
J
j= 1, 2 , 3, 4.
27 = d
+ d 34 < 31 = d 14 + d 23 < 33 = d 13 + d 24
12
a
=~
(33-31)
= 0.5
,
therefore, the estimate D* of D lS
D~c
•
0
12.0
12.5
17.5
12.0
o
14.5
19.5
12.5
14.5
o
15.0
17.5
19.5
15.0
o
=
To obtain the weights of the tree use Equations (3.1)
•
50
through (3.5) as follows:
YP1
1
+d~'~
-d i:
) = 2(12.0+12.5-14.5)
= 5.0,
= 1.(d*
2 . P1 P 2 P1 P 3 P2 P 3
- ..
1
+d~':
) = 2(
d'"
12+14.5- 12.5)
= l.(d~'
Y
2 P1 P 2 P2 P 3 P1 P 3
P2
= 7.0.
+ Y
Note that Y
= d ~', glves a shorter, alternative
12
P1
P2
computation of Y
= d":12 - Y
= 12.0 - 5.0 = 7. O.
P1
P2
..
1
+d~':
) = 2(12.5+15.0-17.5) = 5.0
- d'"
= 1(d~':
Y
P
P
P
2 P1 3 P3 4 P1 4
P3
,
1
) = 2(17.5+15
+d~"
-d~':
- 12.5) =10.0
= 1.(d~·'
Y
P
P
P
2 P1 4 P3 4 P1 3
P4
,
or
=
Y
P4
d~"
P3 P 4
= 15.0 - 5.0 = 10.0
Y
P3
1.(d~':
+
Y5 = 2 P1 P 3
d)':
P2 P 4
d i'
P1 P 2
,
..
)
d'"
P3 P 4
1
= 2( 12.5 + 19.5 - 12.0 - 15.0) = 2.5
The optimal tree lS
•
•
51
The sum of squares of deviations is 4a 2
length of the tree is 29.5.
=1
and the
Notice that the lines of the
tree have been drawn with their length proportional to
their weight.
5.2 A Four Points Example with Two Least Squares
Estimates
This example is used to illustrate the possibility of
obtaining two least squares estimates.
D
a
=
= 0.5
o
2
3
5
2
o
3
3
3
3
o
4
5
3
4
o
and
SS
=
4
X
0
2.0
3.5
4.5
2.0
0
2.5
3.5
3.5
2.5
0
4.0
4.5
3.5
4.0
0
D;~
1 =
•
0.5 2
=
1.0.
•
52
and
with length 6.5.
Solving for P1: P 3 - P2: P 4 yi~lds
•
53
with
and length 6. 5.
5.3
Computational Details of a 5 Points Case
This example is used to give computational details.
d
=
o
70
78
68
80
70
o
66
74
62
78
66
o
60
82
68
74
60
o
50
80
62
82
50
o
As before, d .. glyes the distance between p. and p.;
1J
1
J
1,
J
=
1,2,3,4,5.
The optimal unweighted four point trees are glven by
•
•
54
and
P1: P 2 - P3: P 4' P1: P 3 - P2: P 5' P1: P 2 - P4: P 5' P1: P 3
~.~.,
- P4: P 5' and P2:P3 - P4: PS'
a)
The sequential approach.
To choose the optimal four points tree to which a
point will be added, compute the sum of squares of
deviations of each optimal four points tree:
Optimal tree
P1 :P2
-
P3 :P4
P 1 :P 3 - P 2 :P S
Sum of squares of deviations
1
4(1S2
-
134)2
=
81
~(1S2 - 146)2 = 9
P1 :P2
-
P 4 :PS
~(1S4 - 130)2 = 144
P1 :P3
-
P4 :PS
~(1S0 - 140)2 = 2S
P2 :P3 - P4: P S
~(1S6 - 122)2 = 289.
Choose P1: P 3 - P2: P S and compute its weights to
•
obtain
•
55
39.5
Adding P4 In accord with Appendix 9.2 yields
with total sum of squares of deviations 594.2 and length
165. 8.
b)
Estimating the unweighted tree.
Count the data frequencies:
F 14
=
0,
F 15
=
0,
F 35
=
° and
F 45
=
There are two trees with minimum sums of absolute
deviations of frequencies; they are
•
3.
•
56
and
Each one has a sum of absolute deviations of
frequencies of 4, and both T
and T
1
2
are found by uSlng the
searching method 4.5.2.
The logic of the method lS as follows:
Pi
Each one of the limbs
p.
>-
{i,j}
C {1,2,3,4,5}
J
has a compu"ted frequency
P:'.
=
lJ
3.
So, the first time
-
through step 2, the following partial sums of absolute
deviations of frequencies (PSADF) are obtained:
•
PSADF (joining L 3 and L 4 )
PSADF (joining L 3 and L 5 )
=
=
IF34-F~41
IF35-F~5/
=
=
11- 3 1
10-3/
= 2,
= 3,
•
57
=
PSADF (joining L 4 and L 5 )
= 13-3/ = o.
IF 45 - Fn5'
P4
Ps
>-
PSADF (joining L
=
2,
PSADF (joining L and L )
2
6
= 1,
PSADF (joining L
= 1,
3
Thus, the first time through step
chosen and S is changed to 8
1
L6
=
lS
= S U {L 6 }.
The second time through step 2 yields
1 and L6 )
and
where L
=
b
3
and L )
6
=
has frequencies F*
P4 P S
3,
These frequencies are computed
=
uSlng 4.4.4 for joining
p-b.
3 and
For better understanding, the above frequencies can
be computed by considering a five points tree
P4
T
=
Ps
P
~
u
from which L can be obtained.
v
Following the searching method 4.S.2, the second time
Pi
through step 3
•
L
7
=
P2
>--,
•
58
are chosen and
The
third time through steps 2 and 3, the PSADF
= 1, ..• ,9,
(j oining L. and L.); i
l
]
j
=
i + 1 , ..• ,10 are
computed and all PSADF are compared to choose the limbs
with PSADF equal to 2 (Figure 5.1).
The set of defined
limbs, S2' is changed to S3 by including all limbs with
PSADF = 2.
The fourth time through steps 2 and 3, the set of
defined limbs, S3'
where L 25
lS
changed to 8
=
4
and L 26
= 83 U
{L
25
, L
26
]
=
have PSADF equal to 2.
The fifth time through steps 2 and 3, the limbs L
L
28
, ... ,L
limbs.
34
27
,
(Figure 5.2) are included in the set of defined
Each of L
27
, L
28
, ... ,L
34
has PSADF = 3.
The sixth
time through steps 2 and 3 the set of defined limbs is
changed by including more limbs with PSADF
and L 38
= 3.
They are
=
Finally, the seventh time through steps 2 and 3, the
limbs with PSADF = 4 are included in the set of defined
•
limbs.
Some limbs with PSADF = 4 have five points, and
•
59
the results are obtained.
,
P2
L11 =
L1S =
>P3
,
P1
P3
P2
>
I
>
I
L 18 =
P3
P4
L23 =
Ps
P4
Ps
L 12 =
,
P2
P1
L 21 =
P2
,
P2
>--l...:-t
P3
~
Ps
>-
P3
,
L 13 =
Ps
P1
L 16 =
>
I
>
I
P2
P1
L19 =
>P4
,
P4
P4
<
>
Ps
P2
P4
)
<
Ps
>
P3
P2
I (
P3
P4
P3
rL-
Ps
P1
and
L17 =
P1
Ps
P1
L 20 =
P4
L 22 =
L 14 =
P1
P3
P1
,
P4
L 24 =
)
I
P2
(
Ps
Limbs with PSADF = 2 chosen the third time
through steps 2 and 3 of method 4.5.2.
in
example 5. 3.
Figure 5.1.
Note that the set of 5 four point optimal trees is
incompatible.
There is one incompatibility with either of
the solutions.
The computations to obtain the least squares verSlon
of T
•
1
and T
2
are done In accord with Section 4.2 .
•
60
P1
P1
L27 =
,
>P4
P4
P1
L31 =
P2
>
L 34 =
P5
I
P3
P2
>
Figure 5.2.
>Ps
L 28 =
I
,
,
L 29 =
P1
L
32 =
)
L 35 =
>P4
Ps
I
P3
P3
,
P3
,
L
30 =
>
I
Ps
P3
L
33 =
P4
P4
P2
P2
and L
36
=
P3
>
>P2
I
Ps
~
Limbs with PSADF = 3 chosen the fifth time
through steps 2 and 3 of method 4.5.2 in
example 5.3.
The results are:
T'"
1
=
T'"
2
=
and
•
,
P2
.
•
61
with sum of squares 317.3 and 319.1, and length 167.0 and
167.3 respectively.
Note that Ti gives smaller sum of squares, and T
2 has
a zero weight, yielding a version of T .
1
5.4 A Five Points Example with 5 Trees Having Minimum
Sum of Absolute Deviations of Frequenc~es
This example shows an extreme case where there are
five trees with the same minimum sum of absolute
deviations of frequencies.
It is very rare that this
problem occurs for larger numbers of points.
=
Suppose F 12
F 23
=
0,
F 45
=
0, then
F
Pi
24
=
F
0,
2,
F
2S
=
13
=
2,
34
=
T2 =
P3
Pi
F
F 14
Ps
)>--",-1---«
T1 =
2,
P3
=
2,
F
0,
F
3S
15
=
=
2,
0 and
Pi
P4
P2
)>-~_..1-1~(
P3
Ps
P2
)>--II_--.4(
Ps
P4
and
all have the same SADF
•
=
8.
number of incompatibilities .
They each also have the same
•
62
Remarks:
a)
The frequencies of this example were obtained
from a distance matrix;
b)
Estimating the least squares version of each tree
allows the choosing of the tree with smaller sum
of squares of deviations.
5.5
A Compatible Five Points Case
The distance matrix for five of the six carnivores
analyzed by Farris (1972), was chosen as an example of a
compatible set of four point optimal trees.
D
=
a
48
50
48
98
48
a
44
44
92
50
44
a
24
89
48
44
24
a
90
98
92
89
90
a
Farris' (1972) method yields
D+
•
=
a
48
50
50
90
48
a
44
44
92
50
44
a
24
90
50
44
24
a
90
98
92
90
90
a
•
63
and T
+
=
with length
=
152.
The sequential method (4.1) gives
with sum of squares of deviations equal to 4.6 and length
equal to 151. 3 •
•
•
64
The optimal estimate lS glven by
D~':
and
1'1:
=
0.0
48.0
49.3
49.0
97.6
48.0
0.0
44.0
43.6
92.3
49. 3
44.0
O. 0
24.0
89.6
49.0
43.6
24.0
0.0
89.3
97.6
92.3
89.6
89.3
0.0
=
with sum of squares 2.6 and length 151.5.
The
unweighted tree was found using 4.5.2 and its weights
were computed using 4.2.
Note that the four point optimal trees are compatible
and T* is the least squares version of both T
+ and T .
1
5.6 Direct Solution to the Unweighted Tree Problem In the
Five Points Case.
The five points case lS small enough to allow compl"ete
•
enumeration.
This example lS included to gite some feeling
of the concepts introduced In this dissertation.
•
65
Without regard to the possibility of zero weights,
there are 15 different five point (labeled) trees.
Pl
can be found by rearranging the points of T=
They
P3
>---y
P2
Note that the frequencies of T are'
F~2
P4
P5
= Fn5 = 3,
F*13 -- F*23 = F*34 = F*35 = 1 and F*14 = F*15 = F*24 = F*25 = O.
Five four point optimal trees are obtained from the
data distance matrix.
Assuming that the four point
estimates are unique, there are three basic data forms
(Appendix 9. 3).
and
Note that (1) is the compatible case defining
•
•
66
, (2) has one incompatibility with
p.
l1
p.
l3
__ I __ <
p.
>~
each one of
p.
l4
p.
l2
~
~
p.
l4
p.
l1
and
p.
l2
l5
>
p.
l3
I
< p.
l5
and (3) has two incompatibilities with each one of
p.
l1
p.
l4
p.
l3
I
>
<p.
p.
l2
p.
l1
l5
p.
l3
p. >
p.
l5
I
l2
< p.
p.
l4
p.
l2
p.
l1
>
< p.
p.
l3
I
<p.
p.
l4
p.
l2
and
l5
p.
l3
l4
p.
l2
I
>
p.
l1
>
l5
p.
l1
I
p.
l5
p.
l3
<p.
l4
Note that the worst case (3) presents five possible
fitting trees.
The best case gives a unique compatible
tree.
The frequencies corresponding to (1), (2) and (3) are
(1)
F. l = F. l = 3, F. l = F. l = F. l = F. l =1
l1 2
l4 5
l1 3
l2 3
l3 4
l3 5
and
F.
l1 l 4
(2)
•
= F. l
= F. l
= F. l
= 0,
l1 5
l2 4
l2 5
•
67
and
F.
F.
F.
1 1 = 1 1 = 1 1 = F.1 1 = 0
1 5
2 3
1 4
2 5
( 3)
F.
F.
F.
F.
F.
1 1 = 1 1 = 1 1 = 1 1 = 1 1 = 2
1 2
1 4
2 5
3 4
3 5
and
F.
F.
F.
F.
F.
1 1 = 1 1 = 1 1 = 1 1 = 1 1 = O. 2 3
4 5
1 3
1 5
2 4
The unweighted tree for (1) is identified by
=
F1. .1
=
3-,
i.~.,
consider the tree with frequencies
4 5
= 3 define
Case (2) has two solutions.
= 3.
Both solutions have
One is given by
=
other by
= 3 and the
3.
The unweighted trees for (3) are found by considering
the five pairs of frequencies F
and {p,q}
n {r,s}
= 0.
pq'
F
rs
such that F
pq
=F
rs
=2
For each possible case consider
Note that allowing zero weights 1n the estimate, case
•
68
p.
l1
(2)
>>--~(- p.
lS solved by
p.
l2
P1
P3
P2
*
by
p.
l3
p.
l5
l4
, and case (3) lS solved
P4
with 0 incompatibilities.
P5
Once the unweighted trees are found, the weights are
computed using Section 4.2 and the minimum sum of squares
of deviations tree is chosen.
5.7
Analysis of 25 Human Populations
Table 5.1 shows 25 human populations and Table 5.2
shows their distance matrix, based upon the first three
principal components of the transformed A , A , B, 0, M
1
2
and N gene frequencies and Rh chromosome frequencies used
by Goodman, 1972.
The sequential method of Section 4.1 glves a tree
whose least squares version (Figure 5.3) has sum of
squares of deviations 127 and length
19.2
It has 5 808
incompat i bilities.
The searching method 4.5.2 glves a tree whose least
squares version (Figure 5.4) has sum of squares of
deviations equal to 34.8 and length 16.7.
It has 2,946
incompatibilities and a sum of absolute deviations of
frequencies equal to 7,728 .
•
•
•
Table 5.1.
List of the 25 Human Populations in Example 5.7.
1. French
2. Czech
3. German
4. Basque
5. South Chinese
6. Ainu
7. Australian Abor.
8. New Guinea Native
9. Maori
10. Icelander
11. Eskimo
12. Br. Col. Indians
14. Braz. Indians
15. Bantu
16. N. African Arabs
17. S. Asian Indians 18. U.S. Negro
19. Spanish
20. Norwegian
21. Mexican Indians
23. U.S. Chinese
24. U.S. Japanese
13. Blood Indians
22. Egyptian
25. Navaho
CJ)
<.D
•
•
Table 5.2.
0.00
0.21
0.119
0."5
1.73
2.01
2.69
3.50
1.77
0.27
2.10
i)
=I
2."1
3.11
3.21
2.29
0.37
0.6"
1.56
0.18
0.20
2.65
0.69
1.64
1.75
3.13
0.21
0.00
0.56
0.66
1.70
1.85
2.81
3.52
1.77
0.27
2.00
2 .....
2.9"
3.28
2.119
0.1l9
0.51
1.77
0.32
0.14
2.70
0.86
1.59
1.64
3.21
0.119
0.56
0.00
0.67
2.12
2.39
2.98
3.96
2.0"
0.66
2.15
2.32
3.03
3.29
2.12
0.82
0.91
1.51
0.114
0.51
2.82
1.03
2.06
2.13
2.94
0.115
0.66
0.67
0.00
1.77
2.32
2.37
3.39
1. 75
0.57
2.30
2.35
3.119
3.011
1. 92
0.111
0.97
1.15
0.42
0.6"
2.52
0.40
1.73'
1.95
2.99
Distance Matrix for the 25 Human Populations in Example 5.7.
1.73 2.01 2.69
1.70 '1,85 2.81
2.12 2.39 2.98
1.77 2.32' 2.37
0.00 1.51 1.93
1.51 0.00 3.30
1. 93 3.30 0.00
2.27 2.90 2.10
0.69 2.02 -1.91
1."9 1.811 2.57
1.75 2.19 3.25
2.31 3.35 2.85
3.3" 2.8" 5.07
2.115 3.8" 2.27
3.119 11.22 3.011
1.60 1.96 2.113
1.36 1.59 2.7"
2.68 3.43 2.43
1.69 2.08 2.62
1.81 1.94 2.86
1.66 3.07 1.84
1.71 2.26 2.18
0.27 1.26 2.11
0.66 0.93 2.58
3.31 1;.39 3.37
3.50
3.52
3.96
3.39
2.27
2.90
2.10
0.00
2.77
3.36
3.98
1.77
1. 77
2.011
1. 75
0.69
2.02
1.91
2.77
0.00
1.52
1.37
0.27
0.27
0.66
0.57
1. .. 9
1.8"
2.57
3.36
1.52
0.00
1.86
".31
5."6
3.81
1.6"
3.20
1. 8..
3.32
1.73
1.39
2.57
1.66
1.89
1.09
1.82
0.90
1.11
2.65
2.2"
2.99
3.02
2.117
".61
3.18
3."1
3.88
3.53
3.58
3.15
3.09
2.28
2.60
5.13
0."5
0.41
1. 71
0.25
0.38
2.44
0.78
1.41
1.51
3.04
2.10
2.00
2.15
2.30
1.75
2.19
3.25
3.98
1.37
1.86
0.00
1.,.9
1.9"
2.39
3.89
2.29
1.53
3.24
1.99
2.13
f.92
2.54
1.80
1.59
2.54
2.111
2.1111
2.32
2.35
2.31
3.35
2.85
".31
1.6..
2.2"
1. ..9
0.00
3.13
1. ..1
3.211
2.57
2.16
2.78
2.23
2.54
1.42
2.63
2.48
2.55
1.11
3.11
2.911
3.03
3.21
3.28
3.29
3."9
3.3"
2.8"
5.07
3.0"
2 ...5
5."6
3.20
2.99
1.911
3.13
0.00
11.27
5.03
3.110
2.62
4."8
3.08
3.02
3.86
3.77
3.26
2.87
3.94
2.29
2.119
2.12
1. 92
3.8"
2.27
3.81
1.8"
3.02
2.39
3."9
".22
3.0"
".61
3.32
2."7
3.89
1."1
".27
0.00
3.77
3.22
2.99
3.30
3.05
3.38
0.'80
3.16
2.70
2.92
1.79
3.2"
5.03
3.77
0.00
2.27
2.811
0.82
2.25
2.41
3.60
1.99
3.52
3.82
3.22
0.37
0.119
0.82
0."1
1.60
1.96
2."3
3.18
1.73
0."5
2.29
2.57
3.110
3.22
2.27
0.00
0.82
1.49
0.47
0.48
2.62
0: 110
1.50
1.69
3.30
0.611
0.51
0.91
0.97
1.36
1.59
2.711
3."1
1.39
0."1
1.53
2.16
2.62
2.992.8"
0.82
0.00
2.11
0.61
0.64
2.37
1.17
1.27
1.25
3.05
1.56
1.77
1.51
1.15
2.68
3."3
2 ... 3
3.88
2.57
1.71
3.2"
2.78
".1l8
3.30
0.82
1. ..9
2.11
0.00
1.52
1.71
2.99
1.18
2.71
3.01
3.02
0.18
0.32
0.1111
0.112
1. 69
2.08
2.62
3.53
1.66
0.25
1.99
2.23
3.08
3.05
2.25
0.117
0.61
1.52
0.00
0.35
2.51
0.74
1.63
i. 75
2.95
0.20
0.111
0.51
0.611
1. 81
1.9"
2.86
3.58
1. 89
0.38
2.13
2.5"
3.02
3.38
2."1
0.118
0.64
1.71
0.35
0.00
2.81
0.84
1.70
1.77
3.27
2.65
2.70
2.82
2.52
1.66
3.07
1. 8..
3.15
1.09
2 .....
1.92
1."2
3.86
0.80
3.60
2.62
2.37
2.99
2.51
2.81
0.00
2.60
1.92
2.15
2.19
0.69
0.86
1.03
0."0
1.71
2.26
2.18
3.09
1.82
0.78
2.5"
2,63
3.77
3.16
1.99
0."0
1.17
1.18
0.74
0.84
2.60
0.00
1.66
1.93
3.28
1.611
1. 59
2.06
1.73
0.27
1.26
2.11
2.2a
0.90
1."1
1. 80
2."8
3.26
2.70
3.52
1.50
1.27
2.71
1.63
1.70
1.92
1.66
0.00
0.48
3.48
1.75
1.611
2.13
1.95
0.66
0.93
2.58
2.60
1.11
1. 51
1.59
2.55
2.87
2.92
3.82
1.69
1.25
3.01
1.75
1.77
2.15
1.93
0.48
0.00
3.61
3.13
3.21
2.9"
2.99
3.31
".39
3.37
5.13
2.65
3.0"
2.5"
1.11
3.9"
1.79
3.22
3.30
3.05
3.02
2.95
3.27
2.19
3.28
3.48
3.61
0.00
--...J
o
•
71
Australian
Aborigine
8raz. Indians
8 r. CoL Indians
German
8asque
South
Chinese~
Icelander
S.Asian Indians
U.S. Negro
u. S.
Japaf1ese
'New Guinea
Native
Bantu
Blood Indians
Scale: Icm=. 2
Figure
5.3. L east
squares
version
of
the
obtained by the sequential method tor 25 Hllman
•
tree
Races
•
72
New Guinea
Native
Australian
Aborigine
Ainu
Brazilian Indians
Navaho
Indians
U.S.Chinese
>-
L.-
Spaniard
French
..-"...-r-"--'<..:Czech
Norwegian
Icelander
North
African
Arabs
,--,..-_--'--_.L-
Br. Co/. Indians
EsKimo
U.S. Negro
Blood Indians
Bantu
Scale: /cm:.2
Figure: 5.4. L east squares version of the tree obtained
•
with
method 4.5.2 for 25 Human Races
~
73
5.8
Analysis of 26 Mexican Races of Maize
On the basis of the paper by Goodman and Paterniani
(1969) the following seven characters were chosen:
(1)
Row number
(2)
Ear diameter/length
(3)
Kernel length/ear length
(4)
Kernel thickness/length
(5)
Rachis diameter (cm. )
(6 )
Kernel width (mm. )
(7)
Cob diameter (cm. )
In addition ear and kernel shape were scored on a
1
(=
pointed) to 3
(=
non-pointed
o~
cylindrical) basis.
Endosperm type (pop, flint, flour, dent) and color (white,
yellow, other) were scored on a percentage basis.
Altitude (meters), latitude, longitude and the absolute
value of latitude were also used for a total of 20
characters.
number, Log
;x-
A square root transformation was used for row
e
for characters (2), (3) and (4), and arcsin
for the frequencies.
Data from 218 Latin American
races were used to calculate the among race means
correlation matrix.
From the means and the seven
standardized principal components with eigenvalues greater
than 1.0, distances among all the races were calculated.
~
These results are a slight modification of those used in
•
74
Goodman (1972).
The submatrix corresponding to 26 Mexican
races (Tables 5.3, 5.4) was chosen for this example.
Figure 5.5 shows the least squares verSlon of the
tree obtained by the sequential method.
It has sum of
squares of deviations 93.2 and lenght 31.1.
incompatibilities with the data.
It has 5,486
The least squares verSlon
of the tree found by using method 4.5.2 (Figure 5.6) has sum
of squares of deviations 82.1, lenght 29.8 and 5,502
incompatibilities with the data.
It has sum of absolute
deviations of frequencies equal to 12,988.
This tree has
smaller sum of squares of deviations and length, but larger
number of incompatibilities than the least squares version
of the sequential approach (Figure 5.5).
To have a more
efficient use of the available computer storage, sets of
races were chosen on the basis of their frequency of
appearance on the set of defined limbs.
For each set of
races chosen, a limb was fed into the computer, modifying
the program to ignore other limbs for the same set of
races.
These limbs were modified by trial and error until
obtaining the tree of Figure 5.7.
It has sum of squares
of deviations equal to 77.4, length of 28.7, sum of
absolute deviation$ of frequencies of 12,348, and 4,987
incompatibilities with the data set .
•
•
•
Table 5.3.
List of the 26 Mexican Races of Maize in Example 5.8.
1. Palomero Tolucano
2. Arrocillo Amarillo
3. Chapalote
4. Nal-Tel
5. Cacahuacintle
6. H. de 8
7. H. de 8 Occidente
8. Oloton
9. Maiz Dulce
10. Conico
11. Reventador
12. Tabloncillo
13. Tabloncillo Perla
14. Tehua
15. Comiteco
16. Jala
17. Zapalote Chico
18. Zapalote Grande
,19. Pepitilla
20. Olotillo
21. Tuxpeno
22. Vandeno
25. Conico Norteno
26. Bolita
23. Chalqueno
24. Celaya
-J
(jl
•
0.00
3.37
3.37
0.00
3.119
5.08
4.36
0.00
4.36
4.44
2.18
11.78
4.27
5.93
5.13
Distance Matrix for the 26 Mexican Races of Maize in Example 5.8.
5.110
11.28
2.08
11.49
2.811
2.115
2.29
3.97
2.311
3.95
11.95
3.98
2.18
4.78
5.13
2.60
4.27
3.99
3.95
3.98
2.26
3.43
3.07
2.37
3.86
0.00
3.56
3.38
2.25
2.84
2.29
2.45
3.99
2.115
3.03
2.98
3.25
3.69
11.36
2.10
3.711
2.55
3.45
3.97
11.113
4.115
2.311
3.72
4.02
3.10
3.112
3.98
2.511
5.65
5.115
3.06
3.32
5.26
2.87
2.01
2.59
11.95
3.110
2.33
2.34
5.52
3.38
11.97
3.26
1. 311
1. 75
5.03
11.15
3.35
3.15
2.75
3.31
2.68
11.32
4.17
4.17
11.15
11.30
3.99
3.70
3.18
3.08
2.67
4.113
11.01
11.28
3.98
11.611
3.67
2.63
3.53
2.62
3.70
11.71
4.11
11.33
2.86
11.06
3.25
4.09
5.93
5.110
11.95
4.28
2.08
4.119
I
5.08
3.49
Table 5.11.
2.60
0.00
3.86
11.21
4.44
D =
•
5.21
5.37
11.25
2.73
5.39
3.86
11.30
3.111
3.99
4.21
3.56
0.00
2.12
3.49
2.26
3.07
3.38
2.12
3.113
2.37
0.00
2.95
2.95
0.00
2.25
3.49
2.115
2.98
3.69
2.10
2.55
0.00
2.68
3.115
2.89
3.99
3.03
3.25
4.36
3.711
3.45
2.68
0.00
3.70
3.119
3.67
3.10
3.98
3.26
5.21
4.113
5.37
11.115
5.52
3.72
3.42
2.511
1. 311
11.02
3.38
5.65
5.4&
4.97
3.32
5.26
11.95
3.112
2.87
3.110
3.45
3.70
2.01
2.33
3.06
1.75
2.59
2.34
2.89
3.119
3.33
3.67
3.83
11.62
0.00
3.15
3.15
0.00
0.92
4.27
3.20
0.92
0.00
5.31
11.27
11.110
11.110
0.00
2.63
2.52
2.115
3.21
2.08
2.99
11.11
3.82
2.15
4.110
3.112
3.33
3.83
11.62
3.20
5.31
1. 99
2.07
3.60
3.110
2.1111
3.18
2.88
3.22
3.00
2.86
2.70
2.92
4.27
2.70
2.97
3.112
3.75
11.01
2.112
1. 79
3.56
3.99
2.99
2.115
3.00
11.110
2.79
3.76
3.72
3.011
1. 67
3.71
2.08
2.511
2.23
5.17
11.19
11.30
3.65
11.17
11.115
1. 88
3.86
2.85
2.07
3.011
3.82
3.02
3.25
3.33
3.72
2.67
11.112
11.118
11.08
3.62
3.18
2.86
3.51
11.46
3.36
3.99
3.35
3.21
2.62
2.611
2.90
1. 56
11.10
11.21
2.711
3.112
2.95
3.59
3.611
4.07
3.116
2.811
2.56
2.79
2.81
4.111
3.39
2.911
11.30
3.62
2.65
2.27
2.37
1.12
3.112
4.11
1. 7 q
2.76
2.88
3.33
4.53
3.67
11.36
3.26
2.59
3.15
3.03
2.39
2.115
2.611
11.22
5.03
11.15
4.32
4.17
3.35
3.15
2.75
3.31
11.15
3.86
11.30
3.99
3.18
3.70
2.67
3.08
3.18
3.22
2.68
1.99
2.07
3.60
3.110
2.44
2.63
2.115
0.00
2.08
2.57
1. 75
3.72
2.86
2.99
3.112
2.112
2.52
11.17
11.25
4.30
2.73
3.41
11.43
11.01
4.64
11.28
2.62
2.63
3.70
3.53
4.45
2. ail
3.00
2.70
2.115
3.56
2.92
2.70
3.00
3.99
4.27
3.75
1. 79
4.71
11.11
4.19
1. 88
2.07
3.86
2.85
3.72
3.011
3.71
3.33
11.30
4.17
3.011
3.82
3.02
3.25
4.33
11.06
11.42
11.08
3.18
3.51
3.36
2.67
3.59
11.07
2.57
1.511,
1. 75
11.110
3.72
5.17
2.08
0.00
3.07
2.611
2.12
1.23
2.59
1. 511
1. 23
0.00
1. 811
2.211
2.57
2.25
1.18
1. 58
1. 811
2.211
0.00
2.60
3.35
2.60
0.00
3.35
3.70
2.07
2.69
3.70
1. 80
2.07
2.69
0.00
1. 80
2.211
2.80
0.00
0.95
2.211
1.55
2.80
3.87
1. 33
2.82
2.33
1.18
2.07
1.113
2.83
11.07
1.18
2.25
1.58
2.96
3.29
2.12
2.91
0.96
2.25
1.117
2.08
3.87
3.87
2.117
3.311
1. 86
3.62
1. 311
2.51
0.50
2.011
2.01
2.16
3.51
2.23
1. 211
1. 83
1. 86
2.28
2.03
2.82
3.21
2.611
2.08
2.23
3.21
2.59
2.57
4.48
3.62
2.86
4.116
3.99
1. 56
11.21
3.112
2.95
2.96
2.15
2.86
3.25
3.35
2.62
2.90
3.72 11.10
2.511" 2.711
2.08
3.82
3.07
2.611
3.10
2.35
11.01
2.79
2.99
2.97
1. 67
11.110
3.76
4.11
5.39
3.98
3.67
3.65
3.62
2.75
2.39
3.29
4.09
2.76
11.53
3.611
11.07
3.116
2.81
11.111
3.39
3.67
11.36
2.84
2.56
2.79
2.94
11.30
2.88
2.65
2.37
3.112
1. 73
1. 86
3.87
3.62
3.33
3.26
2.59
3.15
2.27
1.12
4.11
3.03
2.39
2.115
2.611
11.22
3.311
3.62
2.01
2.16
3.87
2.91
2.25
2.117
3.10
3.51
2.35
1. 311
2.08
0.50
2.51
2.011
2.23
1.24
1. 83
1. 55
2.28
2.03
3.87
1. 86
2.33
2.07
1.18
3.62
2.82
2.75
1. 33
2.82
1.113
2.83
2.39
0.00
1. 90
0.00
1.119
2.09
2.13
0.83
1. 89
2.13
2.09
0.83
2.13
0.00
2.08
2.08
0.00
2.13
1. 38
2.13
0.96
1.117
0.95
1. 90
1.119
2.13
1. 89
1. 38
0.00
-....I
(j'\
•
Horinoso de 8
Polomero
77
Horinoso de 8
Occidente
Oloflilo
To/ucono
Arrocillo Amonllo
Vondeno
Zopolote
Cho/queno
Z opolote Chico
Conleo
Norteno
J%
Conico
Reventodor
Tehuo
Chop%te
ScoHJ." I cm:. 2
•
Figure 5.5.
Leos t
by the sequentiol
squores
version of fhe tree obtoined
method for 26 Moize
Roces.
•
78
Pa/omero TO/lJcano
Harinoso de 8
Chapa/ote
Tab/oncillo
Perla
.
T ab/onctllo
PeP/ti/la
Bolito
ChalqlJeno
Conico
Norteno
vanden 0
TlJxpeno
'Arronclllo
Amarillo
Zapalo"
Grande
Maiz
OlJlce
C acahlJacintle
N 01- Tel
TehlJa
Sca/e:/cm=.2
FiglJre 5.6. Least sqlJares version of the tree obtained
by method 4.5.2 for the 26 Maize
•
Races .
•
79
Chopolote
Horinoso de 8
Polomero Tolucono
Zopolote
Pepltillo
Nol-Tel
Chico
Bolito
Jolo
Conlco
Vondeno
Tuxpeno
Comiteco
Zop%te
Gronde
Arrocillo
Amorillo
Oloton
Coco huocinr/e
Scole: / cm :.2
Figure: 5.7.
•
L eost
squores
version of the tree obtoined
by fixing the circled tips in method
Roces
4.5.2 for
the 26Moire
tt
80
6.
6.1
DISCUSSION
Estiniation Criteria
In biology, distances between populations (points)
are not usually measured directly.
Usually, they are
obtained (Goodman, 1972) from " ... a set of vectors
(x.;
i= 1, ... ,N), each of which contains n character means
l
for one of the N taxa or populations being studied -(such
means may be frequencies) and an nxn covariance matrix
which may be derived from the above mentioned vectors, or
from variations of individuals around the taxa means.
II
Choosing the appropriate characters and the
transformations and distance to be used, as well as their
biological and geometrical meaning, is very important.
These problems have been discussed elsewhere (CavalliSforza and Edwards, 1967; Farris, 1972; Goodman, 1972;
Goodman and Paterniani, 1969) and were not considered In
this dissertation.
Little attention has been paid to the
distribution of the sample distance matrix and the
difficulty of its derivation discourages its study.
The
estimation criteria were used, in this dissertation, as a
measure of distortion.
Least squares was chosen as the
measure of distortion (3.2).
Unfortunately, the least
squares solution of the four points case does not solve any
practical problem and consideration of other criteria to
tt
solve the general case is necessary.
•
81
Adding a point at a time by choosing at each step the
point with the smallest sum of squares of deviations
yields the sequential approach (Section 4.1).
p~ocedure
method.
This
is the least squares version of Farris' (1972)
The methods have different criteria and
comparison is difficult, but Farris' data was included In
example 5.5, where the sequential approach glves very
sligthly smaller length and smaller sum of squares of
deviations.
The result of the sequential approach of Section 4.1
can be improved by computing its least squares version
using method 4.2.
Unlike Cavalli-Sforza and Edwards
(1967) and Horne (1967) least squares solution, Section
4.2 gives a method that does not allow negative weights.
The number of incompatibilities concept, though
difficult to work with, seems to have strong intuitive
meanlng.
No research worker would like to have a tree
representation of his distance matrix with a "large"
number of incompatibilities (with respect to the data),
where "large" remains to be defined.
The tree with
minimum sum of absolute deviations of frequencies is used
as an approximation to the tree with minimum number of
incompatibilities.
6.2
•
The Four Points Case
Theorem 2.2.6 gives meaning to the solution of the
•
82
four points case.
Solving larger cases lS equivalent to
solving a set of four point cases.
The non-uniqueness of the least squares solution
merits consideration.
result
The solution is unique unless
3.3.1), in this case, there are two equally good
solutions.
The research worker must reVlew both results
and choose the tree supported by his experlence.
When working an> 4 points problem, there are (~)
four point cases.
A consideration of each non-unique
solution may be impractical, and a decision to choose one
of the solutions must often be made beforehand.
In
solving examples 5.7 and 5.8, ties were broken by
increasing the precision In the computations of the
distances.
The breaking of equalities may be done by
uSlng some random device such as tossing a coin, but the
effect of such a choice has not been studied.
The choice
is more important when the n points tree is not "well
defined,"
i.~.,
when there are several n point trees close
to the minimum number of incompatibilities.
The preceding
discussion also applies to the possibility of a data
submatrix of order four having a tree realization with
three different choices of notation,
•
form
~.~.,
( Remark to Result 3.1.2).
a tree of the
~
83
6.3
The General Case
Two methods to estimate the
unweight~d
tree are
offered:
a)
The sequential approach (Section 4.1), and
b)
The searching method (4.5.2) based on the sum of
absolute deviations of frequencies.
After finding the unweighted tree, method Section 4.2
must be used to obtain its weights.
As can be seen In
example 5.4, when there is more than one unweighted tree,
a choice based on the sum of squares of deviations can be
made.
The five points case was chosen, to illustrate method
4.5.2 (Example 5.3).
The number of times through steps 2
and 3 and the number of defined limbs increase as the
number of incompatibilities increases.
A similar response
can be expected for larger numbers of points.
In example 5.8, the data set gives many different
choices for size 2, 3, 4 and 5 limbs, and the method 4.5.2
program exhausts the computer storage while working on the
tips of the tree.
Fixing the tips of the estimate was
necessary to obtaip an estimate with smaller number of
incompatibilities than the sequential approach estimate.
Example 5.6 shows that for n=5, the minimum number of
incompatibilities gives idea of the general structure of
•
the data set.
impractical.
A similar development for n=6 was
It can be noted that for n > 5 point cases,
•
84
the mlnlmum number of frequencies that define a tree
increases and the largest data set frequencies mainly glve
information about the tips of the tree.
The minimum
number of incompatibilities cannot be used for n > 5, but
the sums of squares of deviations of the estimates in
examples 5.7 and 5.8 decrease as the numbers of
incompatibilities (of the estimate with respect to the
data set) decrease.
Considering that there must be a
.
n
maximum number of lncompatibilities smaller than (4)' the
best estimates of both examples 5.7 and 5.8 seem to have
many incompatibilities, 23.3% and 33.3% with respect to
n
(4).
A study of the maximum possible number of
incompatibilities is necessary.
All computations were done on the Triangle
Universities Computation Center IBM System 370/165.
Programs were developed to work n < 30 point cases.
program for method 4.5.2 searches N < 400 limbs.
The
A
summary of computer storage and time requirements of
programs developed is in Appendix 9.4.
It can be seen
that using 500 k (=512,000) bytes, the sequential method
program can work n
~
169 point cases, the program for
computing the least squares weights can work n
~
63 point
cases and the searching method 4.5.2 can search N < 675
limbs for a n < 30 points case .
•
•
85
7.
SUMMARY
The objective of this work was to solve the problem
of drawing an optimal tree from a distance matrix.
In
accord with previous works (Cavalli-Sforza and Edwards,
1967, and Horne, 1967), least squares was chosen as the
optimality criterion.
The four points case was solved, but there was not a
straightforward generalization.
Least squares was used to
glve a fast sequential method of adding a point at a time.
Instead of decomposing the problem by considering
first some points and then another point, it is possible
to decompose the problem by finding the unweighted tree
(the tree form) and then estimating its least squares
weights by using a restricted least squares method.
n
Considering an n points problem as a set of (4) four point
subproblems made possible the development of the concept
of the minimum number of incompatibilities.
To find a
practical algorithm to compute the unweighted tree, a
related criterion, the mlnlmum sum of absolute deviations
of frequencies was used.
Examples were done on the Triangle Universities
Computation Center IBM System 370/165.
developed work for n < 30 .
•
The programs
tt
86
8.
LIST OF REFERENCES
Beale, E. M. L.
1955. On minimizing a convex function
subject to linear inequalities. J. Roy. Statist.
Soc. (B) 17:173-184.
Camin, J. H. and R. R. Sokal 1965. A method for deducing
branching sequences in phylogeny. Evolution 19:311326.
Cavalli-Sforza, L. L. and A. W. F. Edwards 1965.
Estimation procedures for evolutionary branching
processes.
Bull. Inst. Int. Statist.
41:803-807.
Cavalli-Sforza, L. L. and A. W. F. Edwards 1967.
Phylogenetic analysis. Models and estimation
procedures.
Evolution, 21:550-570.
Edwards, A. W. F. and L. L. Cavalli-Sforza 1963. The
reconstruction of evolution. Heredity, 18:553.
Edwards, A. W. F. and L. L. Cavalli-Sforza
Reconstruction of evolutionary trees.
Association, Pub. No. 6:67-76.
1964.
Systematics
Farris, S. J.
1970. Methods for computing Wagner trees.
Syst. Zool., 19:83-92.
Farris, S. J.
1972. Estimating phylogenetic trees from
distance matrices. The American Naturalist, 106:645668.
Goodman, M. M. 1972. Distance analysis In biology.
Syst. Zool., 21:174-186.
Goodman, M. M. and E. Paterniani 1969. The races of
malze: III. Choices of appropriate characters for
racial classification.
Economic Botany, 23:265-273.
Hakimi, S. L. and S. S. Yau 1965. Distance matrix of a
graph and its realizability.
Quart. App. Math.,
22:305-317.
Harary, F. and R. Z. Norman 1953. Graph theory as a
mathematical model in Social Science.
The University
of Michigan. Ann Arbor, Michigan, 1953.
•
Horne, S. L.
1967. Comparison of primate catalase
tryptic peptide and implications for the study of
molecular evolution.
Evolution, 21:771-786.
•
87
Kluge, A. G. and S. J. Farris 1959. Quantitative
phyletics and the evolution of Anurans. Syst. Zool.,
18:1-32.
Simoes-Pereira, J. M. S.
1966. Some results on the tree
realization of a distance matrix.
In P. Rosenstiehl
(ed.) Theory of graphs.
International Symposium,
Rome, 1966. Gordon and Breach, New York (1967), and
Dunod, Paris (1967). pp. 383-388.
Simoes-Pereira, J. M. S.
1969. A note on the tree
realizability of a distance matrix. Journal of
Combinatorial Theory, 6:303-310.
Wagner, W. H., Jr.
1967. The construction of a
classification. In: Systematic Biology. Proceedings
of an International Conference. Publication 1692,
National Academy of Science. Washington, D.C. pp.
67-90 .
•
tt
88
9.
9.1
APPENDICES
A summary of graph theory definitions used.
With suitable changes, the following ten definitions
are based on Harary and Norman (1953).
Defini tion 9. 1. 1
A graph of n points v ,v 2 " " , v consists of these n
n
1
points together with a subset e ,e , ... ,em of all lines
1 2
joining pairs of these points.
Definition 9.1. 2
A path from v. to v. In the graph G is a collection
l
]
of lines of G of the form vivi+1,vi+1vi+2""
vi+kv j ' where vi' v i + 1 ""
points of G and vhv
e
,vi+k-1vi+k'
,vi+k,v j are all different
denotes the line in the graph joining
V •
e
Definition 9.1. 3
A graph is connected if there exists a path between
every palr of its points.
Defini tion 9.1. 4
A cycle of a graph G is a collection of lines of G of
the form v i v i + 1 ' v i + 1 v i + 2 '··· ,vi+k-1vi+k' vi+kv i ' where
vi' v i + 1 ""
•
,v i +k are all different points of G and
v i v i + 1 ' •.. ,vi+kv i are all lines in the graph .
•
89
Defini tion 9. 1. 5
A tree is a connected graph having no cycles.
Definition 9.1.6
A weighted graph
lS
a graph with a nonnegative weight
attached to each of its lines.
Denote by W.. the weight
lJ
of the line v.v ..
l
J
Defini tion 9. 1. 7
The length of a path of a weighted graph
lS
the sum
of the weights of its lines.
Defini tion 9. 1. 8
The length of a graph
lS
the sum of the weights of
its lines.
Definition 9.1. 9
The distance between two points in a weighted and
connected graph is given by the length of the shortest
path joining them.
Definition 9.1.10
The distance matrix of a connected graph of n points
lS
the matrix D
=
(d .. )
lJ
of order n, where d .. = distance
lJ
between v. and v ..
l
J
The next three definitions are based on Hakimi and
Yau (1965) .
•
•
90
Definition 9.1.11
A (graph) realization of a distance matrix D of order
n
lS
a graph G with m
~
n points, such that n of its
points have distance matrix D.
The n points of G giving D
are called external points, and the remaining points of G
are called internal points.
Remark:
In the above context, it will be understood that an n
points tree
lS
an n external points tree.
Definition 9.1.12
An optimum realization of a distance matrix D
lS
a
realization having minimum length.
Definition 9.1.13
A line v.v. in a connected and weighted graph G
l
J
lS
in G such that v.~vk~v. and
k
l
J
W.. > d' k + d ·, where W.. is the weight of the line v.v.
lJ l
lJ
l J
kJ
and d
(d ) is the distance between vi and v
(v and
k
k
ik
kj
v.).
redundant if there exists v
J
9.2
Minimization procedure of the sequential approach.
Let T _ be an n-1 points (v ' ... ,v _ ) weighted tree
n 1
1
n 1
with distance matrix
•
G~
Denote the points of T 1 which
n-
are not in {v 1 ' ... ,v n _ 1 } by v n +1'v n +2""'v
.
m (the latter
are internal points) .
•
91
Gi ven the distances d.v.v ., l = 1, ... ,n-1; the problem
n
l
lS to find a new set of distances h
v.v
l
l= 1, ... ,n-l;
n
such that
n-1
n-1
(h
2:
i=1
where g
v.v -d v.v
l
v.v
l
n
l
Cg
-d
)2
2:
mln
v.v
v.v
p a point of i=1
l
n
l
n
any line B of
T _ and W>O
n 1
n
lS the distance between the points v. and v
l
n
n
In the tree T , formed by adding a line of weight W at the
n
point p of the line B of T l'
nIn order to put
n-1
2:
i=1
(g
v.v
l
-d
n
v.v
l
)2
n
as a function of p, B, and W, do the following:
Let B be any fixed line of T _ with end points v ,
n 1
r
v , r < s and weight x, then fix p as the point of B at a
s
distance a from v .
r
There are two possibilities, either r < n-l or r > n.
If r < n-1, then for each i
gv.v
l n
•
=
{
= 1,2, ... ,n-1.
a + W
if l=r
d v.v - a+W
l r
if i;tr,
~
92
> n, then for each
and if r
d
gv.v
l
v·v
l
r
= 1,2, ... ,n-1
l
+ a + W
if d
V.V
l
< d
r
v·v
l
s
=
n
d
v.v
l
- a + W
if d
r
> d
v·v
l
r
v.v
l
s
Substituting the value of each gv.v ' it follows that
n
l
n-1
E
i=1
(gv.v
l
d .
)2
v v
n
n
l
= C1
+ C a + C W + C aW + Cs(a2+W2) ,
2
3
4
where if r < n
1
=
d2
C
2
=
-2{d
C
3
=
-2{d
4
=
-2(n-1) + 4,
s
= n-1;
C
C
v
r n
V
and
C
~
and if r
> n
+
E (d
k<n
k;t'r
E
v v
r n
v v
r n
vkv n
(d
k<n
k;t'r
+
E
k<n
k;t'r
(d
d
vkv n
vkv n
)2
vkv r
d
d
vkv r
vkv r
) },
) },
•
93
C
C
C
C
=
1
2
E
= 2{
=
4
1
E
2
(d
E
kE:8
2(s1
vkv n
Cd
kE:"8
= -2{
3
Cd
kE:8
-
2
_d)2 + E
vkv r
k€8
vkv n
vkv n
-d
-d
Cd
k€.8
)+
vkv r
2
E
)-
vkv r
Cd
1
(d
L:
k€8
1
vkv n
vkv n
vkv n
-d
-d
-d
vkv r
vkv r
vkv r
)2,
)}
,
)}
,
s2)'
and
C5 = n-1;
where 8
=
1
has s1 elements, 8
2
has s2 elements, {8 } + {8 }
1
2
{vi'··· ,v n _ 1 } and vi E: 8 1 if and only if d v . v
l
Now, for the line B of T
r
< dv . v '
l
s
n- l' the problem lS to
mlnlmlze
with the restriction that
o
< a < x
and
o
<
W< Z
where Z lS fixed arbitrarily as the maXlmum entry of D.
To find the minimum, check all critical points
•
points with
~ =~ =
(i.~.,
0) within 0 < a < x, 0 < W < Z and
the minimum of Q on the boundary.
This is easily
•
94
accomplished by uSlng the computer.
Then do all the above for each line of T _ and
n 1
choose the line giving the minimum Q.
The problem can be generalized to the case where it
can be chosen among several possible points to be added to
the tree.
9.3
Choose the point giving the minimum Q.
The Possible Outcomes of a Distance Matrix of Order
Five.
Let D be a distance matrix of order 5, let D.;
l
l= 1, 2, 3, 4, 5 be the submatrices of D corresponding to
the sUbscripts A = {1,2,3,4}, A = {1,2,3,5}, A = {1,2,4,
3
1
2
5}, A = {1,3,4,5} and A = {2,3,4,5} respectively, and let
5
4
D. have a unique optimal estimate tree T.; i= 1,2,3,4,5.
l
l
It is clear, by counting, that 0 < F .. < 3;
lJ -
{i<j} C {1,2,3,4,5},
and
4
5
L:
L:
F ..
i=1 j=i+1 lJ
Note that
=
10.
F.. > 2.
max
{i<j} C {1,2,3,4,5} lJ
Indeed,
assumlng the contrary forces F .. = 1; {i<j} C {1,2,3,4,5},
lJ
~.~.,
i:j - k:1, i:k - j:h, i:h - ,j:1, i:1 - k:h
j:k - 1:h; {i,j,k,1, h} = {1,2,3,4,5}.
and
It follows that
(9.1)
•
•
95
(9. 2 )
(9. 3)
(9.4-)
and
(9. 5)
From 9.2, 9.3, 9.4- and 9.5 obtain respectively
d .. > d
lJ
d .. > d
lJ
ik
ih
+ d
+ d
d
jh - kh
jt
-
d
ht
,
( 9 .6)
,
(9.7)
d kt > d it + d kh - d ih
(9. 8)
and
(9. 9 )
Add 9.6, 9.7, 9.8 and 9.9 to obtain 2Cd
d
ik
+ d
jt
+ d
it
+ d
jk
ij
+ d
kt
) >
, in contradiction with 9.1.
Considering fixed i:j - k:t and i:j - k:h, there are
27 possible outcomes.
The outcome F ij
•
They are listed in Table 9.1.
= F th = 2, F ik = Fit = F jk = F jh
F
= F jt = 0 (column 10, Table 9.1)
ih
= Fkt = F kh = 1
and
is impossible.
Indeed, from i:j - k:h, i:t - j:h and
i:k - t:h obtain
•
•
Table 9.1
Outcomes of a Five Points Case With i:j-k:£. and i:j-k:h Fixed.
F R E QUE N C I E S
i:j-R.:h
PAIRS
i:k-R.:h
T
1
T
2
T
i:R.-K:h
3
T
1
T
2
T
i:h-k:R.
3
T1 T
2
T
i:k-R.:h
3
i:R.-k:h
222
2222· 2
i,k
111
111
j,k
j
111
111
2
222
1
1
2
2
2
,R.
1
1
11211
2
224
121
1
2
2
121
1
2
111
112
112
1
2
1
122
k,h
1
2
1
2
3
2
121
1
2
1
2
3· 2
i,h
3
2
2
2
1
1
2
1
1
2
111
Col.
1
2
345
6
7
8
9
10 11 12 13 14 15 16 17 18
j
k
>-<
h
Q,
h
2
1
112
1
k
2
1
1
j
and
222
112
1
>-<
T3 T1 T2 T 4
112
1
1
3
T2 =
2
1
2
=
2
T
1
112
1
1
1
112
T
T
1
k, R.
where
3
111111222
1
Q,
T
111
1
j
2
111
j,h
2
T
i:h-k:R.
111
1
1
1
1
1
1
1
i:R.-k:h
111
111
1
i:k-R.:h
T
333333333
i,h
i:h-k:R.
T 1 T 2 T 3 T1 T T 3 T 1 T 2 T 3
2
i ,j
i,R.
i:h-j:R.
i:R.-j:h
T3 =
h
2
3
>-<
2
1
2
3
1
19 20 21 22 23 24 25 26 27
k
Q,
CD
m
•
97
(9.10 )
(9.11)
and
(9.12)
Adding 9.10 and 9.11 and reducing terms yields
In contradiction with (9.12).
Note that outcomes In
columns 11, 16, 19, 21, and 22 of Table 9.1 are
permutations of the above impossible outcome.
The outcome Fij = Fit = Fkh = 2, F jk = F jh = Fkt =F th =1
and F
ik
= F
ih
impossible.
= F
jt
= 0 (column 13 of Table 9.1) is
From i:j - k:t, i:t - j:h, and j:k - t:h
obtain respectively
(9.13)
(9.14)
and
(9.15)
•
Adding 9.13 and 9.14 and reducing yields
d
kt
+ d
jh
< d
jk
+ d
th
in contradiction with 9.15.
•
98
Note that outcomes. l.n columns 12, 17, 20, 24 and 25
of Table 9.1 are permutations of the above impossible
outcome.
All other outcomes In Table 9.1 are permutations
.-
of the tree outcomes considered In Example
(1)
F ..
(2 )
F ..
(3)
F ..
~,~,
l.e.,
lJ
and
lJ
=
3, F kh
=
= FQ,h =
1
and
lJ
=
2
and
9.4 Summary of Computer Storage and Time Requirements of
Programs Developed.
Programs were developed uSlng FORTRAN IV language.
To glve some idea of the characteristics of the programs
developed, approximate compilation and computing. times are
glven.
•
Both the storage and time requirements may be
improved by a programmer .
•
99
The program for the sequential approach (4.1)
requires the storage of
~{n(n-1)} input distances, ~{(2n-2)
(2n-3} tree distances and
~{C2n-2)(2n-3)} variables
defining the tree and its weights.
approximately
This gives a total of
~(9n2-21n) real variables.
Considering that
a real variable needs 4 bytes, (computer's storage units)
it is necessary to have, approximately, 18n 2-42n bytes.
Compilation time was 5 seconds and a 25 points case needed
15 seconds to be solved (including compilation).
The program for computing the weights requires the
storage of n(n-1) distances (input and output), 2n-3
weights and
(2n-3)+~{(2n-3),(2n-4)}+~{n(n-1)(2n-3)}inte'ger
variables defining the function to be minimized.
Approximately n 2 +n real variables and ~(2n3-n2-6n) integer
variables are necessary.
Considering that an integer
variable uses 2 bytes, approximately 2n +3n 3-2n 2bytes are
necessary.
Compilation time was 9 seconds, and 20 seconds
were necessary to run a 25 points example (including
compilation) .
The program for the searching method 4.5.2 requlres
the storage of
~{n(n-1)} frequencies, ~{N(N-1)} partial
sums of absolute deviations of frequencies and 61N labels
identifying the N limbs and their points, thus n 2 -n+N 2 +121N
bytes are necessary.
•
Compilation time was 11 seconds and
70 seconds were necessary to compute a 25 points case
searching 390 limbs.
100
The programs for computing the frequencies and the
number of incompatibilities of the data set with respect
to a given tree have no special requirements.
The above programs are not assembled and preparlng
inputs is tedious.

Download Report

Castillo-Morales, A.; (1973).Drawing an optimal tree from a distance matrix."

Paperzz.com

Your Paperzz