A New Theory of Phylogeny Inference Through

A New Theory of Phylogeny Inference Through Construction of
Multidimensional Vector Space
Yasuhiro Kitazoe,* Yukio Kurihara,† Yuichi Narita,* Yoshiyasu Okuhara,*
Akira Tominaga,‡§ and Tomohiko Suzuki\
*Center of Medical Information Science, †Department of Information Science, ‡Department of Medical Biology, and
§Medical Research Center, Kochi Medical School, Nankoku, Kochi, Japan; and \Department of Biology, Faculty of Science,
Kochi University, Akebono, Kochi, Japan
Here, a new theory of molecular phylogeny is developed in a multidimensional vector space (MVS). The molecular
evolution is represented as a successive splitting of branch vectors in the MVS. The end points of these vectors are
the extant species and indicate the specific directions reflected by their individual histories of evolution in the past.
This representation makes it possible to infer the phylogeny (evolutionary histories) from the spatial positions of
the end points. Search vectors are introduced to draw out the groups of species distributed around them. These
groups are classified according to the nearby order of branches with them. A law of physics is applied to determine
the species positions in the MVS. The species are regarded as the particles moving in time according to the equation
of motion, finally falling into the lowest-energy state in spite of their randomly distributed initial condition. This
falling into the ground state results in the construction of an MVS in which the relative distances between two
particles are equal to the substitution distances. The species positions are obtained prior to the phylogeny inference.
Therefore, as the number of species increases, the species vectors can be more specific in an MVS of a larger size,
such that the vector analysis gives a more stable and reliable topology. The efficacy of the present method was
examined by using computer simulations of molecular evolution in which all the branch- and end-point sequences
of the trees are known in advance. In the phylogeny inference from the end points with 100 multiple data sets, the
present method consistently reconstructed the correct topologies, in contrast to standard methods. In applications to
185 vertebrates in the a-hemoglobin, the vector analysis drew out the two lineage groups of birds and mammals.
A core member of the mammalian radiation appeared at the base of the mammalian lineage. Squamates were isolated
from the bird lineage to compose the outgroup, while the other living reptilians were directly coupled with birds
without forming any sister groups. This result is in contrast to the morphological phylogeny and is also different
from those of recent molecular analyses.
Introduction
Molecular studies have provided new findings on
evolutionary phylogeny (Goodman 1963; Sarich and
Wilson 1967; Gemmell and Westerman 1994; Janke et
al. 1996; Springer et al. 1997; Kumar and Hedges 1998;
Hedges and Poling 1999). Phylogeny inference has been
done by methods directly using the substitution distances between the species (Sokal and Michener 1958; Eck
and Dayhoff 1966; Fitch and Margoliash 1967; Sokal
and Rohlf 1981; Saitou and Nei 1987) and by those
taking explicit account of nucleotide or amino acid substitutions (Felsenstein 1993; Adachi and Hasegawa
1996; Strimmer and von Haeseler 1996; Swofford
1998). The topologies obtained by these methods are
often changed on the basis of the numbers and kinds of
selected species, including the outgroup. There appear
to be strange topologies in spite of high branch resolutions (Cao et al. 1998a, 1998b; also shown in this paper). Furthermore, the branch resolutions tend to be lower with an increasing number of species, since the number of topological variations increases rapidly. Hence,
the whole structure of the vertebrate phylogeny remains
unclear (Novacek 1992; Hedges 1994; Cao et al. 1998a).
In this respect, the development of a new methodology
Key words: phylogeny inference, multidimensional space, equation of motion, vector analysis, hemoglobin, vertebrates.
Address for correspondence and reprints: Yasuhiro Kitazoe, Center of Medical Information Science, Kochi Medical School, Nankoku,
Kochi 783-8505, Japan. E-mail: [email protected].
Mol. Biol. Evol. 18(5):812–828. 2001
q 2001 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
812
is expected to provide an understanding of molecular
evolution.
There exists an original space of vectors whose
components are the amino acid sequences of the respective species. However, this space is too abstract for
a quantitative analysis of the tree structure, since these
components are not digital. If this space can be transformed into the multidimensional vector space (MVS)
of Euclidean space without decreasing the degree of
freedom of the variables, the tree structure can be directly analyzed by using geometrical quantities such as
the distances and angles between the vectors.
In this paper, we first show how the evolutionary
process is represented in the MVS. To make lucid our
basic idea for this representation, we consider the simplest model of substitutions (hereinafter called the single-substitution model [SSM]), in which the amino acids
of any sites can be changed at most once over all the
evolutionary processes, and the substitution distances
between two species are approximated by the number
of different amino acid sites between two sequences.
Then, the processes are represented as the developments
of branch vectors along the orthogonal coordinate axes
in the MVS, and the end points of these vectors correspond to the extant species. The species vectors indicate
the specific directions in the MVS, respectively.
Next, to infer the tree structure from the species
vectors of the end points, we introduce search vectors
to draw out the groups of species, which are distributed
around these vectors. A search vector may be a onespecies vector or a center-of-gravity vector (CGV) of
Phylogeny Inference in Multidimensional Space
many species. The drawn-out species stand in line in the
nearby order of the topological relationship with the
search vectors if the branch vectors are orthogonal. The
branching patterns of the tree are analyzed by a scatter
diagram which expresses the relationship among the
groups of species drawn out by two search vectors. It is
shown that the diagram consists of the three parts: the
left branch, the right branch, and the central line. The
first two include the drawn-out species which are distributed parallel to the x- and y-axes, respectively, while
the third represents the other species (outgroup) of the
outside branches, which are distributed around the origin
of coordinates or along the central line of y 5 x.
The present theory is meaningless if the spatial positions of the respective species are not obtained. In this
paper, their positions in the MVS are determined so that
their relative distances may reproduce the substitution
distances between species. This can be easily done in
simple cases of three and four species. As the number I
of species and the dimensional size N increase, however,
it is difficult to determine the positions by statistical
methods of minimizing or maximizing a quantity, since
the number of variables is huge (if I 5 150 and N 5
100, the number is I 3 N 5 15,000). To remove this
difficulty, we introduce the dynamic method of timedependently solving the equation of motion for the
many-body system in physics (David and James 1988).
Here, the species are regarded as particles moving under
their mutual interactions. By utilizing the law of physics
that the system finally falls into the ground state, irrespective of the initial random distributions of the particles, we can precisely determine the positions of the
species in the MVS. Here, the interactions are defined
as the sums of two-body potentials, which are expressed
so as to be minimum at the points equal to the substitution distances between the two related species. The
solution of the equation of motion always converges on
the ground state by the (I 2 1)-dimensional size. However, when I is larger than the number L of variable sites
in amino acid sequences of these species, it is found that
the equation of motion can be solved with good accuracy by the L-dimensional size in place of (I 2 1). Then,
the number of variables in the MVS becomes substantially equal to that in the original abstract space. This
means a substantial transformation of the original space
into the MVS.
The validity of the present method was examined
by using computer simulations of molecular evolution
in which any branch patterns could be created by using
the Monte Carlo method (Hammersley and Handscomb
1964) with the weight of the transition probability matrix. Then, it was possible to directly compare the present method with standard ones such as neighbor joining
(NJ), maximum likelihood (ML), and parsimony (PA),
since all the branch- and end-point sequences were
known in advance. The predominance of the present
method was demonstrated by using two examples of
typical branch patterns in which 100 multiple data sets
with a variety of branch lengths were prepared by repeating the simulations.
813
FIG. 1.—Three-dimensional description of the elementary process
of evolution. An elementary process (a) of splitting a branch into two
is expressed in the single-substitution model (SSM) by the three branch
vectors (b) along the x-, y-, and z-axes of the three-dimensional space.
A succession of such elementary processes forms an evolutionary tree,
which can be represented in the multidimensional vector space (MVS).
The present method was applied to the vector analysis of 185 vertebrate species in the a-hemoglobin, and
the results were compared with those of the standard
methods. The standard methods gave erroneous topologies in the relationship among Artiodactyla, Perissodactyla, and Chiroptera in spite of very high branch resolutions which became lower as the number of species
increased. On the other hand, the present approach gave
the correct topology, and a larger number of species
brought a more stable solution. Encouraged by this success, we challenged recent analyses of topics regarding
the reptilian phylogeny. Our results showed that squamates were clearly isolated from the lineage of birds,
while the other extant reptilians belonged to the lineage
of birds without forming any sister groups. This is in
contrast to the traditional phylogeny (Romer 1966; Benton 1997) and is also different from recent reports (Gorr,
Mable, and Kleinschmidt 1998; Hedges and Poling
1999).
Theory
Representation of the Evolutionary Process in the
MVS
The minimum unit of evolution is the elementary
process of figure 1a, in which point 1 is split into points
3 and 4 via point 2. Here, we define di,j as the substitution distance between the ith and jth points. The SSM
814
Kitazoe et al.
FIG. 2.—Evolutionary tree for the presentation of the present theory. The present theory is presented using this figure without loss of
generality. The figure consists of the upper and lower wings, which
have the seven branch points and eight end points, respectively.
gives the relations d1,3 5 d1,2 1 d2,3, d1,4 5 d1,2 1 d2,4,
and d3,4 5 d2,3 1 d2,4. In this paper, we also define the
spatial distance Ri,j 5 zRi 2 Rjz in the MVS by the
square root of di,j. Then, these relations are expressed
as follows: R21,3 5 R21,2 1 R22,3, R21,4 5 R21,2 1 R22,4, and
R23,4 5 R22,3 1 R22,4, which represent Euclidean distances. Thus, the tree structure of figure 1a is represented
by three vectors, R1,2, R2,3, and R2,4, along the x-, y-,
and z-axes in the three-dimensional space (fig. 1b).
When point 3 is further split into two branches, the tree
is composed of five vectors along the orthogonal coordinate axes in the five-dimensional space. In this
way, the evolution in the SSM is represented as the
development of branch vectors along the orthogonal
coordinate axes of the MVS, and the extant species
correspond to the end points of these developments.
The orthogonal relation between the branch vectors
is very useful in the structural analysis of the tree. Without loss of generality, we consider the topology of figure
2, which consists of the branch points (points 1–15) and
the endpoints (points 16–31). Here, point 1 is the root
of the tree. The SSM gives a number of orthogonal relations between the branch vectors Ri,j 5 Ri 2 Rj. For
example, we have Ri,1 ⊥ Rj,1 for the two groups of points
(i 5 2, 4, 5, 8–11, 16–23) and (j 5 3, 6, 7, 12–15, 24–
31), and also Ri,2 ⊥ Rj,2 for the two groups of points (i
5 4, 8, 9, 16–19) and (j 5 5, 10, 11, 20–23). However,
for example, the vectors Ri,1 (i 5 6, 12, 13, 24–27) are
not orthogonal to the vectors Rj,1 (j 5 7, 14, 15, 28–
31), since they include the common vector R1,3.
When we specify a point s as one of the search
vectors, the angle ui,s between the vectors Rs and Ri is
given by the relation Ui,s 5 Ri·Rs 5 SNn51 Xi,nXs,n 5
RiRs cos ui,s. Here, the vectors Ri are measured from the
position of the root (point 1), and Xi,n and Xs,n denote
the nth dimensional components of Ri and Rs, respectively. Then, we have the quantity
Zi,s 5 (Ri cos ui,s)2,
(1)
which represents the square of the projection component
of the ith point vector Ri to Rs. By specifying another
point, t, we obtain the scatter diagram for Wi,s and Wi,t,
which give the x and y values, respectively.
A variety of pairs may be selected for the search
points (s, t). For the first pair, s 5 31 and t 5 16, the
SSM gives the following orthogonal relations: R15,1 ⊥
R15,s, R15,1 ⊥ R15,30, and R15,s ⊥ R15,30, since points 1,
15, 30, and 31 correspond to points 1, 2, 3, and 4, respectively, in figure 1b. We have Zs,s 5
R2s,1 and Z15,s 5 Z30,s. Hence, we obtain Zi,s 5 C1 for the
point (i 5 31), Zi,s 5 C2 for the points (i 5 15, 30), Zi,s
5 C3 for the points (i 5 7, 14, 28, 29), Zi,s 5 C4 for
the points (i 5 3, 6, 12, 13, 24–27), and Zi,t 5 0 for the
points (i 5 3, 6, 7, 12–15, 24–31), where Cj (C1 . C2
. C3 . C4) are constants. In this way, the four groups
of points stand in line on the x-axis according to the
nearby order of branches from the search point (s 5 31).
Similarly, in the upper branches of figure 2, the point (i
5 16) gives the largest value of Zi,t on the y-axis, the
points (i 5 8, 17) form the second group, the points (i
5 4, 9, 18, 19) form the third group, and the points (i
5 2, 5, 10, 11, 20–23) form the fourth group.
For the second example, the pair (s 5 31, t 5 24),
we have the orthogonal relations Zi,s 5 Zi,t 5 0 for the
points (i 5 2, 4, 5, 8–11, 16–23), Zi,t 5 C5 for the point
(i 5 24), Zi,t 5 C6 for the points (i 5 12, 25), Zi,t 5 C7
for the points (i 5 6, 13, 26, 27), and Zi,s 5 C8 for the
points (i 5 3, 7, 14, 15, 28–31), since Ri,3 ⊥ Rj,3 for the
two groups of points (i 5 6, 12, 13, 24–27) and (j 5 7,
14, 15, 28–31). Here, C5, C6, C7, and C8 are constants
(C5 . C6 . C7 . C8). Here, in the x-direction, the point
(i 5 31) gives the largest value of Zi,s, the points (i 5
15, 30) form the second group, the points (i 5 7, 14,
28, 29) form the third group, and the points (i 5 3, 6,
12, 13, 24–27) form the fourth group. In the y-direction,
point 24 gives the largest value of Zi,t, the points (i 5
12, 25) form the second group, the points (i 5 6, 13,
26, 27) form the third group, and the points (i 5 3, 7,
14, 15, 28–31) form the fourth group. In this way, the
scatter diagram for Zi,s and Zi,t gives the groupings of
the points which correspond to the nearby orders of
branches viewed from the search points (s and t). Also,
the regression lines along these plots are orthogonal to
each other. However, actual evolutionary processes may
include more or less substitutions of higher orders than
those in the SSM. Then, the two lines cannot be completely parallel to the x- and y-axes, respectively.
A computer simulation of evolution is a useful tool
to check the validity of the vector analysis in terms of
Zi,s and Zi,t, since all the sequences of the 31 points of
figure 2 are known in advance. The simulation created
these sequences on the basis of the Markov process
without applying the SSM. An amino acid sequence
(141 sites) of the first point (the root) was given by
generating white random numbers for the 20 amino acids. We tried to substitute an amino acid at each site of
the root sequence by generating a random number with
the weight of the transition probability matrix between
amino acids. Then, most of the sites remained unchanged, since the diagonal parts of the matrix were
Phylogeny Inference in Multidimensional Space
much larger than the nondiagonal ones. The sequences
of the next two points (2, 3) were fixed after amino acid
substitutions of 10 sites in the root sequence, respectively. This procedure was continued to the end points
(points 16–31). In this simulation, the Dayhoff model
was used for both the transition probability matrix and
the substitution distances between the sequences of the
respective points (Dayhoff, Schwartz, and Orcutt 1978).
The substitution distances between points 1–31 were
calculated in a way equivalent to that of the software
package Protdist (Felsenstein 1993), and used as the input data for the equation of motion.
Positions Ri of the 31 points of figure 2 were precisely determined by solving the equation of motion in
the 30 dimensions and used for the vector analysis. Figure 3a shows the scatter diagram for Zi,s and Zi,t (s 5
31, t 5 16). The scatter plots were clearly classified into
the eight groups according to the branching orders from
the specific points (s 5 31, t 5 16), and the regression
lines along the plots were almost orthogonal to each
other on the x- and y-axes, although the SSM was not
applied and the end points were rather far from the root
(point 1) (d ø 35). Figure 3b shows the scatter diagram
for Zi,s and Zi,t (s 5 31, t 5 24). Here, the points of the
upper branches of figure 2 were distributed around the
origin of coordinates to compose the outgroup, since
their vectors were almost orthogonal to the vectors Rs
and Rt. Those of the lower branches were distributed in
a pattern quite similar to that shown in figure 3a and
were classified into the seven groups. In this way, a
successive change of the search vectors s and t completely resolved the branch pattern of figure 2.
Vector Analysis for Phylogeny Inference
Unfortunately, the sequences of branch points 1–
15 in figure 2 are not known in actual evolution. To
confirm the validity of the vector analysis, we have to
reproduce the topology of this figure by blinding the
branch points. Using the distance matrix for the end
points (points 16–31) as the input data, we determine
their positions Ri by solving the equation of motion in
15 dimensions.
Prior to the inference, we determine the origin of
coordinates using the positions of the end points. The
molecular clock is assumed as the first approximation
for this position. We look for the position Rc of the
spherical center which is equidistant from all the end
points. Let us consider the quantity S defined by the
equation
S5
O {(R 2 R )
i .j
i
c
2
2 (R j 2 Rc ) 2}2 .
(2)
The central position is determined by minimizing the S
value on the basis of the least-squares method; ]S/]Xc,n
5 0 (n 5 1, 2, . . . , N) with linear coupled equations
for Xc,n. Here, Xc,n denotes the nth dimensional component of Rc. The values Xc,n can be obtained by solving
the matrix equation. As a result, we can check how the
end points are spherically distributed around the center,
815
FIG. 3.—Vector analysis of the evolutionary tree of figure 2. The
positions Ri of the 31 points of figure 2 were precisely determined in
the MVS of 30 dimensions. The vectors R31, R16, and R24 of the points
(i 5 31, 16, 24) were first chosen as the search vectors to analyze the
tree structure. a and b, scatter diagrams for Zi,31 and Zi,16 and for Zi,31
and Zi,24 (defined by eq. 1), respectively. In a, with the search points
(i 5 31, 16), the points of the upper and lower wings of figure 2 were
almost distributed on the y- and x-axes, respectively. The groups of
points were clearly classified in the nearby orders of branches from
the search points (i 5 31, 16). In b, with the search points (i 5 31,
24), the branch points (points 3, 6, 12) were further resolved, and the
regression lines along the plots were also orthogonal to each other.
Here, the points of the upper wing were distributed around the origin
of coordinates as the outgroup, since their vectors were almost orthogonal to the search vectors R24 and R31. The successive changes of the
search vectors could resolve the branches of the tree. Next, to infer
the phylogeny, we determined the vectors Ri of the end points (points
16–31) by solving the equation of motion of only these points. We
also determined position Rc of the spherical center by using equation
(2). The obtained grouping patterns of the end points in the vector
analysis were consistent with those of this figure.
since the distance between the center and the ith point
is given by Ri,c 5 {SNn51 (Xi,n 2 Xc,n)2}1/2.
Once the root position is determined, the tree structure can be analyzed in terms of the scattering diagram
for Zi,s and Zi,t by using only the end points. By adopting
R16 and R31 as the search vectors, we obtain figure 3a
without points 2–15. Then, on the basis of the orthogonal theory presented in the previous section, we know
that points 16–23 and points 24–31 groups that are independent from each other. In the group of points 16–
23, points 20–23 are first separated from the evolution
816
Kitazoe et al.
process toward the point 16, points 18 and 19 are separated next, and point 17 is finally separated. On the
other hand, in the group of points 24–31, points 24–27
are first separated from the evolution process toward
point 31, points 28 and 29 are separated next, and the
point 30 is finally separated. As a result, we obtain the
tree inserted in figure 3a. Next, we further resolve points
24–27 by taking R24 as the search vector instead of R16.
We get figure 3b without points 2–15. Then, we know
that the points 26 and 27 are first separated from the
evolution process toward point 24, and point 25 is separated next. As a result, we get a more detailed tree,
presented in the insert in figure 3b. By continuing such
a procedure, we can obtain the full tree of 16 points.
The vector analysis can be performed without any
mathematical approximations. However, the branch
lengths of the tree are obliged to be evaluated modeldependently. This can be easily done by using the SSM
as follows: In figure 2, for example, we have the orthogonal relations Ri,4 ⊥ Rj,4 for the points [i 5 (16,
17), j 5 (18, 19)], [i 5 (1), j 5 (16, 17)], and [i 5 (1),
2
2
2 1
j 5 (18, 19)]. We have R216,4 1 R18,4
5 R16,18
, R1,4
2
2
2
2
2
R16,4 5 R16,1, and R1,4 1 R18,4 5 R18,1. As a result, the
branch lengths are given by R21,4 5 (R216,1 1 R218,1 2
2
2
2
2
2 )/2, and R2
R16,18
)/2, R16,4
5 (R16,1
1 R16,18
2 R18,1
18,4 5
(R218,1 1 R216,18 2 R216,1)/2. We finally note that the scalar
product Ui,s 5 Ri·Rs 5 SNn51 Xi,nXs,n 5 RiRscos ui,s of
the search vector Rs and vector Ri gives the distance
from the root to the branch point between the two points
(s and i). The scatter diagram may be expressed in terms
of Ui,s and Ui,t in place of Zi,s and Zi,t.
Determination of Species Positions Ri in the MVS
To obtain species positions Ri in the MVS, we have
to determine a large number of parameters of Xi,n (i 5
1, 2, . . . , I; n 5 1, 2, . . . , N), which are the vector
components of Ri. We introduce a new method of timedependently solving the equation of motion for the
many-body system. For this purpose, the species are regarded as the particles moving according to the equation
of motion. They are initially deposited at random within
a volume of the MVS. With time, they continue to move
toward lower energy states under the influences of both
a viscosity effect and their mutual interactions. They
gradually lose their kinetic energies due to the viscosity
effect and finally stop falling into the ground state of
the system, irrespective of the initial conditions given
using random numbers. This falling is one of the most
fundamental laws of physics. Therefore, it is quite easy
to confirm whether the equation of motion has been correctly solved or not, since the ground state is explicitly
known in advance. The interactions are defined as the
sum of two-body attractive potentials, which are minimized at the points equal to the square roots of the substitution distances between the two species. The ground
state corresponds to the situation in which any pairs of
the particles are positioned at the points equal to the
square roots of the substitution distances.
The initial positions Ri and velocities Vi of the particles are randomly distributed within the volumes of R0
and V0 by the equations
Xi,n 5 (2.0F 2 1.0)R0
and
Vi,n 5 (2.0F 2 1.0)V0 ,
(3)
for i 5 (1, 2, . . . , I) and n 5 (1, 2, . . . , N). Here, V0
and R0 are constants, F is a random number between 0
and 1, and Vi,n and Xi,n are the nth dimensional components of Vi and Ri, respectively. The particles are allowed to move in time according to the kinematics of
damped equations of motion (David and James 1988):
dXi,n /dT 5 ]E/]Vi,n 2 m]E/]Xi,n ,
(4)
dVi,n /dT 5 2]E/]Xi,n 2 (m/M)]E/]Vi,n ,
(5)
Vi,n 5 dXi,n /dT,
(6)
which are the differential equations about time T with
the N dimensions (n 5 1, 2, . . . , N) and the I species
(i 5 1, 2, . . . , I). Here, m is the viscosity to damp the
motion of the particles. The total energy E of the system
is given by
E5
O V /2 1 O P ,
2
i
i
i .j
i,j
(7)
which consists of the sum of the kinetic energy of the
first term and the potential energy of the second term.
We introduce the two-body potential between the ith and
jth particles, defined by
Pi,j 5 P0exp[2(Ri,j 2 Di,j)2/A2],
(8)
which minimizes at the point of Ri,j 5 Di,j. Here, P0 and
A are constants, Rij 5 {SNn51 (Xi,n 2 Xj,n)2}1/2, and the
quantity Di,j is defined as the square root of the substitution distance di,j between the two sequences of the ith
and jth species.
Equations (4)–(8) are expressed in no dimensions
without physical units such as meters, grams, or seconds. If we set m 5 0, these equations are reduced to
the well-known ‘‘canonical equation,’’ where the total
energy E of the system is conserved time-independently.
The Runge-Kutta method is applied for the numerical
solution, where we set the constant values as follows:
V0 5 0.3, M 5 900, P0 5 20.5, m 5 8, and A 5 20.
R0 denotes the range of the initial distribution of the
particles and is defined as half of the maximum value
of Di,j. P0 is the depth of the potential which confines
the particles within the interaction range. The value of
m may be adjusted to be small such that the particles
will gradually lose their kinetic energies in time, and m
5 8 was found to be suitable. A time mesh to integrate
the differential equations about time is taken as DT 5
0.3, which is sufficiently small to keep the calculations
accurate. The size of dimensions, N, is set equal to (I
2 1) if I is smaller than the number of variable sites
within the given species, L. If I . L, we may put N 5
L, since the average error of Ri,j versus Di,j is then negligible. The numerical calculations start with a high-energy state given by a set of the initial distribution of the
particles, and they stop at the ground state, whose ap-
Phylogeny Inference in Multidimensional Space
817
Table 1
Percentages of Correct Solutions
Figure 4a . . . . . . .
Figure 4b . . . . . . .
FIG. 4.—Computer simulations of molecular evolution. The present method was compared with the standard ones using the two
branch patterns of a and b. The initial point (point 1) is given by the
consensus sequence of eutherians. A series of sequences of the following evolutionary process are determined by using the Monte Carlo
method with the weight of the transition probability matrix. The branch
points 2, 3, and 4 in a are the positions of the 50, 70, and 70 operations
for the amino acid transitions, respectively. Those in b are the positions
of the 40, 60, and 80 operations, respectively. The end points (points
5–9) correspond to the 100 operations. By repeating the simulations,
100 multiple data sets with a variety of branch lengths were prepared
for a statistical comparison with the other methods, and the sequences
of the end points were used for the phylogeny inference.
pearance verifies the correct solution of equations (4)–
(6). This can be easily confirmed in practice, since the
ground state is given by E0 5 P0I(I 2 1)/2, as known
from equations (7) and (8). In numerical calculations
using the interaction potential of equation (8), we do not
have any local minimum of the total energy. The calculations were performed by a Fortran program with
double precision.
Comparison with Other Methods by Computer
Simulations of Molecular Evolution
The efficacy of the present theory was demonstrated
by using computer simulations of molecular evolutions in
which all of the branch- and end-point sequences were
known in advance. Once an initial sequence of amino acids
is given, a series of sequences of the following evolutionary process are determined by using Monte Carlo method
(Hammersley and Handscomb 1964) with the weight of
the transition probability matrix t(p, q) for the Dayhoff
model. The amino acid transition at each site is determined
by generating a random number F between 0 and 1: if a
Maximum
Likelihood
Neighbor
Joining
Present
Method
68
82
90
89
100
100
given F satisfies the condition T(p, q 2 1) , F # T(p,
q), the transition from the pth amino acid to the qth one
occurs. Here, T(p, q) 5 Sqr51 T(p, r)/S20
r51 T(p, r), T(p, 0)
5 0, and T(p, p) means no transition. An operation of the
transitions is performed from the first site to the last, so
that the first sequence is created. The second sequence is
created by the next operation. The (i 1 1)-th sequence is
given by using the ith sequence. Two sequences are created at each branch point. In this way, we can simulate
any branch patterns of molecular evolution based on the
stochastic (Markov) process. The number of operations is
proportional to evolutionary time.
To compare the present method with the standard
ones, we formed the branch patterns of figure 4a and b.
Here, the end points (points 5–9) correspond to the 100
operations. Branch points 2, 3, and 4 in figure 4a are
the positions of the 50, 70, and 70 operations, respectively. Those in figure 4b are the positions of the 40,
60, and 80 operations, respectively. To make the simulation realistic, we assumed that the initial point (point
1) was the consensus sequence of eutherians. We also
assumed that the 64 sites could be substituted, since the
other 77 sites were known to almost freeze in eutherians.
By repeating the simulation, we produced the 100
multiple data sets (100 evolutionary events) in both topologies of figure 4a and b and tried to reconstruct the
original branch patterns by using only the sequences of
the end points, where point 5 was taken as the outgroup.
Table 1 shows the percentages of the correct solutions
for ML, NJ, and the present method. The ML and NJ
analyses were performed using the software packages
PUZZLE (Strimmer and von Haeseler 1996) and PHYLIP (Felsenstein 1993), respectively. The NJ method
gave better results than the ML method. A clear correlation of branch resolution between the two methods did
not exist due to their different algorithms for phylogeny
inference. Figures 5 and 6 show typical examples in
which the two methods gave erroneous topologies in
same data sets, corresponding to figure 4a and b, respectively. The ML method could not reconstruct the
original topology in spite of the 100% confidence (fig.
5a). This suggests that the ML value cannot necessarily
be the criterion for the correct topology. The results of
table 1 depend on the numbers of operations in figure
4. For example, if the numbers 20 and 30 in figure 4a
are replaced by 30 and 20, respectively, the results of
the standard methods are improved.
Next, we follow the procedure of the present method. First, we obtained the MVS of the four dimensions
by using the distance matrix for the end points (points
5–9). Then, we obtained the spherical center by using
equation (2). However, the obtained radius was too short
due to the small number of end points. According to the
818
Kitazoe et al.
FIG. 5.—Results of the standard methods for the branch pattern
of figure 4a. The figure gives a typical example (table 2) of data sets
in which the maximum-likelihood (a) and neighbor-joining (b) methods could not reconstruct the original topology of figure 4a in spite of
high branch resolutions. Here, point 5 was taken as the outgroup.
SSM, the minimum value of this radius is half of the
average distance between the outgroup (5) and the other
end points. This value was basically used as the spherical center (origin of coordinates) but multiplied by the
factor 1.3 for a deviation from the SSM. We again
solved the equation of motion of the five dimensions
which took into account this center and the end points
(points 5–9) and constructed the MVSs for all of the
data sets. As seen in table 1, we obtained an excellent
result of the vector analysis over all the data sets. Here,
we give full details of the two examples in which the
standard methods gave erroneous topologies (figs. 5 and
6). By examining every pair of end points (points 6–9)
as the search vectors, we selected the best pair of them
which fulfilled the criterion of orthogonal relations such
as those in figure 3b. Figure 7a gives the vector analysis
using the data set in the case of figure 5. Here, the search
vectors of points 6 and 8 revealed the existence of the
two sister groups of (6, 7) and (8, 9). These groupings
were confirmed by using points 8 and 9 as search vectors (see fig. 7b), since the other points (6 and 7) were
positioned outside of the orthogonal relation (solid lines
in Fig. 7b) formed by points 8 and 9. Next, figure 8
gives the vector analysis using the data set in the case
of figure 6. Using points 7 and 8 as search vectors, we
FIG. 6.—Results of the standard methods for the branch pattern
of figure 4b. The figure gives a typical example of data sets in which
the maximum-likelihood (a) and neighbor-joining (b) methods could
not reconstruct the original topology of figure 4b. Here, point 5 was
taken as the outgroup.
found that point 9 was coupled to point 8 while point 6
was positioned outside of the orthogonal relation (solid
lines) formed by points 7–9 (fig. 8a). This branch pattern was confirmed using points 8 and 9 as search vectors, since then points 6 and 7 were positioned outside
of the orthogonal relation formed by points 8 and 9, and
point 7 had larger vector components to points 8 and 9
than point 6 (fig. 8b).
Finally, we evaluated the branch lengths by using
the scalar product Ui,j of the two vectors of the ith and
jth end points. Here, we note the relation Ui,j 5
2 2 R2 )/2. In the topology of figure 4a, length
(R21,i 1 R1,j
i,j
(1-2) was given by the average value of Ui,j (i 5 6, 7;
j 5 8, 9). Lengths (2-3) and (2-4) were given by subtracting length (1-2) from lengths (1-3) and (1-4), respectively. Lengths (3-i) (i 5 6, 7) were given by subtracting length (1-3) from lengths (1-i), respectively.
Lengths (4-i) (i 5 8, 9) were given in the same way.
Figure 9a expresses the branch lengths in the case of
figure 7, which are compared with the true ones (fig. 9b)
given by using the branch- and end-point sequences.
Figure 9c gives the result of the ML method, and a
similar result was also obtained with the NJ method. In
figure 9c, the branch point between points 5 and 6 was
taken as the origin of coordinates, since the position of
the root was unknown. The obtained branch lengths
Phylogeny Inference in Multidimensional Space
FIG. 7.—Scatter diagram for the branch pattern of figure 4a. The
figure gives the vector analysis of the data set used in figure 5. The
analysis was done using the MVS of the five dimensions which include
both the end points (points 5–9) and a spherical center defined as the
origin of coordinates. The search vectors (points 6 and 8) drew out
points 7 and 9, respectively (a). By changing the search vectors to the
points 8 and 9, the points 6 and 7 were positioned outside of the
orthogonal relation (solid lines of b) formed by these vectors. In this
way, the original topology of figure 4a was confirmed.
were very different from both the present ones and the
true ones. However, such a comparison for the different
topologies may be meaningless. The values along the
branches in figure 9 denote the point-to-point distances.
In this paper, the distances between point 1 and the end
points were assumed to be constant for simplicity. The
branch lengths in figure 9a can be revised by readjusting
these distances so as to be proportional to those between
the outgroup (point 5) and the other points (points 6–
9). Table 2 gives the original sequences of point 1 and
the end points (points 6–9).
Applications to Vertebrates in the a-Hemoglobin
The present theory was applied to a few examples
of vertebrates in the a-hemoglobin, in which the largest
number of species have been registered in the database.
The sequences of 255 vertebrate species were obtained
from the Protein Identification Resources (PIR) database. First, we selected only species with 141 and 142
sites, whose sequences were readjusted so as to be 141
819
FIG. 8.—Scatter diagram for the branch pattern of figure 4b. The
figure gives the vector analysis of the data set used in figure 6. The
analysis was done using the MVS of the five dimensions which include
both the end points (points 5–9) and a spherical center defined as the
origin of coordinates. The search vectors (points 7 and 8) drew out
point 9 (a). By changing the search vectors to points 8 and 9, the
points 6 and 7 were positioned outside of the orthogonal relation (solid
lines of b) formed by these vectors, and point 7 had larger vector
components to points 8 and 9 than point 6. In this way, the original
topology of figure 4b was confirmed.
for the alignment analysis. Second, we excluded species
with the same sequences. As a result, we analyzed 185
species from fish to primates. The substitution distances
were calculated in a way equivalent to that of the package Protdist (Felsenstein 1991) with the Dayhoff model,
since calculation errors were found in this package. The
present results were insensitive to the use of other models, such as the Dayhoff-F, the JTT (Jones, Taylor, and
Thornton 1992), and the JTT-F models (Cao et al. 1994).
For the search vectors in the vector analysis, we
introduced the CGV Rs to express the representative direction of a group (s) of species in the MVS. It was
defined by Rs 5 S9i Ri/Is, where the prime sign of S9i
denotes the summation about the species of the group s
and Is is their total number. Consequently, the scatter
diagram for the two groups s and t was expressed in
terms of the quantities Zi,s and Zi,t, which are the projection components of the species vectors Ri to the two
CGVs of Rs and Rt, respectively.
820
Kitazoe et al.
Relationship Among Artiodactyla, Perissodactyla, and
Chiroptera
FIG. 9.—Branch lengths for the branch pattern of figure 4a. The
branch lengths were evaluated using the data set analyzed in figures 5
and 7. A distance between two branch (or end) points was assumed to
be the sum of the branch lengths connecting these two points. The
result (a) is compared with the true tree (b) given by using the branchand end-point sequences. The values along the branches denote the
point-to-point distances. For the maximum-likelihood method (c), the
branch point between points 5 and 6 was taken as the origin of coordinates, since the position of the root was unknown. The obtained
branch lengths were very different from both the present ones and the
true ones due to the erroneous topology (fig. 5a).
First, the present theory was applied to a typical
problem in which it is difficult to get the correct branch
patterns by using standard methods such as the ML, NJ,
and PA methods. We picked up eight species of Artiodactyla and Perissodactyla and used a rat as the outgroup. It is well known that the consensus topology is
[rat, {(Tylopoda, Ruminantia), (Ceratomorpha, Hippomorpha)}] (Novacek 1992; Benton 1997). The standard
methods gave an erroneous pattern, [Tylopoda, {Ruminantia, (Ceratomorpha, Hippomorpha)}], with high
branch resolutions (see fig. 10). Especially, the ML
method gave this strange pattern with 100% confidence
(fig. 10a), although this method gave the correct pattern
with 100% confidence in the case of the b-hemoglobin
(data not shown). Additional inclusions of other orders
resulted in another problem with the topology. As shown
in figure 11, by adding Chiroptera, Tylopoda still made
the first clade of the tree in a way similar to that in
figure 10, and furthermore, Chiroptera were included inside of the ungulate branches. Here, the branch resolutions were rather high for the NJ method (fig. 11b),
while they were very low for the ML method (fig. 11a).
It is well known that Chiroptera are located outside the
ungulate branches, since it is rather close to Primates
(Novacek 1992; Benton 1997).
In the present approach, the positions Ri of 27 species (their accession numbers are listed in the Supplementary Material section) of figure 11 were precisely
determined by solving the equation of motion in the 26
dimensions. The position Rc of the spherical center was
also determined by minimizing the S value of equation
(2) and defined as the origin of coordinates. We looked
at the CGVs of Artiodactyla and Perrisodactyla as the
search vectors and investigated how the species were
distributed around these two vectors. As a result, the
scatter diagram for Zi,a and Zi,p showed a clear pattern
of orthogonal relation between the two groups of Artiodactyla and Perissodactyla (fig. 12a). This is analogous
to the simulation of figure 3b and results in the branch
pattern [Chiroptera, {(Tylopoda, Ruminantia), (Ceratomorpha, Hippomorpha)}]. Here, the rat and Chiroptera
were distributed as the outgroup with small values of
Zi,a and Zi,p, since their vectors were directed at large
angles from the CGVs of Artiodactyla and Perissodactyla. Figure 12b gives the result of a full inclusion of
113 eutherians stored in the PIR database. This figure
shows that the additional inclusion of other species produces a more stable pattern of the scatter diagram.
It is here noted that the orthogonal relation of figure
12a and b is unique for the branch pattern [Chiroptera,
{(Tylopoda, Ruminantia), (Ceratomorpha, Hippomorpha)}]. The results (fig. 10) of the standard methods
supported a grouping of Ceratomorpha, Hippomorpha,
and Ruminantia. However, if the topology of figure 10
were true, the angles between Tylopada and Perissodactyla would become larger than those between Ruminantia and Perissodactyla. As a result, an orthogonal relation such as that in figure 12a and b could not be sat-
Phylogeny Inference in Multidimensional Space
821
Table 2
The Amino Acid Sequences Used in the Analyses of Figures 5, 7, and 9
Point
Point
Point
Point
Point
Point
1........................
5........................
6........................
7........................
8........................
9........................
VLSPADKTNVKAAWGKVGGHAGEYGAEALERMFLSF
.......NKA.DV...C.SDL.DFNG.....L..N.
...N...DKT.V.......Q.TA.QS...D....R.
...A...HK..ET.....A...L.QT.....A..A.
...AS..H...SS.K.P...G.A.L...........
...ET..H....N.K...A...G.L....G......
Point
Point
Point
Point
Point
Point
1........................
5........................
6........................
7........................
8........................
9........................
PTTKTYFPHFDLSHGSAQVKAHGKKVADALTNAVGH
.K.....A...VEQ..N..ES..E..PG...Q.HT.
............G.....LIV..A..PA..SS.PT.
............A...G.LIL..T..GI..KS.P..
.A................LII..T..QN..KS.D..
.......G.V...P....LVL..T..EN..KS.Q..
Point
Point
Point
Point
Point
Point
1........................
5........................
6........................
7........................
8........................
9........................
LDDLPGALSALSDLHAHKLRVDPVNFKLLSHCLLVT
VHEH.SS..S...................T......
.SH...T..I......D............T....A.
IEH...S..I......A..........K.N..I...
.NH...N..L......T............N....Q.
..HV..N..L......N..........M...M..G.
Point
Point
Point
Point
Point
Point
1........................
5........................
6........................
7........................
8........................
9........................
LACHHPAEFTPAVHASLDFKLASVSTVLTSKYR
K....NSN...GI.........A.G........
....PSRR...GS...T.....L..........
I...QSRW...GP........DL..........
..G..GNW...GP.........L.Y........
..G..NPW...GP.........L.K........
isfied. The uniqueness was also confirmed by showing
the nonorthogonal relation in the other groupings (Ceratomorpha 1 Ruminantia and Hippomorpha 1 Tylopoda) for the CGVs (fig. 12c).
Drawing Out of Lineage of Mammals and Birds
The positions Ri of 185 vertebrates were determined in the 110-dimensional size in which the average
error of Ri,j versus Di,j was sufficiently small, i.e., 0.08%.
The spherical center Rc of vertebrates was obtained by
using equation (2) and defined as the origin of coordinates, from which the species vectors Ri were measured.
The CGVs Rm and Rb of mammals and birds were calculated and regarded as their representative directions in
the MVS, respectively. Figure 13 shows the angular distribution of the vectors Ri around the CGV Rm. Here,
the x-axis denotes the projection components Wi,m 5
(Zi,m)1/2 of the species vectors Ri to the vector Rm. The
angles show roughly a chronological order of evolution
of the species, since the branching events of the tree at
earlier stages of evolution yield larger angles. The distribution was well fitted to the dotted curve (given by
W 5 R cos(u) with a constant radius R 5 8.5, which is
the average value of zRiz.
To obtain the scatter diagram for Zi,m and Zi,b, we
shifted the center position toward the directions of the
CGVs of mammals and birds by using the position Qc
given by a proportional relation,
Gm 5 C 3 (Rm 2 Rc ) 1 Rc ,
FIG. 10.—Relationship among the suborders of Artiodactyla and
Perissodactyla by standard methods. The standard methods of maximum likelihood (a) and neighbor joining (b) gave the strange pattern
[Tylopoda, {Ruminantia, (Ceratomorpha, Hippomorpha)}] in spite of
high branch resolutions, especially with the 100% confidence in a.
Here, eight species of Artiodactyla and Perrisodactyla in the a-hemoglobin sequences were selected, and a rat was used as the outgroup.
Gb 5 C 3 (Rb 2 Rc ) 1 Rc ,
Qc 5 (Gm 1 Gb )/2,
and
(9)
with a constant C. Here, the species vectors Ri were
redefined as measured from the new position Qc. Here,
822
Kitazoe et al.
FIG. 11.—Relationship among Artiodactyla, Perissodactyla, and Chiroptera by standard methods. The maximum-likelihood and neighborjoining methods were applied to the analysis of the 19 species of Ungulates and the 7 species of Chiroptera, which were full inclusions from
the PIR database. The relationship among the suborders of Artiodactyla and Perissodactyla remained erroneous in a pattern similar to that in
figure 10. Furthermore, Chiroptera were included inside of the ungulate branches. When the number of species was increased, the branch
resolutions became lower.
the C value may be adjusted for the first few species of
fishes so as to be placed near the new origin of coordinates, since the vectors of these fishes become orthogonal to the CGVs of Rm and Rb. However, the scatter
diagram patterns of the drawn-out species were insensitive to the C values (C 5 0 2 0.35).
Figure 14 gives the scatter diagram (with C 5 0.2)
consisting of the three parts: the central line, the left
branch, and the right branch. This figure showed a clear
splitting into three groups of the central line and the left
and right branches. The central line included fishes, amphibians, and squamates to form the outgroup. Fishes
began with a coelacanth, a carp, and a dragonfish and
ended with a goldfish. The central line included fishes,
newts, frogs, and squamates. The left branch represented
the lineage of birds. Crocodilians, turtles, and tuataras
belonged to this lineage. In this way, the extant reptilians were clearly classified into the two groups of squamates and others. On the other hand, the right branch
represented the mammalian lineage, which began with
Phylogeny Inference in Multidimensional Space
823
Monotremata (a platypus and an echidna) and Marsupialia (an opossum) at the base of the branch. They were
followed by a guinea pig, a hedgehog, a manatee, and
a whale, which are considered to be the core member
species of the mammalian radiation. Primates were located at the uppermost part of the right branch.
From a morphological viewpoint, the origin of
Monotremata has been regarded as much older than that
of the opossum. In the present analysis, however, they
had similar values of Zi,m in spite of their mutual long
distances. This result is consistent with the proposal of
Gemmell and Westerman (1994) that they diversified simultaneously (Cao et al. 1994; Janke et al. 1994). It is
interesting to see that Monotremata shifted to the direction (upward) of birds compared with the opossum.
There appeared to be a vacuum at the base of the
two branches of birds and mammals. This is regarded
as the result of the extinction of reptilians. Finally, it is
noted that the line along the plots of the bird lineage is
not orthogonal to that of the mammalian lineage (see
the solid lines of fig. 14). This suggests that substitutions
of higher orders than the SSM may exist in the sequences between birds and mammals. The effect of these substitutions may be related to the well known underestimation of the molecular clock between mammals and
birds (Zuckerkandl and Pauling 1965). This problem
will be discussed in detail elsewhere.
Relationship Among Birds, Squamates, and Other
Extant Reptilians
FIG. 12.—Relationship among Artiodactyla, Perissodactyla, and
Chiroptera by the present method. a shows the result of the vector
analysis using the same species as in figure 11. Looking at the two
directions of the center-of-gravity vectors (CGVs) of Artiodactyla and
Perisodactyla as the search vectors, we obtained the scatter diagram
for Zi,a and Zi,p. The plots represented a clear orthogonal relation between these two orders: they were similar to those of the simulation
of figure 3b and resulted in the topology [Chiroptera, {(Tylopoda,
Ruminantia), (Ceratomorpha, Hippomorpha)}], since other groupings
of the search vectors did not give the orthogonal relation (c). Here, the
rat and Chiroptera gave small x and y values, since their vectors were
directed at large angles from the CGVs of Artiodactyla and Perissodactyla. It is here noted that the rat and Chiroptera are also directed at
both large angles and long distances from each other, although they
were located close together in this figure. The stability of the topology
The phylogeny of reptilians has been one of the
most important problems in vertebrate evolution. From
a morphological point of view, the number of temporal
openings in the skull has been used as an important clue
to the classification of species (Romer 1966; Benton
1997): mammals have a single opening, the turtle lacks
a temporal opening, and the other extant reptilians have
two openings. In this way, birds and crocodilians formed
one group, tuataras and squamates formed another
group, and the turtle was placed at the base of the tree
(fig. 15a). On the other hand, the concept of a sister
group of birds and mammals was proposed (Gardiner
1982; Løvtrup 1985) and was supported by molecular
analyses using the sequences of the myoglobin and bhemoglobin (Goodman, Miyamoto, and Czelusniak
1987; Bishop and Friday 1988). In recent molecular
analyses (Gorr, Mable, and Kleinschmidt 1998; Hedges
and Poling 1999), however, the turtle was joined with
crocodilians, and squamates were placed at the base of
the reptilian tree. If these results are accurate, the traditional classification based on the number of temporal
openings becomes meaningless for phylogeny, and the
morphological and paleontological evidence then remains unclear.
←
was examined by a full inclusion of 113 eutherians stored in the PIR
database. As shown in b, the branch pattern of the suborders of Artiodactyla and Perissodactyla was consistent with that of a.
824
Kitazoe et al.
FIG. 13.—Angular distribution of 185 vertebrates around the center-of-gravity vector (CGV) of mammals. The positions Ri of 185 vertebrate
species were determined by solving the equation of motion of 110 dimensions. The x-axis denotes the projection components Wi,m 5 (Zi,m)1/2
of the species vectors Ri to the CGV Rm of mammals. The y-axis denotes the angles (u) between Ri and Rm. The angles give roughly a
chronological order of evolution of the species, since evolutionary events at earlier stages give larger angles in the MVS. The scatter plots were
well fitted to the dotted curve given by W 5 R cos(u) with the constant radius R 5 7.5, which is the average value of zRiz.
In the present approach, squamates were clearly
isolated from the other extant reptilians, whose vectors
were directed at small angles from the CGV of birds.
On the other hand, the vectors of squamates, amphibians, and fishes were directed at large angles from the
CGVs of birds and mammals so that they may have had
small values of Zi,m and Zi,b, although they were also
directed at large angles from each other. Figure 14
shows that the first coupling to birds is the tuatara, the
second is the turtle, and the third is crocodilians (their
accession numbers are listed in the Supplementary Material section). Then, we have the topology of figure
15d, in contrast to the morphological view (fig. 15a).
This figure was also compared with the recent results of
figure 15b (Gorr, Mable, and Kleinschmidt 1998) and c
(Hedges and Poling 1999) by the ML method, which
shows a sister group between crocodilians and turtles.
The present analysis did not give this subgroup (fig.
15d), since birds were always and strongly drawn out
by taking tuataras, turtles, or crocodilians in place of
birds as the search vector. This situation was made clearer by the scatter diagram for the CGVs of crocodilians
and birds (fig. 16): the turtle was coupled more strongly
with birds than crocodilians, in a way quite similar to
that shown in figure 3b. Here, a large value (C 5 0.35)
was used in order to enlarge the relationship among
crocodilians, turtles, tuataras, and birds.
Discussion
Molecular evolution has traditionally been studied
with cluster models (Sokal and Michener 1958; Eck and
Dayhoff 1966; Fitch and Margoliash 1967; Sokal and
Rohlf 1981; Saitou and Nei 1987). These models directly infer phylogeny by using only the substitution distances of the scalar. In the present approach, however,
the phylogeny is inferred by constructing the MVS,
which includes the information of not only the scalar,
but also the vector. The ML method is very interesting
in that amino acid substitutions are explicitly taken into
account in the algorithm of phylogeny inference. It is
expected that the ML method can give better results of
the inference than the NJ method, since the former includes a much higher degree of freedom of variables
than the latter. In practical applications with the use of
computer simulations, however, the latter was better
than the former. In the ML method, all of the possible
alignments at the respective branches are summed up in
calculations of the ML value. The number of this summation increases rapidly at branches toward the past
(root), although each of these branches occupied one
sequence in the actual process of the evolutionary history. In the ML method, it is implicitly assumed that the
amino acid sequences of the extant species at the present
time are a realization of the most probable (likely)
course among the evolutionary processes based on the
stochastic process. Events of deviations from this course
may exist in practice. It is not necessarily self-evident
whether they correspond to only the parallel and convergent evolutions pointed out by Cao et al. (1998a).
On the other hand, in the present approach, the ML
method is applied only to get the substitution distances,
which are used as the input data to construct the MVS.
Phylogeny Inference in Multidimensional Space
825
FIG. 14.—Scatter diagram for mammals and birds in 185 vertebrates. The x and y values denote the projection components Zi,m and Zi,b of
the species vectors Ri to the center-of-gravity vectors (CGVs) of mammals and birds, respectively. The scatter plots were clearly decomposed
into the three parts, i.e., the left branch of birds, the right branch of mammals, and the central part of the others. The extant reptilians were split
into the two branches: squamates belonged to the central part, while the others belonged to the lineage of birds. The central part included fishes,
amphibians, and squamates. On the other hand, the right branch began with Monotremata (a platypus and an echidna) and Marsupialia (an
opossum) at the base of the mammalian lineage. They were followed by a guinea pig, a hedgehog, a manatee, and a whale, which are considered
to be the core member species of the mammalian radiation. Primates were found on the uppermost part of the right branch. There appeared a
vacuum at the base of the two branches of birds and mammals. This is regarded as the result of the extinction of reptilians. The species of the
central part took small values of Zi,m and Zi,b, since their vectors were directed at large angles from the CGVs of birds and mammals. The
species vectors of different classes such as fishes and amphibians were directed at large angles from each other, although these species were
closely located in the plots of this figure, since we looked only at the two directions of mammals and birds.
FIG. 15.—Relationship among birds, squamates, and other extant
reptilians. a, The morphological and paleontological view. b and c,
Recent results (Gorr, Mable, and Kleinschmidt 1998; Hedges and Poling 1999) with use of the maximum-likelihood method, respectively.
b and c show a Turtle-crocodilian sister group, different from a. The
present result (d) is basically similar to them, but did not form the
subgroup within reptilians, since all of tuataras, turtles, and crocodilians were strongly coupled with birds.
The MVS was shown to be useful for phylogeny
inference by the vector analysis based on the orthogonal
relation, in contrast to the multidimensional space
(Grantham et al. 1981) which is directly expressed in
terms of the amino acid or codon frequencies. The solution of the kinetic equation for the MVS always converges on the ground state by the N 5 (I 2 1)-dimensional size. However, when I was larger than L, we
found that the equation of motion could be solved with
good accuracy by the L-dimensional size in place of (I
2 1). In fact, the L values for 113 eutherians and 185
vertebrates were 80 and 110, respectively, and the former values were much smaller than the latter. In this
way, the degree of freedom of variables in the MVS
became equal to that in the original space of vectors
having the amino acid sequences as their components.
This means that the MVS is a transformation of the
original space, although the components of the vectors
are quite different from each other. In this way, we have
to analyze a large number of species compared with that
of variable sites in order to get a reliable topology. In
826
Kitazoe et al.
Table 3
The Distance Matrix for Suborders of Ungulates
Rat . . . . . . .
Alpaca . . . .
Bovine . . . .
Horse . . . . .
Tapir . . . . .
FIG. 16.— Relationship among birds, tuataras, turtles, and crocodilians. To confirm the present topology of figure 15d, we examined
the scatter diagram for the CGVs of crocodilians and birds. The turtle
was more strongly coupled with birds than crocodilians, so turtles and
crocodilians did not form a subgroup, in contrast to recent results
(Gorr, Mable, and Kleinschmidt 1998; Hedges and Poling 1999).
fact, in the case of 27 species of ungulates, Chiroptera,
and rat of figure 12a, a part of the outside species (Chiroptera) slightly fluctuated upward on the y-axis (toward
the CGV of Perissodactyla). This fluctuation may be
connected with reason that the size of dimensions (26)
was smaller than the number of variable sites (35), since
the results (figs. 12b, 14, and 16) with the larger numbers of species did not bring such a fluctuation.
The standard methods gave a strange topology in
the relationship among the four suborders of Perissodactyla and Artiodactyla in spite of high branch resolutions (fig. 10). Table 3 gives the distance matrix for a
tapir, a horse, a bovine, an alpaca, and a rat. In this
matrix, it is natural that the distance between the tapir
and the horse (10.1) is the shortest. However, the distance between the bovine and the alpaca of the same
order (17.3) is larger than that between the bovine and
horse of different orders (14.1). Such a complicated relationship among these four species seems phenomenologically to be a main reason for the strange topology
of figure 10. As another possible reason, the number of
eight species sampled in figure 10 might be too small
to obtain the correct topology. To explain this reason,
let us consider a simple case of three species with the
grouping {1, (2, 3)}. Then, we have a possibility of
getting the opposite coupling, {(1, 2), 3}, by additionally
including another species (4) which is located intermediate between points 1 and 2. It was therefore expected
that the correct topology might be obtained by increasing the number of species. However, the situation became worse, since Chiroptera were furthermore included
in the inside of the ungulate branches and the branch
resolutions became very low (fig. 11). Another example
of the strange topology was reported in the ML analysis
of the relationship among primates, ferungulates, and
rodents with use of the ND1 mitochondrial protein by
Rat
Alpaca
Bovine
Horse
Tapir
0.0
25.4
29.2
27.1
30.1
25.4
0.0
17.3
17.9
17.2
29.2
17.3
0.0
14.1
18.1
27.1
17.9
14.1
0.0
10.1
30.1
17.2
18.1
10.1
0.0
Cao et al. (1998a), who suggested the possibility of convergent or parallel evolution. We investigated this relationship in the a-hemoglobin by using the ML and NJ
methods, which gave the same strange pattern, {ungulates, (rodents, primates)} with the ND1, although the
branch resolutions were very poor. Therefore, phenomena similar to the ND1 may exist in the a-hemoglobin.
The present approach clearly produced the normal
pattern for Artiodactyla and Perissodactyla (fig. 12). The
scatter diagrams did not change their pattern with the
full inclusion of 113 eutherians stored in the PIR database. This absence of change is due to a feature of the
present approach in which the positions of species can
be precisely determined prior to and separate from the
vector analysis for the phylogeny inference. With an increasing number of species, the vectors occupy more
specific points of the MVS for the sake of a high-density
relationship among the relative distances, and then the
vector analysis can be also done more precisely. Finally,
the strange topology given by the standard methods did
not appear in the present vector analysis for rodents,
ungulates, and primates, since rodents were found to
form the first clade in the eutherian phylogeny, as seen
in figure 14.
In this paper, we tried to obtain statistical confidence by increasing the number of species. Another approach is to fluctuate the distances themselves. In the
present examples of applications, however, the groupings of species in the scatter diagrams were very clear.
When a grouping pattern is complicated, a more quantitative analysis will be necessary for phylogeny inference. This problem will be discussed elsewhere.
Conclusions
The evolutionary process was represented in the
MVS as the developments of branch vectors whose end
points corresponded to the extant species. A new algorithm was proposed to precisely determine the positions
of species by solving the equation of motion. A vector
analysis was presented to infer the phylogeny from these
positions. A validity of the analysis was demonstrated
using computer simulations of molecular evolution. The
present method consistently reconstructed the correct topologies, in contrast to standard methods. The efficacy
of the method was shown through applications to typical
examples of vertebrates in which the standard methods
give strange topologies or low values of the branch resolution. Furthermore, the application to recent topics
gave a new branch pattern for the reptilian phylogeny.
A feature of the present approach is that the MVS is
constructed prior to and separate from the vector anal-
Phylogeny Inference in Multidimensional Space
ysis for the phylogeny inference. As the number of species increases, the vectors Ri occupy more specific positions in the MVS individually. Consequently, the vector analysis can be done more precisely, since it is then
possible to substantially transform the original abstract
space into the MVS without decreasing the degree of
freedom of variables. The efficacy of the present theory
will be clarified by applications to many fields in molecular evolution where it remains difficult to infer
phylogeny.
Supplementary Material
The species and accession numbers used in figures
11, 12, and 16 are as follows: myotis bat (A25357),
pallid bat (A29702), tomb bat (S28934), Australian bat
(S20277), California bat (A26640), Brazilian bat
(A29391), Indian bat (A29392), Indian rhinoceros
(A26543), white rhinoceros (HARNW), Brazilian tapir
(HATPI), mountain zebra (HAHOZ), common zebra
(HAHOZC), donkey (HAHOD), kulan (HAHOK), horse
(HAHO), moose (HAEKN), reindeer (S13481), sheep
(S31853), goat (HAGT), bovine (HABO), yak
(HAYA1), gayal (HABOG), alpaca (C25478), llama
(HALL), vicuna (A25478), camel (HACMA), rat
(HART1), crocodile (HAAK), alligator (HAAQ), caiman (HACQ), tuatara (HATJA), turtle (HATTAP).
Acknowledgments
We are grateful to Prof. Hiroshi Furutani (Kyoto University of Education), Prof. Takeshi Agatsuma and Dr.
Teruaki Watabe (Kochi Medical School), and Naruya
Saitou (National Institute of Genetics, Japan) for useful
discussions.
LITERATURE CITED
ADACHI, J., and M. HASEGAWA. 1996. MOLPHY version 2.3:
programs for molecular phylogenetics based on maximum
likelihood. Comput. Sci. Monogr. 28:1–150.
BENTON, M. J. 1997. Vertebrate paleontology. 2nd edition.
Chapman and Hall, New York.
BISHOP, M. J., and A. E. FRIDAY. 1988. Estimating the interrelationships of tetrapod groups on the basis of molecular
sequence data. Pp. 33–58 in M. J. BENTON, ed. The phylogeny and classification of the tetrapods. Clarendon Press,
Oxford.
CAO, Y., J. ADACHI, A. JANKE, S. PÄÄBO, and M. HASEGAWA.
1994. Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins:
instability of a tree based on a single gene. J. Mol. Evol.
39:519–527.
CAO, Y., A. JANKE, P. J. WADELL, M. WESTERMAN, O. TAKENAKA, S. MURATA, N. OKADA, S. PÄÄBO, and M. HASEGAWA. 1998a. Conflict among individual mitochondrial
proteins in resolving the phylogeny of eutherian orders. J.
Mol. Evol. 47:307–314.
CAO, Y., P. J. WADDEL, N. OKADA, and M. HASEGAWA. 1998b.
The complete mitochondrial DNA sequence of the shark
Mustelus manazo: evaluating rooting contradictions to living bony vertebrates. Mol. Biol. Evol. 15:1637–1646.
DAVID, H. B., and N. G. JAMES. 1988. Quasiparticle model for
nuclear dynamics studies: ground-state properties. Phys.
Rev. C 38:1870–1878.
827
DAYHOFF, M. O., R. M. SCHWARTZ, and B. C. ORCUTT. 1978.
A model of evolutionary change in proteins. Pp. 345–352
in M. O. DAYHOFF, ed. Atlas of protein sequence and structure, Vol. 5. National Biomedical Research Foundation,
Washington, DC.
ECK, R. V., and M. O. DAYHOFF. 1966. Atlas of protein sequence and structure. National Biomedical Research Foundation, Silver Spring, Md.
FELSENSTEIN, J. 1993. PHYLIP (phylogeny inference package).
Version 3.5c. Distributed by the author, Department of Genetics, University of Washington, Seattle.
FITCH, W. M., and E. MARGOLIASH. 1967. Construction of phylogenetic trees. Science 155:279–284.
GARDINER, B. G. 1982. Tetrapod classification. Zool. Linn.
Soc. 74:207–232.
GEMMELL, N. J., and M. WESTERMAN. 1994. Phylogenetic relationships within the class Mammalia: a study using mitochondrial 12S RNA sequences. J. Mamm. Evol. 2:3–23.
GOODMAN, M. 1963. Serological analysis of the systematics of
recent hominoids. Hum. Biol. 35:377–436.
GOODMAN, M., M. M. MIYAMOTO, and J. CZELUSNIAK. 1987.
Pattern and process in vertebrate phylogeny revealed by coevolution of molecules and morphologies. Pp. 141–176 in
C. PATTERSON, ed. Molecules and morphology in evolution:
conflict or compromise? Cambridge University Press, Cambridge, England.
GORR, T. A., B. K. MABLE, and T. KLEINSCHMIDT. 1998. Phylogenetic analysis of reptilians: trees, rates, and divergences.
J. Mol. Evol. 47:471–485.
GRANTHAM, R., C. GAUTIER, M. JACOBZONE, and R. MERCIER.
1981. Codon catalog usage is a genome strategy modulated
for gene expressivity. Nucleic Acids Res. 9:43r–74r.
HAMMERSLEY, J. M., and D. C. HANDSCOMB. 1964. Monte
Carlo methods. Methuen, London.
HEDGES, S. B. 1994. Molecular evidence for the origin of birds.
Proc. Natl. Acad. Sci. USA 91:2621–2614.
HEDGES, S. B., and L. L. POLING. 1999. A molecular phylogeny of reptiles. Science 283:998–1001.
JANKE, A., G. FELDMAIER-FUCHS, W. K. THOMAS, A. VON HAESELER, and S. PÄÄBO. 1994. The marsupial mitochondrial
genome and the evolution of placental mammals. Genetics
137:243–256.
JANKE, A., N. J. GEMMELL, G. FELDMAIER-FUCHS, A. VON
HAESELER, and S. PÄÄBO. 1996. The complete mitochondrial genome of a monotreme, the platypus (Ornithorhynchus anatinus). J. Mol. Evol. 42:153–159.
JONES, D. T., W. R. TAYLOR, and J. M. THORNTON. 1992. The
rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8: 275–282.
KUMAR, S., and S. B. HEDGES. 1998. A molecular timescale
for vertebrate evolution. Nature 392:917–920.
LøVTRUP, S. 1985. On the classification of the taxon Tetrapoda.
Syst. Zool. 4:463–470.
NOVACEK, M. J. 1992. Mammalian phylogeny: shaking the
tree. Nature 356:121–125.
ROMER, A. S. 1966. Vertebrate paleontology. University of
Chicago Press, Chicago.
SAITOU, N., and M. NEI. 1987. The neighbor-joining method:
a new method for reconstructing phylogenetic trees. Mol.
Biol. Evol. 4:406–425.
SARICH, V. M., and A. C. WILSON. 1967. Immunological time
scale for hominid evolution. Science 158:1200–1203.
SOKAL, R. R., and C. D. MICHENER. 1958. A statistical method
for evaluating systematic relationship. Univ. Kans. Sci.
Bull. 28:1409–1438.
SOKAL, R. R., and F. J. ROHLF. 1981. Biometry. 2nd edition.
Freeman, San Francisco.
828
Kitazoe et al.
SPRINGER, M. S., G. C. GREGORY, O. MADSEN, W. W. DEJONG,
V. G. WADDELL, H. M. AMRINE, and M. J. STANHOPE. 1997.
Endemic African mammals shake the phylogenetic tree. Nature 388:61–64.
STRIMMER, K., and A. VON HAESELER. 1996. A quartet maximum-likelihood method for reconstructing tree structure.
Mol. Biol. Evol. 13:964–969.
SWOFFORD, D. L. 1998. PAUP. Version 4. Sinauer, Sunderland,
Mass.
ZUKERKANDL, E., and L. PAULING. 1965. Evolutionary divergence and convergence in proteins. Pp. 97–166 in V. BRYSON and H. J. VOGEL, eds. Evolving genes and proteins.
Academic Press, New York.
FUMIO TAJIMA, reviewing editor
Accepted January 16, 2001