Amphioxus Mitochondrial DNA, Chordate Phylogeny

Syst. Biol. 47(1):6 1± 76 , 199 8
A m phioxus M itochond rial DNA , C hordate Phylo geny, and the Lim its
of Inference Base d on C om parisons of Seq uences
G A V IN J. P. N A Y L O R 1
AND
W ESLEY M . BRO W N 2
1
D epartment of Z oology and G enetics, Iowa State Un iversity, A mes, Iowa 50011 , USA ;
E-m ail: gn aylor@ iastate.edu
2
D epartment of Biology, Un iversity of M ichigan , A nn A rbor, M ichiga n 48109- 1048 , U SA
A bstract.Ð A n alyses of b oth the nucle otide and am ino acid sequences derive d from all 13 m itochond rial protein -enco din g genes (12,23 4 b p) of 19 m etazoa n spe cies, includ ing that of the lancelet
Branchiostom a ¯ oridae (``am phiox us’ ’ ), fail to yield the w idely acce pted phy logeny for chordates
and , w ith in chordates, for vertebrates. G iven the bread th and the com pellin g n atu re of the data
su pportin g that phy logeny, relation ships sup ported by the m itochondrial sequence com parison s
are alm ost certain ly incorrect, despite their being supp orted by equally weigh ted parsim ony,
d istance, and m axim um -likelihood an alyses. T he incorrect groupings probably resu lt in par t from
convergen t b ase-com position al sim ilarities am ong som e of the tax a, sim ilarities that ar e strong
enough to overw helm the h istorical sign al. Com par isons am ong very d istantly related tax a ar e
likely to b e par ticularly su scep tible to such ar tifacts, b ecause the h istorical sign al is alread y greatly
attenuated. E m p irical results u nderscore the ne ed for ap proache s to phylogenetic inference that
go b eyond sim ple site-by-site com parison of align ed sequences. T h is study and o thers ind icate
that, once a sequence sam ple of reason able size h as b een obtained, accu rate phy logenetic estim ation m ay b e b etter served by incorp oratin g know led ge of m olecu lar structu res and processes
into inference m o dels and by seeking ad dition al h igher order ch ar acters em b ed ded in tho se sequences, than by gathering ever larger sequence sam ples from the sam e organ ism s in the hope
that the historical sign al w ill eventu ally prevail. [A m phioxu s; chordate phy logeny; hom oplasy;
m tDNA ; m olecu lar system atics; phy logenetic inference.]
The practice of inferring evolutionary
trees from DNA sequences has ¯ ourished
in recent years, its credibility bolstered by
the observation that phy lo genies of wellstudied group s are usu ally su pp orted by
sequence data. W hen the sequence of a
p articular gene or other well-de® ne d DNA
segm en t yields an inference cong ruent
w ith an accep ted relationsh ip for a p articular group, there is a tendency to regard
that segm en t as reliable for phylogenetic
inference and to use it to determ ine phylogen ies for taxa w ho se relationsh ip s are
unknow n (G raybeal, 1994; C ho et al.,
19 95). However, from the beginning of
such studies it was recognized that any
DNA segm ent can only be useful over a
lim ited diverge nce ra ng e; outside that
range the historical signal would be either
too undevelop ed or too attenuated to be
reliable. Fu rtherm ore, w ith an increase in
the nu m ber of such studies it also becam e
ap p arent that the useful ra ng e varied
am ong different taxa. T hu s, there are in stances in w hich sequence data provide accurate assess m en ts for som e relationsh ip s,
an d erroneous ones for others (Felsenstein,
19 78; H illis, 19 91; K im , 1996; Philip pe et
al., 1994). The latter occur w heneve r the
em bedded historical signal is overturned
by a stronger, hom oplasious signal am ong
the DNA sequences.
Various m ethods are used for phylogene tic reconstruction, each im plying a different m odel of evolutiona ry chan ge and
em phasizing different asp ects of the observed character-state covariation am ong
taxa. It is com m on practice to rega rd a
phy logeny that is sup ported by severa l different m etho ds as correct, and especially
so w hen statistical m ethods for evaluating
the strength of su pp ort [e.g., boo tstrap p ing (Felsenstein, 1985) and decay indices
(Brem er, 1988; D onogh ue et al., 1992)] are
com pelling. This stem s from a tacit assu m ption that an incorrect phy logeny,
even if it is the best ® t to the available data,
w ill not receive signi® can t statistical su p p ort w hen the result itself is evaluated.
That assum ption is incorrect. Statistical
evaluations merely asses s the strength of
the signal used to order the data hierar61
62
SYST E M AT IC BIO LO GY
chically (Swofford et al., 19 96). Thus, if
there is a hierarchical signa l in the data
that arise s from a nonhistorical source and
if that signa l is su f® ciently strong, it can
overw helm not only a wea ker historical
signal, but also a statistical evaluation of
the resu lt.
The ``total evidence’’ approach to phylogenetic reconstruction (Eerniss e and Kluge,
1993) is currently am ong the m o st w idely
applied. It ``uses character congruence to
® nd the best ® tting hyp othesis for an unp ar titioned set of synap omorphies, w hich
is ideally all of the releva nt available data’ ’
(Eernisse and Klu ge, 1993). In its purest
im plem entation the ap proach weights all
characters equally in order to disp en se
w ith any need to identify different classes
of inform ation. Proponents m aintain that
p ar titioning evidence into classes is arti® cial, ``because there is little reason to believe such categories are m ind-indep en dent
c at e g o rie s
w ith
d is co ver ab le
bound aries’ ’ (Eerniss e and K lu ge, 1993).
This som ew hat narrow perspective has
gained a follow ing because it obviates a
need to sp ecify explicit (and often poorly
know n) processes about the way in w hich
traits evolve. Advocates of the m etho d esp ouse a view that hom oplasy (characterstate covariation am ong taxa due to in ¯ uences other than shared history ) w ill be
random ly distributed w ith respect to taxa,
and that hierarch ically structured historic a l s ig n al (ch a rac te r-st at e c ova riat io n
am ong taxa due to sh ared history) w ill
overshadow the hom oplasy if enough data
are collected (see Farris , 1983). In keep ing
w ith this view, it is assum ed that any in correct inferences w ill be due to stochastic
e rr o r ass o ciat e d w it h a n in su f® cien t ly
large sam ple size of characters, and that
they w ill disap pea r as m ore data are collected.
The sam ple size of sites requ ired to en su re that the historical signal overturn s the
hom oplasy depends to a large extent on
the strength of the historical signal in a
data set and on the gra in size of the hom oplasy. If the hom oplasy is disp ersed in
a ® ne -grain ed fashionÐ that is, is distributed in sm all ``packets,’ ’ each of w hich
VO L.
47
su gge sts a different nonhistorical associationÐ it w ill likely app ear rand om ly distributed at relatively sm all sam ple sizes. If
h o m o p las y is d is p er se d in a co a rs e grained fashion, so that severa l sites su ggest the sam e nonhistorical group ing, then
large r sam ple sizes (i.e., m ore sequence)
w ill be requ ired before patterns of hom oplasy ap pear ra ndom ly distributed. In essence, the sam ple size of sites at w hich the
random ness of hom oplasy becom es ap p aren t is dictated by the grain size of the ho m oplasy. Thu s, even if hom oplasy were
random ly distributed am ong taxa at the
level of the en tire genom e (an assum ption
that has not been em p irically tested), it
would app ea r to be highly nonra ndom ly
distributed w ithin a p articular sam ple of
sites if its grain size were coarse and the
sa m ple of sites in suf® ciently large. Note, in
the current context, that the term ``grain
size’ ’ has no spatial conno tation. W hen we
refer to hom oplasy as ``coarse-grained ,’ ’
we m ean only that severa l sites w ithin a
fragm en t im ply the sam e nonhistorical
group ing; the m isleading sites that collectively constitute a ``p acket’ ’ ne ed not be
sp atially contiguous along the sequence.
The prem ise that hom oplasy is ra nd om ly distributed or un structured w ith in data
sets underlies the phy lo gen etically m eanin gful in terpretations of bo o tstra p pin g ,
decay ind ices, an d successive weigh ting
(Fa rris, 196 9). G roup in gs assessed as un reliable (i.e., tho se w ith little character
sup p ort) are as su m ed to be due to ch an ce,
w hile those as sessed as reliable are as sum ed to be so due to shared history. Unfortu nately, if hom oplasy is nonra nd om ly
d istributed or if it show s ``system atic error’ ’ (Swofford et al., 19 96), then analyses
w ill no t only yield errone ous p hy logen etic
in ferences, but m any of the tests designed
to evalu ate the reliability of their con stituen t no des w ill lead to falsely con® den t
as sessm en ts.
G iven the ap peal of the ``total eviden ce,’ ’
equ ally weighted p arsim ony ap pro ach, it
would be useful to evaluate the extent to
w hich its required assum ption for random
distribution of hom oplasy is actually m et
by m olecula r data sets. W hen phy lo geny is
19 98
NA YLO R AN D BRO W NÐ
63
LIM ITS O F INFEREN CE
T A B L E 1. Sp ecies used and their G enb an k acce ssion num b ers.
Specie s nam e
Com m on nam e
GenBank
accesion
num ber
Mus muscu lus
R attus norvegicus
Bos tau rus
Balaenopteru s physalu s
Balaenopteru s mu sculus
D idelphis virginian a
G allu s gallus
X enopus laevis
Mouse
Rat
Cow
Fin-b ack w hale
Blue w h ale
O p posum
C hicken
Frog
C yprinus carpio
O ncorhyn chus m ykiss
Petrom yzon m arinus
Branchiostom a ¯ oridae
C arp
T rout
Lam prey
Lancelet
Paracentrotus lividus
Stron gylocentrotus pu rpu ratu s
D rosophila yaku ba
Cepaea nemoralis
A nopheles gam biae
A scaris suu m
C aenorhabditis elegan s
Sea urch in 1
J01420
X 14848
J01394
X 6114 5
X 7220 4
Z 29573
X 52392
M 10217
X 01600
X 01601
X 02890
X 61010
L 29771
U 11880
A F03516 4±
A F03517 6
J0481 5
Sea urch in 2
Fruit ¯ y
Snail
Mosqu ito
Nem atode 1
Nem atode 2
X 1263 1
X 03240
U 23045
L 20934
X 5425 3
X 5425 2
know n, nonrand om distribution of hom oplasy can be inferred w hen the data set
strongly su pp orts an incorrect tree. The
strength of depar ture from random ness
can be assess ed by evaluating the level of
bootstrap su pp ort, or the decay index for
the incorrect group s, or by su bjecting the
data to a Tem pleton (1983) test.
A lthough no phylogeny is know n w ith
certain ty, a num ber are very well su p ported, p erhap s the best know n being that for
echinoderm s plus chordates (see M aisey,
19 86, 19 88; G authier et al., 1988, and references therein). Com plete m ito chond rial
genom es have been sequenced for representatives of several vertebrate classes , two
echinoderm classes , and a num ber of outgroup s. We have recen tly sequenced a m itochondrial DNA (m tDNA ) from the la ncelet Branchiostom a ¯ oridae (``am phioxus’ ’ ), a
sp ecies of C ephalochordata, the im m ediate
sister taxon to the C raniata. Thus, com plete m tDNA sequences are now availa ble
from representatives of m o st key lineag es
in vertebrate evolution. Phy logenetic in fer-
F IG U R E 1. T he expe cte d pattern of phy logenetic relationships for 19 taxa. B ranch lengths depicted are
estim ates from the fossil record (Ben ton, 1993) . T hey
re¯ ect the earliest fossil o ccurrence assign able to the
stem lineage of the extan t form . In the case of the fr og ,
the earliest fo ssil occu rrence for the anu ran stem
grou p was used rather than the ® rst fossil as signable
to the am phibian grade.
ences derived from com p arisons of these
sequences can be contrasted w ith the accepted phylogeny, providin g an opp ortunity to exam ine the distribution p atterns of
hom oplasy in m tDNA sequences of this
grou p.
M A T E R IA L S
AND
M E TH O D S
We assem bled complete m itochond rial
sequences for 19 taxa (Table 1) w hose phylogen etic relationsh ip s are noncontroversial (Fig. 1). The protein encoding regions
were align ed at the am ino acid level using
C lu stal W (Thom pson et al., 1994) an d
were checked for higher ord er structura l
concordance using the codon-coloring feature of A ligner (Eerniss e, 1995). The resu lting data set, consisting of 19 aligned
12 ,234-bp sequences, was su bjected to a
s er ie s o f p hy lo g en e t ic a n a ly s e s u s in g
X
X
(frog , ® sh )
(chicken, ® sh)
(chicken, fro g, ® sh )
(chicken, fro g, ® sh, lam prey)
(ro den ts (w h ales, cow, opossum ))
(lam prey (lancelet (ech inoderm s, vertebrate s)))
(lancelet (e ch inoderm s, vertebrates))
(lancelet (¯ ies, ech inoderm s, verteb rate s))
X
X
X
CO1
CO2
CO3
X
X
X
X
X
X
A m ino acid s
X
T ransversionsÐ
X
X
RY
X
X
X
X
X
ND 1
AG C T
CY TB
A ll nucleotide sub stitutionsÐ
ATP8
(frog , ® sh )
(chicken, fro g)
(frog , ch icken, ® sh )
(lancelet (e ch inoderm s, vertebrates))
(lancelet (¯ ies, ech inoderm s, verteb rate s))
X
X
X
AT P6
X
X
X
ND2
X
X
ND 3
X
X
X
X
ND 4
X
X
X
N D 4L
X
X
X
X
X
X
X
ND 5
In ference errors resulting from bo otstrap analysis of each gene ind ividually and all genes com bined.
(frog , ® sh )
(chicken, ® sh)
(frog , ch icken, ® sh )
(frog (® sh, am nio tes)
(opossum (fro g, ® sh, am niotes))
(lancelet (e ch inoderm s, vertebrates))
(lancelet (¯ ies, ech inoderm s, verteb rate s))
Inferred grouping
T A B L E 2.
X
X
X
X
X
ND 6
X
X
X
X
X
X
A ll
64
SYST E M AT IC BIO LO GY
VO L.
47
19 98
NA YLO R AN D BRO W NÐ
Figu re 2. M PTs resu ltin g from equally weighted
an alysis of the com bined data set, for nucleo tides
(top), tran sversion s only (cen ter), and am ino acid s
(b ottom ). Bo otstrap su pport percentag es ar e show n at
each node. T ree length and RI are show n to the righ t
of each topolo gy. Note that two M PTs resu lt for the
am ino acid analysis. The top ology dep icted is the
strict consen sus of the two M PTs.
PAU P*4.0 version 53 (w ritten by D avid
Swofford). Snail, fru it ¯ y, mo squito, and
two nem atode sp ecies were used as a collective outgroup for all ana lyses.
We exam ined trees derived from equ ally
weighted parsim ony an alysi s for each of
the 13 protein-encodin g genes, bo th individually and in com bination. A nalyses
LIM ITS O F INFEREN CE
65
were conducted at three levels: using all
nucleo tides; using transversions only; and
using am ino acid sequences. The deg ree of
su pp ort for each node was evaluated using
the bootstrap m ethod of Felsenstein (1985).
The nucleotide sequence data were also
su bjected to distance an alyses using Juke s±
C antor (JC ) (1969), K imura two-param eter
(K 2P ) (1 9 8 0 ), H a se g aw a ± K is h in o ± Ya no
(H KY) (1985), and genera l tim e-reversible
(G TR) (Lanave et al., 1984; Tavar e , 1986;
R odrõÂ guez et al., 1990) distances. Each of
the four distances were used in conju nction w ith four different m odels of am ongsite rate variation (A SRV ) (Sullivan et al.,
19 95, 1996; Yang , 1996): (a) no rate variation; (b) a proportion of sites assum ed to
be invaria nt, I, w ith the rem ainder having
equal rates (Fitch and M argoliash, 1967);
(c) rate variation follow ing a discrete approxim ation to a gam m a distribution, G
(Yang , 1994); and (d) a prop ortion of sites
assum ed to be invariant, w ith the rem ain der follow ing a discrete approxim ation to
a gam m a distribution, I 1 G
(G u et al.,
19 95; Sullivan et al., su bm itted). In all, 16
(4 3 4) different mo dels were investigated.
Param eter values for each of the 16 m odels
were obtained by ® tting the exp ected tree
to the data and optim izing values for that
tree und er m aximum likelih o od. M aximum -likelih ood tests evaluating the ® t of
each of the m odels to the exp ected tree
were carried out; the results are show n in
the Ap pendix. Heuristic searches were
conducted using m aximum likelih ood for
the sam e 16 m odel conditions just describ ed.
We ® tted the expected tree to the data
an d m easured the phy lo genetic inform ativeness of each of the 12,234 sites for that
tree using the retention index (RI; A rchie,
19 89; Fa rris, 19 89). We m easured base
com position and its variation (deviation
from stationarity) am ong the 19 taxa for
the subset of sites w ith a p erfect ® t to the
expected tree (those w ith RI 5 1.0), and
contrasted the values w ith those obtained
for the en tire population of sites. Principalcom ponent ana lyses of nucleotide base
com position and am ino acid com p osition
were plo tted to provide a graphic repre-
66
SYST E M AT IC BIO LO GY
VO L.
47
Figure 3. Com parison of base-com positional an alysis of the 1207 sites w ith a perfect ® t (R I 5 1.0) to the
expected tree and of the entire population of sites. (a) Expected tree (left); M PT yielded by equally weighted
analysis of the entire data set (right). (b) Corresponding base-com positional pro® les for each data set. Note the
m ore b alance d distribution of the four nucleotides in the sub set of sites w ith a perfect ® t (R I 5 1.0) to the expected
tree. (c) Deviation from stationarity am ong ingroup taxa was assessed usin g a chi-squared test. Base-com positional
d ifferences are signi® cantly d ifferent from random expectation for the entire population of sites (P , 0.000000 1),
but are not sign i® can tly different for the subset of sites with a perfect ® t to the expected tree. These tests are
intended only as coarse heuristics and do not account for phylogenetic structure (Swofford , 1997).
19 98
NA YLO R AN D BRO W NÐ
LIM ITS O F INFEREN CE
67
sentation of overall com positional sim ilarit ie s a m o n g t ax a . M ax im u m -likelih o o d
tests evaluating the ® t of the 16 m odels to
the exp ected tree were carried out for the
su bset of sites w ith an RI of 1.0 and are
contrasted w ith sim ilar an alyses using all
12 ,234 sites (see Ap pendix). K ishino ± H asegawa (1989) tests contrastin g the top ology
of the expected tree w ith that of the mo st
p arsim onious tree (M PT) yielded by the
com plete nucleotide data set are show n for
different mo dels in the Ap pendix.
Each site in the alignm ent was classi® ed
according to gen e, codon p osition, am ino
acid (the m oda l am ino acid across taxa in
the alignm ent), chem ical prop erty, charge,
an d relative hy drophobicity of the mo dal
am ino acid for that site in the alignm en t.
A n an alysis of varian ce assess ing the effect
of each of these six factors on phylogenetic
inform ativene ss (RI for the exp ected tree)
was then carrie d out.
R E SU L T S
F IG U R E 4. C om p o sition al sim ilarity am on g tax a
for nuc leo tide b as es a nd a m ino acid residue s. Prin cip al-co m p on en t plots (P C 1 v s. P C 2 ) a llow im m ed iat e iden ti® cation o f com p osition al sim ilarity am o n g
the 19 tax a. T h e ® rst tw o com po nen ts acco u n t for
98 % of the variation in nucle otide b as e com p osition
an d 67 % of the variation in am ino acid c om po sition .
A ll an a ly ses a re b as e d on the correlation m at rix. Solid circles repre sen t m am m a ls, op en circles no n m am m a lian verteb rate s, and solid tria n gles invertebrate s.
N u m b ers± sp ecies c orrespo nd en ce s: 1 5
® n -b ack
w h ale, 2 5 blue w h ale, 3 5 cow, 4 5 rat , 5 5 m ou se,
6 5 p ossu m , 7 5 ch icken, 8 5 f rog , 9 5 trou t, 10 5
c ar p, 11 5 la m prey, 12 5 la nc ele t, 1 3 5 sea u rch in
1, 14 5 sea urch in 2 , 15 5 m o squito, 16 5 f ruit¯ y,
In the equally weighted parsimony an alyses, none of the genes, either individually
or in com bination, yielded the exp ected
tree, nor were any of the fully resolved
trees resulting from boo tstrap resam pling
of the individual genes consistent w ith the
expected tree. There was considerable consistency am ong genes in the pattern of inferred errors (Table 2). W hen all substitutions were an alyzed, 10 of the 13 gen es
ind icated Branchiostom a to be the sister taxon to a (vertebrate 1 echinoderm ) clade, 4
genes (ATP6, C O 1, N D 4l, and N D 6) indicated chicken to be the sister taxon to ® sh es, and 5 genes (ATP6, C O 1, C O 3, N D 5,
an d N D 6) indicated a m onophy letic (frog,
® sh, chicken) clade. The se rep eated error
p atterns im ply that hom oplasy is highly
nonra ndom ly distributed. The boo tstrap
consensus trees from the transvers ion
an alyses were less resolved and had fewer
con¯ icts w ith the accep ted tree; however,
¬
17 5 snail, 18 5 nem atode 1, 19 5 nem ato de 2. Note
that in b oth plots the two sea u rchin taxa (13 and 14 )
are closer to the vertebrate taxa than is the lancelet (12).
68
SYST E M AT IC BIO LO GY
VO L.
47
19 98
NA YLO R AN D BRO W NÐ
C O 1, C O 2, and N D 2 indicated Branchiostom a to be outside a (vertebrate 1 echino derm ) clade, an d C O 1 an d N D 5 indicated
a (frog, ® sh, chicken) clade. There was less
consistency in the pattern of errors w ith
the am ino acid sequences; neverth eless,
som e of the sam e inference errors seen in
the nucleotide an alyses resurfaced. W hen
all genes were com bined, the M PT and the
correspond in g bo otstrap consensus differed from the exp ected tree at all three
(nucleotide, transvers ion, and am ino acid)
levels of an alysis , and the incorrect group ings often had high levels of boo tstrap
su pp ort (see Fig. 2).
D istance and m aximum -likelih ood an alyses of the com plete nucleotide data set
failed to yield the exp ected top olo gy for
any one of the 16 m odel/A SRV com binations tested. These an alyses, like the parsimony an alysi s (Fig. 2), all placed Branchiostom a outside echinoderm s and the
frog, ® sh, and chicken in a clade of their
ow n. A lthough no single ana lysis yielded
the expected tree, the m ore p aram eter-rich
m odels (H KY and G TR w ith A SRV) yielded trees that were not signi® can tly differen t from the expected tree w hen subjected
to K ishin o± H asegawa (1989) tests (see Ap p en dix). This su ggests an im proved ® t between m odel and the data for the p aram eter-rich m odels.
That the en tire protein-encoding p ortion of the m tDNA , a total of 12,234 sites,
yields an in ference that is both incorrect
an d sup ported by high bootstrap values in
an equally weighted p arsim ony an alysi s is
sobering. The fact that distance and m aximum -likelih ood ana lyses of the data under
a variety of m odels (in w hich rate m atrix
p aram eters were optim iz ed by ® rst ® tting
the exp ected tree to the data set) also fail
to yield the expected tree su p ports our
origin al sup po sition that hom oplasy is
LIM ITS O F INFEREN CE
69
nonra ndom ly distributed w ithin this large
sam ple of sites. It is possible that m ost of
the structured or m isleading hom oplasy is
concen trated w ithin a few genes. Indeed,
we found that w hen we subjected a com bined data set com prising am ino acid sequences from N D1, N D 4, C O 1, C O 2, C O 3,
an d CYTB (2,302 am ino acid sites) to an
e qu a lly w eig h t e d p a rs im o ny b o o t st ra p
an alysis, the expected tree resulted w ith
10 0% bootstrap su pp ort for all but three
nodes. However, the utility of this ® nd ing
is questionable, since the genes yielding
correct results m ight vary am ong data sets
an d thus not be determ in able a priori.
Collective Properties of Sites w ith a Perfect
Fit to the Expected Tree
W hen the sequence data were ® tted to
the top olog y of the exp ected tree, we identi® ed 1,207 phy logenetically inform ative
sites w ith a p erfect ® t to that tree (i.e.,
1,207 sites w ith an RI of 1.0). Base com p osition for this subset of sites was less
skewed and showed no signi® can t devia tion from stationarity for the ingroup taxa,
in m arked contrast to the situation observed for the en tire population of sites
(Fig. 3). Moreover, in princip al-com ponent
plots of nucleotide base com p osition an d
am ino acid com po sition (Fig. 4), the vertebrate taxa have pro® les that are clearly
m ore sim ilar to those of the two echinoderm s than to that of Branchiostom a. These
results are consistent w ith the prediction,
m ad e on the basis of simulations (Saccone
et al., 19 89, 1990, 1993; Steel et al., 1993;
Lockh ar t et al., 19 94; Steel, 19 94; Pesole et
al., 1995), that base-com p ositional devia tions from stationarity can result in hierarchically structured hom oplasy and, consequently, lead to incorrect phylogenetic
inference.
We em phasize, however, that the base-
¬
F IG U R E 5. Te sts of the asso ciation b etween function al ch aracteristics and the phylogene tic inform ativeness
of a site w hen the com b ine d data set is ® tted to the ex pected tree. The degree of inform ativeness was assesse d
u sing RI. A n alysis of variance ind icates that all six factors (g ene, codon position, am ino acid, chem ical property,
ch ar ge, and hy drophobicity) h ave high ly signi® cant effects on RI (log-tran sform ed ). The relative effects of the
d ifferen t levels of e ach factor ar e plotted ag ain st log R I (ordin ate) as resp on se sam ple m ean s. B ars corre spond
to one standard error.
70
SYST E M AT IC BIO LO GY
VO L.
47
that are m ore sim ilar to those of the two
echino derm s than to that of Branchiostom a.
There is thus no sim ple ad ditive corres p o n den ce b etw een b ase-co m p o s itio n a l
bias an d the inferred phylogeny. Fu rtherm ore, erroneous inferences are not am eliorated by LogD et neighbor-jo ining ana lysis,
a procedu re dem onstrated through simulation to retrieve correct phy logenies in the
face of nonstationary base com positions
w hen sites are independen t (Steel et al.,
1993; Lo ckhar t et al., 19 94; Steel, 1994).
This is the case even w hen a prop ortion of
sites are assum ed to be invariant to accom m o date bias due to am ong-site rate he terogeneity (Waddell, 1995; Swofford et al.,
1996).
Relationship Between Fu nction and
Phy logenetic Inform ativeness
F IG U R E 6. M P Ts based on fu nctional sub sets of
am ino acid s. B ootstrap su pport p ercen tage s ar e
show n at e ach node. (a) Strict con sen su s of two M PTs
resu ltin g from the an alysis of ® rst and second co don
p osition s for site s w hose m o dal am ino acid was pro line or cy steine. (b) Single M PT resu ltin g fr om an alysis of ® rst and second codon p osition s for sites
w hose m o dal am ino acid was proline, cy steine, m e thion ine, glutam ine, and asp ar agine (the im ino, su lfur,
and am ide side-ch ain grou ps, resp ectively).
com po sitional differences do no t com pletely account for the inference errors in this
data set. The base-com p osition plo ts for
the subset of sites w ith a perfect ® t to the
exp ected tree, although m arkedly different
from those for the entire data set, also
sh ow the vertebrate taxa to have pro® les
A nalysi s of variance indicated that all
six factors tested (gene, codon p osition,
am ino acid, chem ical properties, charge,
and relative hydrophobicity) have highly
signi® cant effects (P , 0.0005) on the phylogenetic inform ativene ss (RI) of a site
(Fig. 5). This result re¯ ects the im portance
of these prop erties to m olecular structure
and function. Signi® can t interaction term s
were found am ong som e of the prop erties.
For exam ple, ® rst positions had m arkedly
higher RIs for hydrophilic than for hydrophobic sites, an asso ciation not seen at second or third p ositions. A n in teraction was
also seen between gene an d codon position (P , 0.005). Effect tests for this interaction revealed that third po sition sites
had signi® can tly higher RIs (P , 0.05 ) in
ATP8 and N D 4L than in o ther genes, su ggesting that third codon p osition constrain ts m ay differ am ong genes.
Based on these an alyses, we were able
to iden tify classes of sites that yielded the
exp ected tree w hen su bjected to p arsim ony an alysis . The greatest overall sup port
resulted from an an alysi s of ® rst and second codon positions of sites modally coding
for proline, cysteine, methionine, glutam ine, and asp aragine. Parsim ony an alysis
of the ® rst two sites of all codons in positions mo dally coding for proline and cysteine yielded an incompletely resolved
19 98
NA YLO R AN D BRO W NÐ
bootstrap cons en su s tree that was com patible w ith the expected vertebrate tree an d
had 65% bo otstrap su pp ort for a m ono p hy le t ic C h o r d at a (c ep h a lo cho rd at e s 1
vertebrates; Fig. 6a). W hen the ® rst two
sites of all codons in p ositions m odally
coding for m ethionine, glutam ine, and as p aragine were ad ded to this an alysis the
expected tree was obtained in fully resolved form , w ith strengthened (85% )
b o o t st ra p su p p o r t fo r a m o n o p hy le tic
C hordata (Fig. 6b ). A lthough there is an
undeniable elem en t of circularity involved
in using the expected tree to determ ine
sites that are in form ative, it is interesting
an d probably signi® cant that those we
identi® ed are associated w ith conservative
m olecular m otifs that are frequen tly im p ortant for pro tein structure an d function.
By contrast, analysis show s sites m odally
coding for the rap idly evolving hy drophobic am ino acids leucine, isoleucine, and valine (Fig. 5) to have especially p oor ® ts to
the expected tree. A lthough p oor ® ts are
general ly though t to be asso ciated w ith
saturated sites that have lo st their signal,
our ana lysis su gge sts som ething m ore
p ro b le m at ic fo r p hy lo g en e t ic in fe re nc e :
The se sites have not only lost their historical signal, but contain a nonrand om signal that is m isleadin g. Interestingly, a Tem pleton test indicates that the M PT (Fig. 2)
is signi® can tly (P , 0.0001) different from
the exp ected tree (Fig. 1) w hen all 12,234
sites are included in the an alysis, but no t
signi® can tly different (P 5 0.94) w hen iso leucine, leucine, valine, and third p osition
sites are excluded. Sim ilar results are seen
w ith K ishino± H asegawa (1989) tests. D etails are presented in the A pp en dix.
It is p ossible that further work m ay
show some of the patterns identi® ed here
to be m ore w idespread. At present, how ever, we regard them as sp eci® c to this
study an d, at best, ap plicable only to studies using sequences from these sam e gen es
am ong m etazo an taxa over a com parable
range of diverge nce. Had we an alyze d this
sam e set of taxa using sequences from a
different set of genes (e.g., genes for m onom eric en zy m es of the cytosol), different
classes of in form ative sites m ight have
LIM ITS O F INFEREN CE
71
been obtained , and a com parison of these
sam e genes from m ore recen tly diverged
taxa would alm o st certainly yield a differen t suite of inform ative sites. We also acknow ledge that a dense r sa m pling of echinoderm and chordate taxa for the sam e set
of genes would likely chan ge (and p ossibly
improve) the phy logenetic estim ate based
on the en tire data set (Lecointre et al.,
19 93; H illis, 1996; K im , 1996).
C O N C L U SIO N S
The assum ption that historical signal
w ill prevail if enough sites are sa m pled is
w idely held am ong evolutionary and sy stem atic biologists. It is explicitly cham pioned by the ``total evidence p arsim ony ’ ’
school and is often im plicit in the work of
those w ho em brace evolutiona ry m odels
(Churchill et al., 19 92; Huelsenbeck an d
H illis, 1993). For exam ple, C um m ings et al.
(1995) attem p ted to determ ine a sequencesam pling strateg y that would ap proxim ate
inferences yielde d by en tire m tDNA s, believing that the inferences yielde d by the
en tire sequence would be m ore ``reliable’ ’
than would any p articular su bsa mple.
R usso et al. (1996), in evaluating the p erform ance of different phy logen etic inference m ethods, stated: ``The m ost imp ort a n t fact o r in c o n st ru c tin g r elia ble
phy logenetic trees seem s to be the nu m ber
of am ino acids or nucleotides used.’ ’ R esu lts presented in the current study dem ons trate that there are circum stances in
w hich this is sim ply not the case. D esp ite
a very la rg e sam pleÐ 12,23 4 protein-coding sites, the m aximum obtainable from
m etazoan m tDNA Ð an erroneous yet robust top olog y resultedÐ a top ology contradicted by a wealth of other data. C learly,
the m odels und erlying inference m ethods,
w hether implicit as is the case for parsim ony or explicit as is the case for distance
an d m aximum -likelihood m odels, are no t
accom m odating the pro cesses that have
sh ap ed the data. In the present data set,
severa l m ethods actually converge on an
incorrect top ology as m ore sequence is
ad ded. The se results are consistent w ith
predictions based on simulations by Huelsenbeck and H illis (1993). More data are
72
SYST E M AT IC BIO LO GY
better than fewer data only w hen the in ference m odel accom m odates, in an unbiased way, the evolutiona ry forces that have
sh ap ed character-state distributions. A ny
disparities (biases) that exist between a
m o del (im plied or explicit) and the evolutiona ry process w ill be m ag ni® ed w ith in creasing am ounts of data.
This study provides an em p irical dem onstration that further sequencing does not
autom atically lead to an im proved phylogenetic estim ate. O nce sequences from a
few genes have been obtained, we believe
that tim e and effort would be better sp en t
investigating how know ledge of the structures and functions of those sequences and
the products they encode can be integrated
and incorp orated in to phylogenetic inference m etho ds, rather than by ad ding m ore
sequence data. In stating this, it is not our
in tent to discoura ge sequencing efforts,
but to em phasize that it is useful to incorp orate kn ow ledge about w hat a sequence
does as well as ab out w hat it is in to the
in ference m odels we use. Evolutionary biologists ra rely ana lyze in form ation contained in sequence data beyond an ag gregate po oling of inform ation derived from
individual nucleotide sites, even though
such in form ation is available for m any of
the sequences that are routinely used for
phylogenetic in ference. The structural and
functional attributes of a p articular gen e
product persist and can often be followed
long after the historical signal in the underlying individual sequence elem en ts has
been lost. It is becom in g increasingly po ssible to em p irically assess character-state
chan ge probabilities for sites asso ciated
w ith such structura l and functiona l attributes. O nce these have been estim ated for
a particula r gene, they can be incorporated
in to m ethods of inference in much the
sa m e way as has been done w ith estim ates
of relative rates of transitions and transversions. C om p arisons that m ake use of
such in form ation m ay ultim ately provide
the key to resolving phy lo genetic questions, such as those involvin g relationsh ip s
am ong deeply diverged group s, that are
unresolvable by an alysis of the individual
sequence elem en ts them selves.
VO L.
47
A C KN O W L ED GM EN TS
We are gratefu l to Stan B lum , Su san B row n, T im
C ollin s, E lizabeth K nurek, Fred K rau s, C hristian Paz m and i, C h ris Sim on, Un a Sm ith, Jack Su llivan, and
D ave Swofford for critical com m en ts. T h is work was
supp orted by N ation al Science Foundation gran t DEB 922064 0 to W.M .B . and by a Sloan Po stdo ctoral Fellow ship to G.J.P.N.
R E FE R E N C E S
A R C H IE , J. W. 1989 . Hom oplasy excess ratios: New indices for m easurin g levels of hom oplasy in phy logenetic system atics and a critique of the con sistency
index . Syst. Zool. 38:253 ± 269 .
B E N T O N , M . J. 1993 . The fo ssil record 2. C h apm an and
Hall, L ondon .
B R E M E R , K . 1988 . T he lim its of am ino acid sequence
data in an giosperm phy logenetic recon struction.
Evolution 42:795 ± 803 .
C H O , S., A . M IT C H E L L , J. C . R E G IE R , C . M IT T E R , R . W.
P O O L E , T. P. F R IE D L A N D E R , A N D S. Z H A O . 1995. A
high ly conserved nuclear gene for low-level phy logenetics: Elon gation factor-1 a recovers m orpholog ybased tre e for helio thine m oth s. Mol. Biol. E vol. 12 :
650± 656.
C H U R C H IL L , G. A ., A . V O N H A E S S L E R , A N D W. C . N A V ID I . 1992 . Sam ple size for a phy logenetic inference.
Mol. Biol. E vol. 9:753 ± 769 .
C U M M IN G S , M . P., S. P. O T T O , A N D J. W A K E L E Y . 1995.
Sam plin g properties of DNA sequence data in phylogenetic analysis.Mol. Biol. E vol. 12:814 ± 822 .
D O N O G H U E , M . J., R . G. O L M S T E A D , J. F. S M IT H , A N D J.
D. P A L M E R . 1992 . P hy logenetic relationships of dip scales b as ed on rbcL sequences. A n n. M issouri Bo t.
G ar den 79:333 ± 345 .
E E R N IS S E , D. J., A N D A . K L U G E . 1993 . Taxono m ic con gruence versus total evidence, and am n iote phylogeny inferred fr om fossils, m olecules and m orpholog y. Mol. B iol. E vol. 10:117 0± 1195 .
E E R N IS S E , D. J. 19 9 5. DN A Stack s: H y perC a rd sof twar e utilities fo r m olec u lar sy stem atists, version
1.1. Pu blish ed ele ctron ically. Availab le at ftp :/ /
ftp.biolo g y.in d ian a.e du.
F A R R IS , J. S. 1969 . A successive ap proxim ation ap proach to charac ter weightin g. Syst. Z ool. 18:374 385.
F A R R IS , J. S. 1983 . T he logical basis of phy logenetic
analysis. Pag e s 7± 36 in Advanc es in cladistics, Volum e II (N. I. P latn ick.and V. A . Fun k, ed s.). Colum bia Press, New York.
F A R R IS , J. S. 1989 . T he retention index and the rescaled
consistency index. C ladistics 5:417± 419 .
F E L S E N S T E IN , J. 1978 . C ases in w h ich parsim ony or
com patibility m ethod s w ill b e po sitively m islead ing. Syst. Z ool 27:401 ± 416 .
F E L S E N S T E IN , J. 1985 . Con® dence lim its on phy logenies: A n ap proach usin g the b ootstrap. E volution
39:783 ± 791 .
F IT C H , W. M ., A N D E . M A R G O L IA S H . 1967 . A m ethod
for estim atin g the num ber of invarian t am ino acid
positions in a gene u sin g cy toch rom e c as a m o del
case. B iochem . G enet. 1:65 ± 71 .
19 98
NA YLO R AN D BRO W NÐ
G A U T H IE R J., A . G. K L U G E , A N D T. R O W E . 1988. A m n iote phy logeny and the im p ortance of fossils. C lad istics 4:105 ± 209 .
G O L D M A N , N . 1993 . Statistical tests of m o dels of DNA
sub stitution. J. Mol. E vol. 36:182 ± 198.
G R A Y B E A L , A . 1994 . Evaluatin g the phy logenetic utility of genes: A search for gene s inform ative ab out
de ep d ivergences am ong vertebrates. Syst. B iol. 43 :
174± 193.
G U , X., Y.-X . F U , A N D W.-H . L I . 1995 . M ax im um likeliho od estim ation of the heterog eneity of sub stitution rate am on g nucleotide site s. Mol. B iol. Evol. 12 :
546± 557.
H A S E G A W A , M ., H . K IS H IN O , A N D T. Y A N O . 1985. D ating of the hu m an ± ap e splittin g by a m olecu lar clo ck
of m itochondrial DNA . J. Mol. E vol. 21:160 ± 174 .
H IL L IS , D. M . 1991 . D iscrim inating b etween phy logen etic sign al and rando m noise in DNA sequences.
Pa ges 278 ± 29 4 in P hy logenetic an alysis of DNA sequence s (M . M . M iyam o to and J. C racraft, e ds ). O xford Un iv. Pre ss, New York.
H IL L IS , D. M . 1996 . In ferring com plex phylo gen ies.
N atu re 383:13 0± 131.
H U E L S E N B E C K , J. P., A N D D. M . H IL L IS . 1993 . Success
of phy logenetic m e thod s in the four-taxon case.
Syst. B iol. 42:247 ± 264 .
J U K E S , T. H ., A N D C. R . C A N T O R . 1969 . E volution of
protein m olecu les. Pa ges 21 ± 13 2 in M am m alian
protein m etabolism (H . N. Mun ro, e d.). Acade m ic
Pr ess, New York.
K IM , J. 1996 . General incon sistency cond itions for
m aximu m p ar sim ony: Effects of branch len gths and
increasin g num b ers of tax a. Syst. Biol. 45:363 ± 374 .
K IM U R A , M . 1980 . A sim ple m ethod for estim ating
evolution ary rate of b ase sub stitutions throu gh com p arative studie s of nucleo tide sequences. J. Mol.
E vol. 16:111 ± 120 .
K IS H IN O , H ., A N D M . H A S E G A W A . 1989 . E valuation of
the m ax imu m likeliho od estim ate of the evolution ar y tree topolo gies from DNA sequence data, and
the branching order of the Hom inoide a. J. Mol. E vol.
29:170 ± 179 .
L A N A V E , C ., G. P R E P A R A T A , C . S A C C O N E , A N D G. S E R IO . 1984 . A new m etho d for calculatin g evolutionary
sub stitution rates. J. Mol. E vol. 20:86 ± 93 .
L E C O IN T R E , G., H . P H I L IP P E , H . L V A N L EÃ , A N D H . L E
G U Y A D E R . 1993 . Spe cies sam plin g has a m ajo r im p act on phylogenetic inference. Mol. Phyl. E vol. 2:
205± 224.
L O C K H A R T , P. J., M . A . S T E E L , M . D. H E N D Y , A N D D.
P E N N Y . 1994 . R ecovering evolution ary trees u nder a
m ore realistic m odel of sequence evolution. Mol.
B iol. E vol. 11:605 ± 612.
M A IS E Y , J. G . 1986 . Heads and tails: A chordate phylogeny. C lad istics 2:201 ± 256 .
M A IS E Y , J. G. 1988 . Phylogeny of early vertebrate skele tal induction and o ssi® cation patterns. E volution ar y biolog y, Volu m e 22 (M . Hech t, B . Wallace, and
G. T. Prance, eds.). Plenum , New York.
P E S O L E , G., G. D E L L IS A N T I , G. P R E P A R A T A , A N D C . S A C C O N E . 1995 . The im portance of b as e com po sition in
the correct assessm en t of genetic distance. J. Mol.
E vol. 41:112 4± 1127 .
LIM ITS O F INFEREN CE
73
P H IL IP P E H . A . C H E N U IL , A N D A . A D O U T T E . 1994. C an
the C am brian explosion be inferred throu gh m olecu lar phy logeny? D evelopm ent (suppl.):15± 25 .
R O D R Â õ G U E Z , F., J. L . O L IV E R , A . M A R õÂ N , A N D J. R . M E D IN A . 1990 . T he general sto ch as tic m odel of nucleo tide sub stitution. J. T heor. Biol. 142:48 5± 501 .
R U S S O , C . A . M , N. T A K E Z A K I , A N D M . N E I . 1996. E f® ciencies of different genes and d ifferen t tree-build ing m ethod s in recovering a know n vertebrate phylogeny. Mol. B iol. E vol. 13:525 ± 536 .
S A C C O N E , C ., G. P E S O L E , A N D G. P R E P A R A T A . 1989.
DN A m icroenviron m ents and the m ole cular clock.
J. Mol. E vol. 29:407 ± 411 .
S A C C O N E , C ., C. L A N A V E , G. P E S O L E , A N D G. P R E P A R A T A . 1990 . In ¯ uence of base com po sition on qu antitative e stim ates of gene evolution. M ethod s E n zy m ol. 183:57 0± 583.
S A C C O N E , C ., C. L A N A V E , A N D G. P E S O L E . 1993. Tim e
and b iose quences. J. Mol. E vol. 37:154 ± 159 .
S T E E L , M . A ., P. J. L O C K H A R T , A N D D. P E N N Y . 1993.
C on® dence in evolutionary trees from biolo gical sequence data. Natu re 364:44 0± 442.
S T E E L , M . A . 1994 . R ecovering a tree fr om the leaf
coloration s it generate s under a M arkov m o del.
A ppl. M ath. Le tt. 7:19 ± 23 .
S U L L IV A N , J., K . E . H O L S IN G E R , A N D C . S IM O N . 1995.
A m ong-s ite rate variation and phylogenetic analysis
of 12 S rRN A data in sigm odon tine roden ts. Mol.
B iol. E vol. 12:988 ± 1001 .
S U L L IV A N , J., K . E . H O L S IN G E R , A N D C . S IM O N . 1996.
T he effe ct of top ology on estim ates of am on g site
rate variation. J. Mol. Evol. 42:308 ± 312 .
S U L L IV A N , J., D. L . S W O F F O R D , A N D G. J. P. N A Y L O R .
Uncertain ty in estim atin g par am eters of m ixed-distribution m o dels of rate heterog en eity. (sub m itted
to Syst. Biol.)
S W O F F O R D , D. L ., G. J. O L S E N , P. J. W A D D E L L , A N D D.
M . H IL L IS . 1996 . P hy logenetic inference. Pages 407 ±
514 in Molecular system atics, 2nd e dition (D. M .
H illis, C . Moritz, and B . K . M able, ed s.). Sinauer A ssociates, Sunderland, M assach usetts.
T A V A R EÂ , S. 1986 . Som e prob abilistic and statistical
problem s on the an alysis of DNA sequences. Lec.
M ath. L ife Sci. 17:57 ± 86 .
T E M P L E T O N , A . R . 1983 . Convergen t evolution and
n on -p ar am etric inferences from restriction fragm en t
and DNA sequence data. Pa ges 151 ± 17 9 in Statistical
an alysis of DNA sequence data (B. Weir, e d.). M ar cel
D ek ker, New York.
T H O M P S O N , J. D., D. G. H IG G IN S , A N D T. J. G IB S O N .
1994 . C LU STA L W : Im proving the sensitivity of
progressive mu ltiple sequence align m en t through
sequence weigh tin g, p osition speci® c gap pen alties
and weigh t m atrix choice. N ucleic Acids R es. 22 :
4673± 4680.
W A D D E L L , P. J. 1995 . Statistical m etho ds of phy loge n etic an alysis, includin g Hada m ard con ju gation s,
L o gD e t tran sform s and m axim um likeliho od. Ph.D.
D issertation, M assey Univ., New Zealand .
Y A N G , Z . 1994 . M aximu m likeliho od phylogene tic estim ation from DNA sequences w ith variable rate s
over site s: A pproxim ate m ethod s. J. Mol. E vol. 39 :
306± 314.
74
SYST E M AT IC BIO LO GY
Y A N G , Z . 1996 . A m on g site rate variation and its im pact on phylogene tic an alyses. T RE E 11:367 ± 372 .
Received 13 M arch 1997 ; accepte d 31 Ju ly 199 7
A ssociate E ditor: C. S imon
A P P E N D IX
M O D E L S O F S U B ST ITU T IO N
Fou r d ifferen t sub stitution m o dels of incre asin g
com plexity were evaluated. T he sim plest, the Jukes±
C an tor (1969 ) m o del, as sum e s b oth an even base com p osition and an equal prob ability of ch ange for all six
transform ation types. The K im ura (1980 ) two -p ar am e ter m odel assum es equal b ase fr equencies but allow s
a transition :transversion ratio to be sp eci® ed. T he
H ase gawa± K ishino ± Yano (1985 ) m odel allow s for an
u neven b ase com position and a tran sition :tran sversion
ratio. The general tim e-reversible m o del (L anave e t al.,
1984 ; Tavare , 1986 ; R od rõ guez et al., 1990 ) allow s for
an uneven b ase com po sition and separ ate probabilities of ch ange for each of the six po ssible tran sform ation typ es. None of the fou r m o dels accom m o date s
deviation fr om stationar ity in either b ase com po sition
or su bstitution d yn am ics.
Four am on g-site rate -heterog eneity m o dels were
evaluated for e ach of the follow ing sub stitution m od els: (a) equ al rates; (b ) a prop ortion of sites as sum ed
to b e invarian t am on g tax a, the rem ainder as sum ed
to evolve at equ al rates (I; Fitch and M argoliash, 1967) ;
(c) rates as sum ed to follow a d iscrete ap prox im ation
of the gam m a d istribution ( G ; Yan g, 1994) ; (d ) a prop ortion of sites assum ed to be invarian t, the rem ain der to follow a discrete ap prox im ation of the gam m a
d istribution (I 1 G ; G u et al., 1995). T h us, 16 (4 x 4)
sub stitution/ am ong-s ite and rate -variation com b ination s were evalu ate d.
Swoffo rd e t al. (1996 ) have poin ted out the trade offs b etween the con sistency provided by a m odel’s
com plexity and its sen sitivity to rando m error. In gen -
T A B L E 3. L ikeliho od -ratio test values for d ifferen t
sub stitution m odels.
2 log likeJC
JC 1 I
JC 1 G
JC 1 I 1 G
K 2P
K 2P 1 I
K 2P 1 G
K 2P 1 I 1 G
H K Y85
H K Y85 1 I
H K Y85 1 G
H K Y85 1 I 1
G TR
G TR 1 I
G TR 1 G
G TR 1 I 1 G
G
df
lihood
X2
10
9
9
8
9
8
8
7
6
5
5
4
2
1
1
0
190,01 8.657
184,83 4.253
181,48 7.388
181,34 0.654
189,22 8.248
183,98 8.963
180,22 3.396
180,10 9.353
186,02 3.715
180,47 4.223
175,23 8.975
175,16 0.793
184,93 6.447
179,57 3.355
174,98 0.197
174,87 9.971
30,277 .371 2
19,908 .564 7
13,214 .834 2
12,921 .366 2
28,696 .554 6
18,217 .983 3
10,686 .849 3
10,458 .764 4
22,287 .487 4
11,188 .504 8
718.00 86
561.64 45 4
20,112 .951 4
9,386. 7675 2
200.45 27 2
0
VO L.
47
eral, it is desirab le to u se the sim plest e ffe ctive m o del
to explain observation s, that is, to cho ose a m o del that
h as enough param e ters to ex plain the data satisfacto rily, but no t so m any that statistical pow er is com prom ised.
W hich M odel Best E xplai n s the Data ?
In order to identify the m ost ap propriate sub stitu tion m odel the ± ln likelihoo d scores for the ex pected
tree were com pared for the 16 differen t m o dels. A
likelihoo d-ratio test statistic was com puted and con trasted w ith a ch i-square d ap prox im ation of the null
d istribution (G old m an, 1993) . R esults (Table 3) ind icate that all m o dels ® t the data signi® cantly worse (P
, 0.01 ) than does the param eter-rich G TR 1 I 1 G
m odel. T h is should n ot be interpreted to m ean that
the m ost p aram eter-rich m odel w ill ® nd the exp ected
tree (in fac t it do es no t), but rather that sim pler m odels
w ill fare even m ore poorly.
A sim ilar test was carried out for the su bset of 1,20 7
sites m axim ally inform ative for the expecte d tree un der parsim ony (those w ith R I 5 1.0). T he resu lts were
com p arable to tho se ob taine d for the en tire data set,
insofar as all m odels ® tte d the data signi® cantly less
well than d id the param eter-rich G TR 1 I 1 G m o del.
How ever, X 2 values were m uch lower for the m ax im ally inform ative su bset of data, w ith values ran ging
fr om 17.3 to 40.3 (cf. 200.5 ± 30,277 .4). Th is is prob ably
b ecau se the m axim ally inform ative site s can b e m ore
e asily reconciled to the exp ected tree w ith sim ple
m odels. (T hese site s colle ctively exh ibit a m ore even
b as e com po sition and less deviation fr om station arity
and thus do not requ ire ex tra param eters to acco m m odate b ase -com po sitional unevenne ss and am on gsite rate variation.)
Under W hich S ubstitution M odels Is the E xpecte d Tree
S ign i® cantly D ifferent From the M ost-Parsimon ious Tree?
K ishino ± Has eg awa (1989 ) tests were conducted to
con trast the likeliho od score for the M PT resultin g
fr om equally weigh ted parsim ony (F ig. 2) w ith that
for the exp ected tree (F ig. 1) u nder the 16 m odels for
all 12,23 4 sites. In no case (Table 4) did the exp ected
tree ® t the data signi® can tly b etter. For the sim pler
m odels ( JC and K 2P), the M PT had a signi® cantly better
score than the exp ected tree. However, as m ore p aram eter-rich m o dels, acco m m odatin g b ase com p osition (H K Y 85 , GT R ) and am on g-site rate heterogeneity
(I 1 G ) were u sed, the differences in the ± ln likelihood
scores dim inished in sign i® cance. R e sults show that
b o th rate hetero geneity and b ase com po sition mu st be
incorp orate d b efore the M PT and the ex pected tree
ar e no lon ger signi® cantly d ifferent.
K ishino ± Has eg awa tests were also carried out for a
5,566- bp su bset of the data fr om w h ich third co don
p osition s and sites m odally co din g for isoleucine, leu cine, and valine were excluded (Table 5). Unde r all
m odels, the expe cted tree had a b etter score than the
M P T top olog y fr om Figure 2. However the d ifference
in score did not becom e signi® cant un til b oth gam m ad istributed rate hetero geneity and b ase com po sition
were incorporated. We no te that although the expecte d tree has a m ore likely score than the top ology de-
19 98
NA YLO R AN D BRO W NÐ
75
LIM ITS O F INFEREN CE
T A B L E 4. L ikeliho od scores and P values from K ish ino -H asegaw a (1989 ) tests for the an alysis of the entire
data set. T he m ost likely score is underlined. 1 5 E xp ected tre e; 2 5 m o st p ar sim on ious tree.
JC
E qu al rates
1
2
5
I
1
2
5
G
1
2
5
1
2
5
I 1
G
5
5
5
5
K2P
190,01 8.656 5
189,70 6.570 7
P , 0.0001
1
2
5
184,83 4.25 3
184,61 1.382 6
P , 0.0001
1
2
5
181,48 7.38 8
181,34 .796 5
P , 0.0001
1
2
5
181,34 0.65 4
181,20 4.785 9
P , 0.0001
1
2
5
5
5
5
5
H KY 85
189,22 8.248 2
188,96 6.604 7
P , 0.0001
1
2
5
183,98 8.962 6
183,81 1.340 9
P , 0.0001
1
2
5
180,22 3.395 6
180,14 6.420 3
P 5 0.012
1
2
5
180,10 9.35 3
180,03 0.77 2
P 5 0.0089
1
2
5
5
5
5
5
GT R
186,02 3.714 6
185,87 4.145 6
P 5 0.0004
1
2
5
180,47 4.223 3
180,40 3.544 5
P 5 0.31
1
2
5
175,23 8.97 5
175,25 2.09 3
P 5 0.6165
1
2
5
175,16 0.79 3
175,17 5.19 6
P 5 0.5739
1
2
5
184,93 6.44 7
184,80 9.98 2
P 5 0.0020
5
179,57 3.354 7
179,52 5.141 9
P 5 0.1265
5
174,98 0.197 3
175,00 3.390 6
P 5 0.3796
5
5
174,87 9.970 9
174,90 3.332 4
P 5 0.3642
T A B L E 5. Likelihoo d scores and P values fr om K ish ino-H asegaw a (1989 ) tests for this sub set of the data
(th ird p osition, isoleucine, leucine, and valine sites exclude d). The m ost likely score is underline d. 1 5 E xpe cted
tree; 2 5 m o st p ar sim on ious tree.
JC
E qu al rates
1
2
I
1
2
G
1
2
I 1
G
59,463 .524
5 59,485 .268
P 5 0.5123
1
2
5
57,696 .94
57,711 .336
P 5 0.5716
1
2
5
56,455 .274
56,483 .425
P 5 0.1645
1
2
5
56,454 .932
56,482 .845
P 5 0.1668
1
2
5
5
1
2
K2P
5
5
H KY 85
GTR
5
59,425 .251
5 59,451 .074
P 5 0.4367
1
2
5
1
2
5
5
57,650 .961
57,669 .307
P 5 0.4715
1
2
5
59,340 .893
5 59,375 .084
P 5 0.2987
1
2
5
5
56,394 .168
56,425 .445
P 5 0.1213
1
2
5
57,553 .331
57,579 .049
P 5 0.3064
1
2
5
5
56,393 .699
56,424 .717
P 5 0.1231
1
2
5
56,235 .43
56,271 .685
P 5 0.0686
56,235 .09
56,271 .145
P 5 0.0693
1
2
5
5
5
5
5
5
5
59,114 .897
5 59,143 .418
P 5 0.3870
57,314 .134
57,335 .503 9
P 5 0.3998
5
56,105 .484
56,142 .104
P 5 0.0706
5
5
56,104 .476
56,140 .645
P 5 0.0728
T A B L E 6. L ikeliho od scores and P value s fr om K ishino -H asegaw a (1989 ) tests for the sub set of the data
u sing the 1,20 7 sites that ar e m axim ally inform ative for the expecte d tree u nder parsim ony. T he m ost likely
score is u nderlined. 1 5 E xpe cted tree; 2 5 m ost parsim onious tree.
JC
E qu al rates
1
2
I
1
2
G
1
2
I 1
G
1
2
K2P
H KY 85
5
9,221.3 168
9,275.9 519
P 5 0.0001
1
2
5
1
2
5
9,026.8 563
9,071.4 046
P 5 0.0003
1
2
5
9,023.3 549
9,068.3 934
P 5 0.0003
1
2
5
9,026.8 563
9,071.4 046
P 5 0.0003
1
2
5
9,023.3 549
9,068.3 934
P 5 0.0003
1
2
5
9,045.1 688
9,081.1 777
P 5 0.0015
1
2
5
9,023.3 549
9,068.3 934
P 5 0.0003
9,040.2 982
9,076.6 480
P 5 0.0014
1
2
5
5
5
5
5
5
5
5
GT R
5
9,015.3 612
9,059.9 317
P 5 0.0003
1
2
5
9,015.3 612
9,059.9 317
P 5 0.0003
1
2
5
9,015.3 612
9,059.9 317
P 5 0.0003
1
2
5
9,030.3 767
9,066.1 219
P 5 0.0015
1
2
5
5
5
5
5
9,006. 7110
9,052. 3787
P 5 0.0003
5
5
9,006. 7110
9,052. 3787
P 5 0.0003
5
5
9,006. 7110
9,052. 3787
P 5 0.0003
5
5
9,023. 4482
9,060. 3618
P 5 0.0013
5
76
SYST E M AT IC BIO LO GY
VO L.
47
T A B L E 7. R esults of the Tem pleton tests com paring the exp ected tree w ith the m o st par sim onious tree. The
shorter of two trees is underlined.
D ata sub set
C om plete data (12,23 4 b p)
N o I, L, V, or 3rd p osition s (5,56 6 bp )
R I 5 1.0 sites only (1,20 7 b p)
Length of
expected tree
Length of Fig. 2
topology (M PT)
46,058
12,464
1,945
45,734
12,462
1,996
p icted in Figu re 2 for this su bset of the data, other
tree topologies (th at are neither the ex pected tree nor
the tree show n in Fig. 2) h ave still be tter scores. In de ed, the M PT for this par ticular su bset of sites is
d ifferen t from that resultin g from an analysis of all
12,23 4 sites.
K ishino± Hasegawa tests were carried out for a second
subset of the data: those 1,207 sites m axim ally inform ative for the expected tree under parsim ony (i.e., those
P value
, 0.0001
0.9433
, 0.0001
w ith RI 5 1.0). In this case the expected tree (Table 6)
has a signi® cantly better score than does the M PT (Fig.
2), in all 16 cases. Inclusion of extra parameters to accom modat e among-site rate heterogeneity and base-com positional differences has no effect on the level of significance between the two trees tested, because sites with
a perfect ® t do not show appreciable among-site rate
variation or uneven base composition.
T he results of the Tem pleton tests p arallel tho se
seen for the K ish ino± H asegaw a (1989 ) tests (Table 7).