Syst. Biol. 47(1):6 1± 76 , 199 8 A m phioxus M itochond rial DNA , C hordate Phylo geny, and the Lim its of Inference Base d on C om parisons of Seq uences G A V IN J. P. N A Y L O R 1 AND W ESLEY M . BRO W N 2 1 D epartment of Z oology and G enetics, Iowa State Un iversity, A mes, Iowa 50011 , USA ; E-m ail: gn aylor@ iastate.edu 2 D epartment of Biology, Un iversity of M ichigan , A nn A rbor, M ichiga n 48109- 1048 , U SA A bstract.Ð A n alyses of b oth the nucle otide and am ino acid sequences derive d from all 13 m itochond rial protein -enco din g genes (12,23 4 b p) of 19 m etazoa n spe cies, includ ing that of the lancelet Branchiostom a ¯ oridae (``am phiox us’ ’ ), fail to yield the w idely acce pted phy logeny for chordates and , w ith in chordates, for vertebrates. G iven the bread th and the com pellin g n atu re of the data su pportin g that phy logeny, relation ships sup ported by the m itochondrial sequence com parison s are alm ost certain ly incorrect, despite their being supp orted by equally weigh ted parsim ony, d istance, and m axim um -likelihood an alyses. T he incorrect groupings probably resu lt in par t from convergen t b ase-com position al sim ilarities am ong som e of the tax a, sim ilarities that ar e strong enough to overw helm the h istorical sign al. Com par isons am ong very d istantly related tax a ar e likely to b e par ticularly su scep tible to such ar tifacts, b ecause the h istorical sign al is alread y greatly attenuated. E m p irical results u nderscore the ne ed for ap proache s to phylogenetic inference that go b eyond sim ple site-by-site com parison of align ed sequences. T h is study and o thers ind icate that, once a sequence sam ple of reason able size h as b een obtained, accu rate phy logenetic estim ation m ay b e b etter served by incorp oratin g know led ge of m olecu lar structu res and processes into inference m o dels and by seeking ad dition al h igher order ch ar acters em b ed ded in tho se sequences, than by gathering ever larger sequence sam ples from the sam e organ ism s in the hope that the historical sign al w ill eventu ally prevail. [A m phioxu s; chordate phy logeny; hom oplasy; m tDNA ; m olecu lar system atics; phy logenetic inference.] The practice of inferring evolutionary trees from DNA sequences has ¯ ourished in recent years, its credibility bolstered by the observation that phy lo genies of wellstudied group s are usu ally su pp orted by sequence data. W hen the sequence of a p articular gene or other well-de® ne d DNA segm en t yields an inference cong ruent w ith an accep ted relationsh ip for a p articular group, there is a tendency to regard that segm en t as reliable for phylogenetic inference and to use it to determ ine phylogen ies for taxa w ho se relationsh ip s are unknow n (G raybeal, 1994; C ho et al., 19 95). However, from the beginning of such studies it was recognized that any DNA segm ent can only be useful over a lim ited diverge nce ra ng e; outside that range the historical signal would be either too undevelop ed or too attenuated to be reliable. Fu rtherm ore, w ith an increase in the nu m ber of such studies it also becam e ap p arent that the useful ra ng e varied am ong different taxa. T hu s, there are in stances in w hich sequence data provide accurate assess m en ts for som e relationsh ip s, an d erroneous ones for others (Felsenstein, 19 78; H illis, 19 91; K im , 1996; Philip pe et al., 1994). The latter occur w heneve r the em bedded historical signal is overturned by a stronger, hom oplasious signal am ong the DNA sequences. Various m ethods are used for phylogene tic reconstruction, each im plying a different m odel of evolutiona ry chan ge and em phasizing different asp ects of the observed character-state covariation am ong taxa. It is com m on practice to rega rd a phy logeny that is sup ported by severa l different m etho ds as correct, and especially so w hen statistical m ethods for evaluating the strength of su pp ort [e.g., boo tstrap p ing (Felsenstein, 1985) and decay indices (Brem er, 1988; D onogh ue et al., 1992)] are com pelling. This stem s from a tacit assu m ption that an incorrect phy logeny, even if it is the best ® t to the available data, w ill not receive signi® can t statistical su p p ort w hen the result itself is evaluated. That assum ption is incorrect. Statistical evaluations merely asses s the strength of the signal used to order the data hierar61 62 SYST E M AT IC BIO LO GY chically (Swofford et al., 19 96). Thus, if there is a hierarchical signa l in the data that arise s from a nonhistorical source and if that signa l is su f® ciently strong, it can overw helm not only a wea ker historical signal, but also a statistical evaluation of the resu lt. The ``total evidence’’ approach to phylogenetic reconstruction (Eerniss e and Kluge, 1993) is currently am ong the m o st w idely applied. It ``uses character congruence to ® nd the best ® tting hyp othesis for an unp ar titioned set of synap omorphies, w hich is ideally all of the releva nt available data’ ’ (Eernisse and Klu ge, 1993). In its purest im plem entation the ap proach weights all characters equally in order to disp en se w ith any need to identify different classes of inform ation. Proponents m aintain that p ar titioning evidence into classes is arti® cial, ``because there is little reason to believe such categories are m ind-indep en dent c at e g o rie s w ith d is co ver ab le bound aries’ ’ (Eerniss e and K lu ge, 1993). This som ew hat narrow perspective has gained a follow ing because it obviates a need to sp ecify explicit (and often poorly know n) processes about the way in w hich traits evolve. Advocates of the m etho d esp ouse a view that hom oplasy (characterstate covariation am ong taxa due to in ¯ uences other than shared history ) w ill be random ly distributed w ith respect to taxa, and that hierarch ically structured historic a l s ig n al (ch a rac te r-st at e c ova riat io n am ong taxa due to sh ared history) w ill overshadow the hom oplasy if enough data are collected (see Farris , 1983). In keep ing w ith this view, it is assum ed that any in correct inferences w ill be due to stochastic e rr o r ass o ciat e d w it h a n in su f® cien t ly large sam ple size of characters, and that they w ill disap pea r as m ore data are collected. The sam ple size of sites requ ired to en su re that the historical signal overturn s the hom oplasy depends to a large extent on the strength of the historical signal in a data set and on the gra in size of the hom oplasy. If the hom oplasy is disp ersed in a ® ne -grain ed fashionÐ that is, is distributed in sm all ``packets,’ ’ each of w hich VO L. 47 su gge sts a different nonhistorical associationÐ it w ill likely app ear rand om ly distributed at relatively sm all sam ple sizes. If h o m o p las y is d is p er se d in a co a rs e grained fashion, so that severa l sites su ggest the sam e nonhistorical group ing, then large r sam ple sizes (i.e., m ore sequence) w ill be requ ired before patterns of hom oplasy ap pear ra ndom ly distributed. In essence, the sam ple size of sites at w hich the random ness of hom oplasy becom es ap p aren t is dictated by the grain size of the ho m oplasy. Thu s, even if hom oplasy were random ly distributed am ong taxa at the level of the en tire genom e (an assum ption that has not been em p irically tested), it would app ea r to be highly nonra ndom ly distributed w ithin a p articular sam ple of sites if its grain size were coarse and the sa m ple of sites in suf® ciently large. Note, in the current context, that the term ``grain size’ ’ has no spatial conno tation. W hen we refer to hom oplasy as ``coarse-grained ,’ ’ we m ean only that severa l sites w ithin a fragm en t im ply the sam e nonhistorical group ing; the m isleading sites that collectively constitute a ``p acket’ ’ ne ed not be sp atially contiguous along the sequence. The prem ise that hom oplasy is ra nd om ly distributed or un structured w ith in data sets underlies the phy lo gen etically m eanin gful in terpretations of bo o tstra p pin g , decay ind ices, an d successive weigh ting (Fa rris, 196 9). G roup in gs assessed as un reliable (i.e., tho se w ith little character sup p ort) are as su m ed to be due to ch an ce, w hile those as sessed as reliable are as sum ed to be so due to shared history. Unfortu nately, if hom oplasy is nonra nd om ly d istributed or if it show s ``system atic error’ ’ (Swofford et al., 19 96), then analyses w ill no t only yield errone ous p hy logen etic in ferences, but m any of the tests designed to evalu ate the reliability of their con stituen t no des w ill lead to falsely con® den t as sessm en ts. G iven the ap peal of the ``total eviden ce,’ ’ equ ally weighted p arsim ony ap pro ach, it would be useful to evaluate the extent to w hich its required assum ption for random distribution of hom oplasy is actually m et by m olecula r data sets. W hen phy lo geny is 19 98 NA YLO R AN D BRO W NÐ 63 LIM ITS O F INFEREN CE T A B L E 1. Sp ecies used and their G enb an k acce ssion num b ers. Specie s nam e Com m on nam e GenBank accesion num ber Mus muscu lus R attus norvegicus Bos tau rus Balaenopteru s physalu s Balaenopteru s mu sculus D idelphis virginian a G allu s gallus X enopus laevis Mouse Rat Cow Fin-b ack w hale Blue w h ale O p posum C hicken Frog C yprinus carpio O ncorhyn chus m ykiss Petrom yzon m arinus Branchiostom a ¯ oridae C arp T rout Lam prey Lancelet Paracentrotus lividus Stron gylocentrotus pu rpu ratu s D rosophila yaku ba Cepaea nemoralis A nopheles gam biae A scaris suu m C aenorhabditis elegan s Sea urch in 1 J01420 X 14848 J01394 X 6114 5 X 7220 4 Z 29573 X 52392 M 10217 X 01600 X 01601 X 02890 X 61010 L 29771 U 11880 A F03516 4± A F03517 6 J0481 5 Sea urch in 2 Fruit ¯ y Snail Mosqu ito Nem atode 1 Nem atode 2 X 1263 1 X 03240 U 23045 L 20934 X 5425 3 X 5425 2 know n, nonrand om distribution of hom oplasy can be inferred w hen the data set strongly su pp orts an incorrect tree. The strength of depar ture from random ness can be assess ed by evaluating the level of bootstrap su pp ort, or the decay index for the incorrect group s, or by su bjecting the data to a Tem pleton (1983) test. A lthough no phylogeny is know n w ith certain ty, a num ber are very well su p ported, p erhap s the best know n being that for echinoderm s plus chordates (see M aisey, 19 86, 19 88; G authier et al., 1988, and references therein). Com plete m ito chond rial genom es have been sequenced for representatives of several vertebrate classes , two echinoderm classes , and a num ber of outgroup s. We have recen tly sequenced a m itochondrial DNA (m tDNA ) from the la ncelet Branchiostom a ¯ oridae (``am phioxus’ ’ ), a sp ecies of C ephalochordata, the im m ediate sister taxon to the C raniata. Thus, com plete m tDNA sequences are now availa ble from representatives of m o st key lineag es in vertebrate evolution. Phy logenetic in fer- F IG U R E 1. T he expe cte d pattern of phy logenetic relationships for 19 taxa. B ranch lengths depicted are estim ates from the fossil record (Ben ton, 1993) . T hey re¯ ect the earliest fossil o ccurrence assign able to the stem lineage of the extan t form . In the case of the fr og , the earliest fo ssil occu rrence for the anu ran stem grou p was used rather than the ® rst fossil as signable to the am phibian grade. ences derived from com p arisons of these sequences can be contrasted w ith the accepted phylogeny, providin g an opp ortunity to exam ine the distribution p atterns of hom oplasy in m tDNA sequences of this grou p. M A T E R IA L S AND M E TH O D S We assem bled complete m itochond rial sequences for 19 taxa (Table 1) w hose phylogen etic relationsh ip s are noncontroversial (Fig. 1). The protein encoding regions were align ed at the am ino acid level using C lu stal W (Thom pson et al., 1994) an d were checked for higher ord er structura l concordance using the codon-coloring feature of A ligner (Eerniss e, 1995). The resu lting data set, consisting of 19 aligned 12 ,234-bp sequences, was su bjected to a s er ie s o f p hy lo g en e t ic a n a ly s e s u s in g X X (frog , ® sh ) (chicken, ® sh) (chicken, fro g, ® sh ) (chicken, fro g, ® sh, lam prey) (ro den ts (w h ales, cow, opossum )) (lam prey (lancelet (ech inoderm s, vertebrate s))) (lancelet (e ch inoderm s, vertebrates)) (lancelet (¯ ies, ech inoderm s, verteb rate s)) X X X CO1 CO2 CO3 X X X X X X A m ino acid s X T ransversionsÐ X X RY X X X X X ND 1 AG C T CY TB A ll nucleotide sub stitutionsÐ ATP8 (frog , ® sh ) (chicken, fro g) (frog , ch icken, ® sh ) (lancelet (e ch inoderm s, vertebrates)) (lancelet (¯ ies, ech inoderm s, verteb rate s)) X X X AT P6 X X X ND2 X X ND 3 X X X X ND 4 X X X N D 4L X X X X X X X ND 5 In ference errors resulting from bo otstrap analysis of each gene ind ividually and all genes com bined. (frog , ® sh ) (chicken, ® sh) (frog , ch icken, ® sh ) (frog (® sh, am nio tes) (opossum (fro g, ® sh, am niotes)) (lancelet (e ch inoderm s, vertebrates)) (lancelet (¯ ies, ech inoderm s, verteb rate s)) Inferred grouping T A B L E 2. X X X X X ND 6 X X X X X X A ll 64 SYST E M AT IC BIO LO GY VO L. 47 19 98 NA YLO R AN D BRO W NÐ Figu re 2. M PTs resu ltin g from equally weighted an alysis of the com bined data set, for nucleo tides (top), tran sversion s only (cen ter), and am ino acid s (b ottom ). Bo otstrap su pport percentag es ar e show n at each node. T ree length and RI are show n to the righ t of each topolo gy. Note that two M PTs resu lt for the am ino acid analysis. The top ology dep icted is the strict consen sus of the two M PTs. PAU P*4.0 version 53 (w ritten by D avid Swofford). Snail, fru it ¯ y, mo squito, and two nem atode sp ecies were used as a collective outgroup for all ana lyses. We exam ined trees derived from equ ally weighted parsim ony an alysi s for each of the 13 protein-encodin g genes, bo th individually and in com bination. A nalyses LIM ITS O F INFEREN CE 65 were conducted at three levels: using all nucleo tides; using transversions only; and using am ino acid sequences. The deg ree of su pp ort for each node was evaluated using the bootstrap m ethod of Felsenstein (1985). The nucleotide sequence data were also su bjected to distance an alyses using Juke s± C antor (JC ) (1969), K imura two-param eter (K 2P ) (1 9 8 0 ), H a se g aw a ± K is h in o ± Ya no (H KY) (1985), and genera l tim e-reversible (G TR) (Lanave et al., 1984; Tavar e , 1986; R odrõ guez et al., 1990) distances. Each of the four distances were used in conju nction w ith four different m odels of am ongsite rate variation (A SRV ) (Sullivan et al., 19 95, 1996; Yang , 1996): (a) no rate variation; (b) a proportion of sites assum ed to be invaria nt, I, w ith the rem ainder having equal rates (Fitch and M argoliash, 1967); (c) rate variation follow ing a discrete approxim ation to a gam m a distribution, G (Yang , 1994); and (d) a prop ortion of sites assum ed to be invariant, w ith the rem ain der follow ing a discrete approxim ation to a gam m a distribution, I 1 G (G u et al., 19 95; Sullivan et al., su bm itted). In all, 16 (4 3 4) different mo dels were investigated. Param eter values for each of the 16 m odels were obtained by ® tting the exp ected tree to the data and optim izing values for that tree und er m aximum likelih o od. M aximum -likelih ood tests evaluating the ® t of each of the m odels to the exp ected tree were carried out; the results are show n in the Ap pendix. Heuristic searches were conducted using m aximum likelih ood for the sam e 16 m odel conditions just describ ed. We ® tted the expected tree to the data an d m easured the phy lo genetic inform ativeness of each of the 12,234 sites for that tree using the retention index (RI; A rchie, 19 89; Fa rris, 19 89). We m easured base com position and its variation (deviation from stationarity) am ong the 19 taxa for the subset of sites w ith a p erfect ® t to the expected tree (those w ith RI 5 1.0), and contrasted the values w ith those obtained for the en tire population of sites. Principalcom ponent ana lyses of nucleotide base com position and am ino acid com p osition were plo tted to provide a graphic repre- 66 SYST E M AT IC BIO LO GY VO L. 47 Figure 3. Com parison of base-com positional an alysis of the 1207 sites w ith a perfect ® t (R I 5 1.0) to the expected tree and of the entire population of sites. (a) Expected tree (left); M PT yielded by equally weighted analysis of the entire data set (right). (b) Corresponding base-com positional pro® les for each data set. Note the m ore b alance d distribution of the four nucleotides in the sub set of sites w ith a perfect ® t (R I 5 1.0) to the expected tree. (c) Deviation from stationarity am ong ingroup taxa was assessed usin g a chi-squared test. Base-com positional d ifferences are signi® cantly d ifferent from random expectation for the entire population of sites (P , 0.000000 1), but are not sign i® can tly different for the subset of sites with a perfect ® t to the expected tree. These tests are intended only as coarse heuristics and do not account for phylogenetic structure (Swofford , 1997). 19 98 NA YLO R AN D BRO W NÐ LIM ITS O F INFEREN CE 67 sentation of overall com positional sim ilarit ie s a m o n g t ax a . M ax im u m -likelih o o d tests evaluating the ® t of the 16 m odels to the exp ected tree were carried out for the su bset of sites w ith an RI of 1.0 and are contrasted w ith sim ilar an alyses using all 12 ,234 sites (see Ap pendix). K ishino ± H asegawa (1989) tests contrastin g the top ology of the expected tree w ith that of the mo st p arsim onious tree (M PT) yielded by the com plete nucleotide data set are show n for different mo dels in the Ap pendix. Each site in the alignm ent was classi® ed according to gen e, codon p osition, am ino acid (the m oda l am ino acid across taxa in the alignm ent), chem ical prop erty, charge, an d relative hy drophobicity of the mo dal am ino acid for that site in the alignm en t. A n an alysis of varian ce assess ing the effect of each of these six factors on phylogenetic inform ativene ss (RI for the exp ected tree) was then carrie d out. R E SU L T S F IG U R E 4. C om p o sition al sim ilarity am on g tax a for nuc leo tide b as es a nd a m ino acid residue s. Prin cip al-co m p on en t plots (P C 1 v s. P C 2 ) a llow im m ed iat e iden ti® cation o f com p osition al sim ilarity am o n g the 19 tax a. T h e ® rst tw o com po nen ts acco u n t for 98 % of the variation in nucle otide b as e com p osition an d 67 % of the variation in am ino acid c om po sition . A ll an a ly ses a re b as e d on the correlation m at rix. Solid circles repre sen t m am m a ls, op en circles no n m am m a lian verteb rate s, and solid tria n gles invertebrate s. N u m b ers± sp ecies c orrespo nd en ce s: 1 5 ® n -b ack w h ale, 2 5 blue w h ale, 3 5 cow, 4 5 rat , 5 5 m ou se, 6 5 p ossu m , 7 5 ch icken, 8 5 f rog , 9 5 trou t, 10 5 c ar p, 11 5 la m prey, 12 5 la nc ele t, 1 3 5 sea u rch in 1, 14 5 sea urch in 2 , 15 5 m o squito, 16 5 f ruit¯ y, In the equally weighted parsimony an alyses, none of the genes, either individually or in com bination, yielded the exp ected tree, nor were any of the fully resolved trees resulting from boo tstrap resam pling of the individual genes consistent w ith the expected tree. There was considerable consistency am ong genes in the pattern of inferred errors (Table 2). W hen all substitutions were an alyzed, 10 of the 13 gen es ind icated Branchiostom a to be the sister taxon to a (vertebrate 1 echinoderm ) clade, 4 genes (ATP6, C O 1, N D 4l, and N D 6) indicated chicken to be the sister taxon to ® sh es, and 5 genes (ATP6, C O 1, C O 3, N D 5, an d N D 6) indicated a m onophy letic (frog, ® sh, chicken) clade. The se rep eated error p atterns im ply that hom oplasy is highly nonra ndom ly distributed. The boo tstrap consensus trees from the transvers ion an alyses were less resolved and had fewer con¯ icts w ith the accep ted tree; however, ¬ 17 5 snail, 18 5 nem atode 1, 19 5 nem ato de 2. Note that in b oth plots the two sea u rchin taxa (13 and 14 ) are closer to the vertebrate taxa than is the lancelet (12). 68 SYST E M AT IC BIO LO GY VO L. 47 19 98 NA YLO R AN D BRO W NÐ C O 1, C O 2, and N D 2 indicated Branchiostom a to be outside a (vertebrate 1 echino derm ) clade, an d C O 1 an d N D 5 indicated a (frog, ® sh, chicken) clade. There was less consistency in the pattern of errors w ith the am ino acid sequences; neverth eless, som e of the sam e inference errors seen in the nucleotide an alyses resurfaced. W hen all genes were com bined, the M PT and the correspond in g bo otstrap consensus differed from the exp ected tree at all three (nucleotide, transvers ion, and am ino acid) levels of an alysis , and the incorrect group ings often had high levels of boo tstrap su pp ort (see Fig. 2). D istance and m aximum -likelih ood an alyses of the com plete nucleotide data set failed to yield the exp ected top olo gy for any one of the 16 m odel/A SRV com binations tested. These an alyses, like the parsimony an alysi s (Fig. 2), all placed Branchiostom a outside echinoderm s and the frog, ® sh, and chicken in a clade of their ow n. A lthough no single ana lysis yielded the expected tree, the m ore p aram eter-rich m odels (H KY and G TR w ith A SRV) yielded trees that were not signi® can tly differen t from the expected tree w hen subjected to K ishin o± H asegawa (1989) tests (see Ap p en dix). This su ggests an im proved ® t between m odel and the data for the p aram eter-rich m odels. That the en tire protein-encoding p ortion of the m tDNA , a total of 12,234 sites, yields an in ference that is both incorrect an d sup ported by high bootstrap values in an equally weighted p arsim ony an alysi s is sobering. The fact that distance and m aximum -likelih ood ana lyses of the data under a variety of m odels (in w hich rate m atrix p aram eters were optim iz ed by ® rst ® tting the exp ected tree to the data set) also fail to yield the expected tree su p ports our origin al sup po sition that hom oplasy is LIM ITS O F INFEREN CE 69 nonra ndom ly distributed w ithin this large sam ple of sites. It is possible that m ost of the structured or m isleading hom oplasy is concen trated w ithin a few genes. Indeed, we found that w hen we subjected a com bined data set com prising am ino acid sequences from N D1, N D 4, C O 1, C O 2, C O 3, an d CYTB (2,302 am ino acid sites) to an e qu a lly w eig h t e d p a rs im o ny b o o t st ra p an alysis, the expected tree resulted w ith 10 0% bootstrap su pp ort for all but three nodes. However, the utility of this ® nd ing is questionable, since the genes yielding correct results m ight vary am ong data sets an d thus not be determ in able a priori. Collective Properties of Sites w ith a Perfect Fit to the Expected Tree W hen the sequence data were ® tted to the top olog y of the exp ected tree, we identi® ed 1,207 phy logenetically inform ative sites w ith a p erfect ® t to that tree (i.e., 1,207 sites w ith an RI of 1.0). Base com p osition for this subset of sites was less skewed and showed no signi® can t devia tion from stationarity for the ingroup taxa, in m arked contrast to the situation observed for the en tire population of sites (Fig. 3). Moreover, in princip al-com ponent plots of nucleotide base com p osition an d am ino acid com po sition (Fig. 4), the vertebrate taxa have pro® les that are clearly m ore sim ilar to those of the two echinoderm s than to that of Branchiostom a. These results are consistent w ith the prediction, m ad e on the basis of simulations (Saccone et al., 19 89, 1990, 1993; Steel et al., 1993; Lockh ar t et al., 19 94; Steel, 19 94; Pesole et al., 1995), that base-com p ositional devia tions from stationarity can result in hierarchically structured hom oplasy and, consequently, lead to incorrect phylogenetic inference. We em phasize, however, that the base- ¬ F IG U R E 5. Te sts of the asso ciation b etween function al ch aracteristics and the phylogene tic inform ativeness of a site w hen the com b ine d data set is ® tted to the ex pected tree. The degree of inform ativeness was assesse d u sing RI. A n alysis of variance ind icates that all six factors (g ene, codon position, am ino acid, chem ical property, ch ar ge, and hy drophobicity) h ave high ly signi® cant effects on RI (log-tran sform ed ). The relative effects of the d ifferen t levels of e ach factor ar e plotted ag ain st log R I (ordin ate) as resp on se sam ple m ean s. B ars corre spond to one standard error. 70 SYST E M AT IC BIO LO GY VO L. 47 that are m ore sim ilar to those of the two echino derm s than to that of Branchiostom a. There is thus no sim ple ad ditive corres p o n den ce b etw een b ase-co m p o s itio n a l bias an d the inferred phylogeny. Fu rtherm ore, erroneous inferences are not am eliorated by LogD et neighbor-jo ining ana lysis, a procedu re dem onstrated through simulation to retrieve correct phy logenies in the face of nonstationary base com positions w hen sites are independen t (Steel et al., 1993; Lo ckhar t et al., 19 94; Steel, 1994). This is the case even w hen a prop ortion of sites are assum ed to be invariant to accom m o date bias due to am ong-site rate he terogeneity (Waddell, 1995; Swofford et al., 1996). Relationship Between Fu nction and Phy logenetic Inform ativeness F IG U R E 6. M P Ts based on fu nctional sub sets of am ino acid s. B ootstrap su pport p ercen tage s ar e show n at e ach node. (a) Strict con sen su s of two M PTs resu ltin g from the an alysis of ® rst and second co don p osition s for site s w hose m o dal am ino acid was pro line or cy steine. (b) Single M PT resu ltin g fr om an alysis of ® rst and second codon p osition s for sites w hose m o dal am ino acid was proline, cy steine, m e thion ine, glutam ine, and asp ar agine (the im ino, su lfur, and am ide side-ch ain grou ps, resp ectively). com po sitional differences do no t com pletely account for the inference errors in this data set. The base-com p osition plo ts for the subset of sites w ith a perfect ® t to the exp ected tree, although m arkedly different from those for the entire data set, also sh ow the vertebrate taxa to have pro® les A nalysi s of variance indicated that all six factors tested (gene, codon p osition, am ino acid, chem ical properties, charge, and relative hydrophobicity) have highly signi® cant effects (P , 0.0005) on the phylogenetic inform ativene ss (RI) of a site (Fig. 5). This result re¯ ects the im portance of these prop erties to m olecular structure and function. Signi® can t interaction term s were found am ong som e of the prop erties. For exam ple, ® rst positions had m arkedly higher RIs for hydrophilic than for hydrophobic sites, an asso ciation not seen at second or third p ositions. A n in teraction was also seen between gene an d codon position (P , 0.005). Effect tests for this interaction revealed that third po sition sites had signi® can tly higher RIs (P , 0.05 ) in ATP8 and N D 4L than in o ther genes, su ggesting that third codon p osition constrain ts m ay differ am ong genes. Based on these an alyses, we were able to iden tify classes of sites that yielded the exp ected tree w hen su bjected to p arsim ony an alysis . The greatest overall sup port resulted from an an alysi s of ® rst and second codon positions of sites modally coding for proline, cysteine, methionine, glutam ine, and asp aragine. Parsim ony an alysis of the ® rst two sites of all codons in positions mo dally coding for proline and cysteine yielded an incompletely resolved 19 98 NA YLO R AN D BRO W NÐ bootstrap cons en su s tree that was com patible w ith the expected vertebrate tree an d had 65% bo otstrap su pp ort for a m ono p hy le t ic C h o r d at a (c ep h a lo cho rd at e s 1 vertebrates; Fig. 6a). W hen the ® rst two sites of all codons in p ositions m odally coding for m ethionine, glutam ine, and as p aragine were ad ded to this an alysis the expected tree was obtained in fully resolved form , w ith strengthened (85% ) b o o t st ra p su p p o r t fo r a m o n o p hy le tic C hordata (Fig. 6b ). A lthough there is an undeniable elem en t of circularity involved in using the expected tree to determ ine sites that are in form ative, it is interesting an d probably signi® cant that those we identi® ed are associated w ith conservative m olecular m otifs that are frequen tly im p ortant for pro tein structure an d function. By contrast, analysis show s sites m odally coding for the rap idly evolving hy drophobic am ino acids leucine, isoleucine, and valine (Fig. 5) to have especially p oor ® ts to the expected tree. A lthough p oor ® ts are general ly though t to be asso ciated w ith saturated sites that have lo st their signal, our ana lysis su gge sts som ething m ore p ro b le m at ic fo r p hy lo g en e t ic in fe re nc e : The se sites have not only lost their historical signal, but contain a nonrand om signal that is m isleadin g. Interestingly, a Tem pleton test indicates that the M PT (Fig. 2) is signi® can tly (P , 0.0001) different from the exp ected tree (Fig. 1) w hen all 12,234 sites are included in the an alysis, but no t signi® can tly different (P 5 0.94) w hen iso leucine, leucine, valine, and third p osition sites are excluded. Sim ilar results are seen w ith K ishino± H asegawa (1989) tests. D etails are presented in the A pp en dix. It is p ossible that further work m ay show some of the patterns identi® ed here to be m ore w idespread. At present, how ever, we regard them as sp eci® c to this study an d, at best, ap plicable only to studies using sequences from these sam e gen es am ong m etazo an taxa over a com parable range of diverge nce. Had we an alyze d this sam e set of taxa using sequences from a different set of genes (e.g., genes for m onom eric en zy m es of the cytosol), different classes of in form ative sites m ight have LIM ITS O F INFEREN CE 71 been obtained , and a com parison of these sam e genes from m ore recen tly diverged taxa would alm o st certainly yield a differen t suite of inform ative sites. We also acknow ledge that a dense r sa m pling of echinoderm and chordate taxa for the sam e set of genes would likely chan ge (and p ossibly improve) the phy logenetic estim ate based on the en tire data set (Lecointre et al., 19 93; H illis, 1996; K im , 1996). C O N C L U SIO N S The assum ption that historical signal w ill prevail if enough sites are sa m pled is w idely held am ong evolutionary and sy stem atic biologists. It is explicitly cham pioned by the ``total evidence p arsim ony ’ ’ school and is often im plicit in the work of those w ho em brace evolutiona ry m odels (Churchill et al., 19 92; Huelsenbeck an d H illis, 1993). For exam ple, C um m ings et al. (1995) attem p ted to determ ine a sequencesam pling strateg y that would ap proxim ate inferences yielde d by en tire m tDNA s, believing that the inferences yielde d by the en tire sequence would be m ore ``reliable’ ’ than would any p articular su bsa mple. R usso et al. (1996), in evaluating the p erform ance of different phy logen etic inference m ethods, stated: ``The m ost imp ort a n t fact o r in c o n st ru c tin g r elia ble phy logenetic trees seem s to be the nu m ber of am ino acids or nucleotides used.’ ’ R esu lts presented in the current study dem ons trate that there are circum stances in w hich this is sim ply not the case. D esp ite a very la rg e sam pleÐ 12,23 4 protein-coding sites, the m aximum obtainable from m etazoan m tDNA Ð an erroneous yet robust top olog y resultedÐ a top ology contradicted by a wealth of other data. C learly, the m odels und erlying inference m ethods, w hether implicit as is the case for parsim ony or explicit as is the case for distance an d m aximum -likelihood m odels, are no t accom m odating the pro cesses that have sh ap ed the data. In the present data set, severa l m ethods actually converge on an incorrect top ology as m ore sequence is ad ded. The se results are consistent w ith predictions based on simulations by Huelsenbeck and H illis (1993). More data are 72 SYST E M AT IC BIO LO GY better than fewer data only w hen the in ference m odel accom m odates, in an unbiased way, the evolutiona ry forces that have sh ap ed character-state distributions. A ny disparities (biases) that exist between a m o del (im plied or explicit) and the evolutiona ry process w ill be m ag ni® ed w ith in creasing am ounts of data. This study provides an em p irical dem onstration that further sequencing does not autom atically lead to an im proved phylogenetic estim ate. O nce sequences from a few genes have been obtained, we believe that tim e and effort would be better sp en t investigating how know ledge of the structures and functions of those sequences and the products they encode can be integrated and incorp orated in to phylogenetic inference m etho ds, rather than by ad ding m ore sequence data. In stating this, it is not our in tent to discoura ge sequencing efforts, but to em phasize that it is useful to incorp orate kn ow ledge about w hat a sequence does as well as ab out w hat it is in to the in ference m odels we use. Evolutionary biologists ra rely ana lyze in form ation contained in sequence data beyond an ag gregate po oling of inform ation derived from individual nucleotide sites, even though such in form ation is available for m any of the sequences that are routinely used for phylogenetic in ference. The structural and functional attributes of a p articular gen e product persist and can often be followed long after the historical signal in the underlying individual sequence elem en ts has been lost. It is becom in g increasingly po ssible to em p irically assess character-state chan ge probabilities for sites asso ciated w ith such structura l and functiona l attributes. O nce these have been estim ated for a particula r gene, they can be incorporated in to m ethods of inference in much the sa m e way as has been done w ith estim ates of relative rates of transitions and transversions. C om p arisons that m ake use of such in form ation m ay ultim ately provide the key to resolving phy lo genetic questions, such as those involvin g relationsh ip s am ong deeply diverged group s, that are unresolvable by an alysis of the individual sequence elem en ts them selves. VO L. 47 A C KN O W L ED GM EN TS We are gratefu l to Stan B lum , Su san B row n, T im C ollin s, E lizabeth K nurek, Fred K rau s, C hristian Paz m and i, C h ris Sim on, Un a Sm ith, Jack Su llivan, and D ave Swofford for critical com m en ts. T h is work was supp orted by N ation al Science Foundation gran t DEB 922064 0 to W.M .B . and by a Sloan Po stdo ctoral Fellow ship to G.J.P.N. R E FE R E N C E S A R C H IE , J. W. 1989 . Hom oplasy excess ratios: New indices for m easurin g levels of hom oplasy in phy logenetic system atics and a critique of the con sistency index . Syst. Zool. 38:253 ± 269 . B E N T O N , M . J. 1993 . The fo ssil record 2. C h apm an and Hall, L ondon . B R E M E R , K . 1988 . T he lim its of am ino acid sequence data in an giosperm phy logenetic recon struction. Evolution 42:795 ± 803 . C H O , S., A . M IT C H E L L , J. C . R E G IE R , C . M IT T E R , R . W. P O O L E , T. P. F R IE D L A N D E R , A N D S. Z H A O . 1995. A high ly conserved nuclear gene for low-level phy logenetics: Elon gation factor-1 a recovers m orpholog ybased tre e for helio thine m oth s. Mol. Biol. E vol. 12 : 650± 656. C H U R C H IL L , G. A ., A . V O N H A E S S L E R , A N D W. C . N A V ID I . 1992 . Sam ple size for a phy logenetic inference. Mol. Biol. E vol. 9:753 ± 769 . C U M M IN G S , M . P., S. P. O T T O , A N D J. W A K E L E Y . 1995. Sam plin g properties of DNA sequence data in phylogenetic analysis.Mol. Biol. E vol. 12:814 ± 822 . D O N O G H U E , M . J., R . G. O L M S T E A D , J. F. S M IT H , A N D J. D. P A L M E R . 1992 . P hy logenetic relationships of dip scales b as ed on rbcL sequences. A n n. M issouri Bo t. G ar den 79:333 ± 345 . E E R N IS S E , D. J., A N D A . K L U G E . 1993 . Taxono m ic con gruence versus total evidence, and am n iote phylogeny inferred fr om fossils, m olecules and m orpholog y. Mol. B iol. E vol. 10:117 0± 1195 . E E R N IS S E , D. J. 19 9 5. DN A Stack s: H y perC a rd sof twar e utilities fo r m olec u lar sy stem atists, version 1.1. Pu blish ed ele ctron ically. Availab le at ftp :/ / ftp.biolo g y.in d ian a.e du. F A R R IS , J. S. 1969 . A successive ap proxim ation ap proach to charac ter weightin g. Syst. Z ool. 18:374 385. F A R R IS , J. S. 1983 . T he logical basis of phy logenetic analysis. Pag e s 7± 36 in Advanc es in cladistics, Volum e II (N. I. P latn ick.and V. A . Fun k, ed s.). Colum bia Press, New York. F A R R IS , J. S. 1989 . T he retention index and the rescaled consistency index. C ladistics 5:417± 419 . F E L S E N S T E IN , J. 1978 . C ases in w h ich parsim ony or com patibility m ethod s w ill b e po sitively m islead ing. Syst. Z ool 27:401 ± 416 . F E L S E N S T E IN , J. 1985 . Con® dence lim its on phy logenies: A n ap proach usin g the b ootstrap. E volution 39:783 ± 791 . F IT C H , W. M ., A N D E . M A R G O L IA S H . 1967 . A m ethod for estim atin g the num ber of invarian t am ino acid positions in a gene u sin g cy toch rom e c as a m o del case. B iochem . G enet. 1:65 ± 71 . 19 98 NA YLO R AN D BRO W NÐ G A U T H IE R J., A . G. K L U G E , A N D T. R O W E . 1988. A m n iote phy logeny and the im p ortance of fossils. C lad istics 4:105 ± 209 . G O L D M A N , N . 1993 . Statistical tests of m o dels of DNA sub stitution. J. Mol. E vol. 36:182 ± 198. G R A Y B E A L , A . 1994 . Evaluatin g the phy logenetic utility of genes: A search for gene s inform ative ab out de ep d ivergences am ong vertebrates. Syst. B iol. 43 : 174± 193. G U , X., Y.-X . F U , A N D W.-H . L I . 1995 . M ax im um likeliho od estim ation of the heterog eneity of sub stitution rate am on g nucleotide site s. Mol. B iol. Evol. 12 : 546± 557. H A S E G A W A , M ., H . K IS H IN O , A N D T. Y A N O . 1985. D ating of the hu m an ± ap e splittin g by a m olecu lar clo ck of m itochondrial DNA . J. Mol. E vol. 21:160 ± 174 . H IL L IS , D. M . 1991 . D iscrim inating b etween phy logen etic sign al and rando m noise in DNA sequences. Pa ges 278 ± 29 4 in P hy logenetic an alysis of DNA sequence s (M . M . M iyam o to and J. C racraft, e ds ). O xford Un iv. Pre ss, New York. H IL L IS , D. M . 1996 . In ferring com plex phylo gen ies. N atu re 383:13 0± 131. H U E L S E N B E C K , J. P., A N D D. M . H IL L IS . 1993 . Success of phy logenetic m e thod s in the four-taxon case. Syst. B iol. 42:247 ± 264 . J U K E S , T. H ., A N D C. R . C A N T O R . 1969 . E volution of protein m olecu les. Pa ges 21 ± 13 2 in M am m alian protein m etabolism (H . N. Mun ro, e d.). Acade m ic Pr ess, New York. K IM , J. 1996 . General incon sistency cond itions for m aximu m p ar sim ony: Effects of branch len gths and increasin g num b ers of tax a. Syst. Biol. 45:363 ± 374 . K IM U R A , M . 1980 . A sim ple m ethod for estim ating evolution ary rate of b ase sub stitutions throu gh com p arative studie s of nucleo tide sequences. J. Mol. E vol. 16:111 ± 120 . K IS H IN O , H ., A N D M . H A S E G A W A . 1989 . E valuation of the m ax imu m likeliho od estim ate of the evolution ar y tree topolo gies from DNA sequence data, and the branching order of the Hom inoide a. J. Mol. E vol. 29:170 ± 179 . L A N A V E , C ., G. P R E P A R A T A , C . S A C C O N E , A N D G. S E R IO . 1984 . A new m etho d for calculatin g evolutionary sub stitution rates. J. Mol. E vol. 20:86 ± 93 . L E C O IN T R E , G., H . P H I L IP P E , H . L V A N L Eà , A N D H . L E G U Y A D E R . 1993 . Spe cies sam plin g has a m ajo r im p act on phylogenetic inference. Mol. Phyl. E vol. 2: 205± 224. L O C K H A R T , P. J., M . A . S T E E L , M . D. H E N D Y , A N D D. P E N N Y . 1994 . R ecovering evolution ary trees u nder a m ore realistic m odel of sequence evolution. Mol. B iol. E vol. 11:605 ± 612. M A IS E Y , J. G . 1986 . Heads and tails: A chordate phylogeny. C lad istics 2:201 ± 256 . M A IS E Y , J. G. 1988 . Phylogeny of early vertebrate skele tal induction and o ssi® cation patterns. E volution ar y biolog y, Volu m e 22 (M . Hech t, B . Wallace, and G. T. Prance, eds.). Plenum , New York. P E S O L E , G., G. D E L L IS A N T I , G. P R E P A R A T A , A N D C . S A C C O N E . 1995 . The im portance of b as e com po sition in the correct assessm en t of genetic distance. J. Mol. E vol. 41:112 4± 1127 . LIM ITS O F INFEREN CE 73 P H IL IP P E H . A . C H E N U IL , A N D A . A D O U T T E . 1994. C an the C am brian explosion be inferred throu gh m olecu lar phy logeny? D evelopm ent (suppl.):15± 25 . R O D R  õ G U E Z , F., J. L . O L IV E R , A . M A R õ N , A N D J. R . M E D IN A . 1990 . T he general sto ch as tic m odel of nucleo tide sub stitution. J. T heor. Biol. 142:48 5± 501 . R U S S O , C . A . M , N. T A K E Z A K I , A N D M . N E I . 1996. E f® ciencies of different genes and d ifferen t tree-build ing m ethod s in recovering a know n vertebrate phylogeny. Mol. B iol. E vol. 13:525 ± 536 . S A C C O N E , C ., G. P E S O L E , A N D G. P R E P A R A T A . 1989. DN A m icroenviron m ents and the m ole cular clock. J. Mol. E vol. 29:407 ± 411 . S A C C O N E , C ., C. L A N A V E , G. P E S O L E , A N D G. P R E P A R A T A . 1990 . In ¯ uence of base com po sition on qu antitative e stim ates of gene evolution. M ethod s E n zy m ol. 183:57 0± 583. S A C C O N E , C ., C. L A N A V E , A N D G. P E S O L E . 1993. Tim e and b iose quences. J. Mol. E vol. 37:154 ± 159 . S T E E L , M . A ., P. J. L O C K H A R T , A N D D. P E N N Y . 1993. C on® dence in evolutionary trees from biolo gical sequence data. Natu re 364:44 0± 442. S T E E L , M . A . 1994 . R ecovering a tree fr om the leaf coloration s it generate s under a M arkov m o del. A ppl. M ath. Le tt. 7:19 ± 23 . S U L L IV A N , J., K . E . H O L S IN G E R , A N D C . S IM O N . 1995. A m ong-s ite rate variation and phylogenetic analysis of 12 S rRN A data in sigm odon tine roden ts. Mol. B iol. E vol. 12:988 ± 1001 . S U L L IV A N , J., K . E . H O L S IN G E R , A N D C . S IM O N . 1996. T he effe ct of top ology on estim ates of am on g site rate variation. J. Mol. Evol. 42:308 ± 312 . S U L L IV A N , J., D. L . S W O F F O R D , A N D G. J. P. N A Y L O R . Uncertain ty in estim atin g par am eters of m ixed-distribution m o dels of rate heterog en eity. (sub m itted to Syst. Biol.) S W O F F O R D , D. L ., G. J. O L S E N , P. J. W A D D E L L , A N D D. M . H IL L IS . 1996 . P hy logenetic inference. Pages 407 ± 514 in Molecular system atics, 2nd e dition (D. M . H illis, C . Moritz, and B . K . M able, ed s.). Sinauer A ssociates, Sunderland, M assach usetts. T A V A R E , S. 1986 . Som e prob abilistic and statistical problem s on the an alysis of DNA sequences. Lec. M ath. L ife Sci. 17:57 ± 86 . T E M P L E T O N , A . R . 1983 . Convergen t evolution and n on -p ar am etric inferences from restriction fragm en t and DNA sequence data. Pa ges 151 ± 17 9 in Statistical an alysis of DNA sequence data (B. Weir, e d.). M ar cel D ek ker, New York. T H O M P S O N , J. D., D. G. H IG G IN S , A N D T. J. G IB S O N . 1994 . C LU STA L W : Im proving the sensitivity of progressive mu ltiple sequence align m en t through sequence weigh tin g, p osition speci® c gap pen alties and weigh t m atrix choice. N ucleic Acids R es. 22 : 4673± 4680. W A D D E L L , P. J. 1995 . Statistical m etho ds of phy loge n etic an alysis, includin g Hada m ard con ju gation s, L o gD e t tran sform s and m axim um likeliho od. Ph.D. D issertation, M assey Univ., New Zealand . Y A N G , Z . 1994 . M aximu m likeliho od phylogene tic estim ation from DNA sequences w ith variable rate s over site s: A pproxim ate m ethod s. J. Mol. E vol. 39 : 306± 314. 74 SYST E M AT IC BIO LO GY Y A N G , Z . 1996 . A m on g site rate variation and its im pact on phylogene tic an alyses. T RE E 11:367 ± 372 . Received 13 M arch 1997 ; accepte d 31 Ju ly 199 7 A ssociate E ditor: C. S imon A P P E N D IX M O D E L S O F S U B ST ITU T IO N Fou r d ifferen t sub stitution m o dels of incre asin g com plexity were evaluated. T he sim plest, the Jukes± C an tor (1969 ) m o del, as sum e s b oth an even base com p osition and an equal prob ability of ch ange for all six transform ation types. The K im ura (1980 ) two -p ar am e ter m odel assum es equal b ase fr equencies but allow s a transition :transversion ratio to be sp eci® ed. T he H ase gawa± K ishino ± Yano (1985 ) m odel allow s for an u neven b ase com position and a tran sition :tran sversion ratio. The general tim e-reversible m o del (L anave e t al., 1984 ; Tavare , 1986 ; R od rõ guez et al., 1990 ) allow s for an uneven b ase com po sition and separ ate probabilities of ch ange for each of the six po ssible tran sform ation typ es. None of the fou r m o dels accom m o date s deviation fr om stationar ity in either b ase com po sition or su bstitution d yn am ics. Four am on g-site rate -heterog eneity m o dels were evaluated for e ach of the follow ing sub stitution m od els: (a) equ al rates; (b ) a prop ortion of sites as sum ed to b e invarian t am on g tax a, the rem ainder as sum ed to evolve at equ al rates (I; Fitch and M argoliash, 1967) ; (c) rates as sum ed to follow a d iscrete ap prox im ation of the gam m a d istribution ( G ; Yan g, 1994) ; (d ) a prop ortion of sites assum ed to be invarian t, the rem ain der to follow a discrete ap prox im ation of the gam m a d istribution (I 1 G ; G u et al., 1995). T h us, 16 (4 x 4) sub stitution/ am ong-s ite and rate -variation com b ination s were evalu ate d. Swoffo rd e t al. (1996 ) have poin ted out the trade offs b etween the con sistency provided by a m odel’s com plexity and its sen sitivity to rando m error. In gen - T A B L E 3. L ikeliho od -ratio test values for d ifferen t sub stitution m odels. 2 log likeJC JC 1 I JC 1 G JC 1 I 1 G K 2P K 2P 1 I K 2P 1 G K 2P 1 I 1 G H K Y85 H K Y85 1 I H K Y85 1 G H K Y85 1 I 1 G TR G TR 1 I G TR 1 G G TR 1 I 1 G G df lihood X2 10 9 9 8 9 8 8 7 6 5 5 4 2 1 1 0 190,01 8.657 184,83 4.253 181,48 7.388 181,34 0.654 189,22 8.248 183,98 8.963 180,22 3.396 180,10 9.353 186,02 3.715 180,47 4.223 175,23 8.975 175,16 0.793 184,93 6.447 179,57 3.355 174,98 0.197 174,87 9.971 30,277 .371 2 19,908 .564 7 13,214 .834 2 12,921 .366 2 28,696 .554 6 18,217 .983 3 10,686 .849 3 10,458 .764 4 22,287 .487 4 11,188 .504 8 718.00 86 561.64 45 4 20,112 .951 4 9,386. 7675 2 200.45 27 2 0 VO L. 47 eral, it is desirab le to u se the sim plest e ffe ctive m o del to explain observation s, that is, to cho ose a m o del that h as enough param e ters to ex plain the data satisfacto rily, but no t so m any that statistical pow er is com prom ised. W hich M odel Best E xplai n s the Data ? In order to identify the m ost ap propriate sub stitu tion m odel the ± ln likelihoo d scores for the ex pected tree were com pared for the 16 differen t m o dels. A likelihoo d-ratio test statistic was com puted and con trasted w ith a ch i-square d ap prox im ation of the null d istribution (G old m an, 1993) . R esults (Table 3) ind icate that all m o dels ® t the data signi® cantly worse (P , 0.01 ) than does the param eter-rich G TR 1 I 1 G m odel. T h is should n ot be interpreted to m ean that the m ost p aram eter-rich m odel w ill ® nd the exp ected tree (in fac t it do es no t), but rather that sim pler m odels w ill fare even m ore poorly. A sim ilar test was carried out for the su bset of 1,20 7 sites m axim ally inform ative for the expecte d tree un der parsim ony (those w ith R I 5 1.0). T he resu lts were com p arable to tho se ob taine d for the en tire data set, insofar as all m odels ® tte d the data signi® cantly less well than d id the param eter-rich G TR 1 I 1 G m o del. How ever, X 2 values were m uch lower for the m ax im ally inform ative su bset of data, w ith values ran ging fr om 17.3 to 40.3 (cf. 200.5 ± 30,277 .4). Th is is prob ably b ecau se the m axim ally inform ative site s can b e m ore e asily reconciled to the exp ected tree w ith sim ple m odels. (T hese site s colle ctively exh ibit a m ore even b as e com po sition and less deviation fr om station arity and thus do not requ ire ex tra param eters to acco m m odate b ase -com po sitional unevenne ss and am on gsite rate variation.) Under W hich S ubstitution M odels Is the E xpecte d Tree S ign i® cantly D ifferent From the M ost-Parsimon ious Tree? K ishino ± Has eg awa (1989 ) tests were conducted to con trast the likeliho od score for the M PT resultin g fr om equally weigh ted parsim ony (F ig. 2) w ith that for the exp ected tree (F ig. 1) u nder the 16 m odels for all 12,23 4 sites. In no case (Table 4) did the exp ected tree ® t the data signi® can tly b etter. For the sim pler m odels ( JC and K 2P), the M PT had a signi® cantly better score than the exp ected tree. However, as m ore p aram eter-rich m o dels, acco m m odatin g b ase com p osition (H K Y 85 , GT R ) and am on g-site rate heterogeneity (I 1 G ) were u sed, the differences in the ± ln likelihood scores dim inished in sign i® cance. R e sults show that b o th rate hetero geneity and b ase com po sition mu st be incorp orate d b efore the M PT and the ex pected tree ar e no lon ger signi® cantly d ifferent. K ishino ± Has eg awa tests were also carried out for a 5,566- bp su bset of the data fr om w h ich third co don p osition s and sites m odally co din g for isoleucine, leu cine, and valine were excluded (Table 5). Unde r all m odels, the expe cted tree had a b etter score than the M P T top olog y fr om Figure 2. However the d ifference in score did not becom e signi® cant un til b oth gam m ad istributed rate hetero geneity and b ase com po sition were incorporated. We no te that although the expecte d tree has a m ore likely score than the top ology de- 19 98 NA YLO R AN D BRO W NÐ 75 LIM ITS O F INFEREN CE T A B L E 4. L ikeliho od scores and P values from K ish ino -H asegaw a (1989 ) tests for the an alysis of the entire data set. T he m ost likely score is underlined. 1 5 E xp ected tre e; 2 5 m o st p ar sim on ious tree. JC E qu al rates 1 2 5 I 1 2 5 G 1 2 5 1 2 5 I 1 G 5 5 5 5 K2P 190,01 8.656 5 189,70 6.570 7 P , 0.0001 1 2 5 184,83 4.25 3 184,61 1.382 6 P , 0.0001 1 2 5 181,48 7.38 8 181,34 .796 5 P , 0.0001 1 2 5 181,34 0.65 4 181,20 4.785 9 P , 0.0001 1 2 5 5 5 5 5 H KY 85 189,22 8.248 2 188,96 6.604 7 P , 0.0001 1 2 5 183,98 8.962 6 183,81 1.340 9 P , 0.0001 1 2 5 180,22 3.395 6 180,14 6.420 3 P 5 0.012 1 2 5 180,10 9.35 3 180,03 0.77 2 P 5 0.0089 1 2 5 5 5 5 5 GT R 186,02 3.714 6 185,87 4.145 6 P 5 0.0004 1 2 5 180,47 4.223 3 180,40 3.544 5 P 5 0.31 1 2 5 175,23 8.97 5 175,25 2.09 3 P 5 0.6165 1 2 5 175,16 0.79 3 175,17 5.19 6 P 5 0.5739 1 2 5 184,93 6.44 7 184,80 9.98 2 P 5 0.0020 5 179,57 3.354 7 179,52 5.141 9 P 5 0.1265 5 174,98 0.197 3 175,00 3.390 6 P 5 0.3796 5 5 174,87 9.970 9 174,90 3.332 4 P 5 0.3642 T A B L E 5. Likelihoo d scores and P values fr om K ish ino-H asegaw a (1989 ) tests for this sub set of the data (th ird p osition, isoleucine, leucine, and valine sites exclude d). The m ost likely score is underline d. 1 5 E xpe cted tree; 2 5 m o st p ar sim on ious tree. JC E qu al rates 1 2 I 1 2 G 1 2 I 1 G 59,463 .524 5 59,485 .268 P 5 0.5123 1 2 5 57,696 .94 57,711 .336 P 5 0.5716 1 2 5 56,455 .274 56,483 .425 P 5 0.1645 1 2 5 56,454 .932 56,482 .845 P 5 0.1668 1 2 5 5 1 2 K2P 5 5 H KY 85 GTR 5 59,425 .251 5 59,451 .074 P 5 0.4367 1 2 5 1 2 5 5 57,650 .961 57,669 .307 P 5 0.4715 1 2 5 59,340 .893 5 59,375 .084 P 5 0.2987 1 2 5 5 56,394 .168 56,425 .445 P 5 0.1213 1 2 5 57,553 .331 57,579 .049 P 5 0.3064 1 2 5 5 56,393 .699 56,424 .717 P 5 0.1231 1 2 5 56,235 .43 56,271 .685 P 5 0.0686 56,235 .09 56,271 .145 P 5 0.0693 1 2 5 5 5 5 5 5 5 59,114 .897 5 59,143 .418 P 5 0.3870 57,314 .134 57,335 .503 9 P 5 0.3998 5 56,105 .484 56,142 .104 P 5 0.0706 5 5 56,104 .476 56,140 .645 P 5 0.0728 T A B L E 6. L ikeliho od scores and P value s fr om K ishino -H asegaw a (1989 ) tests for the sub set of the data u sing the 1,20 7 sites that ar e m axim ally inform ative for the expecte d tree u nder parsim ony. T he m ost likely score is u nderlined. 1 5 E xpe cted tree; 2 5 m ost parsim onious tree. JC E qu al rates 1 2 I 1 2 G 1 2 I 1 G 1 2 K2P H KY 85 5 9,221.3 168 9,275.9 519 P 5 0.0001 1 2 5 1 2 5 9,026.8 563 9,071.4 046 P 5 0.0003 1 2 5 9,023.3 549 9,068.3 934 P 5 0.0003 1 2 5 9,026.8 563 9,071.4 046 P 5 0.0003 1 2 5 9,023.3 549 9,068.3 934 P 5 0.0003 1 2 5 9,045.1 688 9,081.1 777 P 5 0.0015 1 2 5 9,023.3 549 9,068.3 934 P 5 0.0003 9,040.2 982 9,076.6 480 P 5 0.0014 1 2 5 5 5 5 5 5 5 5 GT R 5 9,015.3 612 9,059.9 317 P 5 0.0003 1 2 5 9,015.3 612 9,059.9 317 P 5 0.0003 1 2 5 9,015.3 612 9,059.9 317 P 5 0.0003 1 2 5 9,030.3 767 9,066.1 219 P 5 0.0015 1 2 5 5 5 5 5 9,006. 7110 9,052. 3787 P 5 0.0003 5 5 9,006. 7110 9,052. 3787 P 5 0.0003 5 5 9,006. 7110 9,052. 3787 P 5 0.0003 5 5 9,023. 4482 9,060. 3618 P 5 0.0013 5 76 SYST E M AT IC BIO LO GY VO L. 47 T A B L E 7. R esults of the Tem pleton tests com paring the exp ected tree w ith the m o st par sim onious tree. The shorter of two trees is underlined. D ata sub set C om plete data (12,23 4 b p) N o I, L, V, or 3rd p osition s (5,56 6 bp ) R I 5 1.0 sites only (1,20 7 b p) Length of expected tree Length of Fig. 2 topology (M PT) 46,058 12,464 1,945 45,734 12,462 1,996 p icted in Figu re 2 for this su bset of the data, other tree topologies (th at are neither the ex pected tree nor the tree show n in Fig. 2) h ave still be tter scores. In de ed, the M PT for this par ticular su bset of sites is d ifferen t from that resultin g from an analysis of all 12,23 4 sites. K ishino± Hasegawa tests were carried out for a second subset of the data: those 1,207 sites m axim ally inform ative for the expected tree under parsim ony (i.e., those P value , 0.0001 0.9433 , 0.0001 w ith RI 5 1.0). In this case the expected tree (Table 6) has a signi® cantly better score than does the M PT (Fig. 2), in all 16 cases. Inclusion of extra parameters to accom modat e among-site rate heterogeneity and base-com positional differences has no effect on the level of significance between the two trees tested, because sites with a perfect ® t do not show appreciable among-site rate variation or uneven base composition. T he results of the Tem pleton tests p arallel tho se seen for the K ish ino± H asegaw a (1989 ) tests (Table 7).
© Copyright 2026 Paperzz