Sy st. Biol. 47(1):3± 8, 1 998 Taxonom ic Sam plin g , Phylo genetic Accu racy, and Inv estigator B ias D A V ID M . H ILL IS D epartment of Zoology and In stitute of Cellular and Molecular B iology, Un iversity of Texas, Austin, Texas 78712, U SA ; E-m ail: hillis@ phylo.zo.utexas.edu In this issue of System atic Biology, a serie s of authors use severa l differen t ap proaches to exam ine the effects of taxonom ic sam plin g on phylo genetic analysis. This topic is receiving increasing attention, in p art because recen t studies have reached a confusin g diversity of conclu sions about the effects of taxonom ic sam plin g. For instance, contrast the conclu sions reached in two recent p ap ers on this topic: w hether investigators choose to evaluate the phylogenetic perform ance on a branch-bybranch basis or on a tree-by-tree basis. For every taxon added to an analysis, we are also attempting to estim ate an additional internal branch. Thus, the problem gets m ore complex as we add taxa, and there are m ore places in the tree w here problem s w ith inconsistency m ay arise. This led Kim (1996) to his recom mendation just quoted, and some authors (e.g., G raur et al., 1996) regu larly heed this advice by reducing taxonom ic problems to the sim plest possible four-taxon trees. Under this strategy, a system atist samples all possible quartets of taxa that involve the internal branch of interest, and tabulates the number of times each of the three possible trees is supported. For instance, G raur et al. (1996) evaluated the relationships of rabbits by evaluating all possible quartets of taxa selected from rabbits, prim ates, other m am m als, and an outgroup (a m arsupial, monotrem e, or reptile). None of the quartets supported the traditional group Glires (rabbits plus rodents), which they took as evidence that rabbits and rodents are not closely related. O ne problem w ith redu cing a phy logene tic analysis to its sim plest possible form is that four-taxon trees can be very dif® cult to estim ate correctly if rates of evolution are high (e.g., H illis et al., 1994). K im (1996:372) concluded ``that to be 95% con® den t of avoiding inconsistency problem s, the expected nu m ber of change s over the en tire tree for a given character must be less than one out of four.’ ’ However, m uch higher rates of character evolution are acceptable (and even desira ble) if the tree is d e n s ely s a m p le d . To d em o n s tr at e th is p oint, I mo di® ed the sim ulation of the 228taxon tree from H illis (1996) by increasing the exp ected am ount of change along all the bra nche s by 10-fold (Fig. 1). This tree If the evolutionary question of interest does no t requ ire a large num ber of taxa, it seem s b est to use few er taxa b ecau se larger trees are m ore likely to contain incon sistent branches. (K im , 1996:372) Includin g large num b ers of taxa in an an alysis m ay be the b est way to en sure phy logenetic accu racy. (H illis, 1996:131) The se recom mend ations, taken at face value, app ea r to be in dire ct con¯ ict w ith regard to advice on taxon sam pling . The pap ers in this issue extend these studies and m odify these recom mend ations on the basis of analyse s of real data sets (Soltis et al., 1998; Poe, 1998), sim ulations (Graybea l, 1998), and theoretical considerations (K im , 1998). O ne conclu sion from reading these p ap ers is that w hether incre ased taxonom ic sam pling help s or hinders the process of accurate phy logenetic estim ation depend s to a great extent on how accuracy is evaluated and w hat is m ea nt by ``taxonom ic sam plin g.’ ’ L O C A L V E R SU S G L O B A L E FF E C T S T A X O N O M IC S A M P L IN G OF M uch of the ap p aren t d isagreem en t among authors on the effects of taxonomic sampling stem s from the different evaluation criteria being evaluated. Kim (1998) discusses several of the criteria, and emphasizes the differences between evaluating ef® ciency versu s consistency in phylogene tic analysis. Although this difference is important, a greater difference occurs depending upon 3 4 SYSTE M AT IC BIOLO GY VO L. 47 F IG U R E 1. A m odel tree b ased on the phylogene tic an alysis of angiosperm diversity by Soltis e t al. (1997). In the original simu lation (H illis, 1996), rates of divergence were b ased on the observed rates am ong the an giosperm s; in this case (s im ulation on e), the scale b ar represents 2% d ivergence. In the presen t pap er, the simulation was repeated (sim ulation two), but evolution ary rates were increased so that the expecte d d ivergen ce was 10-fold greater (the scale b ar represen ts 20% divergence in this case). is based on a phy logene tic estim ate of angiosp erm relationsh ip s (Soltis et al., 1997), and as such it represents an ap proxim ation of the topolog y of the kind of tree that sy stem atists are actually attem pting to estim ate. The characters are evolving accordin g to a K imura two-param eter m odel of evolution, w ith a 2:1 tra nsitio n:transversion ratio, and rate hetero geneity am ong sites (m odeled w ith a gam m a distribution w ith the sh ap e p aram eter a 5 0.5). Und er these conditions, the avera ge character is changing 23.6 tim es across the tree, and because of the rate hetero geneity am ong sites, som e characters change m any m ore tim es. At these high rates of evolution, m any of the term inal sequen ces are so dissim ilar that no biologist would recognize them as hom olo gou s. Nonethele ss, the tree is accurately reconstructed w ith just a few thou sand nucleotides, and m any of the branches require fewer data to reconstruct than at lower rates of evolution (Fig. 2). Supp ose we are intere sted in a particu - lar intern al branch in the tree (m arked w ith an arrow in Fig. 3). This bra nch is corre ctly estim ated in the full tree if all the taxa are included. If we sam ple a quartet of taxa to exam ine this sam e branch (e.g., as in Fig. 3), then the branch w ill be inconsisten tly estim ated for almo st every po ssible qu artet. For the quartet of taxa show n in Figu re 3, the probability that a sing le nucleotide w ill be m isin form ative about the relationship s of the four taxa under the p arsim ony criterio n is approxim ately 0.4. The probability that a sin gle nucleotide w ill be inform ative about the relationsh ip s of the four taxa und er the p arsim ony criterion is approxim ately 0.006. Thu s, one would expect to converge on the w rong solution for these four taxa w ith great speed under these cond itions; only a few nucleo tides would need to be sequenced to guarantee ® nd in g the w rong solution. In contrast, if all the taxa are included in the ana lysis, then the branch is correctly reconstructed w ith a few thou sand nucleo- 1998 HILLISÐ TA XO NO M IC SA M PLING F IG U R E 2. Perfor m ance of parsim ony in estim ating the 228-taxon tree show n in Figu re 1. ``Percen t of tree correct’’ is b ased on the partition m etric (R obinson and Fou lds, 1981; Penny and Hend y, 1985). A ll intern al branches in the tree are correctly estim ated w ith 5,000 nucleo tides, for either simulation. tides. C learly, at least some phylo genetic problem s require intensive and extensive taxonom ic sam pling . T A X O N O M IC S A M P L IN G S C H E M E S There are m any differen t ways that sy stem atists m igh t sele ct taxa for analysis. In m any cases, the taxa selected w ill be based on availability. In other cases, it m igh t be p ossible to sele ct taxa accord ing to a sam plin g strategy. Let us assu m e that a sam plin g strateg y is po ssible, and im agine that a sy stem atist is intere sted in analyzing the phy logeny of a larg e and diverse grou p, such as angiosp erm s. We w ill also assum e that prelim in ary data are available for 20 sp ecies. The sy stem atist now has tim e and m oney to add 200 m ore sp ecies to the analysis, so som e strategy for taxonom ic sam plin g is necess ary. C onsider ® ve of the m any p ossible strategies: 1. Add the 200 additional taxa rand om ly from living organism s (e.g., the sy stem atist would sam ple ra ndom ly from the tree of life). 2 . C h o o se ta x a ra n d o m ly w ith in th e m onophyletic grou p of interest (in this exam ple, the sy stem atist would ran- 5 dom ly sa mple 200 additional angio sp erm s). 3. Sele ct taxa w ithin the m onophy letic grou p of interes t that w ill represen t the overall diversity of the grou p. For example, the system atist m ight select two diverge nt represen tatives from each of 100 differen t fam ilie s of angiosp erm s, purpo sefu lly chosen to best represen t ang iosperm diversity. 4. Sele ct taxa w ithin the m onophy letic grou p of intere st that are exp ected (based on current taxonom y or prev iou s phy logen etic studies) to su bdivide long branches in the initia l tree. 5. Add (and delete) taxa until the a priori biases of the system atist are su pp orted . I call this last strategy the Therio t Effect after the tongue-in-cheek practices of Therio t et al. (1995:4): ``We added or discarded characters [taxa] until we achieve d the resu lts we believed , then stopp ed.’ ’ A lthough this range of options m ay seem extrem e, they re¯ ect the range of studies that have been conducted on the topic of ``taxonom ic sam plin g.’ ’ I expect few pract ic in g s y s tem at ist s w ou ld p u rp o se fu lly choose sa m pling strategies 1, 2, or 5. The ® rst strategy would en su re the inclu sion of very long bra nche s in the tree, and gen es that were evolving at an appropriate rate for elucidating the phy logene tic relationsh ip s am ong angiosp erm s would likely be saturated for change s am ong the other taxa. Adding additional taxa would no t reduce the branch len gths in the tree, and the additional taxa would be high ly unlikely to help reso lve angio sperm phy logeny. The second strategy m igh t seem m ore likely, but I doubt any system atist would choose this ap proach either. If he or sh e did, a large p ercentage of the added taxa would be com p osites and orchids, and m ost of the fam ilies of angiosperm s would be unrepresented . The dangers of the Therio t Effect (strategy 5) should be clear, and hopefully this strateg y would not be selected. I would exp ect the typ ical pla nt system atist to choo se som ething sim ilar to the third sam plin g strategy, or, if he or she was ex- 6 SYSTE M AT IC BIOLO GY VO L. 47 F IG U R E 3. C orrect phy logenetic estim ation of a sm all internal branch (ind icated by the arrow ) is strongly dep enden t on taxonom ic sam pling. Under the cond itions sim ulated (10 times the observed rates of evolution for ang iosperm s), if only the fou r taxa highlighted in b old were sam pled, then a m isinform ative character (one that wou ld su pport one of the two w rong trees for fou r taxa) is ap prox im ately 67 times m ore likely than an infor m ative ch aracter (one that supp orts the correct tree). The w rong tree would b e estim ated w ith virtual certainty if m ore than a few ch aracters were collected. However, the branch in question is recon structed correctly in the analysis of all the taxa. T he vast m ajor ity of o ther quartets of taxa de® ned by this (and m any other) internal branches show the sam e effect. plicitly adding taxa to red uce problem s w ith long-branch attraction, the sy stem atist m ight choo se the fourth option. It is likely that he or she would choo se som e com bination of strategies 3 and 4. K im ’s (1996) study is m ost releva nt to sa m plin g strategy 1, or addin g increasin gly distantly-related taxa to the analysis. In his princip al sim ulation, K im (1996) evaluated a sam pling schem e in w hich taxa are a d d e d w it h ou t r e d ucin g th e avera g e branch len gth of taxa in the tree. Nam ely, he random ly selected a tree relating t taxa, to w hich he random ly assig ned branch len gth s from an exp onen tial distribution. To exam ine the effects of taxonom ic sam plin g, he held the average branch len gth in the tree constant w hile changing the num ber of taxa included in the ana lysis. In the real world, this sam plin g schem e could only be ap proxim ated by adding successively m ore distantly related taxa (i.e., out- sid e the orig inal grou p of interest), so that the addition of taxa did not reduce the len gth of the average branch in the tree. K im ’s (1996) sim ulation su gg ests that sy stem atists are correct to avoid this strategy. K im (1998) conducted new sim ulations to evalu ate strateg y 2, of random ly addin g taxa from the group of interest to the analysis. Und er these cond itions, he found that addition of taxa can either increase or decrease the differen ce in parsim ony scores between the m odel tree and its nearestneighb or trees. A lthough this m easure does no t directly assess the accuracy of phylo genetic estim ates, it do es su gg est that it is better to add som e taxa tha n others. W hich are the best taxa to add? Not su rprisin gly, the taxa that break up long branches (and thereb y m ake the tree s less sta r-like) are the best ones to add. This adds su p port for strateg y 4, or the pur- 1998 HILLISÐ TA XO NO M IC SA M PLING p osefu l division of long branches in the tree. Yang and G oldm an (1997) also recen tly reported a set of sim ulations in w hich taxa were ra ndom ly selected for analysis from the grou p of interest (see also Purv is and Q uicke, 1997). They found that the p ercen tage of taxa sam pled from a clade had a greater effect on phylo gen etic accuracy than did the absolute nu m ber of taxa sam pled . This is exp ected under ra ndom sam plin g of taxa if the sp eciation and extinction rates are held constant through tim e in the m odeled tree. Under these conditions, the estim ated tree for 20 taxa sam pled from a m odel tree of 1,000 taxa w ill be ne arly star-like (very sm all intern al branches w ith long peripheral branches), w here as the estim ated tree for 20 taxa sam pled from a m odel tree of 20 taxa w ill have m any relatively larg e intern al branches. O nce aga in, this su gg ests that inve stigator control of the addition of taxa can have a high ly ben e® cial effect on phy logenetic ana lyses. G rayb eal (1998) evalu ated strategy 4, nam ely, purp osefu lly breaking up long branches in the tree by judicious addition of taxa. This follow s the recom m endations of m ost recent authors on the su bject of taxon sa m plin g (e.g., Hend y and Penny, 1989; Swofford et al., 1996). She found that addition of such taxa is not only stron gly ben e® cial, but und er m any cond itions accuracy of the phylo genetic estim ate im proves w ith the addition of taxa even if the total nu mber of characters exam ined remains unch anged. In other words, given a lim ited am ount of tim e and m oney for phy logene tic analysis, one can som etim es im prove the accuracy of the phy logene tic estim ate by collecting fewer data for m ore taxa. O bviously, there are lim its to this effect, but G raybeal’s (1998) resu lts high light ju st how ben e® cial judicious taxon sam pling can be. W hat are the effects of taxon sam pling as practiced by real system atists? O bviou sly, this w ill vary from case to case, but the studies by Soltis et al. (1998) and Poe (1998) provide som e in sigh t. The app aren t tractability of the real angiosp erm tree 7 sam pled by Soltis et al. (1998) ind icates that sy stem atists have chosen taxa well. The study of em pirical data sets by Poe (1998) ind icates that for clades w ith sm all nu m bers of taxa, incom plete sam plin g is not likely to be a seriou s problem . This reinforces the ide a that the percentage of included taxa in a clade m ay be a m ore im p o r ta n t c o n s id e ratio n t h a n th e t o ta l nu m ber of included taxa. However, m ore em pirical studies are ne eded to exam ine the effects of sam plin g few taxa from a clade of m any taxa; the angio sperm data set of Soltis et al. (1998) ap pears to be ideal for this purp ose. The p ap ers in this issue are usefu l for iden tifying the range of outcom es of taxonom ic sam pling schem es, from the very bad (e.g., strategy 1: rand om ly adding taxa from the tree of life) to the very good (e.g., strategy 4: addin g taxa to break up long branches). Random sam pling of taxa from a group of interest (strategy 2) can be effective or not, dep end in g on the details of the true tree. However, it is obviou sly no t the best strategy, nor is it the strategy like ly to be used by m ost system atists. C areful addition of taxa to ensu re coverage of the grou p of interest and to purposefully break up long branches (a com bination of strategie s 3 and 4) seem s to be optim al. In som e cases, deletion of problem atic taxa (e.g., taxa w ith abnorm ally high rates of evolution) m ay also be warranted. Unfortunately, purp oseful addition and deletio n of taxa allow s the possibility of consciou sly or unconsciously biasing the resu lts (the d re a d e d T h e rio t E ffe ct ). T h is p r o blem would be easy to overcom e through use of a sim ple method, nam ely, the blind in g of taxon nam es du rin g ana lysis. If taxa are to be selected for inclu sion or exclusion after an in itia l ana lysis, this shou ld be done w ithout the a priori know led ge of the investig ator of the effects on the analysis of the additions or dele tions. Thus, all decisions about inclusion or exclusion of taxa would be based only on inform ation about the tree itself, thus avoiding the possibility of an inve stigator sele cting taxa on the basis of how closely the resu lts m atch his or her preconceive d notions of relationsh ip. 8 SYSTE M AT IC BIOLO GY Blind ing of taxon na m es should be a standard feature of progra m s for phy lo genetic ana lysis. A lthou gh there is still much disagreem en t about the exp ected effects of taxonom ic sam plin g in phy lo genetic analysis, there are a few conclusions that seem to be uncontroversial. First, at least som e large, very com plex trees are far easier to estim ate than m ost sy stem atists would have guesse d. Second, som e sm all trees (e.g., quartets) are am ong the harde st p ossible phylo genetic tree s to estim ate correctly. Third , inclusion of m any taxa in a densely sa m pled tree perm its m ore effective use of rap idly evolving characters tha n in a p oorly sam pled tree. Fou rth, judicious addition of taxa can m ove som e phy lo genetic prob lem s from the virtu ally im p ossible to the tractable. Fifth, addition of taxa does not a lw ay s m a ke p ro blem s e as ier ; ad d in g highly diverg en t taxa, for instance, can m a ke p hy lo g e n e tic e s t im at io n h a rd e r. Sixth, taxonom ic sam plin g, as practiced by sy stem atists, typically does not involve random sam plin g of taxa, nor is this exp ected to be a particula rly effective strategy. Fin ally, given the role of a system atist in selecting taxa for inclusion or exclusion in an ana lysis, and given the p ossibility of thereby biasin g the resu lts of the ana lysis, sy stem atists shou ld use blind ing of taxon nam es du rin g the decision-m aking process. It is clea r that taxonom ic sa m plin g can have imp ortant consequences for phylo genetic analysis. Therefore, sy stem atists sh ould give careful consid eration to how they decide w hich taxa to add to an analysis, and shou ld describe their sam plin g strategy. Theorists should eva luate com p eting sam plin g strategies, and em phasize realistic sam plin g strategies rather than in vent new sam plin g strategies that no sy stem atist could or would use. Perha ps then we can beg in to form ulate m ore practical advice on the subject of how to best sam - VO L. 47 ple taxa to estim ate relationsh ip s w ithin the tree of life. R E FE R E N C E S G R A U R , D., L . D U R E T , A N D M . G O U Y . 1996. Phylog enetic po sition of the order L agom orpha (r abbits, hares, and allies). Nature 379:333± 335. G R A Y B E A L , A . 1998. Is it b etter to add taxa or characters to a d if® cult phy logenetic problem ? Syst. Biol. 47:9± 17. H E N D Y , M . D., A N D D. P E N N Y . 1989. A fram ew ork for the quan titative study of evolutionary trees. Syst. Zo ol. 38:297± 309. H IL L IS , D. M . 1996. Infer ring com plex phylogen ies. Nature 383:130. H IL L IS , D. M ., J. P. H U E L S E N B E C K , A N D D. L . S W O FF O R D . 1994. Hob goblin of phy logenetics? Nature 369:363± 364. K IM , J. 1996. G eneral incon sistency conditions for m aximum parsimony: E ffects of branch lengths and increasing num b ers of taxa. Syst. B iol. 45:363± 374. K IM , J. 1998. Large -scale phy logen ies and m e asuring the p erform ance of phy logenetic estim ators. Syst. Biol. 47:43± 60. P E N N Y , D., A N D M . D. H E N D Y . 1985. T he use of tree com parison m etrics. Syst. Z ool. 34:75± 82. P O E , S. 1998. Sensitivity of phy logeny estim ation to taxonom ic sam pling. Syst. Biol. 47:18± 31. P U R V IS , A ., A N D D. L . J. Q U IC K E . 1997. A re big trees indeed easy? R eply fr om A . Pu rvis and D. L . J. Q u icke. T R EE 12:357± 358. R O B IN S O N , D. F., A N D L . R. F O U L D S . 1981. Com p arison of phy logenetic trees. M ath. Biosci. 53:131± 147. S O L T IS , D. E ., P. S. S O L T IS , M . E . M O R T , M . W. C H A S E , V. S A V O L A IN E N , S. B . H O O T , A N D C . M . M O R T O N . 1998. In fer ring com plex phylogen ies using p arsim ony : A n em pirical approach u sing three large DNA data sets for ang iosperm s. Syst. Biol. 47:32± 42. S O L T IS , D. E ., P. S. S O L T IS , D. L . N IC K R E N T , L . A . J O H N S O N , W. J. H A H N , S. B. H O O T , J. A . S W E E R E , R . K . K U Z O FF , K . A . K R O N , M . W. C H A S E , S. M . S W E N S E N , E. A . Z IM M E R , S.-M . C H A W , L . J. G IL L E SP IE , W. J. K R E S S , A N D K . J. S Y T S M A . 1997. A n giosperm phy logeny infer red from 18S ribosom al DNA sequences. A n n. M issouri B ot. G ard. 84:1± 49. S W O FF O R D , D. L ., G. J. O L S E N , P. J. W A D D E L L , A N D D. M . H IL L IS . 1996. Phy logenetic inference. Pa ge s 407± 514 in Molecular system atics, 2nd. ed ition (D. M . H illis, C . Moritz, and B . K . M able, eds .). Sinauer, Su nder land, M assachu setts. T H E R IO T , E . C , A . E . B O G A N , A N D E . E . S P A M E R . 1995. The taxonom y of B arney: Ev idence of convergence in hom inid evolution. A n n. Im prob. R es. 1:3± 7. Y A N G , Z., A N D N. G O L D M A N . 1997. A re big trees indeed easy? T R EE 12:357. Received 10 N ovember 1997; accepted 20 N ovember 1997 A ssociate E ditor: D. C an natella
© Copyright 2025 Paperzz