Taxonomic Sampling, Phylogenetic Accuracy, and Investigator Bias

Sy st. Biol. 47(1):3± 8, 1 998
Taxonom ic Sam plin g , Phylo genetic Accu racy, and Inv estigator B ias
D A V ID M . H ILL IS
D epartment of Zoology and In stitute of Cellular and Molecular B iology,
Un iversity of Texas, Austin, Texas 78712, U SA ; E-m ail: hillis@ phylo.zo.utexas.edu
In this issue of System atic Biology, a serie s
of authors use severa l differen t ap proaches
to exam ine the effects of taxonom ic sam plin g on phylo genetic analysis. This topic
is receiving increasing attention, in p art
because recen t studies have reached a confusin g diversity of conclu sions about the
effects of taxonom ic sam plin g. For instance, contrast the conclu sions reached in
two recent p ap ers on this topic:
w hether investigators choose to evaluate the
phylogenetic perform ance on a branch-bybranch basis or on a tree-by-tree basis. For
every taxon added to an analysis, we are
also attempting to estim ate an additional internal branch. Thus, the problem gets m ore
complex as we add taxa, and there are m ore
places in the tree w here problem s w ith inconsistency m ay arise. This led Kim (1996)
to his recom mendation just quoted, and
some authors (e.g., G raur et al., 1996) regu larly heed this advice by reducing taxonom ic
problems to the sim plest possible four-taxon
trees. Under this strategy, a system atist samples all possible quartets of taxa that involve
the internal branch of interest, and tabulates
the number of times each of the three possible trees is supported. For instance, G raur
et al. (1996) evaluated the relationships of
rabbits by evaluating all possible quartets of
taxa selected from rabbits, prim ates, other
m am m als, and an outgroup (a m arsupial,
monotrem e, or reptile). None of the quartets
supported the traditional group Glires (rabbits plus rodents), which they took as evidence that rabbits and rodents are not closely related.
O ne problem w ith redu cing a phy logene tic analysis to its sim plest possible form
is that four-taxon trees can be very dif® cult
to estim ate correctly if rates of evolution
are high (e.g., H illis et al., 1994). K im
(1996:372) concluded ``that to be 95% con® den t of avoiding inconsistency problem s,
the expected nu m ber of change s over the
en tire tree for a given character must be
less than one out of four.’ ’ However, m uch
higher rates of character evolution are acceptable (and even desira ble) if the tree is
d e n s ely s a m p le d . To d em o n s tr at e th is
p oint, I mo di® ed the sim ulation of the 228taxon tree from H illis (1996) by increasing
the exp ected am ount of change along all
the bra nche s by 10-fold (Fig. 1). This tree
If the evolutionary question of interest does no t requ ire a large num ber of taxa, it seem s b est to use
few er taxa b ecau se larger trees are m ore likely to
contain incon sistent branches. (K im , 1996:372)
Includin g large num b ers of taxa in an an alysis m ay
be the b est way to en sure phy logenetic accu racy.
(H illis, 1996:131)
The se recom mend ations, taken at face value, app ea r to be in dire ct con¯ ict w ith regard to advice on taxon sam pling . The pap ers in this issue extend these studies and
m odify these recom mend ations on the basis of analyse s of real data sets (Soltis et
al., 1998; Poe, 1998), sim ulations (Graybea l,
1998), and theoretical considerations (K im ,
1998). O ne conclu sion from reading these
p ap ers is that w hether incre ased taxonom ic sam pling help s or hinders the process of
accurate phy logenetic estim ation depend s
to a great extent on how accuracy is evaluated and w hat is m ea nt by ``taxonom ic
sam plin g.’ ’
L O C A L V E R SU S G L O B A L E FF E C T S
T A X O N O M IC S A M P L IN G
OF
M uch of the ap p aren t d isagreem en t
among authors on the effects of taxonomic
sampling stem s from the different evaluation
criteria being evaluated. Kim (1998) discusses several of the criteria, and emphasizes the
differences between evaluating ef® ciency
versu s consistency in phylogene tic analysis.
Although this difference is important, a
greater difference occurs depending upon
3
4
SYSTE M AT IC BIOLO GY
VO L.
47
F IG U R E 1. A m odel tree b ased on the phylogene tic an alysis of angiosperm diversity by Soltis e t al. (1997).
In the original simu lation (H illis, 1996), rates of divergence were b ased on the observed rates am ong the
an giosperm s; in this case (s im ulation on e), the scale b ar represents 2% d ivergence. In the presen t pap er, the
simulation was repeated (sim ulation two), but evolution ary rates were increased so that the expecte d d ivergen ce
was 10-fold greater (the scale b ar represen ts 20% divergence in this case).
is based on a phy logene tic estim ate of angiosp erm relationsh ip s (Soltis et al., 1997),
and as such it represents an ap proxim ation
of the topolog y of the kind of tree that sy stem atists are actually attem pting to estim ate. The characters are evolving accordin g to a K imura two-param eter m odel of
evolution, w ith a 2:1 tra nsitio n:transversion ratio, and rate hetero geneity am ong
sites (m odeled w ith a gam m a distribution
w ith the sh ap e p aram eter a 5 0.5). Und er
these conditions, the avera ge character is
changing 23.6 tim es across the tree, and
because of the rate hetero geneity am ong
sites, som e characters change m any m ore
tim es. At these high rates of evolution,
m any of the term inal sequen ces are so dissim ilar that no biologist would recognize
them as hom olo gou s. Nonethele ss, the tree
is accurately reconstructed w ith just a few
thou sand nucleotides, and m any of the
branches require fewer data to reconstruct
than at lower rates of evolution (Fig. 2).
Supp ose we are intere sted in a particu -
lar intern al branch in the tree (m arked
w ith an arrow in Fig. 3). This bra nch is
corre ctly estim ated in the full tree if all the
taxa are included. If we sam ple a quartet
of taxa to exam ine this sam e branch (e.g.,
as in Fig. 3), then the branch w ill be inconsisten tly estim ated for almo st every po ssible qu artet. For the quartet of taxa show n
in Figu re 3, the probability that a sing le
nucleotide w ill be m isin form ative about
the relationship s of the four taxa under the
p arsim ony criterio n is approxim ately 0.4.
The probability that a sin gle nucleotide
w ill be inform ative about the relationsh ip s
of the four taxa und er the p arsim ony criterion is approxim ately 0.006. Thu s, one
would expect to converge on the w rong solution for these four taxa w ith great speed
under these cond itions; only a few nucleo tides would need to be sequenced to
guarantee ® nd in g the w rong solution. In
contrast, if all the taxa are included in the
ana lysis, then the branch is correctly reconstructed w ith a few thou sand nucleo-
1998
HILLISÐ
TA XO NO M IC SA M PLING
F IG U R E 2. Perfor m ance of parsim ony in estim ating
the 228-taxon tree show n in Figu re 1. ``Percen t of tree
correct’’ is b ased on the partition m etric (R obinson
and Fou lds, 1981; Penny and Hend y, 1985). A ll intern al branches in the tree are correctly estim ated w ith
5,000 nucleo tides, for either simulation.
tides. C learly, at least some phylo genetic
problem s require intensive and extensive
taxonom ic sam pling .
T A X O N O M IC S A M P L IN G S C H E M E S
There are m any differen t ways that sy stem atists m igh t sele ct taxa for analysis. In
m any cases, the taxa selected w ill be based
on availability. In other cases, it m igh t be
p ossible to sele ct taxa accord ing to a sam plin g strategy. Let us assu m e that a sam plin g strateg y is po ssible, and im agine that
a sy stem atist is intere sted in analyzing the
phy logeny of a larg e and diverse grou p,
such as angiosp erm s. We w ill also assum e
that prelim in ary data are available for 20
sp ecies. The sy stem atist now has tim e and
m oney to add 200 m ore sp ecies to the analysis, so som e strategy for taxonom ic sam plin g is necess ary. C onsider ® ve of the
m any p ossible strategies:
1. Add the 200 additional taxa rand om ly
from living organism s (e.g., the sy stem atist would sam ple ra ndom ly from the
tree of life).
2 . C h o o se ta x a ra n d o m ly w ith in th e
m onophyletic grou p of interest (in this
exam ple, the sy stem atist would ran-
5
dom ly sa mple 200 additional angio sp erm s).
3. Sele ct taxa w ithin the m onophy letic
grou p of interes t that w ill represen t the
overall diversity of the grou p. For example, the system atist m ight select two
diverge nt represen tatives from each of
100 differen t fam ilie s of angiosp erm s,
purpo sefu lly chosen to best represen t
ang iosperm diversity.
4. Sele ct taxa w ithin the m onophy letic
grou p of intere st that are exp ected
(based on current taxonom y or prev iou s
phy logen etic studies) to su bdivide long
branches in the initia l tree.
5. Add (and delete) taxa until the a priori
biases of the system atist are su pp orted .
I call this last strategy the Therio t Effect
after the tongue-in-cheek practices of
Therio t et al. (1995:4): ``We added or
discarded characters [taxa] until we
achieve d the resu lts we believed , then
stopp ed.’ ’
A lthough this range of options m ay seem
extrem e, they re¯ ect the range of studies
that have been conducted on the topic of
``taxonom ic sam plin g.’ ’ I expect few pract ic in g s y s tem at ist s w ou ld p u rp o se fu lly
choose sa m pling strategies 1, 2, or 5. The
® rst strategy would en su re the inclu sion of
very long bra nche s in the tree, and gen es
that were evolving at an appropriate rate
for elucidating the phy logene tic relationsh ip s am ong angiosp erm s would likely be
saturated for change s am ong the other
taxa. Adding additional taxa would no t reduce the branch len gths in the tree, and the
additional taxa would be high ly unlikely
to help reso lve angio sperm phy logeny. The
second strategy m igh t seem m ore likely,
but I doubt any system atist would choose
this ap proach either. If he or sh e did, a
large p ercentage of the added taxa would
be com p osites and orchids, and m ost of
the fam ilies of angiosperm s would be unrepresented . The dangers of the Therio t Effect (strategy 5) should be clear, and hopefully this strateg y would not be selected. I
would exp ect the typ ical pla nt system atist
to choo se som ething sim ilar to the third
sam plin g strategy, or, if he or she was ex-
6
SYSTE M AT IC BIOLO GY
VO L.
47
F IG U R E 3. C orrect phy logenetic estim ation of a sm all internal branch (ind icated by the arrow ) is strongly
dep enden t on taxonom ic sam pling. Under the cond itions sim ulated (10 times the observed rates of evolution
for ang iosperm s), if only the fou r taxa highlighted in b old were sam pled, then a m isinform ative character (one
that wou ld su pport one of the two w rong trees for fou r taxa) is ap prox im ately 67 times m ore likely than an
infor m ative ch aracter (one that supp orts the correct tree). The w rong tree would b e estim ated w ith virtual
certainty if m ore than a few ch aracters were collected. However, the branch in question is recon structed correctly
in the analysis of all the taxa. T he vast m ajor ity of o ther quartets of taxa de® ned by this (and m any other)
internal branches show the sam e effect.
plicitly adding taxa to red uce problem s
w ith long-branch attraction, the sy stem atist m ight choo se the fourth option. It is
likely that he or she would choo se som e
com bination of strategies 3 and 4.
K im ’s (1996) study is m ost releva nt to
sa m plin g strategy 1, or addin g increasin gly distantly-related taxa to the analysis. In
his princip al sim ulation, K im (1996) evaluated a sam pling schem e in w hich taxa are
a d d e d w it h ou t r e d ucin g th e avera g e
branch len gth of taxa in the tree. Nam ely,
he random ly selected a tree relating t taxa,
to w hich he random ly assig ned branch
len gth s from an exp onen tial distribution.
To exam ine the effects of taxonom ic sam plin g, he held the average branch len gth in
the tree constant w hile changing the num ber of taxa included in the ana lysis. In the
real world, this sam plin g schem e could
only be ap proxim ated by adding successively m ore distantly related taxa (i.e., out-
sid e the orig inal grou p of interest), so that
the addition of taxa did not reduce the
len gth of the average branch in the tree.
K im ’s (1996) sim ulation su gg ests that sy stem atists are correct to avoid this strategy.
K im (1998) conducted new sim ulations
to evalu ate strateg y 2, of random ly addin g
taxa from the group of interest to the analysis. Und er these cond itions, he found that
addition of taxa can either increase or decrease the differen ce in parsim ony scores
between the m odel tree and its nearestneighb or trees. A lthough this m easure
does no t directly assess the accuracy of
phylo genetic estim ates, it do es su gg est
that it is better to add som e taxa tha n others. W hich are the best taxa to add? Not
su rprisin gly, the taxa that break up long
branches (and thereb y m ake the tree s less
sta r-like) are the best ones to add. This
adds su p port for strateg y 4, or the pur-
1998
HILLISÐ
TA XO NO M IC SA M PLING
p osefu l division of long branches in the
tree.
Yang and G oldm an (1997) also recen tly
reported a set of sim ulations in w hich taxa
were ra ndom ly selected for analysis from
the grou p of interest (see also Purv is and
Q uicke, 1997). They found that the p ercen tage of taxa sam pled from a clade had
a greater effect on phylo gen etic accuracy
than did the absolute nu m ber of taxa sam pled . This is exp ected under ra ndom sam plin g of taxa if the sp eciation and extinction rates are held constant through tim e
in the m odeled tree. Under these conditions, the estim ated tree for 20 taxa sam pled from a m odel tree of 1,000 taxa w ill
be ne arly star-like (very sm all intern al
branches w ith long peripheral branches),
w here as the estim ated tree for 20 taxa
sam pled from a m odel tree of 20 taxa w ill
have m any relatively larg e intern al branches. O nce aga in, this su gg ests that inve stigator control of the addition of taxa can
have a high ly ben e® cial effect on phy logenetic ana lyses.
G rayb eal (1998) evalu ated strategy 4,
nam ely, purp osefu lly breaking up long
branches in the tree by judicious addition
of taxa. This follow s the recom m endations
of m ost recent authors on the su bject of
taxon sa m plin g (e.g., Hend y and Penny,
1989; Swofford et al., 1996). She found that
addition of such taxa is not only stron gly
ben e® cial, but und er m any cond itions accuracy of the phylo genetic estim ate im proves w ith the addition of taxa even if the
total nu mber of characters exam ined remains
unch anged. In other words, given a lim ited
am ount of tim e and m oney for phy logene tic analysis, one can som etim es im prove
the accuracy of the phy logene tic estim ate
by collecting fewer data for m ore taxa. O bviously, there are lim its to this effect, but
G raybeal’s (1998) resu lts high light ju st
how ben e® cial judicious taxon sam pling
can be.
W hat are the effects of taxon sam pling
as practiced by real system atists? O bviou sly, this w ill vary from case to case, but the
studies by Soltis et al. (1998) and Poe
(1998) provide som e in sigh t. The app aren t
tractability of the real angiosp erm tree
7
sam pled by Soltis et al. (1998) ind icates
that sy stem atists have chosen taxa well.
The study of em pirical data sets by Poe
(1998) ind icates that for clades w ith sm all
nu m bers of taxa, incom plete sam plin g is
not likely to be a seriou s problem . This reinforces the ide a that the percentage of included taxa in a clade m ay be a m ore im p o r ta n t c o n s id e ratio n t h a n th e t o ta l
nu m ber of included taxa. However, m ore
em pirical studies are ne eded to exam ine
the effects of sam plin g few taxa from a
clade of m any taxa; the angio sperm data
set of Soltis et al. (1998) ap pears to be ideal
for this purp ose.
The p ap ers in this issue are usefu l for
iden tifying the range of outcom es of taxonom ic sam pling schem es, from the very
bad (e.g., strategy 1: rand om ly adding taxa
from the tree of life) to the very good (e.g.,
strategy 4: addin g taxa to break up long
branches). Random sam pling of taxa from
a group of interest (strategy 2) can be effective or not, dep end in g on the details of
the true tree. However, it is obviou sly no t
the best strategy, nor is it the strategy like ly to be used by m ost system atists. C areful
addition of taxa to ensu re coverage of the
grou p of interest and to purposefully
break up long branches (a com bination of
strategie s 3 and 4) seem s to be optim al. In
som e cases, deletion of problem atic taxa
(e.g., taxa w ith abnorm ally high rates of
evolution) m ay also be warranted. Unfortunately, purp oseful addition and deletio n
of taxa allow s the possibility of consciou sly
or unconsciously biasing the resu lts (the
d re a d e d T h e rio t E ffe ct ). T h is p r o blem
would be easy to overcom e through use of
a sim ple method, nam ely, the blind in g of
taxon nam es du rin g ana lysis. If taxa are to
be selected for inclu sion or exclusion after
an in itia l ana lysis, this shou ld be done
w ithout the a priori know led ge of the investig ator of the effects on the analysis of
the additions or dele tions. Thus, all decisions about inclusion or exclusion of taxa
would be based only on inform ation about
the tree itself, thus avoiding the possibility
of an inve stigator sele cting taxa on the basis of how closely the resu lts m atch his or
her preconceive d notions of relationsh ip.
8
SYSTE M AT IC BIOLO GY
Blind ing of taxon na m es should be a standard feature of progra m s for phy lo genetic
ana lysis.
A lthou gh there is still much disagreem en t about the exp ected effects of taxonom ic sam plin g in phy lo genetic analysis,
there are a few conclusions that seem to be
uncontroversial. First, at least som e large,
very com plex trees are far easier to estim ate than m ost sy stem atists would have
guesse d. Second, som e sm all trees (e.g.,
quartets) are am ong the harde st p ossible
phylo genetic tree s to estim ate correctly.
Third , inclusion of m any taxa in a densely
sa m pled tree perm its m ore effective use of
rap idly evolving characters tha n in a p oorly sam pled tree. Fou rth, judicious addition
of taxa can m ove som e phy lo genetic prob lem s from the virtu ally im p ossible to the
tractable. Fifth, addition of taxa does not
a lw ay s m a ke p ro blem s e as ier ; ad d in g
highly diverg en t taxa, for instance, can
m a ke p hy lo g e n e tic e s t im at io n h a rd e r.
Sixth, taxonom ic sam plin g, as practiced by
sy stem atists, typically does not involve
random sam plin g of taxa, nor is this exp ected to be a particula rly effective strategy. Fin ally, given the role of a system atist
in selecting taxa for inclusion or exclusion
in an ana lysis, and given the p ossibility of
thereby biasin g the resu lts of the ana lysis,
sy stem atists shou ld use blind ing of taxon
nam es du rin g the decision-m aking process.
It is clea r that taxonom ic sa m plin g can
have imp ortant consequences for phylo genetic analysis. Therefore, sy stem atists
sh ould give careful consid eration to how
they decide w hich taxa to add to an analysis, and shou ld describe their sam plin g
strategy. Theorists should eva luate com p eting sam plin g strategies, and em phasize
realistic sam plin g strategies rather than in vent new sam plin g strategies that no sy stem atist could or would use. Perha ps then
we can beg in to form ulate m ore practical
advice on the subject of how to best sam -
VO L.
47
ple taxa to estim ate relationsh ip s w ithin
the tree of life.
R E FE R E N C E S
G R A U R , D., L . D U R E T , A N D M . G O U Y . 1996. Phylog enetic po sition of the order L agom orpha (r abbits,
hares, and allies). Nature 379:333± 335.
G R A Y B E A L , A . 1998. Is it b etter to add taxa or characters to a d if® cult phy logenetic problem ? Syst. Biol.
47:9± 17.
H E N D Y , M . D., A N D D. P E N N Y . 1989. A fram ew ork for
the quan titative study of evolutionary trees. Syst.
Zo ol. 38:297± 309.
H IL L IS , D. M . 1996. Infer ring com plex phylogen ies.
Nature 383:130.
H IL L IS , D. M ., J. P. H U E L S E N B E C K , A N D D. L . S W O FF O R D .
1994. Hob goblin of phy logenetics? Nature 369:363±
364.
K IM , J. 1996. G eneral incon sistency conditions for
m aximum parsimony: E ffects of branch lengths and
increasing num b ers of taxa. Syst. B iol. 45:363± 374.
K IM , J. 1998. Large -scale phy logen ies and m e asuring
the p erform ance of phy logenetic estim ators. Syst.
Biol. 47:43± 60.
P E N N Y , D., A N D M . D. H E N D Y . 1985. T he use of tree
com parison m etrics. Syst. Z ool. 34:75± 82.
P O E , S. 1998. Sensitivity of phy logeny estim ation to
taxonom ic sam pling. Syst. Biol. 47:18± 31.
P U R V IS , A ., A N D D. L . J. Q U IC K E . 1997. A re big trees
indeed easy? R eply fr om A . Pu rvis and D. L . J.
Q u icke. T R EE 12:357± 358.
R O B IN S O N , D. F., A N D L . R. F O U L D S . 1981. Com p arison
of phy logenetic trees. M ath. Biosci. 53:131± 147.
S O L T IS , D. E ., P. S. S O L T IS , M . E . M O R T , M . W. C H A S E ,
V. S A V O L A IN E N , S. B . H O O T , A N D C . M . M O R T O N .
1998. In fer ring com plex phylogen ies using p arsim ony : A n em pirical approach u sing three large
DNA data sets for ang iosperm s. Syst. Biol. 47:32±
42.
S O L T IS , D. E ., P. S. S O L T IS , D. L . N IC K R E N T , L . A . J O H N S O N , W. J. H A H N , S. B. H O O T , J. A . S W E E R E , R . K .
K U Z O FF , K . A . K R O N , M . W. C H A S E , S. M . S W E N S E N ,
E. A . Z IM M E R , S.-M . C H A W , L . J. G IL L E SP IE , W. J.
K R E S S , A N D K . J. S Y T S M A . 1997. A n giosperm phy logeny infer red from 18S ribosom al DNA sequences.
A n n. M issouri B ot. G ard. 84:1± 49.
S W O FF O R D , D. L ., G. J. O L S E N , P. J. W A D D E L L , A N D D.
M . H IL L IS . 1996. Phy logenetic inference. Pa ge s 407±
514 in Molecular system atics, 2nd. ed ition (D. M .
H illis, C . Moritz, and B . K . M able, eds .). Sinauer,
Su nder land, M assachu setts.
T H E R IO T , E . C , A . E . B O G A N , A N D E . E . S P A M E R . 1995.
The taxonom y of B arney: Ev idence of convergence
in hom inid evolution. A n n. Im prob. R es. 1:3± 7.
Y A N G , Z., A N D N. G O L D M A N . 1997. A re big trees indeed easy? T R EE 12:357.
Received 10 N ovember 1997; accepted 20 N ovember 1997
A ssociate E ditor: D. C an natella