Syst. Biol. 50(4):557–564, 2001 Ancestral State Estimation and Taxon Sampling Density B ENJAMIN A. S ALISBURY1 AND J UNHYONG K IM 2 1 2 Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut 06520-8106 , USA; E-mail: [email protected] Department of Ecology and Evolutionary Biology, Department of Molecular, Cellular, and Developmental Biology, and Department of Statistics, Yale University, New Haven, Connecticut 06520, USA; E-mail: [email protected] Abstract.—A set of experiments based on simulation and analysis found that using the parsimony algorithm for ancestral state estimation can benet from increased sampling of terminal taxa. Estimation at the base of small clades showed strong sensitivity to tree topology and number of descendent tips. These effects were largely driven by the creation and negation of ambiguity across a topology. Root state and internal state estimation showed similar behavior. We conclude that increased taxon sampling density is generally advisable, and attention to topological effects may be advisable in evaluating the condence placed in state estimation. We also explore the factors affecting ancestral state estimation and conjecture that as taxa are added to a tree, the total amount of information for root state estimation depends on the tree topology and distance to root state of added taxa. For a pure-birth model tree, we conjecture that the addition of N taxa increases root state information in proportion to log(N). [Parsimony; state estimation; taxon sampling; tree topology.] The challenge of estimating ancestral character states has recently received attention from many authors, including those of the seven-paper symposium published in Systematic Biology 48, no.3 (Cunningham, 1999; Martins, 1999; Mooers and Schluter, 1999; Omland, 1999; Pagel, 1999; Ree and Donoghue, 1999; Schultz and Churchill, 1999). Relative to our understanding of phylogeny estimation, methods of ancestral state estimation are somewhat poorly characterized; this disparity is perhaps due to the logical priority of the former step in inferring evolutionary history. Most papers on state estimation theory have revolved around the merits and liabilities of competing analytical methods, such as parsimony and various avors of maximum likelihood. In this note we explore a single issue: how the accuracy of estimating ancestral states depends on the density of taxon sampling. We consider only parsimony estimation because it is the most commonly applied method and its algorithm is amenable to analytical analysis. Frumhoff and Reeve (1994) and a subsequent paper by Schultz et al. (1996) considered evolution and root state estimation on a “null model” phylogenetic star tree analyzing the effects of asymmetry of character change rates and other sources of correlated homoplasy. Under this simple tree model, the probability of successfully estimating the ancestral state increases with added taxa unless the asymmetry of character evolution makes the root state less likely to be observed at each tip than is some other state. Of course, by assuming a star phylogeny, those estimates were effectively nonphylogenetic, derived simply from the plurality state among the observations. Zhang and Nei (1997) used character simulation on a few, fully branched model trees (up to 10 tips) to estimate probabilities of correctly estimating ancestral states at internal nodes by using parsimony, maximum likelihood, and a hybrid distance–maximum likelihood method. Under parsimony, they found that having more taxa usually improved the proportion of correct state estimates, with some unexplained exceptions. However, their experiments considered only small changes in taxon sampling (differences of 1 or 2 tips). Steel and Charleston (1995) asked what happens to the probability of correct root state estimation by parsimony when tree size is increased by adding large numbers of taxa. Their investigation was prompted by recognizing that when taxa are added to the tree, the information level for root state estimation increases, but the total amount of evolution in the tree (thus erasure of the root state information) also increases. Therefore, which of the two “forces,” effecting information increase and decrease, respectively, would win out was not clear. In Steel and Charleston, taxa were added to a fully balanced tree by doubling the numbers at each time step and preserving the balanced tree structure. Furthermore, they 557 558 S YSTEMATIC BIOLOGY assumed that each added branch had a xed probability, P, of state change such that the time depth of the tree increased with each doubling of the taxa. From this model of taxon sampling they obtained the result that the probability of correct root state estimate, Pc , goes to Á ! p 1 (1 6x)(1 2x) Pc D 1 2x C , 2 1 2p where xD p 1 2p (1) when p < 1=8 and goes to 1/3 when p > 1/8. In this taxon sampling model, the probability of correct root state estimate is a decreasing function of number of taxa. However, this model is rather unrealistic. In the usual empirical cases where we might be interested in the root state estimate of a xed clade, the expected amount of evolution per lineage would not increase with increased taxon sampling unless we accidentally sampled outside the clade or by chance included a very deviant subclade with high rates of evolution. Therefore, in this paper, we revisit the problem of asking what happens to root state estimation probabilities when a more realistic taxon sampling model is applied. We also extend this research to estimation of internal node states. M ETHODS We began by creating trees that would each represent an entire, fully sampled clade. For simplicity, we generated trees using a pure birth (i.e., no extinction) Markovian speciation model (also known as a Yule model); all lineages were equally likely to speciate and the speciation rate was constant over time. To build the trees, we used the conditioned sampling approach (Ross, 1996). We conditioned on the number of tips equaling 512 (i.e., 29 ) over a unit time interval. The speciation rate was set to ln(512/2), for which 512 tips are expected after one unit time under the pure birth model after an initial speciation event. With these settings we hoped to generate trees that were not drastically different from those encountered in natural investigations. Subsamples of different sizes (details below) were taken from the parent trees such that smaller subsamples were always nested VOL. 50 within the larger ones. Our results would thus show the effect of adding more information to an existing set of observations. Taxa were chosen equiprobably from the initial 512. The subsamples were always required to span the root of the tree. We used exact calculations to determine the probabilities of correct, incorrect, and ambiguous (Pc , Pi , and Pa , respectively) estimates of the root states of characters evolved on these trees. The calculations were enabled by algorithms derived independently by Maddison (1995) and Kim (1996). Here we assumed (1) binary characters, (2) time homogeneous rates of evolution, and (3) symmetric (equal) change rates between the two states. We also assumed that tree topologies were correctly estimated before estimation of ancestral states. We extended the previous algorithm to calculate the conditional probabilities of correctly, incorrectly, and ambiguously reconstructing the internal node states for our trees. This was accomplished by effectively rerooting a tree as a trichotomy at each node. EXPERIMENTS AND R ESULTS Root State Estimation Our rst experiments were designed to reect a commonly encountered endeavor: estimating ancestral character states at the root of a particular clade (e.g., orchid habit; Frohlich, 1987). The researcher must decide how thoroughly to sample taxa from among the observable, extant members of the clade. To address this situation, we examined how the probability of correct root state estimation varies in relation to subsampling different numbers of taxa from a larger clade. We rst examined how the probability, Pc , of correctly estimating root ancestral states responds to large changes in sample size. For each of 100 replicate 512-tip trees, we examined nested subsamples of size N D 16, 32, 64, 128, 256, and 512 tips. For each tree, we began with a subsample of 16 equiprobably chosen tips that together spanned the root of the tree. Each larger sample was generated by adding more taxa to the preceding smaller sample, as might be done in an empirical study. We calculated Pc for a homogeneous instantaneous rate of character change, r, for each tree and subtree created in the above fashion. This analysis was conducted for character change rates of 0.5, 1.0, 2001 SALIS BURY AND KIM—ANCESTRAL S TATE ESTIMATION 559 and 2.0. Because the total time depth of the tree was a unit interval, the characters were expected to have 1, 2, or 4 changes over a path connecting any pair of tips that spanned the root of the tree. For a pure birth model of speciation, the total number of steps over the entire tree has the expectation (derived from Ross, 1996): Z D 2t C (N 2) 1 e ¸t ¸te ¸(1 e ¸t ) ¸t (2) Therefore, with time t D 1 and r D 0:5, 1.0, and 2.0, the total expected numbers of character changes over the whole tree are roughly bounded by 46, 92, and 184, respectively. Figure 1 depicts the mean values of the probability of correct root state estimate, Pc , for the 100 trees under each set of conditions. There is an invariant trend of increasing accuracy with increased sample sizes. Variation masked by averaging of Pc over the sample is hinted at by the standard deviation bars. Details are discussed further later. Secondarily, we considered the effects of sampling at very small clade sizes. Because the parsimony state estimation algorithm is strictly a function of the tree topology, it can display aberrant behavior. For example, in a comb-shaped tree, the most “basal” lineages FIGURE 2. Behavior of Pi , Pa , and Pc for subsample trees with very few tips. Details as in Figure 1. Error bars are shown only for Pc . are longer (i.e., more error prone) than the rest yet exert an overwhelming inuence regarding estimation of the root state. Such tree topologies and behavior can be especially common and pronounced when the number of taxa is small. Using the same 512-tip tree generation as above and an analogous sampling strategy, we calculated the root ancestral estimation probabilities for every value of N from 2 to 8 given r D 0:5, 1.0, and 2.0. Figure 2 shows the results for r D 2, which were comparable with, though more extreme than, results for the other two values of r . Probabilities for each value of N were averaged over the 100 samples. The previously observed tendency for Pc to increase with sample size (Fig. 1) is still apparent. However, for the smallest values of N, the exact number of tips is a strong determinant of Pc , as is evident in the clear oscillatory variation. Pi oscillates synchronously with Pc , whereas Pa oscillates out of phase with the others. Internal Node State Estimation FIGURE 1. Mean probabilities, Pc , of correctly estimating the root state of a binary character evolving at three rates (r) on subsamples of 512-tip, pure-birth model trees. The bars around each mean indicate § 1 SD based on a sample of 100 trees. Our second set of experiments was designed to complement the above work by focusing on the estimation of ancestral states throughout a phylogeny. We generated 100 trees as above and subsampled N D 16, 32, 64, 128, 256, and 512 tips, again requiring that the root be spanned. We calculated the 560 VOL. 50 S YSTEMATIC BIOLOGY conditional probabilities Pc , Pi , and Pa for every internal node above the root. For each node, we also noted the number of descendent tips and the temporal distance from the root (from 0 to 1). This experimental design treats each internal node state (a random variable) as if it were the root state parameter. It does not assess the joint probability of correct estimates at all internal nodes, only the marginal states at each node. Figure 3 depicts second-order local regressions (loess t; Venables and Ripley, 1997) of distance from root and Pc for the nodes of each subsample size. The clear result is that, at any depth in the tree, adding more terminal taxa to the study (not necessarily within the subtended clade) tends to increase estimation success at an internal node. Furthermore, the deeper the node is in the tree, the more important taxon sampling density becomes. FIGURE 3. Probability of correctly estimating an internal node state as a function of its position. Characters are modeled as binary with symmetric change rate r D 2:0. Time depth is the position of the internal node relative to the terminal taxa and the root node (the present D 0; the root position D 1). Pc is the conditional probability of correctly reconstructing the internal node state. The curves are for different sizes of samples of the original tree. From top to bottom the sample sizes are 512, 256, 128, 64, 32, and 16, respectively. The curves were obtained by a second-order local regression of the internal node positions and their probability of correct state estimate. D IS CUS SION The primary implication of our results is that the probability of correctly estimating the ancestral states of characters can be increased by adding more taxa to an analysis. This holds for both internal states and root states. Despite variability, the pattern of increased root state Pc with increased taxon sampling held almost universally in our rst experiment. When the taxon density was doubled, only 2.3% of the cases resulted in a decreased Pc . Furthermore, the magnitudes of those decreases tended to be slight and they occurred primarily for r D 0:5, where sample size had little effect in either direction because of the conservative pace of character evolution. Mean improvement for root state Pc varied by size; Figure 1 shows the diminishing return of sample doubling. Diminishing returns on taxon sampling investment was less evident for high r, as seen in Figure 2 and the bottom curve of Figure 1. The positive association between sampling density and accuracy seen in Figure 1 is an average tendency that hides some interesting variation. Rather than being normally distributed around the means, Pc is distinctly bimodal, especially at low taxon sizes. The one restriction we placed on our subsampling was that the root had to be spanned; that is, the two branches distinguished by the root node had to be represented by at least one taxon each. In a repeat of the rst experiment, if we include an additional requirement that the two halves of each tree subsample be represented by at least two tips each, the bimodality effectively disappears (data not shown). The original bimodality can be attributed to the presence of trees that were excluded by the new criterion: trees in which a monotypic lineage forms the sister group to rest of the taxa (1 and N 1). In those trees, the monotypic branch contributes an observed rather than estimated state to the estimation process at the root; although this branch is the longest branch in the tree (and therefore most error prone), the state observation at its tip has more inuence than any other tip state observation because it contributes directly to the root estimation and is valued equally with the estimate at the root of the sister clade. Trees with a 2:14 split have higher Pc and lower Pa and Pi on average than the 1:15 trees. 2001 SALIS BURY AND KIM—ANCESTRAL S TATE ESTIMATION A different kind of variability was evident for small taxon samples. At low N, parity is a key determinant of root state estimation success (Fig. 2). When an odd number of tips is used, ambiguity is more likely to be negated at the root. We do not believe that these results argue for intentionally sampling even or odd numbers of taxa. Rather, the differential results reect parity-dependent topological sampling and parsimony’s topologydependent estimation effects. However, the ndings do suggest that if a study nds an unambiguous root estimate when N is low and odd, the estimate should be viewed suspiciously as a possible artifact of parsimony when there is any ambiguity at nodes near to the root. We also considered the role of parity for internal state estimation. When we plotted (not shown) Pc for the internal nodes of 16-taxon samples separately according to number of descendent taxa, we found that parity again made a large difference: Especially when near the root, Pc values were relatively higher for nodes with an even number of descendents. Parity, however, is only a crude indicator of the likely extent of ambiguity. Figure 4 shows the internal Pa values for every fourdescendent node in 16-tip subsamples of 500 random 512-tip trees. The points are marked according to whether the clade is topologically symmetrical (2:2) or asymmet- FIGURE 4. Probability of ambiguously estimating the state of an internal node with four descendent terminal taxa as a function of its position and topology. Characters are modeled as binary with symmetric change rate r D 2:0. Pa is the conditional probability of ambiguously reconstructing the internal node state. The points shown represent all four-tip clades from a set of 500 16-tip, rootspanning subtrees. The points are labeled according to whether the clade is symmetrical (2:2) or asymmetrical (1:3). 561 rical (1:3). Pa is distinctly greater for the balanced clades. Similarly, Pc and Pi also show bimodality. Clearly, topological considerations may be important when assessing the success of parsimony estimates of ancestral states. Another aspect of parsimony estimates that deserves mention is the difference between root and internal state estimation. In a fully resolved tree, root estimates derive from two subestimates (left and right), whereas internal estimates are based on left, right, and ancestral subestimates. This difference dramatically affects how much ambiguity is expected. Notice that the Pc values at time depth D 1 in Figure 3 are distinctly greater than the corresponding Pc values in Figure 1 (the r D 2 curve). This discrepancy appears to largely reect the shift in ambiguity; for example, when r D 2 and N D 16, Pa D 0:32 for the root, whereas Pa D 0:23 for internal nodes near the root (time depth ¸ 0.99). Effective Information of Added Taxa It is useful to consider a simpler case to analyze the general factors affecting root state estimation. Suppose we have a single lineage with the ancestor at time t D 0 and a descendent at time t D ¿ and binary state characters with symmetric probability of change. If the character state, X, of the descendent is 0, then our best guess at the ancestor state is also 0, because we have no other information at hand. If we model the character evolution process as a continuous-time Markov model, the probability that the descendent will be identical to the ancestor is 12 C 12 e r t . Thus Pc is a decreasing function at the order of »O(e r t ), where r is the rate constant. As expected, the quality of information about root state degrades the further away in time we sample the descendent state. Suppose now we have N descendent lineages and these are arranged as a star topology. Then we have N independent lines of evidence about the ancestral state. Let f 0 and f 1 be the frequency of state 0 and state 1 observed over the N descendent taxa. Then the maximum parsimony estimate of the ancestral state is the most common state (as is the maximum likelihood estimate). In the next steps we assume state 0 is the true ancestral state without loss of generality. As we add independent lineages, Probf f 0 > f 1 g 562 VOL. 50 S YSTEMATIC BIOLOGY goes to 1 as N goes to innity (even if the length of each lineage varies) as long as Probfdescendent state is different from ancestor stateg < 1=2, consistent with the results of Frumhoff and Reeve (1994). Thus, when we have a star topology and the probability of difference between the ancestor and descendent is bounded at 1/2, we always converge to the true estimate as we add taxa. Let Prob fdescendent state is different from ancestor stateg be p D 12 12 e r t ; then f 0 and f 1 follow a binomial distribution. To compute Probf f 0 > f 1 g, we can use the normal approximation to compute the Z-score rst. N(1 Z» p p p Log N Z» p ¼ O( Log Ne N=2 p) Np(1 e 2r t p) p N D p 2r t e 1 p ¼ O( Ne rt ): (3) Therefore, the standard Z-score increases as a square root of N and decreases as a negative exponential function of time. If N is large, we can use an approximation to the cumulative distribution of the normal distribution (Rohatgi, 1976): 1 ProbfZ > xg D p e x 2¼ x 2 =2 pendency problem by incorporating the tree structure into the state estimates. What about “less information” provided by the tree itself? In the star topology case, each lineage provides independent information about the root state, leading to the scaling relationship given in equation 5. In a “normal” tree the dependencies reduce the effective number of informative lineages. Here we conjecture that the number of informative lineages in a Yule tree (or similar model trees with constant expected branch length) scales as Log(N). This gives us a conjectured scaling relationship for Yule trees in terms of Z-scores: (x ! 1) (4) and substituting (3) gives us the scaling relationship: Probfcorrect root state estimateg Á ! 1 N 2r t ¼ 1 p exp rt e 2 2¼ N (5) When the lineages have more treelike structure, we have dependencies in the lineages and therefore less total information as well as less potentially misleading information. For example, if we have 99 lineages forming a very recently diversied clade with a very long stem and a single sister lineage, even if we have 99 lineages with state 0, we would be wary of declaring the ancestral state as 0 if the monotypic sister was state 1. Tree-dependent estimates, such as parsimony estimates or maximum likelihood estimates, attempt to resolve this de- rt 1 ) (6) The scaling relationship given by equation 6 ts our data well (not shown). It is also supported by comparing our results with those of Steel and Charleston (1995). Because the total amount of time in our trees is given approximately by equation 2 and because there are 2N 2 branches in any tree, the average branch length of our trees (equation 2 divided by 2N 2) rapidly asymptotes to a constant length as the number of lineages is increased. Therefore, contrary to the intuition of “breaking up” branches, average branch length of a pure-birth tree stays constant regardless of the number of taxa; it does not decrease to 0. Steel and Charleston’s trees also have constant average branch length. However, the total time from the tips to the root is a xed constant in our trees, whereas it scales as a function of Log(N) in Steel and Charleston’s trees. Letting rt D Log(N) and substituting this into equation 6 shows that we expect Z-scores to go to 0 as N increases for Steel and Charleston’s trees. Therefore, we would expect the probability of correct root state estimate to decrease with N, consistent with their results. Conditions and Conclusions As listed in the Methods section, we made several assumptions in this study, of which two are particularly noteworthy. First, tree topology is known (i.e., correctly estimated) before ancestral character states are estimated. The signicance of this assumption is perhaps not as great as it might seem; Zhang and Nei (1997) found that errors in topology estimation had negligible effects on state 2001 SALIS BURY AND KIM—ANCESTRAL S TATE ESTIMATION estimation at nodes that were not in the immediate neighborhood of a topology error. From our analysis, we can see that the main effect of the tree topology estimate is to correct for the dependence structure in the data. Then, as long as the estimated trees are not wildly deviant from the true trees (i.e., as long as they capture the rough dependence structure of the trees), ancestral state estimates apparently would not be greatly affected. However, a more thorough investigation of this problem should still be attempted. A second critical assumption of this study was that taxon sampling is random with respect to taxon identity and phylogeny. In practice, taxon sampling will depend on the availability of specimens and data, researchers’ opinions on what constitutes an appropriate sampling scheme, and the objectives of the study beyond the estimation of ancestral states of particular characters. At worst, or perhaps best, taxa may be chosen with specic regard to a character of interest. It is quite possible to sample taxa in a pathological manner that will decrease the probability of correctly estimating an ancestral state. However, from our analysis we expect the results here to be generally robust to any particular taxon sampling scheme— as long as the total depth of the tree (as measured by expected number of changes) does not signicantly increase as a result of the added taxa. The ndings of this paper present clear evidence that increased taxon sampling, as a general practice, can be helpful in estimating ancestral character states. This pattern appears unaffected by rate of character change as long as the total depth of the tree does not increase with increased taxa. Sampling taxa more densely, especially when the sample would otherwise be sparse, appears to be a reliable way to improve ancestral character state estimates. We also demonstrated some peculiar properties of the parsimony algorithm. Because parsimony has the effect of considering all branches to be equally prone to character change and because it deals in absolute state assignment and nongraded ambiguity, topology can have a large inuence on estimation success. Finally, we note that all our results are with respect to marginal states at a particular node, root or otherwise. The joint estimate at all nodes is a complicated problem. For an N-taxon tree, we have N state observations 563 that we are hoping to use to deduce N 1 unobserved states. The difculty of such a problem is evident. Ancestral state estimation is crucial to phylogenetic biology, but far less attention has been paid to the problem than to the problem of tree topology estimation. Many open questions remain for future studies. ACKNOWLEDGMENTS We are grateful to Dick Olmstead, David Ackerly, and an anonymous reviewer for their useful comments, especially the encouragement to expand our research to include internal node estimation. This work was supported in part by NSF grant DEB-9806570 to J.K. B.A.S. was also supported through a Forest B.H. and Elizabeth D.W. Brown Postdoctoral Fellowship. This paper is dedicated to F. James Rohlf on his 65th birthday. R EFERENCES CUNNINGHAM , C. W. 1999. Some limitations of ancestral character-state reconstruction when testing evolutionary hypotheses. Syst. Biol. 48:665–674. FR OHLICH, M. W. 1987. Common-is-primitive—a partial validatio n by tree counting. Syst. Bot. 12:217– 237. FR UMHOFF, P. C., and H. K. REEVE . 1994. Using phylogenies to test hypotheses of adaptation—a critique of some current proposals. Evolution 48:172–180. KIM , J. 1996. General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing numbers of taxa. Syst. Biol. 45:363–374. MADDISON, W. P. 1995. Calculating the probability distributions of ancestral states reconstructed by parsimony on phylogenetic trees. Syst. Biol. 44:474– 481. MARTINS , E. P. 1999. Estimation of ancestral states of continuous characters: A computer simulation study. Syst. Biol. 48:642–650. MOOERS , A. Ø., and D. SCHLUTER . 1999. Reconstructing ancestor states with maximum likelihood: Support for one- and two-rate models. Syst. Biol. 48:623– 633. OMLAND , K. E. 1999. The assumptions and challenges of ancestral state reconstructions. Syst. Biol. 48:604– 611. PAGEL, M. 1999. The maximum likelihood approach to reconstructing ancestral character states of discrete characters on phylogenies. Syst. Biol. 48:612– 622. REE , R. H., and M. J. DONOGHUE. 1999. Inferring rates of change in ower symmetry in asterid angiosperms. Syst. Biol. 48:633–641. ROHATGI, V. K. 1976. An introduction to probability theory and mathematical statistics. Wiley & Sons, New York. ROS S , S. M. 1996. Stochastic Processes, 2nd edition. Wiley & Sons, New York. SCHULTZ, T. R., and G. A. CHUR CHILL. 1999. The role of subjectivity in reconstructing ancestral character states: A Bayesian approach to unknown rates, states, and transformation asymmetries. Syst. Biol. 48:651– 664. 564 S YSTEMATIC BIOLOGY SCHULTZ, T. R., R. B. COCROFT , and G. A. CHURCHILL. 1996. The reconstruction of ancestral character states. Evolution 50:504–511. STEEL, M., and M. CHARLESTON. 1995. Five Surprising properties of parsimoniously colored trees. Bull. Math. Biol. 57:367–375. VENABLES , W. N., and B. D. RIPLEY. 1997. Modern Applied Statistics with S-Plus. Springer-Verlag, New York. VOL. 50 ZHANG , J., and M. NEI . 1997. Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods. J. Mol. Evol. 44:S139– S146. Received 4 May 2000; accepted 17 July 2000. Associate Editor: R. Olmstead
© Copyright 2024 Paperzz