Syst. Biol. 55(2):245-258,2006 Copyright © Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DO1:10.1080/10635150500481473 How Can Third Codon Positions Outperform First and Second Codon Positions in Phylogenetic Inference? An Empirical Example from the Seed Plants MARK P. SIMMONS, 1 LI-BING ZHANG, 1 ' 2 COLLEEN T. WEBB/ AND AARON REEVES1'3 7 Deportment of Biology, Colorado State University, Fort Collins, Colorado 80523-1878, USA; E-mail: [email protected] (M.RS.) 2 Current Address: Department of Integrative Biology, Brigham Young University, Provo, Utah 84602, USA ^Current Address: Animal Population Health Institute, Colorado State University, Fort Collins, Colorado 80526-8117, USA Abstract.—Greater phylogenetic signal is often found in parsimony-based analyses of third codon positions of protein-coding genes relative to their corresponding first and second codon positions, even for early-derived ("basal") clades. We used the Soltis et al. (2000; Bot. J. Linn. Soc. 133:381-461) data matrix of atpB and rbcL from 567 seed plants to quantify how each of six factors (observed character-state space, frequencies of observed character states, substitution probabilities among nucleotides, rate heterogeneity among sites, overall rate of evolution, and number of parsimony-informative characters) contributed to this phenomenon. Each of these six factors was estimated from the original data matrix for parsimony-informative third codon positions considered separately from first and second codon positions combined. One of the most parsimonious trees found was used as the constraint topology; branch lengths were estimated using likelihood-based distances, and characters were simulated on this tree. Differential frequencies of observed character states were found to be the most limiting of the factors simulated for all three codon positions. Differential frequencies of observed character states and differential substitution probabilities among states were relatively advantageous for first and second codon positions. In contrast, differential numbers of observed character states, differential rate heterogeneity among sites, the greater number of parsimony-informative characters, and the higher overall rate of evolution were relatively advantageous for third codon positions. The amount of possible synapomorphy was predictive of the overall success of resolution. [Amount of possible synapomorphy; character-state frequencies; character-state space; codon positions; phylogenetic signal; rate heterogeneity.] In protein-coding genes, greater phylogenetic signal is often found in parsimony-based analyses of third codon positions relative to their corresponding first and second codon positions, even for early-derived ("basal") clades (e.g., Manhart, 1994; Lewis et al., 1997; Bjorklund, 1999; Kallersjo et al., 1998, 1999; Wenzel and Siddall, 1999; Campbell et al., 2000; Sennblad and Bremer, 2000; Simmons et al., 2002). Note that this pattern is by no means universal (e.g., Phillips and Penny, 2003; Simmons and Miya, 2004). The greater phylogenetic signal often found in third codon positions could be considered surprising given their faster rate of evolution, which can result in multiple hits along individual branches that can obscure synapomorphies and result in long-branch attraction (Felsenstein, 1978). Six factors that could contribute to this phenomenon are as follows. First, greater observed character-state space (i.e., the number of alternative states that a character may take) for third codon positions allows for having a higher rate of evolution without a linear increase in homoplasy (Naylor et al., 1995; Simmons et al., 2004b; Steel and Penny, 2005). Second, differential character-state frequencies among the states represented may affect the amount of homoplasy. For example, all else equal, less homoplasy and fewer unobserved substitutions may be expected for a character with 25% frequencies of A, C, G, and T than for a character with 49% A and T yet only 1% G and C. Third, differential substitution probabilities among character states, as is generally the case with transitions and transversions, also affect the amount of homoplasy and frequency of unobserved substitutions. Indeed, the second and third factors are tightly linked. Fourth, differential rate heterogeneity among sites may allow for faster-evolving characters to resolve recent divergences while more slowly evolving characters resolve the ancient divergences (Hillis, 1987). Fifth, all else equal, more parsimony-informative third codon-position characters decrease the potential for stochastic errors and increase branch-support values. Sixth, differences in the overall rate of evolution of the sampled characters affect the ability to infer phylogenetic relationships accurately at a given level of divergence. The fifth and sixth factors are closely linked, though the potential for long-branch attraction is primarily a function of the sixth factor. In this study, the Soltis et al. (2000) data matrix of two protein-coding plastid genes (atpB and rbcL) from 567 seed plants was used to quantify how each of six factors contributed to, or detracted from, this phenomenon. As described by Simmons et al. (2002), the 920 parsimonyinformative third codon positions from atpB and rbcL outperformed the 663 parsimony-informative first and second codon positions together for all three measures of phylogenetic signal that were used (resolution, branch support, and congruence with independent evidence). Third positions resolved 2.3 times the number of clades as first and second positions together, and, on average, resolved 113% larger clades than first and second positions. Of the 60 clades with >95% jackknife support on the 18S rDNA jackknife tree (the third gene sampled by Soltis et al. [2000]), 29.3% more were resolved by third positions. Of the 60 clades resolved by both first and second positions together as well as third positions analyzed separately, the clades had 14% higher average jackknife support with third positions. Third positions outperformed first and second positions in spite of an average of 2 and 2.5 times more observed substitutions than first and second codon positions, respectively, for the parsimony-informative characters. Similar results have been reported for rbcL in green plants by Lewis et al. (1997) and Kallersjo et al. (1999). 245 246 VOL. 55 SYSTEMATIC BIOLOGY We used simulations to quantify how much each of the six factors affected the relative performance of the first and second positions combined and the third positions from Soltis et al. (2000). Each of the six factors was estimated from the original data matrix for parsimonyinformative first and second positions combined and for parsimony-informative third positions. One of the most parsimonious trees found by Soltis et al. (2000) was used as the constraint topology, and branch lengths were estimated on this tree using likelihood-based distances. Each of these six factors was simulated independently of one another, as well as in all possible combinations. Performance of phylogenetic inference was measured by subtracting the number of clades incorrectly resolved from the number of clades correctly resolved in parsimonybased jackknife trees. TABLE l. Codons 1st & 2nd 3rd 1st, 2nd, 3rd Model parameters estimated for each data partition." G-T C-T C-G A-T A-G A-C pi A piC piG P iT Alpha 1 2.77 0.99 0.89 1.98 1.43 0.27 0.27 0.22 0.24 0.67 1 3.81 0.92 0.19 4.36 0.99 0.33 0.17 0.16 0.34 1.30 1 3.30 1.14 0.28 3.47 1.07 0.32 0.20 0.16 0.32 0.85 "Rounded to the nearest hundredth. The parameters used in the simulations were rounded to the nearest millionth. second positions asymptotically approached stationarity, reaching roughly the same -log likelihood. Neither the analyses for the third positions nor for all three positions reached the same stationarity within 4.6+ million generations. Model parameters were taken from the maximum posterior probability (MAP) tree (Rannala and Yang, 1996) sampled across both analyses for each partition (Table 1). Note that it is unlikely that the actual MAP trees were sampled in this number of generations MATERIALS AND METHODS for a data matrix of this size; to ensure doing so would After removal of 29 positions from the 5' end of be computationally intractable (Goloboff and Pol, 2005). rbcL and 58 positions from the 3' end of atpB (folThis approach to estimating model parameters is based lowing the original authors; with one additional third on the premise that model estimation is relatively codon position removed from the 5' end of rbcL folinsensitive to the tree topology used (Yang et al., 1995; lowing Simmons et al. [2002:80]), the Soltis et al. (2000) Posada and Crandall, 2001). data matrix includes 1398 nucleotide characters repThe model parameters estimated for all three posiresenting 466 codons from rbcL (of which 788 are tions together were then used to estimate branch lengths, parsimony-informative) and 1470 nucleotide characters representing 490 codons from atpB (of which 795 are with one of the most parsimonious trees found by Soltis parsimony-informative) for 567 seed plants. Of the 1583 et al. (2000) as the constraint topology, in which all 565 parsimony-informative nucleotide characters, 663 (of a clades were constrained as a fully dichotomous tree. possible 1912) are from first and second codon positions Likelihood-based distances were calculated on this conand 920 (of a possible 956) are from third codon positions. straint topology in PAUP* 4.0bl0 (Swofford, 2001) using neighbor-joining. Eight rate categories were used for the gamma distribution, and negative branch lengths were Simulations set to absolute branch lengths. The general time-reversible (GTR) model with rate Matrices were simulated using the Evolver program heterogeneity among sites following a gamma distri- within the PAML suite (Yang, 1997). The "MCbase.dat" bution (Yang, 1993) was chosen for the simulations. parameter hie was used to simulate the nucleotide charThe invariant-sites parameter was not used because acters. The most parsimonious tree topology with branch parsimony-uninformative sites were eliminated. Model lengths determined using likelihood-based distances parameters were estimated using Bayesian MCMC was used to simulate the characters. Twenty replicate (Rannala and Yang, 1996; Yang and Rannala, 1997) matrices were simulated for each set of model paramewith MrBayes 3.0b4 (Huelsenbeck and Ronquist, 2001) ters (see below). separately for (1) parsimony-informative first and secOverall tree lengths per character used for the simulaond positions together, (2) parsimony-informative third tions were determined using two procedures. The goal positions only, and (3) parsimony-informative char- of these two procedures was to estimate the average rate acters from all three codon positions. Parsimony- of evolution at the parsimony-informative first and secuninformative characters were eliminated because their ond positions separately from that for the parsimonyinclusion would have altered the alpha parameter for informative third positions. the gamma distribution. Given that only the parsimonyThe primary procedure entailed adding up across the informative characters are of interest to this study entire tree branch lengths that were estimated using (because they are the characters being used in tree con- likelihood-based distances from parsimony-informative struction through parsimony; see Olmstead et al., 1998), first and second positions together, as well as those it was considered appropriate to estimate model param- for parsimony-informative third positions only. The eseters from them exclusively. timated overall tree length for first and second posiFor each of the three partitions from which model tions was 14.30562, and 36.94496 for third positions. This parameters were estimated, two independent analyses suggested that parsimony-informative third positions were run, with four chains per analysis, trees sampled evolved on average 2.6 times faster than parsimonyevery 100 generations, and a minimum of 4.6 million informative first and second positions combined. generations run (10 million for the first and second The secondary procedure entailed using averpositions) per analysis. The analyses for the first and age genetic distances among terminals (using the 2006 SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS likelihood-based distances) as computed by PAUP*. This procedure was considered inferior to the primary procedure because it estimates distances without regard to the phylogeny, whereas the primary procedure took the inferred phylogenetic relationships into account. The average estimated distance between terminals for first and second positions was 0.089933, and 0.319677 for third positions. This indicated that parsimony-informative third positions evolved on average 3.5546 times the rate of parsimony-informative first and second positions combined. Five separate overall tree lengths per character were examined to bracket the actual overall rate of evolution of the different positions from Soltis et al. (2000): 10.39356, 14.30562, 25.62529, 36.94496, and 50.85076. The second and fourth rates represented the average overall rates for first and second positions combined and third positions, respectively, estimated using the primary procedure. The first rate represents 1/3.5546 the fourth rate, and the fifth rate represents 3.5546 times the second rate. The third rate was selected as intermediate between the second and fourth rates. 247 cleotides. Separate simulations were then performed using the Felsenstein (1981) model for each of the three possible numbers of observed character states, in which all of the states included were represented at equal frequencies. For two-state characters, for example, the characterstate frequencies were set at 50% for two nucleotides and 0% for the other two nucleotides. The numbers of two-, three-, and four-state characters were simulated proportionally to their observed frequencies and then concatenated together. To examine the effect of differential characterstate space independently of the greater number of parsimony-informative third positions, the simulations were also performed for third positions by multiplying the number of characters in each grouping by 0.720652 and rounding to whole numbers. This served to decrease the number of simulated third-codon-position characters from 920 to 663, which is identical to the number of parsimony-informative first- and second-position characters. To simulate differential frequencies (percentages) of observed character states independently of the observed character-state space, the differential ratios of nucleotide percentages needed to be maintained while having all Model Parameters four nucleotides represented. First, however, the charTo quantify how each of the six factors affected the rela- acters needed to be partitioned into approximately tive performance of first and second positions combined homogeneous groups with respect to nucleotide perrelative to third positions, each of these factors needed to centages. MEGA 2.1 (Kumar et al., 2001) was used to be considered independently of one another. By then pro- calculate the percentage of each of the four nucleotides gressively combining the different factors together with represented in each character from the transposed data one another, we could examine how the factors interact matrices. Note that when doing so, MEGA automatiwith one another (e.g., do they cancel each other out, are cally discards all polymorphic entries, which was considtheir contributions additive, or are they more than the ered appropriate here. The percentages for each character were loaded into Microsoft Excel, whereupon all percentsum of their parts?). Two initial simulations were performed using a ages representing singleton character states were elimiJukes-Cantor (1969) model (all four nucleotides repre- nated (see above), and the remaining percentages were sented, each represented in equal frequency, and without recalculated to add up to 100%. Individual characters rate heterogeneity between nucleotides or among sites) were grouped into blocks of characters representing each for 663 characters and 920 characters, respectively. This 10th percentile for each nucleotide (Table 2). For examsimulation served as the baseline to compare whether ple, 9.5% (63) of the 663 first and second positions had each of the factors had a positive or negative effect on a 98.4% or greater percentage of thymine represented phylogenetic inference from first and second positions among the 567 taxa sampled. Blocks representing 0% of a character state were grouped together. For example, the combined or third positions, respectively. Differential character-state space was simulated inde- lowermost four blocks (first 40%) of the first and second pendently of the other three model factors (differential frequencies of observed character states, differential subTABLE 2. Tenth percentile blocks for each nucleotide from first and stitution probabilities among nucleotides, and rate het- second positions together separately from third positions." erogeneity among sites; see Table 3) using the Felsenstein (1981) model. To decrease the potential for sequencing erThird positions First and second positions rors (see Kellogg and Juliano [1997]) artificially inflating T T A C A G C G the observed character-state space, all nucleotides repre- Percentile 98.4 97.71 61.33 97.01 74.91 99.1 99 98.58 90% sented in only one of the 567 terminals for a given charac96.26 96.8 17.1 92.26 91.74 10.7 95.42 22.56 ter were re-scored as missing data before calculating the 80% 12.34 13.52 1.94 8.3 64.24 4.3 88.93 6.13 observed character-state space. Of the 663 parsimony- 70% 0.62 3.4 2.02 1.72 14.72 2.5 48.4 1.2 60% informative first and second positions, 65% (428) had two 50% 0.5 0.4 0.7 2.7 1.6 14.3 2.1 0.8 observed nucleotides, 23% (155) had three observed nu- 40% 0.4 0 0.4 1.1 3.56 1.4 0.4 0.9 cleotides, and 12% (80) had four observed nucleotides. Of 30% 0 0 0 0 0.6 0.9 0.9 0.4 0 0 0 0 0 0 0.4 0 the 920 parsimony-informative third positions, 33% (304) 20% 0 0 0 0 0 0 0 0 had two observed nucleotides, 25% (233) had three ob- 10% served nucleotides, and 42% (383) had four observed nu" Blocks that were grouped together are indicated by boxes. 248 SYSTEMATIC BIOLOGY positions all had 0% thymine and were consequently grouped together (Table 2). Also, blocks for which the uppermost and lowermost percentages were within 10% of one another were grouped together. For example, the two uppermost blocks (above 80%) for third positions had 97.71% to 99.9% thymine and 95.42% to 97.70% thymine, respectively, and were grouped together (Table 2). A total of 48 separate character block patterns were thereby delimited for first and second positions, and 77 block patterns for third positions (available as an Excel file at http://systematicbiology.org/). For example, one third codon position character block pattern had 1.53% A, 5.7% G, 56.23% T, and 36.54% C. Only a subset of the possible block patterns was realized, in part because many of the possible patterns would have been mutually exclusive percentile classes for the four nucleotides (e.g., one cannot have an average of both 75% A and 75% T at third codon positions). Each character block pattern was simulated independently of the others using the Felsenstein (1981) model in Evolver, and the simulated character block patterns were concatenated together into a single NEXUS file using CONCAT (available at http://www.biology.colostate.edu/Research/). This methodology was performed identically for first and second positions combined independently of third positions. Maintaining the differential ratios of nucleotide percentages was straightforward for character block patterns with two nucleotides represented when transforming them to having all four nucleotides represented. The percentage of the two nucleotides represented was halved and then copied to the two unobserved nucleotides. For example, 60% A, 40% G, 0% T, 0% C would be changed to: 30% A, 20% G, 30% T, 20% C. For consistency, the same relative percentages within purines and pyrimidines were maintained when making these changes. When only one purine and one pyrimidine were sampled for a given character, the discrepancy in percentage within purines and within pyrimidines was maintained to the degree possible. For example, 60% A, 0% G, 40% T, 0% C would be changed to 30% A, 20% G, 20% T, 30% C. For character block patterns with three character states, an ad hoc method was applied in an attempt to maintain the discrepancy in nucleotide percentages. This involved changing from three states to two states, and then following the procedure outlined above. To change from three states to two states, the state with the intermediate percentage was averaged first with the low-percentage state and then with the high-percentage state. The percentage of one of the two resultant character states is x/(x + y), where the average of the percentages of the most highly represented state and the intermediate state is x, and the average of the percentages of the least represented state and the intermediate state is y. The percentage of the other resultant character state is y/(x + y). For example, 50% A, 40% G, 10% T, 0% C would be changed to: 64.3% A, 35.7% G, 0% T, 0% C. Following the procedure outlined above for two states, this would then be changed to: 32.15% A, 32.15% G, 17.85% T, 17.85% VOL. 55 C. The two states that were originally in highest percentage (adenine and guanine) were maintained at the higher percentage in the resultant four-state character. Ideally, character block patterns with all four states would not have to be modified. Due to the grouping procedure used, many characters lacking some states were often grouped together with other characters in which the state was represented at a minute percentage. As a result, there were almost always four states represented in each group, even though two or three states were often represented below 1% each. In these cases, states represented at < 1 % were grouped together to make two- or three-state characters, whereupon the procedures described above were followed. When two or three other states were represented at >1%, those state(s) represented at < 1 % were combined with the lowestpercentage state represented at >1%. For example, 60% A, 39% G, 0.5% T, 0.5% C would be changed to 60% A, 40% G, 0% T, 0% C. Following the procedure outlined above for two states, this would then be changed to: 30% A, 20% G, 30% T, 20% C. When all four states were represented at >1%, no modifications were made. An Excel hie detailing all changes made is available as supplemental data at http://systematicbiology.org/. To examine differential percentages of observed character states independently of the greater number of parsimony-informative third positions, the simulations were also performed for third positions by multiplying the number of characters in each grouping by 0.720652 and rounding to whole numbers. Differential substitution probabilities among nucleotides were simulated independently of the other three model factors using the GTR model with all four nucleotides represented in equal frequency. The substitution probabilities for first and second positions combined used in Evolver were G-T: 0.505105; C-T: 1.400678; C-G: 0.501771; A-T: 0.45084; A-G: 1; A-C: 0.724495. The substitution probabilities for third positions used in Evolver were: G-T: 0.229471; C-T: 0.87462; C-G: 0.21174; A-T: 0.043822; A-G: 1; A-C: 0.227507. (Note that Evolver sets the A-G rate to 1, whereas MrBayes sets the G-T rate to 1; hence the differences relative to Table 1.) To examine differential substitution probabilities among nucleotides independently of the greater number of parsimonyinformative third positions, the simulations were also performed for third positions using only 663 characters. Rate heterogeneity among sites was simulated independently of the other three model factors using the gamma distribution with the Jukes-Cantor (1969) model. Alpha was set at 0.668668 for first and second positions combined, and 1.299032 for third positions (indicating greater rate heterogeneity among sites for first and second positions). Twenty categories were used for the discrete gamma distribution. To examine rate heterogeneity among sites independently of the greater number of parsimony-informative third positions, the simulations were also performed for third positions using only 663 characters. Each of the six pairwise combinations of parameters was then examined, followed by the four triplets, and 2006 SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS TABLE 3. Model parameters that were incorporated in each of the 15 simulations that were performed for all five rates of evolution examined (indicated by Xs). Simulation number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 State space Frequencies of observed states Substitution probabilities among states Rate heterogeneity among sites X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X all four parameters together (Table 3). For the three combinations wherein differential state space and substitution probabilities among nucleotides were examined together, blocks of characters with different pairs or triplets of states represented were simulated independently of one another following Table 4. Each of these simulations was also performed using only 663 thirdposition characters, as described above. Phylogenetic Analyses A total of 235 sets of data matrices were simulated (1 pair of baseline simulations +15 simulations as outlined in Table 3, each including different simulations for first and second positions combined, all third positions, and 663 third positions; 5 tree lengths), with each set consisting of 20 replicate data matrices. A total of 4700 jackknife analyses were performed (235 sets of data matrices; 20 replicates per set), with each analysis composed of 1000 jackknife replicates for a grand total of 4.7 million parsimony-based tree-bisection-reconnection (TBR) tree searches. All characters were equally weighted following Soltis et al. (2000) and Simmons et al. (2002). Because of the computational demands, each jackknife replicate was limited to a single TBR tree search with random sequence addition and a single tree held. Parsimony jackknife analyses (Farris et al., 1996) were conducted using PAUP* with the removal probability set to approximately e~l (37%), and "jac" resampling emulated (such that the deletion probability is applied to each character individually rather than to an overall percentage of characters; see Freudenstein et al., 2004). Note that jackTABLE 4. Numbers of parsimony-informative characters for which each combination of nucleotides was represented for first and second positions independently of third positions. Codons AG AC AT TC TG CG ACG TAG TCA TCG AGCT 1st & 2nd 114 69 35 117 39 54 3rd 132 5 5 156 6 0 71 68 23 29 33 50 28 86 80 383 249 knife values obtained using the e l deletion probability are generally higher than bootstrap values (Mort et al., 2000; Davis et al., 2004; Felsenstein, 2004). Jackknife trees were independently calculated for all clades with >50% support, >70% support, and >95% support. PEST version 2.2 (Zujko-Miller and Miller, 2003) was used to determine the number of clades correctly and incorrectly resolved in the jackknife trees relative to the reference trees for each matrix. Three reference trees were used (1) the tree topology on which the characters were simulated (all 565 clades); (2) the simulated tree topology in which only the 96 clades that included >16 terminals were resolved; and (3) the simulated tree topology in which only the 410 clades that included <7 terminals were resolved (see below). Unless otherwise noted, the relative performance of phylogenetic inference was assessed using the overall success of resolution (the number of clades correctly resolved minus the number of clades incorrectly resolved). For fully resolved trees, this scales linearly to the Robinson-Foulds distance (Robinson and Foulds, 1981; Penny and Hendy, 1985), which would range from 0 to 1130 for our trees. This approach may appear to treat incorrect resolution equally to correct resolution (e.g., the overall success for 110 clades correctly resolved and 100 clades incorrectly resolved could receive the same score as 10 clades correctly resolved and the remaining clades unresolved), whereas many systematists would be much more concerned about well-supported, incorrect resolution. However, long-branch attraction between two distantly related terminals would result in many clades being scored as incorrectly resolved. This effect was considered to be a sufficient extra penalty for incorrect resolution. We examined the overall success of resolution in three ways, each using results from 50%, 70%, and 95% jackknife trees. First, we considered overall success for the entire tree of 565 clades. In this case, the maximum score was 565, for all clades correctly resolved, and the worst possible score was -565, in which all clades from the reference tree would be contradicted. Second, we restricted our attention to the larger clades (in this case, the 96 clades that included >16 terminals). All else equal, these clades represent the early-derived (or "basal") lineages. Here, the maximum score was 96 and the minimum score was —96. Third, we only examined smaller clades (in this case, the 410 clades that included <7 terminals). All else equal, these clades represent the recently derived (or "distal") lineages. In this case, the best possible score was 410 and the worst possible score was —410. The maximum possible number of steps minus the minimum possible number of steps for each matrix (as determined by PAUP*) was used as a measure of the "amount of possible synapomorphy" (Farris, 1989:418; see also Simmons et al., 2004a). As such, the amount of possible synapomorphy for parsimony-uninformative characters is zero. The maximum and minimum possible number of steps are determined strictly by reference to the data matrix, not to any particular tree. The minimum number of steps for each character, summed across 250 VOL. 55 SYSTEMATIC BIOLOGY TABLE 5. Response variables, independent variables, and the number of characters used for the 3rd position characters for each of the multiple regression analyses performed. R2 values indicate the amount of variability in the data explained by the model for each analysis at 50% and 95% cutoffs for the entire tree, just the smaller clades, and just the larger clades. R2 Al Model Response variable Overall success A2 Incorrectly resolved No. of 3rd position characters 50% 95% Sml. Large Ent. Sml. Large 920 0.96 0.96 0.90 0.93 0.93 0.84 920 0.62 0.69 0.27 0.26 0.31 0.03 663 0.94 0.94 0.87 0.91 0.92 0.75 Position Rate 920 0.83 0.80 0.67 0.77 0.82 0.75 0.81 0.77 0.67 0.75 0.79 0.89 0.70 0.69 0.55 0.66 0.70 0.02° 0.90 0.88 0.82 0.88 0.87 0.96 0.88 0.88 0.84 0.86 0.87 0.98 0.90 0.79 0.62 0.88 0.78 0.29 Position Rate 920 Position No. of factors Rate Position x No. of factors Position Factor Ratel Factor x Ratel Position Amount of synapomorphy Position x Amount of synapomorphy 920 0.16 0.27 0.01" 0.02" 0.31 0.32 0.54 0.25 0.28 0.01" 0.06 0.40 0.53 0.54 0.07 0.09 0.05 0.06 0.01" 0.06 0.51 0.08 0.13 0.12 0.16 0.07 0.24 0.63 0.05 0.15 0.16 0.12 0.05 0.26 0.62 0.00" 0.00" 0.04" 0.04" 0.02" 0.02" 0.60 663 0.96 0.96 0.88 0.98 0.99 0.81 920 0.97 0.97 0.91 0.96 0.96 0.91 Independent variables Position Factor Rate Position x Factor Position Factor Ent. Rate B Overall success Cl D Overall success Baseline State space State frequency GTR model Rate heterogeneity Four-way interaction Incorrectly resolved Baseline State space State frequency GTR model Rate heterogeneity Four-way interaction Overall success E Overall success F Overall success C2 Position x Factor Position Factor Rate Position x Factor 'Overall model not significant at P = 0.01 level. all characters, would only be the same as the most parsimonious tree length if there was no character conflict (i.e., CI = 1). For example, for the Soltis et al. (2000) matrix of 567 terminals in which 7 terminals have an adenine at a given nucleotide position and the other 560 terminals have a guanine at that position, the amount of possible synapomorphy would be six (maximum = 7, minimum = 1). The amount of possible synapomorphy for an entire data matrix (as used here) is the sum of the amount of possible synapomorphy from all characters. In this study, we were interested in determining whether the amount of possible synapomorphy would be predictive of the overall success of resolution for each matrix. Statistical Analyses In order to determine how each of the six factors affected the relative performance of first and second positions relative to third positions, several different multiple regression models were implemented in JMP IN (SAS Institute, Table 5). For each regression model, the response variable was either the overall success of resolution or the number of incorrectly resolved clades (Table 5). All independent variables were treated as fixed effects. The independent variables used in the different analyses were (1) position, a nominal categorical variable indicating codon position, 0 = 1st and 2nd positions, 1 = 3rd positions; (2) factor, a nominal categorical variable indicating baseline, state space, frequency of states, GTR model, rate heterogeneity, and all two-way, threeway, and four-way combinations of nonbaseline factors; (3) rate, the rate of evolution; (4) number of factors, the number of factors included in the simulation model (1, 2, 3, or 4); and (5) ratel, an ordered categorical variable for the rate of evolution, 0 = 14.30562,1 = 36.94496 (Table 5). The specific combinations of independent variables and interactions included in each model are shown in Table 5. Several versions of the models Cl and C2 2006 251 SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS (Table 5) were performed, one for each of the single factors, including baseline, and one for the four-way interaction. Models B and E (Table 5) are similar regression models except that model B contains the variable rate, which is essentially continuous, and model E contains the variable ratel, which is discrete and only includes the second and fourth rates. The second and fourth rates represented the average overall rates for first and second positions combined and third positions, respectively, estimated using the primary procedure. Model E allows us to compare these rates statistically using contrasts. The data were analyzed separately for the 50% and 95% cutoffs for the entire tree, the smaller clades by themselves, and the larger clades by themselves. This was done because of inherent correlations among the data (for cutoff and clade) that cannot easily be taken into account in the regression models without affecting significance levels. For all of our analyses, residuals were normally distributed, and no high-leverage points or outliers were observed, indicating that multiple regression on the untransformed data was appropriate. Least-squares mean estimates of the categorical independent variables were obtained in addition to the parameter estimates, and independent contrasts on the least-squares means were performed where needed. Groups of analyses with multiple tests were Bonferroni-corrected in order to control for spurious significant results that could be caused by the large numbers of comparisons. TABLE 6. The least-squared means of overall success of resolution scaled to 1, with 3rd positions calculated for 920 characters. 50% Cutoff Overall Baseline State space State frequencies Model Rate heterogeneity Larger clades Baseline State space State frequencies Model Rate heterogeneity Smaller clades Baseline State space State frequencies Model Rate heterogeneity 95% Cutoff 1st & 2nd 3rd 1st & 2nd 3rd .74 .67 .60 .73" .65 .80 .77" .66 .77 .75" .49 .42 .38" .48" .39 .58 .543" .46 .544 .52" .60 .51 .38 .58" .47" .70 .65" .46 .66" .61" .29 .20 .15" .28 .17" .39 .311" .23 .35 .306" .78 .72 .66 .77" .70 .83 .799" .72 .804 .793" .54 .48 .445" .53" .453 .63 .5975" .53 .5976 .58" "Contrast relative to the partition with the next highest least-squared mean not significant at the 0.05 level after Bonferroni correction. cases, the most severely limiting factor was the differential frequencies of observed character states, followed by rate heterogeneity among sites, observed characterstate space, and the differential substitution probabilities among nucleotide character states (Table 5). The finding that differential frequencies of observed RESULTS AND DISCUSSION character states and lower observed character-state space The results were generally not qualitatively different are disadvantageous for phylogenetic inference corrobwhen using the 50%, 70%, or 95% jackknife trees for orates the results from Simmons et al.'s (2004b) simuour analyses. Unless otherwise noted, the relative per- lations. In contrast, our finding that rate heterogeneity formance of the simulated characters using the overall among sites was disadvantageous is contradictory. Note success of resolution was assessed using both the 50% that the use of the gamma distribution (by itself or when and 95% jackknife trees. The overall success of resolu- simulated together with other heterogeneous model pation and the number of clades incorrectly resolved on rameters) often resulted in both constant and variable but the 50% jackknife trees (which showed a greater spread parsimony-uninformative characters for both first and than the 70% and 95% jackknife trees and were there- second positions together as well as third codon posifore easier to graph) are presented in Figures 1 and 2 for tions across all five overall tree lengths per character simeach of the four heterogeneous model parameters sim- ulated. In these cases, the same expected overall number ulated independently of one another. Excel files of the of changes still occurred for each tree length per character raw data and figures for the average number of clades simulated, but they were concentrated in a subset of the correctly resolved, the average number of clades incor- available characters. This resulted in, on average, a faster rectly resolved, and the average overall success of reso- rate of evolution per parsimony-informative character lution using the 50%, 70%, and 95% cutoffs are available for those characters simulated using rate heterogeneity as supplemental data at http://systematicbiology.org/. relative to those simulated without it. Characters evolving at a faster rate would be expected Our results from regression model Al show that taken across all five rates of evolution examined together, in- to have a higher chance of having multiple hits along corporation of the four heterogeneous model parameters individual branches as well as more cases of ambiguous examined (observed character-state space, frequencies optimization, both of which would lead to reduced resof observed character states, substitution probabilities olution and support for correctly resolved clades. In this among nucleotides, and rate heterogeneity among sites) particular empirically based simulation study, those negall had a negative effect on phylogenetic inference rel- ative effects were not sufficiently outweighed by the posative to the baseline Jukes-Cantor model (Fig. 1). This itive effects of the many more slowly evolving characters. was found for both first and second positions together As such, although rate heterogeneity among characters as well as for third positions, across the entire tree of may generally be advantageous for phylogenetic infer565 clades, for the larger clades by themselves, and ence (Hillis, 1987), our study indicates that it is not always for the smaller clades by themselves (Table 6). In all beneficial when the overall number of character-state 252 VOL. 55 SYSTEMATIC BIOLOGY 500 400 2 300 200 10 20 30 10 40 average steps / PI character 20 30 40 20 30 40 50 average steps / PI character D 50 10 average steps / PI character 20 30 40 50 average steps / PI character X 350 smaller clades; 663 3 rd 200 200 10 20 30 40 50 average steps / PI character -1+2 baseline • 10 20 30 40 50 average steps / PI character 1+2 state space A 1+2 state frequency • 1+2 model —•—1+2 rate heterogeneity _-.-*.-.3rd baseline —O— 3 state space -~A "3" state frequency •-O— 3rd model ~D— 3rd rate heterogeneity rd 1 FIGURE 1. The average overall success of resolution (number of clades correctly resolved minus the number of clades incorrectly resolved) for jackknife trees using the 50% cutoff, across all five average numbers of steps per parsimony-informative (PI) character, for each of the four heterogeneous model parameters (character-state space, character-state frequencies, rate heterogeneity among nucleotide states [model], and rate heterogeneity among sites), independently of one another. The baselines differ only in the number of characters sampled (663 for 1st & 2nd positions; 920 for 3rd positions), (a) Measured across the entire tree of 565 clades for all 1st & 2nd positions relative to all 920 3rd positions, (b) Measured across the entire tree for all 1st & 2nd positions relative to 663 3rd positions, (c) Measured for the 96 larger clades for all 1st & 2nd positions relative to all 920 3rd positions, (d) Measured for the larger clades for all 1st & 2nd positions relative to 663 3rd positions, (e) Measured for the 410 smaller clades for all 1st & 2nd positions relative to all 920 3rd positions, (f) Measured for the smaller clades for all 1st & 2nd positions relative to 663 3rd positions. 2006 253 SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS A X-A. entire tree; 920 3rd '•-.. entire tree; 663 3 rd A- A • A o »n L 5- - ==# T X < c • ' •••-5K-IV;:; "X- — x average steps / PI character larger clades; 920 3 rd D so average steps / PT character A "A" S jrrernTTTTT^TTTTTr^Q^^^^ larger clades; 663 3 rd E average steps / PI character average steps / PI character '•smallerclades; 663 3 rd • average steps / PI character • • • • * - - • - . . , average steps / PI character —*—"1+2 baseline • 1+2 state spacer-A—1+2 state frequency • 1+2 model 11 1+2 rate heterogeneity —*—3 rd baseline —•—3 r d state space —»fir-3rd state frequency —O— 3rd model —D— 3rd rate heterogeneity FIGURE 2. The average number of clades incorrectly resolved for jackknife trees using the 50% cutoff, across all five average numbers of steps per parsimony-informative (PI) character, for each of the four heterogeneous model parameters (character-state space, character-state frequencies, rate heterogeneity among nucleotide states [model], and rate heterogeneity among sites), independently of one another. The baselines differ only in the number of characters sampled (663 for 1st & 2nd positions; 920 for 3rd positions), (a) Measured across the entire tree of 565 clades for all 1st & 2nd positions relative to all 920 3rd positions, (b) Measured across the entire tree for all 1st & 2nd positions relative to 663 3rd positions, (c) Measured for the 96 larger clades for all 1st & 2nd positions relative to all 920 3rd positions, (d) Measured for the larger clades for all 1st & 2nd positions relative to 663 3rd positions, (e) Measured for the 410 smaller clades for all 1st & 2nd positions relative to all 920 3rd positions, (f) Measured for the smaller clades for all 1st & 2nd positions relative to 663 3rd positions. 254 SYSTEMATIC BIOLOGY TABLE 7. The least-squared means of overall success of resolution scaled to 1, with 3rd positions calculated for 663 characters. 95% Cutoff 50% Cutoff Overall State space State frequencies Model Rate heterogeneity Overall rate" Larger clades State space State frequencies Model Rate heterogeneity Overall rate" Smaller clades State space State frequencies Model Rate heterogeneity Overall rate" 1st & 2nd 3rd 1st & 2nd 3rd .67 .60 .73b-c .65 .69 .70 .50 .70 .69 .80 .42 .448 .34 .450 .435 .59 .51" .38 .58"'c .56 .13 .54 .53 .68 .20 .74 .61 .75 .72 .83 .48 A?b.c .54 .72b .66 77b,c .70 .73 .38C .48C .39 .38 .15C .28 .17C .19 .445 C .53C .453 .44 .25 .08 .24 .22 .38 .505 .42 .510 .48 .65 " Calculated for 1st and 2nd positions at the second rate of evolution (14.30562) and for 3rd positions at the fourth rate of evolution (36.94496) using regression model E. All other results from regression model B. b Contrast of 1st and 2nd positions versus 3rd positions not significant at the 0.05 level after Bonferroni correction. c Contrast relative to the partition with the next highest least-squared mean not significant at the 0.05 level after Bonferroni correction. changes is held constant. This result reinforces the importance of conducting empirically based simulations (e.g., Hillis, 1996) to supplement those performed using simplistic tree topologies and branch lengths (e.g., Simmons et al., 2004b). Results from regression models B and E show that, taken across all five rates of evolution examined together, two of the heterogeneous model parameters examined (frequencies of observed character states and substitution probabilities among nucleotide character states) favored first and second positions, whereas observed character-state space and rate heterogeneity among sites favored third positions (Fig. 1). This was determined by testing for significant differences in the overall success of resolution when incorporating each heterogeneous factor into the simulation model independently of one another for all parsimony-informative first and second positions relative to the same number (663) of parsimony-informative third positions. The differences were significant in all cases (across the entire tree of 565 clades, as well as when only examining the larger or smaller clades independently of one another; Table 7) when applied to the 95% jackknife trees, and in 6 of the 12 cases for the 50% jackknife trees. Number of Parsimony-Informative Characters The greater number of parsimony-informative third positions provided a significant increase in the overall success of resolution when comparing the baseline Jukes-Cantor model between first and second positions (663 characters) and third positions (920 characters) in all cases (Table 6). Likewise, the faster overall rate of VOL. 55 evolution for third positions (fourth rate: 36.94496 steps per parsimony-informative character) was found to be a significant advantage relative to the slower overall rate of evolution for first and second positions (second rate: 14.30562 steps per parsimony-informative character; Table 7). Overall Rate of Evolution Results from regression model AI show that, taken across all five rates of evolution examined, increasing the rate of evolution invariably improved the overall success of resolution (when significantly different from zero) at both the 50% and 95% cutoffs, across the entire tree of 565 clades, as well as when only examining the larger or smaller clades independently of one another (Fig. 1, Table 8). This result indicates that, taken across the tree as a whole, the taxon sampling used by Soltis et al. (2000) was sufficiently dense so as to largely prevent saturation (i.e., multiple hits along an individual branch; see Wenzel and Siddall, 1999) at third positions from overwhelming phylogenetic signal (Hillis, 1996,1998; Soltis et al., 2004; Albert, 2005). Results from regression model A2 indicate that, in some cases, increasing the rate of evolution led to fewer incorrectly resolved clades (Figure 2, Table 8), TABLE 8. Parameter estimate for the rate of evolution, with 3rd positions calculated for all 920 characters. Overall success Overall Across all Baseline State space State frequencies Model Rate heterogeneity Space + frequency + model + rate heterogeneity Larger clades Across all Baseline State space State frequencies Model Rate heterogeneity Space + frequency + model + rate heterogenity Smaller clades Across all Baseline State space State frequencies Model Rate heterogeneity Space + frequency + model + rate heterogeneity Incorrectly resolved 50% 95% 50% 1.30 2.38 2.00 2.20 2.21 1.82 0.18" 2.21 4.32 3.43 2.96 4.01 2.94 0.47 0.03" -0.07" 0.03" -0.07" -0.01" -0.06" 0.05° 0.34 0.55 0.51 0.64 0.55 0.49 0.13" 0.26 0.76 0.53 0.36 0.68 0.39 -0.02" 0.05 0.05" -0.05" 0.04 0.01" -0.16" 0.84 1.52 1.27 1.30 1.41 1.15 0.10" -o.or 1.74 3.08 2.55 2.96 2.90 2.26 0.47 0.06 -0.10 - 5 x 10"3" 5 x 10"3" -0.06" -0.06" 0.19" 95% 0.02 0.01 0.02 0.02" 0.02 0.01" 0.02" - 2 x 10"3" 1 x 10-"" 4 x 10-4" -o.or 2 x 10- " 3 7 x 10"4" - 2 x 10"3" 0.02 5 x 10"3" 0.02 0.03 0.01 0.01" 0.02" "Not significantly different from zero at the 0.05 level after Bonferroni correction. 2006 255 SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS 110000 TABLE 9. Slope of the regression of overall success on the number of heterogeneous model parameters, with 3rd positions calculated for all 920 characters. 50% Cutoff 1st & 2nd Overall Larger clades Smaller clades -96.91 -18.77 -66.40 entire tree; 920 3rd..$ —*—1+2 baseline 90000 * 95% Cutoff 3rd -81.48 -21.21 -49.16 1st & 2nd -65.01" -7.14 -51.44 70000— 3rd -61.25" -9.81 -43.28 50000 " Not significantly different from one another at the 0.05 level after Bonferroni correction. 30000 3 rd baseline • 1+2 state space 0 3«i state space —A—1+2 state frequency A 3r<i state frequency —•—1+2 model -— D 10000 presumably reducing the number of incorrectly resolved clades that were weakly supported due to stochastic effects. This explanation is consistent with the reduction in incorrect resolution at higher rates of evolution being more commonly observed on the 50%, rather than the 95%, jackknife trees (Table 8). More generally, however, increasing the rate of evolution led to more incorrectly resolved clades, although the relationship was not strong. Either way, the change in the number of incorrectly resolved clades was often not significantly different from zero after the Bonferroni correction (Table 8). Increasing Number of Heterogeneous Parameters Results from regression model D show that, taken across all five rates of evolution examined, increasing the number of heterogeneous model parameters incorporated into the simulations was significantly more disadvantageous for first and second positions than it was for third positions, as measured by the slope of the regression of overall success on the number of heterogeneous model parameters incorporated in the simulations. This occurred when the overall success of resolution was measured across the entire tree using the 50% jackknife cutoff (not significant at the 95% cutoff) and when restricting attention to the smaller clades (using both jackknife cutoffs; Table 9). In contrast, increasing the number of heterogeneous model parameters was significantly more disadvantageous for third positions than it was for first and second positions for the larger clades (using both jackknife cutoffs; Table 9). Whereas third positions were more robust to incorporation of their heterogeneity for resolving the smaller clades, the first and second positions were more robust to incorporation of their heterogeneity for resolving larger clades, suggesting that the different heterogeneous model parameters examined have significantly different 0 3 rd model —•—1+2 rate heterogeneity 10 3 rd rate heterogeneity 50 20 30 40 average steps / PI character FIGURE 3. The amount of possible synapomorphy across all five average numbers of steps per parsimony-informative character for each of the four heterogeneous model parameters (character-state space, character-state frequencies, rate heterogeneity among nucleotide states [model], and rate heterogeneity among sites) independently of one another, measured across the entire tree of 565 clades for all 1st & 2nd positions relative to all 920 3rd positions. effects on our ability to infer larger clades and smaller clades. However, there is also the confounding effect of the greater number of parsimony-informative third positions (920 versus 663). When the third positions were restricted to the same number of characters as the first and second positions (663 versus 663), their slope changed from -81.48 to -94.59 using the 50% jackknife cutoff while examining clades across the entire tree. This slope is not significantly different from the slope for first and second positions (—96.91; Table 9), indicating that the significant difference is primarily caused by the greater number of parsimony-informative third-position characters. Amount of Possible Synapomorphy Results from regression model F show that, taken across all five rates of evolution examined, the amount of possible synapomorphy was predictive of the overall success of resolution at both the 50% and 95% cutoffs, across the entire tree of 565 clades, as well as when only examining the larger or smaller clades independently of one another. This is indicated by the significant amount-of-possible-synapomorphy parameter estimate (Table 10, all results significant at the Bonferroni-corrected P =0.01 level). The amount of possible synapomorphy was predictive of the overall success TABLE 10. Parameter estimate for the amount of possible synapomorphy (APS) and slope of the regression of overall success on the amount of possible synapomorphy with 3rd positions calculated for all 920 characters. 95% Cutoff 50% Cutoff Overall Larger clades Smaller clades APS parameter estimate 1st & 2nd Slope 3rd Slope APS parameter estimate 1st & 2nd Slope 3rd Slope 0.0038 0.0008 0.0025 0.0050 0.0010 0.0034 0.0027 0.0007 0.0016 0.0008 0.0004 0.0033 0.0010 0.0004 0.0039 0.0007 0.0004 0.0027 256 SYSTEMATIC BIOLOGY VOL. 55 rates relative to one another with respect to either silent or replacement substitution rates (Muse and Gaut, 1997) and were found to evolve at similar overall rates among vascular plants (P. Soltis et al., 2002) and seed plants (Bell et al., 2005). Sixth, because the most parsimonious tree that characters were simulated onto was calculated using characters from 18S nuclear rDNA in addition to atpB and rbcL, it was assumed that 18S nuclear rDNA and the plastid genome (from which atpB and rbcL were sampled) have the same history among the lineages sampled, following Soltis et al. (2000). This is a reasonable assumption for the taxa sampled because lineage sorting and introgression (Doyle, 1992) are generally only CONCLUSIONS expected to potentially confound phylogenetic inference Several assumptions are inherent in this sort of study when sampling closely related eukaryotic taxa. Although wherein models are used to simulate empirical data (as a reduced (232) taxon-sampling dataset indicated signifwith parametric bootstrapping [Saitou and Nei, 1986], for icant character-based incongruence (Farris et al., 1995) instance). First, the model used (GTR+F) was assumed between rbcL and 18S nuclear rDNA, the two gene trees to sufficiently capture the complexity of the empirical were generally topologically congruent (Soltis et al., characters being simulated. All parametric models are 1997). Despite these limitations, we believe that this type of simplifications of the process of molecular evolution (Penny et al., 1992). One possible way to account for simulation study is an important step towards underthird positions outperforming first and second positions standing the behavior of empirical characters. With these at deeper clades that was not simulated in this study in- limitations in mind, the greater phylogenetic signal obvolves the covarion process (Fitch and Markowitz, 1970). served at third codon positions of atpB and rbcL relative The covarion process may be operating more rapidly at to their corresponding first and second codon positions third positions relative to the first and second positions. in the Soltis et al. (2000) data matrix is attributable to their If so, this would be advantageous for the third positions greater observed character-state space, lower rate hetero(Penny et al., 2001). Second, within the confines of the geneity among sites, higher overall rate of evolution, and GTR-f-F model, all parsimony-informative first and sec- greater number of parsimony-informative characters. In ond positions were assumed to evolve in a homogeneous contrast, differential frequencies of observed character manner, as were the third positions, across all lineages states and differential substitution probabilities among sampled. This type of assumption is inherent to para- states were relative advantages of first and second posimetric phylogenetic inference. Most of the rate variation tions. Incorporation of all four heterogeneous model pain the plastid genome appears to be attributable to re- rameters examined had a negative effect on phylogenetic placement substitutions (Gaut and Clegg, 1993; Muse inference relative to the baseline Jukes-Cantor model for and Gaut, 1997) rather than silent substitutions, and at all three codon positions. The most severely limiting facfirst and second codon positions rather than third po- tor was the differential frequencies of observed characsitions (Ane" et al., 2005). As such, this assumption is ter states, followed by rate heterogeneity among sites, likely to be more severely violated for first and second observed character-state space, and the differential subpositions than for third positions. Third, the rate param- stitution probabilities among nucleotide character states. eters in the GTR model and the shape of the gamma These results were obtained when the entire tree of 565 distribution were assumed to have been accurately es- clades was examined, as well as when attention was timated by MrBayes and to apply to all lineages. This restricted to only the larger or smaller clades indepenis unlikely given that the actual MAP trees were prob- dently of one another. ably not sampled. To ensure doing so would be comRate heterogeneity among sites was inferred to be disputationally intractable for matrices of the sizes used advantageous for the Soltis et al. (2000) data matrix. Al(Goloboff and Pol, 2005). Fourth, the manner in which though rate heterogeneity among characters is normally characters were simulated assumes that the character- cited as advantageous for phylogenetic inference (e.g., state space remains constant across all lineages sampled. Pennington, 1996; Barker, 2004), this advantage may only This assumption is unrealistic as predicted by the co- generally occur in empirical studies when rate heterovarion theory. Furthermore, the degeneracy of first and geneity is paired with a higher overall rate of evoluthird codon positions would vary depending on which tion among the sampled characters, not when the overall amino acid the codon specified at any given time in each rate of evolution is kept constant, as was done in this lineage. Fifth, the two genes simulated, atpB and rbcL, study. Other empirically based simulations need to be were assumed to have evolved in a homogeneous man- conducted to test how general this and our other rener within the lineages sampled. There is some support sults are by using different clades, different genes, and for this assumption in that these two genes were not also considering alternative methods of phylogenetic found to be evolving within lineages at heterogeneous inference. of resolution for both the first and second positions and the third positions, independently of one another, as indicated by the significantly positive slopes of the regression lines (Table 10). The slopes in Table 10, although significant in all cases, are very shallow because of the dramatic differences in scale between the overall success of resolution and the amount of possible synapomorphy. For example, for the results presented in Figure 3, the average overall success of resolution ranged from 275 to 492, whereas the average amount of possible synapomorphy ranged from 15,785 to 101,647. 2006 SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS 257 Kallersjo, M., V. A. Albert, and J. S. Farris. 1999. Homoplasy increases phylogenetic structure. Cladistics 15:91-93. Kallersjo, M., J. S. Farris, M. W. Chase, B. Bremer, M. F. Fay, C. J. Humphries, G. Petersen, O. Seberg, and K. Bremer. 1998. Simultaneous parsimony jackknife analysis of 2538 rbcL DNA sequences reveals support for major clades of green plants, land plants, seed plants, and flowering plants. Plant Syst. Evol. 213:259287. Kellogg, E. A., and N. D. Juliano. 1997. The structure and function of RuBisCo and their implications for systematic studies. Am. J. Bot. REFERENCES 84:413-428. Albert, V. A. 2005. Parsimony and phylogenetics in the genomic age. Kumar, S., K. Tamura, I. B. Jakobsen, and M. Nei. 2001. MEGA2: Molecular Evolutionary Genetics Analysis software. Bioinformatics 17:1244Pages 1-11 in Parsimony, phylogeny, and genomics (V. A. Albert, 1245. ed.). Oxford University Press, Oxford. An6, C, J. G. Burleigh, M. M. McMahon, and M. J. Sanderson. 2005. Lewis, L. A., B. D. Mishler, and R. Vilgalys. 1997. Phylogenetic relationships of the liverworts (Hepaticeae), a basal embryophyte lineage, Covarion structure in plastid genome evolution: A new statistical inferred from nucleotide sequence data of the chloroplast gene rbcL. test. Mol. Biol. Evol. 22:914-924. Mol. Phylogenet. Evol. 7:377-393. Barker, F. K. 2004. Monophyly and relationships of wrens (Aves: Troglodytidae): A congruence analysis of heterogeneous mitochon- Manhart, J. R. 1994. Phylogenetic analysis of green plant rbcL sequences. Mol. Phylogenet. Evol. 3:114-127. drial and nuclear DNA sequence data. Mol. Phylogenet. Evol. 31:486Mort, M. E., P. S. Soltis, D. E. Soltis, and M. L. Mabry. 2000. Compari504. Bell, C. D., D. E. Soltis, and P. S. Soltis. 2005. The age of the angiosperms: son of three methods of estimating internal support on phylogenetic trees. Syst. Biol. 49:160-171. A molecular timescale without a clock. Evolution 59:1245-1258. Bjorklund, M. 1999. Are third positions really that bad? A test using Muse, S. V, and B. S. Gaut. 1997. Comparing patterns of nucleotide substitution rates among chloroplast loci using the relative rate test. vertebrate cytochrome b. Cladistics 15:191-197. Genetics 146:393-399. Campbell, D. L., A. V. Z. Brower, and N. E. Pierce. 2000. Molecular evolution of the Wingless gene and its implications for the phylo- Naylor, G. J. P., T. M. Collins, and W. M. Brown. 1995. Hydrophobicity and phylogeny. Nature 373:565-566. genetic placement of the butterfly family Riodinidae (Lepidoptera: Olmstead, R. G., P. A. Reeves, and A. C. Yen. 1998. Patterns of sequence Papilionoidea). Mol. Biol. Evol. 17:684-696. evolution and implications for parsimony analysis of chloroplast Davis, J. I., D. W. Stevenson, G. Petersen, O. Seberg, L. M. Campbell, DNA. Pages 164-187 in Molecular systematics of plants II: DNA J. V. Freudenstein, D. H. Goldman, C. R. Hardy, F. A. Michelangeli, sequencing (D. S. Soltis, P. S. Soltis, and J. J. Doyle, eds.). Kluwer M. P. Simmons, and C. D. Specht. 2004. A phylogeny of the monocots, Academic Publishers, Boston. as inferred from rbcL and atpA sequence variation, and a comparison of methods for calculating jackknife and bootstrap values. Syst. Bot. Pennington, R. T. 1996. Molecular and morphological data provide phylogenetic resolution at different hierarchial levels in Andira. Syst. Biol. 29:467-510. 45:496-515. Doyle, J. J. 1992. Gene trees and species trees: Molecular systematics as Penny, D., and M. D. Hendy. 1985. The use of tree comparison metrices. one-character taxonomy. Syst. Bot. 17:144-163. Syst. Zool. 34:75-82. Farris, J. S. 1989. The retention index and the rescaled consistency index. Penny, D., M. D. Hendy, and M. A. Steel. 1992. Progress with methCladistics 5:417-419. ods for constructing evolutionary trees. Trends Ecol. Evol. 7:73Farris, J. S., V. A. Albert, M. Kallersjo, D. Lipscomb, and A. G. 79. Kluge. 1996. Parsimony jackknifing outperforms neighbor-joining. Penny, D., B. J. McComish, M. A. Charleston, and M. D. Hendy. Cladistics 12:99-124. 2001. Mathematical elegance with biochemical realism: The coFarris, J. S., M. Kallersjo, A. G. Kluge, and C. Bult. 1995. Testing signifvarion model of molecular evolution. J. Mol. Evol. 53:711icance of incongruence. Cladistics 10:315-319. 723. Felsenstein, J. 1978. Cases in which parsimony or compatibility methPhillips, M. J., and D. Penny. 2003. The root of the mammalian tree ods will be positively misleading. Syst. Zool. 27:401-410. inferred from whole mitochondrial genomes. Mol. Phylogenet. Evol. Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maxi28:171-185. mum likelihood approach. J. Mol. Evol. 17:368-376. Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Posada, D., and K. A. Crandall. 2001. Selecting the best-fit model of nucleotide substitution. Syst. Biol. 50:580-601. Sunderland, Massachusetts. Fitch, W. M., and E. Markowitz. 1970. An improved method for deter- Rannala, B., and Z. Yang. 1996. Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. J. Mol. mining codon variability in a gene and its application to the rate of Evol. 43:304-311. fixation of mutations in evolution. Biochem. Genet. 4:579-593. Freudenstein, J. V., C. van den Berg, D. H. Goldman, P. J. Kores, M. Robinson, D. F., and L. R. Foulds. 1981. Comparison of phylogenetic trees. Math. Biosci. 53:131-147. Molvray, and M. W. Chase. 2004. An expanded plastid DNA phylogeny of Orchidaceae and analysis of jackknife branch support strat- Saitou, N., and M. Nei. 1986. The number of nucleotides required to determine the branching order of three species, with special reference egy. Am. ]. Bot. 91:149-157. Gaut, B. S., S. V. Muse, and M. T. Clegg. 1993. Relative rates of nucleotide to the human-chimpanzee-gorilla divergence. J. Mol. Evol. 24:189204. substitution in the chloroplast genome. Mol. Phylogenet. Evol. 2:89Sennblad, B., and B. Bremer. 2000. Is there a justification for differential 96. a priori weighting in coding sequences? A case study from rbcL and Goloboff, P. A., and D. Pol. 2005. Parsimony and Bayesian phylogeApocynaceae s.l. Syst. Biol. 49:101-113. netics. Pages 148-159 in Parsimony, phylogeny, and genomics (V. A. Simmons, M. P., T. G. Carr, and K. O'Neill. 2004a. Relative characterAlbert, ed.). Oxford University Press, Oxford. state space, amount of potential phylogenetic information, and hetHillis, D. M. 1987. Molecular versus morphological approaches. Ann. erogeneity of nucleotide and amino acid characters. Mol. Phylogenet. Rev. Ecol. Syst. 18:23-42. Evol. 32:913-926. Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130-131. Hillis, D. M. 1998. Taxonomic sampling, phylogenetic accuracy, and Simmons, M. P., and M. Miya. 2004. Efficiently resolving the basal clades of a phylogenetic tree using Bayesian and parsimony apinvestigator bias. Syst. Biol. 47:3-8. proaches: A case study using mitogenomic data from 100 higher Huelsenbeck, J. P., and F. Ronquist. 2001. MrBayes: Bayesian inference teleost fishes. Mol. Phylogenet. Evol. 31:351-362. of phylogenetic trees. Bioinformatics 17:754-755. Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Simmons, M. P., H. Ochoterena, and J. V. Freudenstein. 2002. Amino acid vs. nucleotide characters: Challenging preconceived notions. Pages 21-132 in Mammalian protein metabolism, volume 3 (H. N. Mol. Phylogenet. Evol. 24:78-90. Munro, ed.). Academic Press, New York. ACKNOWLEDGMENTS We thank Victor Albert, Rod Page, Pam Soltis, and an anonymous reviewer for helpful suggestions that improved the manuscript; Damon Little for sending most-parsimonious trees found for the Soltis et al. (2000) matrix; Pat Reeves for help running the Bayesian analyses; Mike Antolin, Donovan Bailey, Joe von Fischer, Melissa Islam, Kurt Pickett, Chris Randle, Pat Reeves, and Ali Schultz for helpful discussions. 258 SYSTEMATIC BIOLOGY VOL. 55 Simmons, M. P., A. Reeves, and J. I. Davis. 2004b. Character state space mony, phylogeny, and genomics (V. A. Albert, ed.). Oxford Univerversus rate of evolution for phylogenetic inference. Cladistics 20:191sity Press, Oxford. 204. Swofford, D. L. 2001. PAUP*: Phylogenetic analysis using parSoltis, D. E., V. A. Albert, V. Savolainen, K. Hilu, Y.-L. Qiu, M. W. Chase, simony (*and other methods). Sinauer Associates, Sunderland, J. S. Farris, S. Stefanovic, D. W. Rice, J. D. Palmer, and P. S. Soltis. Massachusetts. 2004. Genome-scale data, angiosperm relationships, and "ending Wenzel, J. W., and M. E. Siddall. 1999. Noise. Cladistics 15:51-64. incongruence": A cautionary tale in phylogenetics. Trends Plant Sci. Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from 9:477-483. DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396-1401. Soltis, D. E., C. Hibsch-Jetter, P. S. Soltis, M. W. Chase, and J. S. Farris. 1997. Molecular phylogenetic relationships among angiosperms: An Yang, Z. 1997. PAML: A program package for phylogenetic analysis by overview based on rbcL and 18S rDNA sequences. Pages 157-178 in maximum likelihood. CABIOS 13:555-556. Evolution and diversification of land plants (K. Iwatsuki and P. H. Yang, Z., N. Goldman, and A. Friday. 1995. Maximum likelihood trees Raven, eds.). Springer, Tokyo. from DNA sequences: A peculiar statistical estimation problem. Syst. Biol. 44:384-399. Soltis, D. E., P. S. Soltis, M. W. Chase, M. E. Mort, D. C. Albach, M. Zanis, V. Savolainen, W. H. Hahn, S. B. Hoot, M. F. Fay, M. Axtell, Yang, Z., and B. Rannala. 1997. Bayesian phylogenetic inference using S. M. Swensen, K. C. Nixon, and J. S. Farris. 2000. Angiosperm phyDNA sequences: A Markov Chain Monte Carlo method. Mol. Biol. logeny inferred from a combined data set of 18S rDNA, rbcL, and Evol. 14:717-724. atpB sequences. Bot. J. Linn. Soc. 133:381-161. Zujko-Miller, C, and J. A. Miller. 2003. PEST: Precision estiSoltis, P. S., D. E. Soltis, V. Savolainen, P. R. Crane, and T. G. Barraclough. mated by sampling traits, http://www.gwu.edu/~clade/spiders/ pestDocs.htm. Program distributed by the authors. 2002. Rate heterogeneity among lineages of tracheophytes: Integration of molecular and fossil data and evidence for molecular living First submitted 28 April 2005; reviews returned 2 September 2005; fossils. Proc. Natl. Acad. Sci. USA 99:4430-4435. final acceptance 14 October 2005 Steel, M., and D. Penny. 2005. Maximum parsimony and the phylogenetic information in multistate characters. Pages 163-178 in Parsi- Associate Editor: Pam Soltis
© Copyright 2026 Paperzz