How Can Third Codon Positions Outperform First

Syst. Biol. 55(2):245-258,2006
Copyright © Society of Systematic Biologists
ISSN: 1063-5157 print / 1076-836X online
DO1:10.1080/10635150500481473
How Can Third Codon Positions Outperform First and Second Codon Positions
in Phylogenetic Inference? An Empirical Example from the Seed Plants
MARK P. SIMMONS, 1 LI-BING ZHANG, 1 ' 2 COLLEEN T. WEBB/ AND AARON REEVES1'3
7
Deportment of Biology, Colorado State University, Fort Collins, Colorado 80523-1878, USA; E-mail: [email protected] (M.RS.)
2
Current Address: Department of Integrative Biology, Brigham Young University, Provo, Utah 84602, USA
^Current Address: Animal Population Health Institute, Colorado State University, Fort Collins, Colorado 80526-8117, USA
Abstract.—Greater phylogenetic signal is often found in parsimony-based analyses of third codon positions of protein-coding
genes relative to their corresponding first and second codon positions, even for early-derived ("basal") clades. We used the
Soltis et al. (2000; Bot. J. Linn. Soc. 133:381-461) data matrix of atpB and rbcL from 567 seed plants to quantify how each of
six factors (observed character-state space, frequencies of observed character states, substitution probabilities among nucleotides, rate heterogeneity among sites, overall rate of evolution, and number of parsimony-informative characters) contributed to this phenomenon. Each of these six factors was estimated from the original data matrix for parsimony-informative
third codon positions considered separately from first and second codon positions combined. One of the most parsimonious trees found was used as the constraint topology; branch lengths were estimated using likelihood-based distances,
and characters were simulated on this tree. Differential frequencies of observed character states were found to be the most
limiting of the factors simulated for all three codon positions. Differential frequencies of observed character states and
differential substitution probabilities among states were relatively advantageous for first and second codon positions. In
contrast, differential numbers of observed character states, differential rate heterogeneity among sites, the greater number
of parsimony-informative characters, and the higher overall rate of evolution were relatively advantageous for third codon
positions. The amount of possible synapomorphy was predictive of the overall success of resolution. [Amount of possible
synapomorphy; character-state frequencies; character-state space; codon positions; phylogenetic signal; rate heterogeneity.]
In protein-coding genes, greater phylogenetic signal is
often found in parsimony-based analyses of third codon
positions relative to their corresponding first and second codon positions, even for early-derived ("basal")
clades (e.g., Manhart, 1994; Lewis et al., 1997; Bjorklund,
1999; Kallersjo et al., 1998, 1999; Wenzel and Siddall,
1999; Campbell et al., 2000; Sennblad and Bremer, 2000;
Simmons et al., 2002). Note that this pattern is by no
means universal (e.g., Phillips and Penny, 2003; Simmons
and Miya, 2004). The greater phylogenetic signal often
found in third codon positions could be considered surprising given their faster rate of evolution, which can result in multiple hits along individual branches that can
obscure synapomorphies and result in long-branch attraction (Felsenstein, 1978).
Six factors that could contribute to this phenomenon
are as follows. First, greater observed character-state
space (i.e., the number of alternative states that a character may take) for third codon positions allows for having
a higher rate of evolution without a linear increase in homoplasy (Naylor et al., 1995; Simmons et al., 2004b; Steel
and Penny, 2005). Second, differential character-state frequencies among the states represented may affect the
amount of homoplasy. For example, all else equal, less
homoplasy and fewer unobserved substitutions may be
expected for a character with 25% frequencies of A, C,
G, and T than for a character with 49% A and T yet only
1% G and C. Third, differential substitution probabilities among character states, as is generally the case with
transitions and transversions, also affect the amount of
homoplasy and frequency of unobserved substitutions.
Indeed, the second and third factors are tightly linked.
Fourth, differential rate heterogeneity among sites may
allow for faster-evolving characters to resolve recent divergences while more slowly evolving characters resolve
the ancient divergences (Hillis, 1987). Fifth, all else equal,
more parsimony-informative third codon-position characters decrease the potential for stochastic errors and increase branch-support values. Sixth, differences in the
overall rate of evolution of the sampled characters affect
the ability to infer phylogenetic relationships accurately
at a given level of divergence. The fifth and sixth factors
are closely linked, though the potential for long-branch
attraction is primarily a function of the sixth factor.
In this study, the Soltis et al. (2000) data matrix of two
protein-coding plastid genes (atpB and rbcL) from 567
seed plants was used to quantify how each of six factors
contributed to, or detracted from, this phenomenon. As
described by Simmons et al. (2002), the 920 parsimonyinformative third codon positions from atpB and rbcL
outperformed the 663 parsimony-informative first and
second codon positions together for all three measures
of phylogenetic signal that were used (resolution, branch
support, and congruence with independent evidence).
Third positions resolved 2.3 times the number of clades
as first and second positions together, and, on average,
resolved 113% larger clades than first and second positions. Of the 60 clades with >95% jackknife support on
the 18S rDNA jackknife tree (the third gene sampled by
Soltis et al. [2000]), 29.3% more were resolved by third
positions. Of the 60 clades resolved by both first and
second positions together as well as third positions analyzed separately, the clades had 14% higher average
jackknife support with third positions. Third positions
outperformed first and second positions in spite of an average of 2 and 2.5 times more observed substitutions than
first and second codon positions, respectively, for the
parsimony-informative characters. Similar results have
been reported for rbcL in green plants by Lewis et al.
(1997) and Kallersjo et al. (1999).
245
246
VOL. 55
SYSTEMATIC BIOLOGY
We used simulations to quantify how much each of
the six factors affected the relative performance of the
first and second positions combined and the third positions from Soltis et al. (2000). Each of the six factors was
estimated from the original data matrix for parsimonyinformative first and second positions combined and for
parsimony-informative third positions. One of the most
parsimonious trees found by Soltis et al. (2000) was used
as the constraint topology, and branch lengths were estimated on this tree using likelihood-based distances. Each
of these six factors was simulated independently of one
another, as well as in all possible combinations. Performance of phylogenetic inference was measured by subtracting the number of clades incorrectly resolved from
the number of clades correctly resolved in parsimonybased jackknife trees.
TABLE l.
Codons
1st & 2nd
3rd
1st, 2nd, 3rd
Model parameters estimated for each data partition."
G-T C-T C-G A-T A-G A-C pi A piC piG
P iT
Alpha
1 2.77 0.99 0.89 1.98 1.43 0.27 0.27 0.22 0.24 0.67
1 3.81 0.92 0.19 4.36 0.99 0.33 0.17 0.16 0.34 1.30
1 3.30 1.14 0.28 3.47 1.07 0.32 0.20 0.16 0.32 0.85
"Rounded to the nearest hundredth. The parameters used in the simulations
were rounded to the nearest millionth.
second positions asymptotically approached stationarity, reaching roughly the same -log likelihood. Neither
the analyses for the third positions nor for all three
positions reached the same stationarity within 4.6+
million generations. Model parameters were taken from
the maximum posterior probability (MAP) tree (Rannala
and Yang, 1996) sampled across both analyses for each
partition (Table 1). Note that it is unlikely that the actual
MAP trees were sampled in this number of generations
MATERIALS AND METHODS
for
a data matrix of this size; to ensure doing so would
After removal of 29 positions from the 5' end of
be
computationally
intractable (Goloboff and Pol, 2005).
rbcL and 58 positions from the 3' end of atpB (folThis
approach
to
estimating
model parameters is based
lowing the original authors; with one additional third
on
the
premise
that
model
estimation is relatively
codon position removed from the 5' end of rbcL folinsensitive
to
the
tree
topology
used (Yang et al., 1995;
lowing Simmons et al. [2002:80]), the Soltis et al. (2000)
Posada
and
Crandall,
2001).
data matrix includes 1398 nucleotide characters repThe model parameters estimated for all three posiresenting 466 codons from rbcL (of which 788 are
tions
together were then used to estimate branch lengths,
parsimony-informative) and 1470 nucleotide characters
representing 490 codons from atpB (of which 795 are with one of the most parsimonious trees found by Soltis
parsimony-informative) for 567 seed plants. Of the 1583 et al. (2000) as the constraint topology, in which all 565
parsimony-informative nucleotide characters, 663 (of a clades were constrained as a fully dichotomous tree.
possible 1912) are from first and second codon positions Likelihood-based distances were calculated on this conand 920 (of a possible 956) are from third codon positions. straint topology in PAUP* 4.0bl0 (Swofford, 2001) using
neighbor-joining. Eight rate categories were used for the
gamma distribution, and negative branch lengths were
Simulations
set to absolute branch lengths.
The general time-reversible (GTR) model with rate
Matrices were simulated using the Evolver program
heterogeneity among sites following a gamma distri- within the PAML suite (Yang, 1997). The "MCbase.dat"
bution (Yang, 1993) was chosen for the simulations. parameter hie was used to simulate the nucleotide charThe invariant-sites parameter was not used because acters. The most parsimonious tree topology with branch
parsimony-uninformative sites were eliminated. Model lengths determined using likelihood-based distances
parameters were estimated using Bayesian MCMC was used to simulate the characters. Twenty replicate
(Rannala and Yang, 1996; Yang and Rannala, 1997) matrices were simulated for each set of model paramewith MrBayes 3.0b4 (Huelsenbeck and Ronquist, 2001) ters (see below).
separately for (1) parsimony-informative first and secOverall tree lengths per character used for the simulaond positions together, (2) parsimony-informative third tions were determined using two procedures. The goal
positions only, and (3) parsimony-informative char- of these two procedures was to estimate the average rate
acters from all three codon positions. Parsimony- of evolution at the parsimony-informative first and secuninformative characters were eliminated because their ond positions separately from that for the parsimonyinclusion would have altered the alpha parameter for informative third positions.
the gamma distribution. Given that only the parsimonyThe primary procedure entailed adding up across the
informative characters are of interest to this study entire tree branch lengths that were estimated using
(because they are the characters being used in tree con- likelihood-based distances from parsimony-informative
struction through parsimony; see Olmstead et al., 1998), first and second positions together, as well as those
it was considered appropriate to estimate model param- for parsimony-informative third positions only. The eseters from them exclusively.
timated overall tree length for first and second posiFor each of the three partitions from which model tions was 14.30562, and 36.94496 for third positions. This
parameters were estimated, two independent analyses suggested that parsimony-informative third positions
were run, with four chains per analysis, trees sampled evolved on average 2.6 times faster than parsimonyevery 100 generations, and a minimum of 4.6 million informative first and second positions combined.
generations run (10 million for the first and second
The secondary procedure entailed using averpositions) per analysis. The analyses for the first and age genetic distances among terminals (using the
2006
SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS
likelihood-based distances) as computed by PAUP*. This
procedure was considered inferior to the primary procedure because it estimates distances without regard to
the phylogeny, whereas the primary procedure took the
inferred phylogenetic relationships into account. The average estimated distance between terminals for first and
second positions was 0.089933, and 0.319677 for third
positions. This indicated that parsimony-informative
third positions evolved on average 3.5546 times the
rate of parsimony-informative first and second positions
combined.
Five separate overall tree lengths per character were
examined to bracket the actual overall rate of evolution of
the different positions from Soltis et al. (2000): 10.39356,
14.30562, 25.62529, 36.94496, and 50.85076. The second
and fourth rates represented the average overall rates for
first and second positions combined and third positions,
respectively, estimated using the primary procedure. The
first rate represents 1/3.5546 the fourth rate, and the fifth
rate represents 3.5546 times the second rate. The third
rate was selected as intermediate between the second
and fourth rates.
247
cleotides. Separate simulations were then performed using the Felsenstein (1981) model for each of the three possible numbers of observed character states, in which all
of the states included were represented at equal frequencies. For two-state characters, for example, the characterstate frequencies were set at 50% for two nucleotides and
0% for the other two nucleotides. The numbers of two-,
three-, and four-state characters were simulated proportionally to their observed frequencies and then concatenated together.
To examine the effect of differential characterstate space independently of the greater number of
parsimony-informative third positions, the simulations
were also performed for third positions by multiplying
the number of characters in each grouping by 0.720652
and rounding to whole numbers. This served to decrease
the number of simulated third-codon-position characters from 920 to 663, which is identical to the number of parsimony-informative first- and second-position
characters.
To simulate differential frequencies (percentages) of
observed character states independently of the observed
character-state space, the differential ratios of nucleotide
percentages needed to be maintained while having all
Model Parameters
four nucleotides represented. First, however, the charTo quantify how each of the six factors affected the rela- acters needed to be partitioned into approximately
tive performance of first and second positions combined homogeneous groups with respect to nucleotide perrelative to third positions, each of these factors needed to centages. MEGA 2.1 (Kumar et al., 2001) was used to
be considered independently of one another. By then pro- calculate the percentage of each of the four nucleotides
gressively combining the different factors together with represented in each character from the transposed data
one another, we could examine how the factors interact matrices. Note that when doing so, MEGA automatiwith one another (e.g., do they cancel each other out, are cally discards all polymorphic entries, which was considtheir contributions additive, or are they more than the ered appropriate here. The percentages for each character
were loaded into Microsoft Excel, whereupon all percentsum of their parts?).
Two initial simulations were performed using a ages representing singleton character states were elimiJukes-Cantor (1969) model (all four nucleotides repre- nated (see above), and the remaining percentages were
sented, each represented in equal frequency, and without recalculated to add up to 100%. Individual characters
rate heterogeneity between nucleotides or among sites) were grouped into blocks of characters representing each
for 663 characters and 920 characters, respectively. This 10th percentile for each nucleotide (Table 2). For examsimulation served as the baseline to compare whether ple, 9.5% (63) of the 663 first and second positions had
each of the factors had a positive or negative effect on a 98.4% or greater percentage of thymine represented
phylogenetic inference from first and second positions among the 567 taxa sampled. Blocks representing 0% of
a character state were grouped together. For example, the
combined or third positions, respectively.
Differential character-state space was simulated inde- lowermost four blocks (first 40%) of the first and second
pendently of the other three model factors (differential
frequencies of observed character states, differential subTABLE 2. Tenth percentile blocks for each nucleotide from first and
stitution probabilities among nucleotides, and rate het- second
positions together separately from third positions."
erogeneity among sites; see Table 3) using the Felsenstein
(1981) model. To decrease the potential for sequencing erThird positions
First and second positions
rors (see Kellogg and Juliano [1997]) artificially inflating
T
T
A
C
A
G
C
G
the observed character-state space, all nucleotides repre- Percentile
98.4
97.71
61.33
97.01
74.91
99.1
99
98.58
90%
sented in only one of the 567 terminals for a given charac96.26 96.8 17.1 92.26 91.74 10.7 95.42 22.56
ter were re-scored as missing data before calculating the 80%
12.34 13.52
1.94
8.3 64.24 4.3 88.93 6.13
observed character-state space. Of the 663 parsimony- 70%
0.62
3.4
2.02
1.72 14.72 2.5 48.4
1.2
60%
informative first and second positions, 65% (428) had two 50%
0.5
0.4
0.7
2.7
1.6 14.3
2.1
0.8
observed nucleotides, 23% (155) had three observed nu- 40%
0.4
0
0.4
1.1
3.56
1.4
0.4
0.9
cleotides, and 12% (80) had four observed nucleotides. Of 30%
0
0
0
0
0.6
0.9
0.9
0.4
0
0
0
0
0
0
0.4
0
the 920 parsimony-informative third positions, 33% (304) 20%
0
0
0
0
0
0
0
0
had two observed nucleotides, 25% (233) had three ob- 10%
served nucleotides, and 42% (383) had four observed nu" Blocks that were grouped together are indicated by boxes.
248
SYSTEMATIC BIOLOGY
positions all had 0% thymine and were consequently
grouped together (Table 2). Also, blocks for which the uppermost and lowermost percentages were within 10% of
one another were grouped together. For example, the two
uppermost blocks (above 80%) for third positions had
97.71% to 99.9% thymine and 95.42% to 97.70% thymine,
respectively, and were grouped together (Table 2).
A total of 48 separate character block patterns were
thereby delimited for first and second positions, and 77
block patterns for third positions (available as an Excel
file at http://systematicbiology.org/). For example, one
third codon position character block pattern had 1.53%
A, 5.7% G, 56.23% T, and 36.54% C. Only a subset of
the possible block patterns was realized, in part because
many of the possible patterns would have been mutually exclusive percentile classes for the four nucleotides
(e.g., one cannot have an average of both 75% A and
75% T at third codon positions). Each character block
pattern was simulated independently of the others using the Felsenstein (1981) model in Evolver, and the
simulated character block patterns were concatenated
together into a single NEXUS file using CONCAT (available at http://www.biology.colostate.edu/Research/).
This methodology was performed identically for first
and second positions combined independently of third
positions.
Maintaining the differential ratios of nucleotide
percentages was straightforward for character block
patterns with two nucleotides represented when transforming them to having all four nucleotides represented.
The percentage of the two nucleotides represented was
halved and then copied to the two unobserved nucleotides. For example, 60% A, 40% G, 0% T, 0% C would
be changed to: 30% A, 20% G, 30% T, 20% C. For consistency, the same relative percentages within purines
and pyrimidines were maintained when making these
changes. When only one purine and one pyrimidine were
sampled for a given character, the discrepancy in percentage within purines and within pyrimidines was maintained to the degree possible. For example, 60% A, 0% G,
40% T, 0% C would be changed to 30% A, 20% G, 20% T,
30% C.
For character block patterns with three character
states, an ad hoc method was applied in an attempt
to maintain the discrepancy in nucleotide percentages.
This involved changing from three states to two states,
and then following the procedure outlined above. To
change from three states to two states, the state with
the intermediate percentage was averaged first with the
low-percentage state and then with the high-percentage
state. The percentage of one of the two resultant character
states is x/(x + y), where the average of the percentages
of the most highly represented state and the intermediate state is x, and the average of the percentages of
the least represented state and the intermediate state is
y. The percentage of the other resultant character state is
y/(x + y). For example, 50% A, 40% G, 10% T, 0% C would
be changed to: 64.3% A, 35.7% G, 0% T, 0% C. Following
the procedure outlined above for two states, this would
then be changed to: 32.15% A, 32.15% G, 17.85% T, 17.85%
VOL. 55
C. The two states that were originally in highest percentage (adenine and guanine) were maintained at the higher
percentage in the resultant four-state character.
Ideally, character block patterns with all four states
would not have to be modified. Due to the grouping
procedure used, many characters lacking some states
were often grouped together with other characters in
which the state was represented at a minute percentage. As a result, there were almost always four states
represented in each group, even though two or three
states were often represented below 1% each. In these
cases, states represented at < 1 % were grouped together
to make two- or three-state characters, whereupon the
procedures described above were followed. When two or
three other states were represented at >1%, those state(s)
represented at < 1 % were combined with the lowestpercentage state represented at >1%. For example, 60%
A, 39% G, 0.5% T, 0.5% C would be changed to 60% A,
40% G, 0% T, 0% C. Following the procedure outlined
above for two states, this would then be changed to: 30%
A, 20% G, 30% T, 20% C. When all four states were represented at >1%, no modifications were made.
An Excel hie detailing all changes made is available
as supplemental data at http://systematicbiology.org/.
To examine differential percentages of observed character states independently of the greater number of
parsimony-informative third positions, the simulations
were also performed for third positions by multiplying
the number of characters in each grouping by 0.720652
and rounding to whole numbers.
Differential substitution probabilities among nucleotides were simulated independently of the other
three model factors using the GTR model with all four
nucleotides represented in equal frequency. The substitution probabilities for first and second positions combined
used in Evolver were G-T: 0.505105; C-T: 1.400678; C-G:
0.501771; A-T: 0.45084; A-G: 1; A-C: 0.724495. The substitution probabilities for third positions used in Evolver
were: G-T: 0.229471; C-T: 0.87462; C-G: 0.21174; A-T:
0.043822; A-G: 1; A-C: 0.227507. (Note that Evolver sets
the A-G rate to 1, whereas MrBayes sets the G-T rate to
1; hence the differences relative to Table 1.) To examine
differential substitution probabilities among nucleotides
independently of the greater number of parsimonyinformative third positions, the simulations were also
performed for third positions using only 663 characters.
Rate heterogeneity among sites was simulated independently of the other three model factors using
the gamma distribution with the Jukes-Cantor (1969)
model. Alpha was set at 0.668668 for first and second
positions combined, and 1.299032 for third positions
(indicating greater rate heterogeneity among sites for
first and second positions). Twenty categories were used
for the discrete gamma distribution. To examine rate
heterogeneity among sites independently of the greater
number of parsimony-informative third positions, the
simulations were also performed for third positions
using only 663 characters.
Each of the six pairwise combinations of parameters
was then examined, followed by the four triplets, and
2006
SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS
TABLE 3. Model parameters that were incorporated in each of the
15 simulations that were performed for all five rates of evolution examined (indicated by Xs).
Simulation
number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
State
space
Frequencies
of observed
states
Substitution
probabilities
among states
Rate
heterogeneity
among sites
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
all four parameters together (Table 3). For the three
combinations wherein differential state space and substitution probabilities among nucleotides were examined together, blocks of characters with different pairs
or triplets of states represented were simulated independently of one another following Table 4. Each of these
simulations was also performed using only 663 thirdposition characters, as described above.
Phylogenetic Analyses
A total of 235 sets of data matrices were simulated
(1 pair of baseline simulations +15 simulations as outlined in Table 3, each including different simulations for
first and second positions combined, all third positions,
and 663 third positions; 5 tree lengths), with each set
consisting of 20 replicate data matrices. A total of 4700
jackknife analyses were performed (235 sets of data matrices; 20 replicates per set), with each analysis composed
of 1000 jackknife replicates for a grand total of 4.7 million
parsimony-based tree-bisection-reconnection (TBR) tree
searches. All characters were equally weighted following
Soltis et al. (2000) and Simmons et al. (2002).
Because of the computational demands, each jackknife
replicate was limited to a single TBR tree search with random sequence addition and a single tree held. Parsimony
jackknife analyses (Farris et al., 1996) were conducted using PAUP* with the removal probability set to approximately e~l (37%), and "jac" resampling emulated (such
that the deletion probability is applied to each character individually rather than to an overall percentage of
characters; see Freudenstein et al., 2004). Note that jackTABLE 4. Numbers of parsimony-informative characters for which
each combination of nucleotides was represented for first and second
positions independently of third positions.
Codons
AG AC AT TC TG CG ACG TAG TCA TCG AGCT
1st & 2nd 114 69 35 117 39 54
3rd
132 5 5 156 6 0
71
68
23
29
33
50
28
86
80
383
249
knife values obtained using the e l deletion probability
are generally higher than bootstrap values (Mort et al.,
2000; Davis et al., 2004; Felsenstein, 2004). Jackknife trees
were independently calculated for all clades with >50%
support, >70% support, and >95% support.
PEST version 2.2 (Zujko-Miller and Miller, 2003) was
used to determine the number of clades correctly and
incorrectly resolved in the jackknife trees relative to the
reference trees for each matrix. Three reference trees were
used (1) the tree topology on which the characters were
simulated (all 565 clades); (2) the simulated tree topology
in which only the 96 clades that included >16 terminals
were resolved; and (3) the simulated tree topology in
which only the 410 clades that included <7 terminals
were resolved (see below).
Unless otherwise noted, the relative performance of
phylogenetic inference was assessed using the overall
success of resolution (the number of clades correctly
resolved minus the number of clades incorrectly resolved). For fully resolved trees, this scales linearly to
the Robinson-Foulds distance (Robinson and Foulds,
1981; Penny and Hendy, 1985), which would range from
0 to 1130 for our trees. This approach may appear to
treat incorrect resolution equally to correct resolution
(e.g., the overall success for 110 clades correctly resolved
and 100 clades incorrectly resolved could receive the
same score as 10 clades correctly resolved and the remaining clades unresolved), whereas many systematists
would be much more concerned about well-supported,
incorrect resolution. However, long-branch attraction between two distantly related terminals would result in
many clades being scored as incorrectly resolved. This
effect was considered to be a sufficient extra penalty for
incorrect resolution.
We examined the overall success of resolution in three
ways, each using results from 50%, 70%, and 95% jackknife trees. First, we considered overall success for the
entire tree of 565 clades. In this case, the maximum score
was 565, for all clades correctly resolved, and the worst
possible score was -565, in which all clades from the reference tree would be contradicted. Second, we restricted
our attention to the larger clades (in this case, the 96
clades that included >16 terminals). All else equal, these
clades represent the early-derived (or "basal") lineages.
Here, the maximum score was 96 and the minimum score
was —96. Third, we only examined smaller clades (in
this case, the 410 clades that included <7 terminals). All
else equal, these clades represent the recently derived (or
"distal") lineages. In this case, the best possible score was
410 and the worst possible score was —410.
The maximum possible number of steps minus the
minimum possible number of steps for each matrix (as
determined by PAUP*) was used as a measure of the
"amount of possible synapomorphy" (Farris, 1989:418;
see also Simmons et al., 2004a). As such, the amount
of possible synapomorphy for parsimony-uninformative
characters is zero. The maximum and minimum possible number of steps are determined strictly by reference
to the data matrix, not to any particular tree. The minimum number of steps for each character, summed across
250
VOL. 55
SYSTEMATIC BIOLOGY
TABLE 5. Response variables, independent variables, and the number of characters used for the 3rd position characters for each of the multiple
regression analyses performed. R2 values indicate the amount of variability in the data explained by the model for each analysis at 50% and 95%
cutoffs for the entire tree, just the smaller clades, and just the larger clades.
R2
Al
Model
Response variable
Overall success
A2
Incorrectly resolved
No. of 3rd
position
characters
50%
95%
Sml.
Large
Ent.
Sml.
Large
920
0.96
0.96
0.90
0.93
0.93
0.84
920
0.62
0.69
0.27
0.26
0.31
0.03
663
0.94
0.94
0.87
0.91
0.92
0.75
Position
Rate
920
0.83
0.80
0.67
0.77
0.82
0.75
0.81
0.77
0.67
0.75
0.79
0.89
0.70
0.69
0.55
0.66
0.70
0.02°
0.90
0.88
0.82
0.88
0.87
0.96
0.88
0.88
0.84
0.86
0.87
0.98
0.90
0.79
0.62
0.88
0.78
0.29
Position
Rate
920
Position
No. of factors
Rate
Position x No. of factors
Position
Factor
Ratel
Factor x Ratel
Position
Amount of synapomorphy
Position x Amount of synapomorphy
920
0.16
0.27
0.01"
0.02"
0.31
0.32
0.54
0.25
0.28
0.01"
0.06
0.40
0.53
0.54
0.07
0.09
0.05
0.06
0.01"
0.06
0.51
0.08
0.13
0.12
0.16
0.07
0.24
0.63
0.05
0.15
0.16
0.12
0.05
0.26
0.62
0.00"
0.00"
0.04"
0.04"
0.02"
0.02"
0.60
663
0.96
0.96
0.88
0.98
0.99
0.81
920
0.97
0.97
0.91
0.96
0.96
0.91
Independent variables
Position
Factor
Rate
Position x Factor
Position
Factor
Ent.
Rate
B
Overall success
Cl
D
Overall success
Baseline
State space
State frequency
GTR model
Rate heterogeneity
Four-way interaction
Incorrectly resolved
Baseline
State space
State frequency
GTR model
Rate heterogeneity
Four-way interaction
Overall success
E
Overall success
F
Overall success
C2
Position x Factor
Position
Factor
Rate
Position x Factor
'Overall model not significant at P = 0.01 level.
all characters, would only be the same as the most parsimonious tree length if there was no character conflict
(i.e., CI = 1). For example, for the Soltis et al. (2000) matrix of 567 terminals in which 7 terminals have an adenine at a given nucleotide position and the other 560
terminals have a guanine at that position, the amount
of possible synapomorphy would be six (maximum = 7,
minimum = 1). The amount of possible synapomorphy
for an entire data matrix (as used here) is the sum of the
amount of possible synapomorphy from all characters.
In this study, we were interested in determining whether
the amount of possible synapomorphy would be predictive of the overall success of resolution for each matrix.
Statistical Analyses
In order to determine how each of the six factors affected the relative performance of first and second positions relative to third positions, several different multiple
regression models were implemented in JMP IN (SAS
Institute, Table 5). For each regression model, the response variable was either the overall success of resolution or the number of incorrectly resolved clades
(Table 5). All independent variables were treated as fixed
effects. The independent variables used in the different
analyses were (1) position, a nominal categorical variable
indicating codon position, 0 = 1st and 2nd positions,
1 = 3rd positions; (2) factor, a nominal categorical variable indicating baseline, state space, frequency of states,
GTR model, rate heterogeneity, and all two-way, threeway, and four-way combinations of nonbaseline factors;
(3) rate, the rate of evolution; (4) number of factors,
the number of factors included in the simulation model
(1, 2, 3, or 4); and (5) ratel, an ordered categorical variable for the rate of evolution, 0 = 14.30562,1 = 36.94496
(Table 5). The specific combinations of independent variables and interactions included in each model are shown
in Table 5. Several versions of the models Cl and C2
2006
251
SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS
(Table 5) were performed, one for each of the single factors, including baseline, and one for the four-way interaction. Models B and E (Table 5) are similar regression
models except that model B contains the variable rate,
which is essentially continuous, and model E contains
the variable ratel, which is discrete and only includes
the second and fourth rates. The second and fourth rates
represented the average overall rates for first and second
positions combined and third positions, respectively, estimated using the primary procedure. Model E allows us
to compare these rates statistically using contrasts.
The data were analyzed separately for the 50% and
95% cutoffs for the entire tree, the smaller clades by themselves, and the larger clades by themselves. This was
done because of inherent correlations among the data (for
cutoff and clade) that cannot easily be taken into account
in the regression models without affecting significance
levels. For all of our analyses, residuals were normally
distributed, and no high-leverage points or outliers were
observed, indicating that multiple regression on the untransformed data was appropriate. Least-squares mean
estimates of the categorical independent variables were
obtained in addition to the parameter estimates, and independent contrasts on the least-squares means were
performed where needed. Groups of analyses with multiple tests were Bonferroni-corrected in order to control
for spurious significant results that could be caused by
the large numbers of comparisons.
TABLE 6. The least-squared means of overall success of resolution
scaled to 1, with 3rd positions calculated for 920 characters.
50% Cutoff
Overall
Baseline
State space
State frequencies
Model
Rate heterogeneity
Larger clades
Baseline
State space
State frequencies
Model
Rate heterogeneity
Smaller clades
Baseline
State space
State frequencies
Model
Rate heterogeneity
95% Cutoff
1st & 2nd
3rd
1st & 2nd
3rd
.74
.67
.60
.73"
.65
.80
.77"
.66
.77
.75"
.49
.42
.38"
.48"
.39
.58
.543"
.46
.544
.52"
.60
.51
.38
.58"
.47"
.70
.65"
.46
.66"
.61"
.29
.20
.15"
.28
.17"
.39
.311"
.23
.35
.306"
.78
.72
.66
.77"
.70
.83
.799"
.72
.804
.793"
.54
.48
.445"
.53"
.453
.63
.5975"
.53
.5976
.58"
"Contrast relative to the partition with the next highest least-squared mean
not significant at the 0.05 level after Bonferroni correction.
cases, the most severely limiting factor was the differential frequencies of observed character states, followed
by rate heterogeneity among sites, observed characterstate space, and the differential substitution probabilities
among nucleotide character states (Table 5).
The finding that differential frequencies of observed
RESULTS AND DISCUSSION
character states and lower observed character-state space
The results were generally not qualitatively different are disadvantageous for phylogenetic inference corrobwhen using the 50%, 70%, or 95% jackknife trees for orates the results from Simmons et al.'s (2004b) simuour analyses. Unless otherwise noted, the relative per- lations. In contrast, our finding that rate heterogeneity
formance of the simulated characters using the overall among sites was disadvantageous is contradictory. Note
success of resolution was assessed using both the 50% that the use of the gamma distribution (by itself or when
and 95% jackknife trees. The overall success of resolu- simulated together with other heterogeneous model pation and the number of clades incorrectly resolved on rameters) often resulted in both constant and variable but
the 50% jackknife trees (which showed a greater spread parsimony-uninformative characters for both first and
than the 70% and 95% jackknife trees and were there- second positions together as well as third codon posifore easier to graph) are presented in Figures 1 and 2 for tions across all five overall tree lengths per character simeach of the four heterogeneous model parameters sim- ulated. In these cases, the same expected overall number
ulated independently of one another. Excel files of the of changes still occurred for each tree length per character
raw data and figures for the average number of clades simulated, but they were concentrated in a subset of the
correctly resolved, the average number of clades incor- available characters. This resulted in, on average, a faster
rectly resolved, and the average overall success of reso- rate of evolution per parsimony-informative character
lution using the 50%, 70%, and 95% cutoffs are available for those characters simulated using rate heterogeneity
as supplemental data at http://systematicbiology.org/. relative to those simulated without it.
Characters evolving at a faster rate would be expected
Our results from regression model Al show that taken
across all five rates of evolution examined together, in- to have a higher chance of having multiple hits along
corporation of the four heterogeneous model parameters individual branches as well as more cases of ambiguous
examined (observed character-state space, frequencies optimization, both of which would lead to reduced resof observed character states, substitution probabilities olution and support for correctly resolved clades. In this
among nucleotides, and rate heterogeneity among sites) particular empirically based simulation study, those negall had a negative effect on phylogenetic inference rel- ative effects were not sufficiently outweighed by the posative to the baseline Jukes-Cantor model (Fig. 1). This itive effects of the many more slowly evolving characters.
was found for both first and second positions together As such, although rate heterogeneity among characters
as well as for third positions, across the entire tree of may generally be advantageous for phylogenetic infer565 clades, for the larger clades by themselves, and ence (Hillis, 1987), our study indicates that it is not always
for the smaller clades by themselves (Table 6). In all beneficial when the overall number of character-state
252
VOL. 55
SYSTEMATIC BIOLOGY
500
400
2 300
200
10
20
30
10
40
average steps / PI character
20
30
40
20
30
40
50
average steps / PI character
D
50
10
average steps / PI character
20
30
40
50
average steps / PI character
X
350
smaller clades; 663 3 rd
200
200
10
20
30
40
50
average steps / PI character
-1+2 baseline
•
10
20
30
40
50
average steps / PI character
1+2 state space A 1+2 state frequency
• 1+2 model —•—1+2 rate heterogeneity
_-.-*.-.3rd baseline —O— 3 state space -~A "3" state frequency •-O— 3rd model ~D— 3rd rate heterogeneity
rd
1
FIGURE 1. The average overall success of resolution (number of clades correctly resolved minus the number of clades incorrectly resolved)
for jackknife trees using the 50% cutoff, across all five average numbers of steps per parsimony-informative (PI) character, for each of the four
heterogeneous model parameters (character-state space, character-state frequencies, rate heterogeneity among nucleotide states [model], and
rate heterogeneity among sites), independently of one another. The baselines differ only in the number of characters sampled (663 for 1st & 2nd
positions; 920 for 3rd positions), (a) Measured across the entire tree of 565 clades for all 1st & 2nd positions relative to all 920 3rd positions,
(b) Measured across the entire tree for all 1st & 2nd positions relative to 663 3rd positions, (c) Measured for the 96 larger clades for all 1st & 2nd
positions relative to all 920 3rd positions, (d) Measured for the larger clades for all 1st & 2nd positions relative to 663 3rd positions, (e) Measured
for the 410 smaller clades for all 1st & 2nd positions relative to all 920 3rd positions, (f) Measured for the smaller clades for all 1st & 2nd positions
relative to 663 3rd positions.
2006
253
SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS
A
X-A.
entire tree; 920 3rd
'•-.. entire tree; 663 3 rd
A-
A
•
A
o
»n
L
5- -
==#
T
X
<
c
•
'
•••-5K-IV;:;
"X-
— x
average steps / PI character
larger clades; 920 3 rd
D
so
average steps / PT character
A "A"
S
jrrernTTTTT^TTTTTr^Q^^^^
larger clades; 663 3 rd
E
average steps / PI character
average steps / PI character
'•smallerclades; 663 3 rd
•
average steps / PI character
•
•
•
•
*
-
-
•
-
.
.
,
average steps / PI character
—*—"1+2 baseline • 1+2 state spacer-A—1+2 state frequency • 1+2 model 11 1+2 rate heterogeneity
—*—3 rd baseline —•—3 r d state space —»fir-3rd state frequency —O— 3rd model —D— 3rd rate heterogeneity
FIGURE 2. The average number of clades incorrectly resolved for jackknife trees using the 50% cutoff, across all five average numbers of steps
per parsimony-informative (PI) character, for each of the four heterogeneous model parameters (character-state space, character-state frequencies,
rate heterogeneity among nucleotide states [model], and rate heterogeneity among sites), independently of one another. The baselines differ only
in the number of characters sampled (663 for 1st & 2nd positions; 920 for 3rd positions), (a) Measured across the entire tree of 565 clades for all
1st & 2nd positions relative to all 920 3rd positions, (b) Measured across the entire tree for all 1st & 2nd positions relative to 663 3rd positions,
(c) Measured for the 96 larger clades for all 1st & 2nd positions relative to all 920 3rd positions, (d) Measured for the larger clades for all 1st &
2nd positions relative to 663 3rd positions, (e) Measured for the 410 smaller clades for all 1st & 2nd positions relative to all 920 3rd positions,
(f) Measured for the smaller clades for all 1st & 2nd positions relative to 663 3rd positions.
254
SYSTEMATIC BIOLOGY
TABLE 7. The least-squared means of overall success of resolution
scaled to 1, with 3rd positions calculated for 663 characters.
95% Cutoff
50% Cutoff
Overall
State space
State frequencies
Model
Rate heterogeneity
Overall rate"
Larger clades
State space
State frequencies
Model
Rate heterogeneity
Overall rate"
Smaller clades
State space
State frequencies
Model
Rate heterogeneity
Overall rate"
1st & 2nd
3rd
1st & 2nd
3rd
.67
.60
.73b-c
.65
.69
.70
.50
.70
.69
.80
.42
.448
.34
.450
.435
.59
.51"
.38
.58"'c
.56
.13
.54
.53
.68
.20
.74
.61
.75
.72
.83
.48
A?b.c
.54
.72b
.66
77b,c
.70
.73
.38C
.48C
.39
.38
.15C
.28
.17C
.19
.445 C
.53C
.453
.44
.25
.08
.24
.22
.38
.505
.42
.510
.48
.65
" Calculated for 1st and 2nd positions at the second rate of evolution (14.30562)
and for 3rd positions at the fourth rate of evolution (36.94496) using regression
model E. All other results from regression model B.
b
Contrast of 1st and 2nd positions versus 3rd positions not significant at the
0.05 level after Bonferroni correction.
c
Contrast relative to the partition with the next highest least-squared mean
not significant at the 0.05 level after Bonferroni correction.
changes is held constant. This result reinforces the importance of conducting empirically based simulations (e.g.,
Hillis, 1996) to supplement those performed using simplistic tree topologies and branch lengths (e.g., Simmons
et al., 2004b).
Results from regression models B and E show that,
taken across all five rates of evolution examined together,
two of the heterogeneous model parameters examined
(frequencies of observed character states and substitution probabilities among nucleotide character states)
favored first and second positions, whereas observed
character-state space and rate heterogeneity among sites
favored third positions (Fig. 1). This was determined
by testing for significant differences in the overall success of resolution when incorporating each heterogeneous factor into the simulation model independently
of one another for all parsimony-informative first and
second positions relative to the same number (663) of
parsimony-informative third positions. The differences
were significant in all cases (across the entire tree of 565
clades, as well as when only examining the larger or
smaller clades independently of one another; Table 7)
when applied to the 95% jackknife trees, and in 6 of the
12 cases for the 50% jackknife trees.
Number of Parsimony-Informative Characters
The greater number of parsimony-informative third
positions provided a significant increase in the overall success of resolution when comparing the baseline
Jukes-Cantor model between first and second positions
(663 characters) and third positions (920 characters) in
all cases (Table 6). Likewise, the faster overall rate of
VOL. 55
evolution for third positions (fourth rate: 36.94496 steps
per parsimony-informative character) was found to be
a significant advantage relative to the slower overall
rate of evolution for first and second positions (second
rate: 14.30562 steps per parsimony-informative character; Table 7).
Overall Rate of Evolution
Results from regression model AI show that, taken
across all five rates of evolution examined, increasing the
rate of evolution invariably improved the overall success
of resolution (when significantly different from zero) at
both the 50% and 95% cutoffs, across the entire tree of
565 clades, as well as when only examining the larger
or smaller clades independently of one another (Fig. 1,
Table 8). This result indicates that, taken across the tree
as a whole, the taxon sampling used by Soltis et al. (2000)
was sufficiently dense so as to largely prevent saturation
(i.e., multiple hits along an individual branch; see Wenzel
and Siddall, 1999) at third positions from overwhelming
phylogenetic signal (Hillis, 1996,1998; Soltis et al., 2004;
Albert, 2005).
Results from regression model A2 indicate that, in
some cases, increasing the rate of evolution led to
fewer incorrectly resolved clades (Figure 2, Table 8),
TABLE 8. Parameter estimate for the rate of evolution, with 3rd
positions calculated for all 920 characters.
Overall success
Overall
Across all
Baseline
State space
State frequencies
Model
Rate heterogeneity
Space + frequency +
model + rate
heterogeneity
Larger clades
Across all
Baseline
State space
State frequencies
Model
Rate heterogeneity
Space + frequency +
model + rate
heterogenity
Smaller clades
Across all
Baseline
State space
State frequencies
Model
Rate heterogeneity
Space + frequency +
model + rate
heterogeneity
Incorrectly resolved
50%
95%
50%
1.30
2.38
2.00
2.20
2.21
1.82
0.18"
2.21
4.32
3.43
2.96
4.01
2.94
0.47
0.03"
-0.07"
0.03"
-0.07"
-0.01"
-0.06"
0.05°
0.34
0.55
0.51
0.64
0.55
0.49
0.13"
0.26
0.76
0.53
0.36
0.68
0.39
-0.02"
0.05
0.05"
-0.05"
0.04
0.01"
-0.16"
0.84
1.52
1.27
1.30
1.41
1.15
0.10"
-o.or
1.74
3.08
2.55
2.96
2.90
2.26
0.47
0.06
-0.10
- 5 x 10"3"
5 x 10"3"
-0.06"
-0.06"
0.19"
95%
0.02
0.01
0.02
0.02"
0.02
0.01"
0.02"
- 2 x 10"3"
1 x 10-""
4 x 10-4"
-o.or
2 x 10- "
3
7 x 10"4"
- 2 x 10"3"
0.02
5 x 10"3"
0.02
0.03
0.01
0.01"
0.02"
"Not significantly different from zero at the 0.05 level after Bonferroni
correction.
2006
255
SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS
110000
TABLE 9. Slope of the regression of overall success on the number
of heterogeneous model parameters, with 3rd positions calculated for
all 920 characters.
50% Cutoff
1st & 2nd
Overall
Larger clades
Smaller clades
-96.91
-18.77
-66.40
entire tree; 920 3rd..$
—*—1+2 baseline
90000
*
95% Cutoff
3rd
-81.48
-21.21
-49.16
1st & 2nd
-65.01"
-7.14
-51.44
70000—
3rd
-61.25"
-9.81
-43.28
50000
" Not significantly different from one another at the 0.05 level after Bonferroni
correction.
30000
3 rd baseline
•
1+2 state space
0
3«i state space
—A—1+2 state frequency
A
3r<i state frequency
—•—1+2 model
-—
D
10000
presumably reducing the number of incorrectly resolved
clades that were weakly supported due to stochastic effects. This explanation is consistent with the reduction
in incorrect resolution at higher rates of evolution being
more commonly observed on the 50%, rather than the
95%, jackknife trees (Table 8). More generally, however,
increasing the rate of evolution led to more incorrectly resolved clades, although the relationship was not strong.
Either way, the change in the number of incorrectly resolved clades was often not significantly different from
zero after the Bonferroni correction (Table 8).
Increasing Number of Heterogeneous Parameters
Results from regression model D show that, taken
across all five rates of evolution examined, increasing
the number of heterogeneous model parameters incorporated into the simulations was significantly more disadvantageous for first and second positions than it was
for third positions, as measured by the slope of the regression of overall success on the number of heterogeneous
model parameters incorporated in the simulations. This
occurred when the overall success of resolution was measured across the entire tree using the 50% jackknife cutoff
(not significant at the 95% cutoff) and when restricting attention to the smaller clades (using both jackknife cutoffs;
Table 9). In contrast, increasing the number of heterogeneous model parameters was significantly more disadvantageous for third positions than it was for first and
second positions for the larger clades (using both jackknife cutoffs; Table 9).
Whereas third positions were more robust to incorporation of their heterogeneity for resolving the smaller
clades, the first and second positions were more robust
to incorporation of their heterogeneity for resolving
larger clades, suggesting that the different heterogeneous
model parameters examined have significantly different
0
3 rd model
—•—1+2 rate heterogeneity
10
3 rd rate heterogeneity
50
20
30
40
average steps / PI character
FIGURE 3. The amount of possible synapomorphy across all five average numbers of steps per parsimony-informative character for each
of the four heterogeneous model parameters (character-state space,
character-state frequencies, rate heterogeneity among nucleotide states
[model], and rate heterogeneity among sites) independently of one another, measured across the entire tree of 565 clades for all 1st & 2nd
positions relative to all 920 3rd positions.
effects on our ability to infer larger clades and smaller
clades. However, there is also the confounding effect
of the greater number of parsimony-informative third
positions (920 versus 663). When the third positions
were restricted to the same number of characters as the
first and second positions (663 versus 663), their slope
changed from -81.48 to -94.59 using the 50% jackknife
cutoff while examining clades across the entire tree. This
slope is not significantly different from the slope for
first and second positions (—96.91; Table 9), indicating
that the significant difference is primarily caused by the
greater number of parsimony-informative third-position
characters.
Amount of Possible Synapomorphy
Results from regression model F show that, taken
across all five rates of evolution examined, the amount
of possible synapomorphy was predictive of the overall success of resolution at both the 50% and 95% cutoffs, across the entire tree of 565 clades, as well as
when only examining the larger or smaller clades independently of one another. This is indicated by the
significant amount-of-possible-synapomorphy parameter estimate (Table 10, all results significant at the
Bonferroni-corrected P =0.01 level). The amount of possible synapomorphy was predictive of the overall success
TABLE 10. Parameter estimate for the amount of possible synapomorphy (APS) and slope of the regression of overall success on the amount
of possible synapomorphy with 3rd positions calculated for all 920 characters.
95% Cutoff
50% Cutoff
Overall
Larger clades
Smaller clades
APS parameter
estimate
1st & 2nd
Slope
3rd Slope
APS parameter
estimate
1st & 2nd
Slope
3rd Slope
0.0038
0.0008
0.0025
0.0050
0.0010
0.0034
0.0027
0.0007
0.0016
0.0008
0.0004
0.0033
0.0010
0.0004
0.0039
0.0007
0.0004
0.0027
256
SYSTEMATIC BIOLOGY
VOL. 55
rates relative to one another with respect to either silent
or replacement substitution rates (Muse and Gaut, 1997)
and were found to evolve at similar overall rates among
vascular plants (P. Soltis et al., 2002) and seed plants (Bell
et al., 2005). Sixth, because the most parsimonious tree
that characters were simulated onto was calculated using characters from 18S nuclear rDNA in addition to atpB
and rbcL, it was assumed that 18S nuclear rDNA and the
plastid genome (from which atpB and rbcL were sampled) have the same history among the lineages sampled, following Soltis et al. (2000). This is a reasonable
assumption for the taxa sampled because lineage sorting and introgression (Doyle, 1992) are generally only
CONCLUSIONS
expected to potentially confound phylogenetic inference
Several assumptions are inherent in this sort of study when sampling closely related eukaryotic taxa. Although
wherein models are used to simulate empirical data (as a reduced (232) taxon-sampling dataset indicated signifwith parametric bootstrapping [Saitou and Nei, 1986], for icant character-based incongruence (Farris et al., 1995)
instance). First, the model used (GTR+F) was assumed between rbcL and 18S nuclear rDNA, the two gene trees
to sufficiently capture the complexity of the empirical were generally topologically congruent (Soltis et al.,
characters being simulated. All parametric models are 1997).
Despite these limitations, we believe that this type of
simplifications of the process of molecular evolution
(Penny et al., 1992). One possible way to account for simulation study is an important step towards underthird positions outperforming first and second positions standing the behavior of empirical characters. With these
at deeper clades that was not simulated in this study in- limitations in mind, the greater phylogenetic signal obvolves the covarion process (Fitch and Markowitz, 1970). served at third codon positions of atpB and rbcL relative
The covarion process may be operating more rapidly at to their corresponding first and second codon positions
third positions relative to the first and second positions. in the Soltis et al. (2000) data matrix is attributable to their
If so, this would be advantageous for the third positions greater observed character-state space, lower rate hetero(Penny et al., 2001). Second, within the confines of the geneity among sites, higher overall rate of evolution, and
GTR-f-F model, all parsimony-informative first and sec- greater number of parsimony-informative characters. In
ond positions were assumed to evolve in a homogeneous contrast, differential frequencies of observed character
manner, as were the third positions, across all lineages states and differential substitution probabilities among
sampled. This type of assumption is inherent to para- states were relative advantages of first and second posimetric phylogenetic inference. Most of the rate variation tions. Incorporation of all four heterogeneous model pain the plastid genome appears to be attributable to re- rameters examined had a negative effect on phylogenetic
placement substitutions (Gaut and Clegg, 1993; Muse inference relative to the baseline Jukes-Cantor model for
and Gaut, 1997) rather than silent substitutions, and at all three codon positions. The most severely limiting facfirst and second codon positions rather than third po- tor was the differential frequencies of observed characsitions (Ane" et al., 2005). As such, this assumption is ter states, followed by rate heterogeneity among sites,
likely to be more severely violated for first and second observed character-state space, and the differential subpositions than for third positions. Third, the rate param- stitution probabilities among nucleotide character states.
eters in the GTR model and the shape of the gamma These results were obtained when the entire tree of 565
distribution were assumed to have been accurately es- clades was examined, as well as when attention was
timated by MrBayes and to apply to all lineages. This restricted to only the larger or smaller clades indepenis unlikely given that the actual MAP trees were prob- dently of one another.
ably not sampled. To ensure doing so would be comRate heterogeneity among sites was inferred to be disputationally intractable for matrices of the sizes used advantageous for the Soltis et al. (2000) data matrix. Al(Goloboff and Pol, 2005). Fourth, the manner in which though rate heterogeneity among characters is normally
characters were simulated assumes that the character- cited as advantageous for phylogenetic inference (e.g.,
state space remains constant across all lineages sampled. Pennington, 1996; Barker, 2004), this advantage may only
This assumption is unrealistic as predicted by the co- generally occur in empirical studies when rate heterovarion theory. Furthermore, the degeneracy of first and geneity is paired with a higher overall rate of evoluthird codon positions would vary depending on which tion among the sampled characters, not when the overall
amino acid the codon specified at any given time in each rate of evolution is kept constant, as was done in this
lineage. Fifth, the two genes simulated, atpB and rbcL, study. Other empirically based simulations need to be
were assumed to have evolved in a homogeneous man- conducted to test how general this and our other rener within the lineages sampled. There is some support sults are by using different clades, different genes, and
for this assumption in that these two genes were not also considering alternative methods of phylogenetic
found to be evolving within lineages at heterogeneous inference.
of resolution for both the first and second positions and
the third positions, independently of one another, as indicated by the significantly positive slopes of the regression lines (Table 10). The slopes in Table 10, although
significant in all cases, are very shallow because of the
dramatic differences in scale between the overall success
of resolution and the amount of possible synapomorphy. For example, for the results presented in Figure 3,
the average overall success of resolution ranged from 275
to 492, whereas the average amount of possible synapomorphy ranged from 15,785 to 101,647.
2006
SIMMONS ET AL.—THIRD POSITIONS VS. FIRST AND SECOND POSITIONS
257
Kallersjo, M., V. A. Albert, and J. S. Farris. 1999. Homoplasy increases
phylogenetic structure. Cladistics 15:91-93.
Kallersjo, M., J. S. Farris, M. W. Chase, B. Bremer, M. F. Fay,
C. J. Humphries, G. Petersen, O. Seberg, and K. Bremer. 1998.
Simultaneous parsimony jackknife analysis of 2538 rbcL DNA
sequences reveals support for major clades of green plants, land
plants, seed plants, and flowering plants. Plant Syst. Evol. 213:259287.
Kellogg, E. A., and N. D. Juliano. 1997. The structure and function of
RuBisCo and their implications for systematic studies. Am. J. Bot.
REFERENCES
84:413-428.
Albert, V. A. 2005. Parsimony and phylogenetics in the genomic age. Kumar, S., K. Tamura, I. B. Jakobsen, and M. Nei. 2001. MEGA2: Molecular Evolutionary Genetics Analysis software. Bioinformatics 17:1244Pages 1-11 in Parsimony, phylogeny, and genomics (V. A. Albert,
1245.
ed.). Oxford University Press, Oxford.
An6, C, J. G. Burleigh, M. M. McMahon, and M. J. Sanderson. 2005. Lewis, L. A., B. D. Mishler, and R. Vilgalys. 1997. Phylogenetic relationships of the liverworts (Hepaticeae), a basal embryophyte lineage,
Covarion structure in plastid genome evolution: A new statistical
inferred from nucleotide sequence data of the chloroplast gene rbcL.
test. Mol. Biol. Evol. 22:914-924.
Mol. Phylogenet. Evol. 7:377-393.
Barker, F. K. 2004. Monophyly and relationships of wrens (Aves:
Troglodytidae): A congruence analysis of heterogeneous mitochon- Manhart, J. R. 1994. Phylogenetic analysis of green plant rbcL sequences. Mol. Phylogenet. Evol. 3:114-127.
drial and nuclear DNA sequence data. Mol. Phylogenet. Evol. 31:486Mort, M. E., P. S. Soltis, D. E. Soltis, and M. L. Mabry. 2000. Compari504.
Bell, C. D., D. E. Soltis, and P. S. Soltis. 2005. The age of the angiosperms: son of three methods of estimating internal support on phylogenetic
trees. Syst. Biol. 49:160-171.
A molecular timescale without a clock. Evolution 59:1245-1258.
Bjorklund, M. 1999. Are third positions really that bad? A test using Muse, S. V, and B. S. Gaut. 1997. Comparing patterns of nucleotide
substitution rates among chloroplast loci using the relative rate test.
vertebrate cytochrome b. Cladistics 15:191-197.
Genetics 146:393-399.
Campbell, D. L., A. V. Z. Brower, and N. E. Pierce. 2000. Molecular
evolution of the Wingless gene and its implications for the phylo- Naylor, G. J. P., T. M. Collins, and W. M. Brown. 1995. Hydrophobicity
and phylogeny. Nature 373:565-566.
genetic placement of the butterfly family Riodinidae (Lepidoptera:
Olmstead, R. G., P. A. Reeves, and A. C. Yen. 1998. Patterns of sequence
Papilionoidea). Mol. Biol. Evol. 17:684-696.
evolution and implications for parsimony analysis of chloroplast
Davis, J. I., D. W. Stevenson, G. Petersen, O. Seberg, L. M. Campbell,
DNA. Pages 164-187 in Molecular systematics of plants II: DNA
J. V. Freudenstein, D. H. Goldman, C. R. Hardy, F. A. Michelangeli,
sequencing (D. S. Soltis, P. S. Soltis, and J. J. Doyle, eds.). Kluwer
M. P. Simmons, and C. D. Specht. 2004. A phylogeny of the monocots,
Academic Publishers, Boston.
as inferred from rbcL and atpA sequence variation, and a comparison
of methods for calculating jackknife and bootstrap values. Syst. Bot. Pennington, R. T. 1996. Molecular and morphological data provide phylogenetic resolution at different hierarchial levels in Andira. Syst. Biol.
29:467-510.
45:496-515.
Doyle, J. J. 1992. Gene trees and species trees: Molecular systematics as
Penny, D., and M. D. Hendy. 1985. The use of tree comparison metrices.
one-character taxonomy. Syst. Bot. 17:144-163.
Syst. Zool. 34:75-82.
Farris, J. S. 1989. The retention index and the rescaled consistency index.
Penny, D., M. D. Hendy, and M. A. Steel. 1992. Progress with methCladistics 5:417-419.
ods for constructing evolutionary trees. Trends Ecol. Evol. 7:73Farris, J. S., V. A. Albert, M. Kallersjo, D. Lipscomb, and A. G.
79.
Kluge. 1996. Parsimony jackknifing outperforms neighbor-joining.
Penny, D., B. J. McComish, M. A. Charleston, and M. D. Hendy.
Cladistics 12:99-124.
2001. Mathematical elegance with biochemical realism: The coFarris, J. S., M. Kallersjo, A. G. Kluge, and C. Bult. 1995. Testing signifvarion model of molecular evolution. J. Mol. Evol. 53:711icance of incongruence. Cladistics 10:315-319.
723.
Felsenstein, J. 1978. Cases in which parsimony or compatibility methPhillips, M. J., and D. Penny. 2003. The root of the mammalian tree
ods will be positively misleading. Syst. Zool. 27:401-410.
inferred from whole mitochondrial genomes. Mol. Phylogenet. Evol.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maxi28:171-185.
mum likelihood approach. J. Mol. Evol. 17:368-376.
Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Posada, D., and K. A. Crandall. 2001. Selecting the best-fit model of
nucleotide substitution. Syst. Biol. 50:580-601.
Sunderland, Massachusetts.
Fitch, W. M., and E. Markowitz. 1970. An improved method for deter- Rannala, B., and Z. Yang. 1996. Probability distribution of molecular
evolutionary trees: A new method of phylogenetic inference. J. Mol.
mining codon variability in a gene and its application to the rate of
Evol. 43:304-311.
fixation of mutations in evolution. Biochem. Genet. 4:579-593.
Freudenstein, J. V., C. van den Berg, D. H. Goldman, P. J. Kores, M. Robinson, D. F., and L. R. Foulds. 1981. Comparison of phylogenetic
trees. Math. Biosci. 53:131-147.
Molvray, and M. W. Chase. 2004. An expanded plastid DNA phylogeny of Orchidaceae and analysis of jackknife branch support strat- Saitou, N., and M. Nei. 1986. The number of nucleotides required to
determine the branching order of three species, with special reference
egy. Am. ]. Bot. 91:149-157.
Gaut, B. S., S. V. Muse, and M. T. Clegg. 1993. Relative rates of nucleotide to the human-chimpanzee-gorilla divergence. J. Mol. Evol. 24:189204.
substitution in the chloroplast genome. Mol. Phylogenet. Evol. 2:89Sennblad, B., and B. Bremer. 2000. Is there a justification for differential
96.
a priori weighting in coding sequences? A case study from rbcL and
Goloboff, P. A., and D. Pol. 2005. Parsimony and Bayesian phylogeApocynaceae s.l. Syst. Biol. 49:101-113.
netics. Pages 148-159 in Parsimony, phylogeny, and genomics (V. A.
Simmons, M. P., T. G. Carr, and K. O'Neill. 2004a. Relative characterAlbert, ed.). Oxford University Press, Oxford.
state space, amount of potential phylogenetic information, and hetHillis, D. M. 1987. Molecular versus morphological approaches. Ann.
erogeneity of nucleotide and amino acid characters. Mol. Phylogenet.
Rev. Ecol. Syst. 18:23-42.
Evol. 32:913-926.
Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130-131.
Hillis, D. M. 1998. Taxonomic sampling, phylogenetic accuracy, and Simmons, M. P., and M. Miya. 2004. Efficiently resolving the basal
clades of a phylogenetic tree using Bayesian and parsimony apinvestigator bias. Syst. Biol. 47:3-8.
proaches: A case study using mitogenomic data from 100 higher
Huelsenbeck, J. P., and F. Ronquist. 2001. MrBayes: Bayesian inference
teleost fishes. Mol. Phylogenet. Evol. 31:351-362.
of phylogenetic trees. Bioinformatics 17:754-755.
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Simmons, M. P., H. Ochoterena, and J. V. Freudenstein. 2002. Amino
acid vs. nucleotide characters: Challenging preconceived notions.
Pages 21-132 in Mammalian protein metabolism, volume 3 (H. N.
Mol. Phylogenet. Evol. 24:78-90.
Munro, ed.). Academic Press, New York.
ACKNOWLEDGMENTS
We thank Victor Albert, Rod Page, Pam Soltis, and an anonymous
reviewer for helpful suggestions that improved the manuscript; Damon
Little for sending most-parsimonious trees found for the Soltis et al.
(2000) matrix; Pat Reeves for help running the Bayesian analyses; Mike
Antolin, Donovan Bailey, Joe von Fischer, Melissa Islam, Kurt Pickett,
Chris Randle, Pat Reeves, and Ali Schultz for helpful discussions.
258
SYSTEMATIC BIOLOGY
VOL. 55
Simmons, M. P., A. Reeves, and J. I. Davis. 2004b. Character state space
mony, phylogeny, and genomics (V. A. Albert, ed.). Oxford Univerversus rate of evolution for phylogenetic inference. Cladistics 20:191sity Press, Oxford.
204.
Swofford, D. L. 2001. PAUP*: Phylogenetic analysis using parSoltis, D. E., V. A. Albert, V. Savolainen, K. Hilu, Y.-L. Qiu, M. W. Chase, simony (*and other methods). Sinauer Associates, Sunderland,
J. S. Farris, S. Stefanovic, D. W. Rice, J. D. Palmer, and P. S. Soltis.
Massachusetts.
2004. Genome-scale data, angiosperm relationships, and "ending Wenzel, J. W., and M. E. Siddall. 1999. Noise. Cladistics 15:51-64.
incongruence": A cautionary tale in phylogenetics. Trends Plant Sci. Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from
9:477-483.
DNA sequences when substitution rates differ over sites. Mol. Biol.
Evol. 10:1396-1401.
Soltis, D. E., C. Hibsch-Jetter, P. S. Soltis, M. W. Chase, and J. S. Farris.
1997. Molecular phylogenetic relationships among angiosperms: An Yang, Z. 1997. PAML: A program package for phylogenetic analysis by
overview based on rbcL and 18S rDNA sequences. Pages 157-178 in
maximum likelihood. CABIOS 13:555-556.
Evolution and diversification of land plants (K. Iwatsuki and P. H. Yang, Z., N. Goldman, and A. Friday. 1995. Maximum likelihood trees
Raven, eds.). Springer, Tokyo.
from DNA sequences: A peculiar statistical estimation problem. Syst.
Biol. 44:384-399.
Soltis, D. E., P. S. Soltis, M. W. Chase, M. E. Mort, D. C. Albach, M.
Zanis, V. Savolainen, W. H. Hahn, S. B. Hoot, M. F. Fay, M. Axtell, Yang, Z., and B. Rannala. 1997. Bayesian phylogenetic inference using
S. M. Swensen, K. C. Nixon, and J. S. Farris. 2000. Angiosperm phyDNA sequences: A Markov Chain Monte Carlo method. Mol. Biol.
logeny inferred from a combined data set of 18S rDNA, rbcL, and
Evol. 14:717-724.
atpB sequences. Bot. J. Linn. Soc. 133:381-161.
Zujko-Miller, C, and J. A. Miller. 2003. PEST: Precision estiSoltis, P. S., D. E. Soltis, V. Savolainen, P. R. Crane, and T. G. Barraclough. mated by sampling traits, http://www.gwu.edu/~clade/spiders/
pestDocs.htm. Program distributed by the authors.
2002. Rate heterogeneity among lineages of tracheophytes: Integration of molecular and fossil data and evidence for molecular living
First submitted 28 April 2005; reviews returned 2 September 2005;
fossils. Proc. Natl. Acad. Sci. USA 99:4430-4435.
final acceptance 14 October 2005
Steel, M., and D. Penny. 2005. Maximum parsimony and the phylogenetic information in multistate characters. Pages 163-178 in Parsi- Associate Editor: Pam Soltis