Characters, States, and Homology

2005
POINTS OF VIEW
and use of the incorrect Hastings ratio has a negligible effect on the clade posteriors derived these data; the largest
difference in clade posterior probability between the two
runs was only 0.016.
ACKNOWLEDGEMENTS
We thank Marc Suchard, Jeff Thorne, Rod Page, and an anonymous
reviewer for many helpful suggestions for improving the manuscript,
and we thank Fredrik Ronquist for valuable discussions on this topic.
We thank the National Science foundation for financial support (MTH
was supported by award DBI-0306047 and DLS, POL, and MTH were
funded by EF 03-31495, part of the CIPRES project). BL acknowledges
support from NIH grants R01 GM068950-01 and R01 GM069801-01.
R EFERENCES
Drummond, A. J., and A. Rambaut. 2003. Bayesian Evolutionary Analysis Sampling Trees (BEAST), v1.0. Available from http://evolve.
zoo.ox.ac.uk/beast/.
Green, P. J. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82:711–
732.
Green, P. J. 2003. Trans-dimensional Markov chain Monte Carlo.
Pages 179–198 in Highly structured stochastic systems (P. J. Green,
N. L. Hjort, and S. Richardson, Eds.). Oxford University Press,
Oxford, UK.
965
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape
splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol.
22:160–174.
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov
chains and their applications. Biometrika 57:97–109.
Huelsenbeck, J. P., and F. R. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755.
Huelsenbeck, J. P., F. R. Ronquist, R. Nielsen, and J. P. Bollback. 2001.
Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294:2310–2314.
Larget, B., and D. L. Simon. 1999. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol.
Evol. 16:750–759.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and
E. Teller. 1953. Equation of state calculations by fast computing machines. J. Chem. Phys. 21:1087–1092.
Redelings, B. D., and M. A. Suchard. 2005. Joint Bayesian estimation of
alignment and phylogeny. Syst. Biol. 54:401–418.
Simon, D., and B. Larget. 2001. Bayesian analysis in molecular biology
and evolution (BAMBE), 2.03 beta edition. Department of Mathematics and Computer Science, Duquesne University.
Wilgenbusch, J. C., D. L. Warren, and D. L. Swofford. 2004. AWTY:
A system for graphical exploration of MCMC convergence in
Bayesian phylogenetic inference, v0.5. Available from http://ceb.
csit.fsu.edu/awty.
First submitted 6 December 2004; reviews returned 4 February 2005;
final acceptance 17 May 2005
Associate Editor: Jeff Thorne
Syst. Biol. 54(6):965–973, 2005
c Society of Systematic Biologists
Copyright ISSN: 1063-5157 print / 1076-836X online
DOI: 10.1080/10635150500354654
Characters, States, and Homology
J OHN V. FREUDENSTEIN
Department of Evolution, Ecology and Organismal Biology, Ohio State University Herbarium, 1315 Kinnear Road, Columbus, Ohio 43212, USA;
E-mail: [email protected]
Characters are the fundamental units used to formalize
hypotheses of homology for all phylogenetic analyses,
meaning that the decision about how observations are
translated into characters is of paramount importance in
systematics. Clearly, the importance of characters also extends beyond systematics, being central in evolutionary
process studies (cf. Gould and Lewontin, 1979), physiology, and any branch of biology that is concerned with
the attributes of organisms. Therefore, it is important that
an internally consistent, nonarbitrary, yet flexible way of
viewing characters be available that can accommodate
any type of organismal aspect. It is beyond the scope of
this contribution to attempt to solve all problems with
character delimitation and coding, but one important
issue involving the distinction between characters and
states remains problematic that might be clarified via review and consideration in the light of current thinking
in systematics.
Although the idea of homologous structures among
taxa has a long history (cf. Panchen, 1994), the distinction between the terms character and character state was
not introduced until the middle of the 20th century. Mayr
(1942), for example, used the term character to denote
the particular attribute of an organism (e.g., red flowers, backbone, or five petals), not distinguishing between
character and state. It was with the rise of numerical approaches to taxonomy that the character/state distinction
became common. Maslin (1952) described a “chronocline” that relates a series of characters through time and
is equivalent to the current concept of transformation series. Michener and Sokal (1957) distinguished between
the character/state usage (which they employed) and the
practice of calling all attributes simply characters, but ascribed no conceptual implications to the difference. Cain
and Harrison (1958) did not use the term state, but did
assign different numerical values to characters. Sneath
966
SYSTEMATIC BIOLOGY
and Sokal (1962) and Davis and Heywood (1963) made
a clear distinction between characters and states. Hennig
(1966) and Wiley (1980, 1981) also recognized this conceptual distinction, although they considered character
and state to be synonyms (equivalent to state) and, like
Maslin (1952), used transformation series for what many
now term characters.
Farris et al. (1970) provided clear definitions of character and character state based on Hennig (1966) for the
purposes of phylogenetic analysis:
A character (“transformation series” of Hennig) is a collection of mutually exclusive states (attributes; features; “characters,” “character
states,” or “stages of expression” of Hennig) which
a) have a fixed order of evolution such that
b) each state is derived directly from just one other state, and
c) there is a unique state from which every other state is eventually
derived.
Pimentel and Riggins (1987) defined a character more
simply as “a feature of organisms that can be evaluated
as a variable with two or more mutually exclusive
and ordered states.” Many authors use characters that
correspond essentially to these definitions (except that a
priori state ordering is not usually required). However,
neither of these definitions circumscribes characters
well with respect to other such units because they
do not specify where a character “begins and ends.”
In fact, in the context of current definitions, Brower
(2000) noted that, “it is not possible . . . to know with
certainty where one character ends and the next begins.”
These definitions, for example, do not preclude various
portions of a single transformation series being called
different characters and therefore embody a significant
degree of arbitrariness.
The distinction between the notions of character and
state has itself at times been challenged as unnecessary
and arbitrary. Bock (1973) stated that “no distinction exists between characters and character states. The latter
are simply characteristics which may be homologous
with a more restrictive conditional phrase.” Platnick
(1978) agreed with this notion and later (1979) explained
his position more fully:
. . . all characters can be seen as modifications (or restrictions) of other
characters, and the grouping of character states within a character
can be seen as just arbitrarily delimiting clusters of separate characters that are increasingly more restricted in generality (i.e., that form
nested sets of increasingly modified versions of other characters).
Further indications of the perceived arbitrary nature of characters can be found. Eldredge and Cracraft
(1980) considered the terms character and state to indicate only “relative levels of similarity within a given
hierarchy.” Nelson and Platnick (1981) and Patterson
(1988) saw no distinction between characters and states.
Ghiselin (1984), from a philosophical perspective, suggested abandoning both terms and substituting “feature” for both. Pleijel (1995) defined characters and states
to be the columns and cell values, respectively, in data
matrices—a pragmatic but conceptually minimalist approach. Many others (e.g., Pimentel and Riggins, 1987;
VOL. 54
Brower and Schawaroch, 1996; and Hawkins et al., 1997)
have argued in favor of the character/state distinction
based on its usefulness.
The crucial point explored here is not whether the
distinction between characters and states is useful, but
whether it is arbitrary. Platnick’s claim of arbitrariness
begs a justification for the use of both concepts beyond
simple convenience, as does the uncertainty in character
circumscription in the definitions of Farris et al. (1970)
and Pimentel and Riggins (1987). I argue here that there
is a conceptual justification for the distinction and that
it is to be found in our notions of the different homology relations that are commonly recognized for genetic
features—in short, characters correspond essentially to
paralogs and their states to their orthologs and this distinction should be embraced as a paradigm for all data
types. Furthermore, there are practical coding implications that follow from the way that characters are viewed
and these need to be considered when empirical studies
are undertaken.
CHARACTERS —O NTOLOGY AND EPISTEMOLOGY
As with any system in which a theoretical framework
has real applications, it is important to distinguish the
conceptual basis for characters (telling us what a character is) from the practical operation of finding them and
to recognize that the resulting empirical units may correspond only imperfectly to the conceptual ideal. This
can be due to complexity in the empirical case that obscures mapping to the conceptual framework. The imperfect correspondence of the empirical to the conceptual
does not detract from the usefulness of the latter concept,
however. A parallel to the character situation exists with
species concepts (Frost and Kluge, 1994; Baum and Shaw,
1995), where we conceive of a conceptual unit (such as
the Evolutionary Species; Wiley, 1978) as well as methods that allow us to approximate such a unit in practice
(e.g., Phylogenetic Species of Nixon and Wheeler, 1990).
HOMOLOGY AND M OLECULAR CHARACTERS
It is helpful to focus first on molecular characters to
examine homology relations because the situation is at
least superficially more straightforward than with morphology due to the discrete and “simple” nature of the
characters. An important first point is that the way that
attributes of taxa originate with respect to each other
is the key to their homology relation. Fitch (1970) distinguished between orthologous and paralogous proteins,
the former representing variants of a protein in different
species, whereas the latter are proteins found in a single
individual that resulted from a gene duplication event.
Solignac et al. (1995) recognized that paralogy and orthology, as originally defined, do not cover all possible
shades of homology relations; they coined the term metalogous to refer to the relationship between paralogs that
have been separated by a speciation event (and therefore appear in different taxa). Koonin (2001) also proposed the same term for this situation. Sonnhammer and
Koonin (2002) subsequently coined the terms inparalog
2005
POINTS OF VIEW
and outparalog to denote the same distinction. Paralogy
is commonly used more broadly than its original definition and will be used here to include Solignac et al.’s
metalogous relation.
Gene duplication is well established as a mechanism
(perhaps the dominant one) for creating new loci that can
diverge and specialize in function (Ohno, 1970; Hughes,
1994; Zhang, 2003). The paralogy relation defines groups
of orthologs, because duplicated loci may be free to
diverge independently, resulting in mutually exclusive
transformation series of orthologs. The importance of
correctly distinguishing these patterns has been discussed frequently (e.g., Goodman et al., 1979; Sanderson
and Doyle, 1992; Zouine et al., 2002), because mistaking
the nature of gene relationships can lead to errors in the
reconstruction of taxon trees.
Two other mechanisms for generating new loci are fusion of previously separate elements (domain shuffling)
and acquisition of foreign genetic material (lateral transfer). Domain shuffling is a process by which gene segments that code for protein domains are combined to
yield new loci (Doolittle, 1995). This process ultimately
depends on duplication as well, because the raw materials for new genes are derived from partial copies of other,
perhaps still functional, genes. In this sense, the pattern
is a subset of paralogy, but essentially results in a “reverse paralogy” event, because instead of yielding new,
potentially independently changing DNA segments, two
or more segments are combined into a single unit.
Lateral transfer of genetic material among taxa is best
known from microbial genomes, but Won and Renner
(2003) described a case of transfer of a mitochondrial
intron among seed plant lineages, suggesting that increased scrutiny of genomes is likely to reveal additional
cases among multicellular organisms. The overall extent
of this process in the history of organisms is far from
known. When a DNA segment is transferred laterally
(i.e., nonhierarchically) between taxa, it yields a relationship to the homologous native segment that was termed
xenology by Gray and Fitch (1983). That particular homology relation is of less interest here than the simple fact
that foreign DNA can become part of a genome. With
either the introduction of foreign DNA or the fusion of
segments, a new segment is established that will behave
essentially as a paralog when compared to a previously
existing similar locus, in the sense that it can accumulate
its own orthologs as mutations occur.
Thus, new loci may arise by duplication, fusion, or insertion from a foreign source; these processes generate
a new unit that at least potentially has its own distinct
fate. The reification of loci as systematic characters ultimately depends only on their individualization from
other such segments, rather than the specific process by
which they achieved this independence. Individualization as used here is the acquisition of transformational
(as opposed to genomic or linkage) independence, meaning that two features can, at least potentially, change
independently. This is essentially the same as “quasiindependence,” named by Lewontin (1978) as a property necessary for features to be susceptible to adaptive
967
change and taken up by Stadler et al. (2001) and Wagner
and Stadler (2003). If a particular DNA segment is known
not to be independent of another (such as members of a
tandem repeat array that undergo concerted evolution),
it should not be called a distinct character, because independence is a basic requirement for systematic characters
(e.g., Cain and Harrison, 1958; Wheeler and Honeycutt,
1988; Schuh, 2000).
M ORPHOLOGICAL CHARACTERS
Patterson (1988) described the parallels between morphological and molecular homology, but did not examine the range of morphological situations that exist, nor
did he make the conceptual connection between homology relation and the distinction between characters and
states. In fact, he explicitly rejected the use of character
states as distinct from characters.
If the key to character individualization is transformational independence, and characters come into being
from preexisting ones by duplication, fusion, or foreign
acquisition, then the homology paradigm for molecular
characters should also apply to any other characters that
arise through such processes. Müller and Wagner (1991)
reviewed the concept of morphological novelty and described a number of ways that apomorphies can arise that
correspond to the aforementioned processes, including
differentiation of repeated elements, synorganization of
elements, apparent de novo origin (“new elements”), and
change of shape and context.
The duplication of morphological structures (serial or
iterative homology) is equivalent to genetic duplication,
and like it leads to a paralogy relationship (Patterson,
1988). Animal segmentation is perhaps the classic case of
serial homology and one that is easy to recognize. However, Nelson (1994) argued that many more morphological features can have paralogous relationships. He cited
the example of mammalian hair and mammary glands,
which he concluded are related because they are both
epidermal derivatives and are connected to each other
by a series of other homologs “in a manner perhaps only
analogous to gene duplication.” He further concluded
that ultimately “all characters are homologous.” The fact
that new morphological features become individualized
from coexisting structures is sufficient to render their hierarchic pattern comparable to that expected with paralogous genes, because each individualized structure is at
least potentially free to change independently of others,
producing a group of orthologs (states).
Synorganization of morphological structures is
equivalent to fusion of genetic segments as in domain
shuffling. A good example is the column in orchids, the
result of fusion among styles and stamens (Rudall and
Bateman, 2002). Other examples are described in Müller
and Wagner (1991). In both morphological and molecular
situations, one might expect a range of degree to which
the synorganization is accomplished.
It is difficult to imagine that any morphological structures can “come from nowhere”; they must be the result of underlying mutations, shifts in developmental
968
SYSTEMATIC BIOLOGY
patterns, or other modification of preexisting information. One view describes morphological innovation as
likely arising at least in part by the stabilization of epigenetic variation, followed by the “capture” and encoding of this innovation at the molecular level (Newman and Müller, 2001). There are many cases where the
origin of a morphological structure is unclear; Müller
and Wagner (1991) used the corpus callosum of the
brain as an example where no precursor is evident. Although novel morphological structures clearly are not
imported into an organism as foreign genes may be, the
resulting pattern can be the same—in both instances,
a new transformation series (character) is established,
the equivalent of a paralog, with its resulting orthologs
(states).
Changes in shape and context result only in modifications of previously existing structures and therefore do
not produce new transformation series (characters), only
new states (orthologs) in currently existing paralogs.
It is important to note that the application of the
concepts of paralogy and orthology, fusion, and foreign acquisition to phenotypic features such as physical
structures does not depend on or imply a direct connection between processes at the genetic level and at the
phenotypic level. In other words, paralogous morphological structures need not correspond in a one-to-one
way to underlying paralogous genes. The relationship
between genotype and morphological structure is complex and poorly understood for most characters (Shubin
and Marshall, 2000). Wagner (1989a, 1989b) stated that
“It seems implausible that continuity of gene lineages
alone could account for the homology of morphological features” and that “the continuity of descent is an
epiphenomenon of the continuity of gene lineages.” At
the level of individual genes, it is known that homologous loci may control either homologous or nonhomologous structures (Bolker and Raff, 1996). Because
duplicated genes only rarely give rise to new proteins
(Lynch and Conery, 2000), and only some of these will
contribute to structural features, it is not clear that gene
duplication can be implicated in a direct way to explain
the diversity of morphological features.
If, as is argued here, all characters can be viewed in
the same way with respect to the homology relations
that they exhibit, then a broadly applicable nonarbitrary
distinction between characters and states exists and the
extent of an individual character with respect to others is
clear. I propose the following definitions that incorporate
this reasoning:
Characters are individualized assemblages of features
(states) among taxa that are the result of duplications,
fusions, or foreign acquisitions (“novelties”) and whose
elements exhibit paralogous or equivalent nonorthologous relationships to other such assemblages.
Character states are mutually exclusive features
among taxa of a single paralog-equivalent assemblage
that exhibit orthologous relationships to each other.
Note that these definitions leave no room for ambiguity as to the sets of states that comprise a character and
VOL. 54
do not permit portions of a single transformation series
to be called different characters.
O PERATIONAL ASPECTS
Although the concept of a character may be easily
stated, the empirical exercise of sorting orthologs into
their paralogs can be a difficult one and is an integral
part of the process of establishing homology hypotheses.
Indeed, there have been empirical methods proposed to
deal explicitly with accommodating molecular paralogs
in phylogenetic analysis. Goodman et al. (1979) and Page
(1994 and subsequent work) developed the method of
“reconciled trees” to fit gene trees (or, more broadly, character trees) into their species trees by minimizing (and
thus identifying) duplications and other events, whereas
Simmons et al. (2000) developed the “uninode” approach
for coding taxa that have experienced known duplications. Advocates of each of these methods have argued
its merits (Simmons and Freudenstein, 2002b; Cotton and
Page, 2003).
De Pinna (1991) wrote that “the decision whether any
two or more attributes comprise a single transformation
series or two or more independent series is one of the
most basic, albeit still confusing issues in systematics.”
Patterson (1982, 1988) and de Pinna (1991), among others,
described three tests that can be used to distinguish homology relations—similarity, congruence, and conjunction. Similarity refers to perceived degree of sameness
among features, and as such is subjective, although criteria exist (e.g., Remane, 1950). Congruence depends on
topological relationship on a cladogram—whether a feature represents a synapomorphy or not. Conjunction asks
whether two features are present in the same organism.
Whether all of these are viewed equally as tests or if similarity is viewed only as an initial criterion (de Pinna,
1991; Brower and Schwaroch,1996) is not important for
this discussion. Orthologs pass all three tests as long as
states are mutually exclusive, whereas paralogs and their
equivalents fail conjunction, because a state from each of
two or more characters will be present in the same organism. Hence, conjunction can be used to distinguish features that are states of different characters from those that
are states of a single character. As an example, imagine
that there are four epidermal protrusion shapes that can
occur—rounded, pointed, linear, and squared. If only one
of these types is observed in each organism belonging to
the study taxa, one might hypothesize that these are all
states of the same character. However, if states co-occur in
an organism, then they cannot be states of the same character. If rounded and pointed are observed to co-occur
and linear and squared co-occur, then we hypothesize
that two characters are involved. However, we may still
not know the exact character circumscriptions—i.e., if
the two characters are (rounded, linear) and (pointed,
squared) or (rounded, squared) and (linear, pointed)
without some other type of information, such as topographic correspondence (specifying in sufficient detail
on which part of the organism particular protuberances
2005
POINTS OF VIEW
occur). Observing co-occurrence of any other pair (such
as rounded and linear) will answer that question, however, because the two states that co-occur cannot be part
of the same character.
Brower (2000) stated that, “Character state identity, but
not topographical identity, is tested by character congruence in cladistic analysis,” but this is not strictly true.
Congruence can be applied as a test for all characters but
it cannot distinguish between errors due to paralogy and
orthology; homoplasy will be detected but its basis will
not be clear. That is, any particular instance of homoplasy
could be explained either by the character state having
been derived independently in two taxa (falsifying only
character state identity) or by the state actually belonging
to another character (falsifying both topographic identity and character state identity). Although it is true that
we usually hold character hypotheses fixed when interpreting the pattern of character state changes that results
from a phylogenetic analysis (i.e., we do not change the
sequence alignment or shift morphological states among
characters) and attribute homoplasy to incorrect character state homology assessment, errors in assignment of
states to their characters must remain a possible explanation. That is why instances of homoplasy can prompt
us to reconsider homology hypotheses of either type in
the reciprocal illumination process.
One approach in which hypotheses of both character
and state truly are evaluated simultaneously is the DNA
alignment procedures developed by Wheeler (Wheeler
and Gladstein, 1994; Wheeler, 1996) in which alternative alignments are evaluated by parsimony (MALIGN)
or sequence transformations are created directly on the
trees they imply (POY). No attempt is made to distinguish hypotheses of the two types, however.
I NDIVIDUALIZATION AND CODING LEVELS
Having discussed what characters are conceptually
and how to distinguish them operationally, I now turn
to practical aspects of their coding. Because not every
distinguishable feature need be individualized with respect to transformational independence, not every one
should be called a character. Particular hairs on a mammal would not be considered individualized units, for example, unless each truly could transform independently
of others. However, “hair” as a whole might be individualized. Whether any particular individualized morphological structure can in fact transform independently
of another is an empirical question, and one that exists
for genes as well (cf. concerted evolution; Zimmer et al.,
1980). One might question whether individualization is
a phenomenon that exists apart from our ability to recognize it, or whether it is simply a matter of our perception.
I argue that it is real, to the extent that there is some minimal feature that is potentially free to change apart from
other such units. Ultimately, this feature is the nucleotide.
This does not preclude individualization also among
phenotypic features based on the nucleotide, however.
Such phenotypic features would include amino acids,
secondary chemical compounds, morphological struc-
969
tures, etc. This diversity of levels exists because different
processes operate to cause change at different organizational levels and there is a resulting hierarchy of parts
within an organism as well as among taxa (Riedl, 1978).
Discovering the individualized unit at any particular
level is an empirical exercise—one that is attempted routinely in the delineation of characters. Wagner (1989b),
for example, argued for the importance of developmental
constraints in identifying minimal morphological units
of change.
The level of characters used in a phylogenetic analysis
is a matter of choice, because in many cases either reductive (minimal) characters or composite characters that
are based on them could be utilized (Wilkinson, 1995;
Simmons and Freudenstein, 2002a). This difference in
level should not be confused with the distinction between characters and states and is the reason that the
term “minimal level” does not appear in the character
and state definitions presented here—because it would
force the definition to the level of the nucleotide. Perhaps in that sense there remains arbitrariness in these
definitions, but if so it is a result of the multiple levels
of hierarchy that exist in and among living organisms.
Some degree of flexibility in these definitions is required
in order to capture the maximum amount of information
available.
Thus, for genetic data, the presence/absence of a particular locus may be scored, as may variants of each
locus. The loci themselves are paralogs or their equivalents and are characters, whereas the variants of each
are orthologs, or character states. Alternatively, individual base positions in an alignment may be scored. Base
positions have a paralogous relationship to each other to
the extent that increase in gene size is the result of duplications (e.g., slipped-strand mispairing) or insertions.
Hence, base positions are also characters (to the extent
that they are independent) and base identities at each position are states. The same is true for morphology, where
we might have the choice of coding the face, the nose, or
the nostril as the character. Whether both genomic and
phenotypic characters based on them should be coded
in the same analysis is a separate question (Agosti et al.,
1996; Freudenstein et al., 2003).
HIERARCHIC CHARACTER PATTERNS AND
I MPLICATIONS FOR CODING : EXAMPLES
Thus far we have a character concept that can be
applied at different levels within an organism. Once
characters and their states have been recognized, it is
necessary to code them for analysis, which means fitting these constructs into a form that is interpretable
by current phylogenetic analysis software, using what
are termed hereafter “matrix characters.” This is another
level at which some ambiguity occurs, because at least
in some cases there is no single optimal way to represent the homology statements for the software (see examples below). The recognition of characters as paralogs
or equivalents distinct from states as orthologs does not
preclude flexibility in coding the matrix characters—
970
SYSTEMATIC BIOLOGY
the number of paralogs need not equal the number of
character columns, for example.
Even though the correspondence between at least
some homology hypotheses and matrix characters is imperfect, the conceptual view of a character does have
certain logical implications for the matrix character(s)
used to represent it. Platnick (1979) imagined “a great
chain of characters (or synapomorphies, or homologies)
stretching from those of complete generality, which are
true for all life, on to those true for only a single species.”
The logical approach to coding such variation in a matrix would be as a series of binary characters, as suggested by Platnick (1979) and Pleijel (1995). Coding in
this presence-absence way could easily be done for simple characters that correspond to paralogs as described
here, because they should all be observable in each organism. However, if there was variation within any of
the paralogs, those variants would be orthologs (creating a transformation series), and coding them as binary
characters would pose a problem. The problem is that
because only a single state for each character is observable in an organism, only it can be coded as present,
even though that state is part of the larger transformation series, any antecedent states of which are also truly
present (in modified form). Platnick (1979) criticized unordered multistate coding because it does not preserve
this hierarchic information among states, but unless one
is willing to make assumptions about ancestral states,
neither does binary coding. At least coding orthologs as
states of a nonadditive multistate character eliminates
some of the problems and biases encountered when attempting to code them as binary characters (Pimentel
and Riggins, 1987; Hawkins et al., 1997; Hawkins, 2000).
Those problems stem from the fact that transformation
from the presence state (“1”) of one binary character to
the presence state of another always requires two steps
(loss of one and gain of the other) and thus biases against
a direct transformation between those states, making independent gain of two features cost the same as transformation between them.
It is useful to consider some examples of characters and
states to see how these concepts relate to matrix coding.
Eldredge and Cracraft (1980:30) provided an example to
illustrate the arbitrariness of characters and states:
. . . the character “feathers” is a common similarity of all birds, although specific character-states of the character “feathers” (e.g., variation in color texture, and pattern) would be similarities common to
various groups of birds. At the same time, it is apparent that even
the character “feathers” could be considered a character-state, say
within the vertebrates, if the systematist were considering the “character” to be the vertebrate integument. . .
An illustration of this situation is provided in
Figure 1a, where symbols are used to denote features. In
this diagram, the triangle represents a precursor indumentum type (such as the chondrichthyan denticle) and
the square a scale, believed to be transformed at some
point among tetrapods into a feather, designated by a circle. Among feathers there are three distinct unspecified
types, designated by ticks on the circle. Because birds
VOL. 54
FIGURE 1. A cladogram for five taxa (a) and corresponding data
matrix (b), which shows three ways of encoding the variation.
have both scales and feathers (i.e., only some of their
scales are transformed into feathers), there is the potential that they represent two characters (because having
both would appear to mean that they fail the conjunction
test). However, if it is just a proportion of the original
scales that are transformed into feathers, then it becomes
a question of whether the feathers and remaining scales
are transforming independently. If they are, then two
characters would be warranted, because two sets of orthologs have been established. If not (as assumed for the
purpose of this example), then the variation comprises
a single set of orthologs, so a single nonadditive multistate character could be used (Fig. 1b). However, this
coding allows transformation between any two states in
one step, which may be undesirable if one wishes to emphasize the homology of feathers. Another well-known
option for this situation is to use two matrix characters,
one coding for each different basic type of indumentum
(denticle, scale, feather), and a second to code for the different types of feathers (and coding taxa with scales as
inapplicable for feather type). Although this approach
uses two columns in the matrix, it yields the same minimum number of apomorphies (four) as the multistate
option. It does incur the problems of inapplicable states,
however (Maddison, 1993). A third option would be to
use a binary character for each feature, but although this
approach also uses four apomorphies, it requires two
steps to go between any two states, making independent derivation of states cost the same as transformation
among them.
Homology relationships among features can be more
complex than this simple example when paralogy is also
involved. Doyle and Davis (1998:106) illustrated a situation where a gene duplication occurs, resulting in a more
complicated pattern. Their diagram is adapted here in
Figure 2a to illustrate the different situations that can
arise, using symbols to represent characters and states
(which could be either molecular or morphological), in
a character phylogeny that is equivalent to a gene tree.
The symbols indicate similar units—those that have circles, regardless of shading, would all be recognized as
“the same thing,” whereas shading differences indicate
more minor variants. Three duplication events are depicted. Under the paradigm described here, a new character is created at each duplication event. Therefore, four
characters (paralogs) are depicted in Figure 2a. The corresponding taxon tree is shown in Figure 2b.
2005
POINTS OF VIEW
971
FIGURE 2. A character (gene) tree for five taxa (a), its corresponding taxon tree (b), and two possible matrices for the data (c, d). The diamonds
in (a) represent duplication events, symbols at tick marks represent transformations, and symbols at terminals are the states for those taxa. “D”
in the matrix indicates a duplication matrix character. The multistate matrix character designated by (N) is an alternative way of coding the
duplications.
Doyle and Davis (1998) made the point that although
the loci in taxa whose divergence followed the duplication are clearly paralogous to one another, each paralog must be viewed as orthologous to the feature in the
taxon preceding the duplication. In terms of hierarchic
relationship only, each of the two paralogs has exactly
the same relationship to the sister of the pair, regardless
of function. However, it may well be that one paralog
retains the function/morphology of the ancestral locus,
whereas the other acquires a new function/morphology.
Fitch (2000) suggested the term isortholog to designate
the member of the duplicated pair that retains the ancestral function. This relation can be seen in Figure 2a,
where the character “square” persists through the first
duplication event, at which point a new character (“circle”) is created by the duplication. The square in taxon
A is coded as an ortholog to the squares in taxa B to
E, and it is also an ortholog to the circle character. This
means that a particular feature can be a state in each
of two characters, which is not problematic because it
will always represent the plesiomorphic state in each of
them.
Is a new character created even when two paralogs
are indistinguishable? Yes; this is illustrated in Figure 2a
at duplication 3. There are no differences between the
paralogs that can be coded as character states. However, the duplication itself (the origination of the new
character), as evidenced by the presence of a pair of
identical structures, can be coded. This is similar to the
“uninode” approach of Simmons et al. (2000) for analyzing paralogous loci simultaneously with those of
preexisting unduplicated taxa, in which they coded a
separate character in the species tree analysis to mark a
duplication.
Hence, two types of information potentially can be
coded in a data set—the number of characters that are
present in the taxa and the variation that occurs within
each. When the number of characters present among
taxa is uniform (i.e., with no taxa having “absence”
states), such as with an invariant length DNA sequence
or a morphological dataset in which all characters are
present in all taxa, then one matrix character per paralog is sufficient to code the orthologs. If the number
of characters present among taxa varies, this information can be coded in different ways. One option is to
code a presence/absence matrix character for each paralog (representing duplication/other origination events)
and a separate (perhaps multistate) matrix character to
score orthologs for each paralog (Fig. 2c and as in Fig.
1 discussed above). In this case, the state for a paralog that is scored for taxa that preceded the duplication would be “inapplicable.” No matrix character was
coded for the “keystone” character in the example in Figure 2 because there was no ortholog variation for that
character.
An equivalent way of accounting for the character
number (paralog) variation would be to score a single
character to enumerate the paralogs—essentially a transformation of characters rather than states (Fig. 2d). An
example of this would be coding the number of vertebrae for a group. This is a meristic character, which
only enumerates the iterated units but does not code
any variation they might have. Additional matrix characters would be needed to code variation for individual
vertebrae.
Another option is to code both the presence/absence
information for a character and its ortholog variation in
a single matrix character (coding “absence” as a state;
Fig. 2e). In this case a matrix character needs to be used
specifically to score duplication 3, which was not accompanied by any ortholog changes. The number of state
changes is five for all of these options.
The important element that all of these coding options have in common is the recognition that character
originations and state transformations are distinct and
that both can be recorded and used as synapomorphies.
972
SYSTEMATIC BIOLOGY
This is consistent with the idea that taxon phylogenies
are built from character phylogenies and that orthology
and paralogy-type relations should be distinguished and
used in the proper context.
In summary, there is a nonarbitrary basis for the distinction between characters and states at any level and it
is the difference between paralogs and orthologs, which
can be recognized empirically by application of the conjunction criterion. Use of a character definition based
on the paralogy/orthology difference results in circumscription of characters as whole transformation series.
This definition does not preclude flexibility in the matrix
coding of characters, however.
ACKNOWLEDGEMENTS
The author thanks Allan Baker, Marymegan Daly, Jerrold Davis,
Roderic Page, Kurt Pickett, Daniel Potter, Chris Randle, Mark
Simmons, Peter Stevens, Günter Wagner, John Wenzel, and Mark
Wilkinson for discussion of these ideas and comments on the
manuscript.
R EFERENCES
Agosti, D., D. Jacobs, and R. DeSalle. 1996. On combining protein sequences and nucleic acid sequences in phylogenetic analysis: The
homeobox protein case. Cladistics 12:65–82.
Baum, D. A., and K. L. Shaw. 1995. Genealogical perspectives on the
species problem. Pages 289–303 in Experimental and molecular approaches to plant biosystematics (P. C. Hoch and A. G. Stephenson,
eds.). Missouri Botanical Garden, St. Louis, Missouri.
Bock, W. J. 1973. Philosophical foundations of classical evolutionary
classification. Syst. Zool. 22:375–392.
Bolker, J. A., and R. A. Raff. 1996. Developmental genetics and traditional homology. BioEssays 18:489–494.
Brower A. V. Z. 2000. Homology and the inference of systematic relationships: Some historical and philosophical perspectives. Pages
10–21 in Homology and systematics: Coding characters for phylogenetic analysis (R. Scotland and R. T. Pennington, eds.). Taylor and
Francis, London.
Brower, A. V. Z., and V. Schawaroch. 1996. Three steps of homology
assessment. Cladistics 12:265–272.
Cain, A. J., and G. A. Harrison. 1958. An analysis of the taxonomist’s
judgment of affinity. Proc. Zool. Soc. Lond. 131:85–98.
Cotton, J. A., and R. D. M. Page. 2003. Gene tree parsimony vs. uninode coding for phylogenetic reconstruction. Mol. Phylogenet. Evol.
29:298–308.
Davis, P. H., and V. H. Heywood. 1963. Principles of angiosperm taxonomy. Van Nostrand, Princeton, New Jersey.
de Pinna, M. C. C. 1991. Concepts and tests of homology in the cladistic
paradigm. Cladistics 7:367–394.
Doolittle, R. F. 1995. The multiplicity of domains in proteins. Annu.
Rev. Biochem. 64:287–314.
Doyle, J. J., and J. I Davis. 1998. Homology in molecular phylogenetics:
a parsimony perspective. Pages 101–131 in Molecular systematics of
plants II: DNA sequencing (D. E. Soltis, P. S. Soltis, and J. J. Doyle,
eds.). Kluwer, Boston.
Eldredge, N., and J. Cracraft. 1980. Phylogenetic patterns and the evolutionary process. Columbia University Press, New York.
Farris, J. S., A. G. Kluge, and M. J. Eckardt. 1970. A numerical approach
to phylogenetic systematics. Syst. Zool. 19:172–191.
Fitch, W. M. 1970. Distinguishing homologous from analogous proteins. Syst. Zool. 19:99–113.
Fitch, W. M. 2000. Homology: a personal view on some of the problems.
Trends Genet. 16:227–231.
Freudenstein, J. V., K. M. Pickett, M. P. Simmons, and J. W. Wenzel.
2003. From basepairs to birdsongs: phylogenetic data in the age of
genomics. Cladistics 19:333–347.
Frost, D. R., and A. G. Kluge. 1994. A consideration of epistemology
in systematic biology, with special reference to species. Cladistics
10:259–294.
VOL.
54
Ghiselin, M. T. 1984. “Definition,” “character,” and other equivocal
terms. Syst. Zool. 33:104–110.
Goodman, M., J. Czelusniak, G. W. Moore, A. E. Romero-Herrera, and
G. Matsuda. 1979. Fitting the gene lineage into its species lineage,
a parsimony strategy illustrated by cladograms constructed from
globin sequences. Syst. Zool. 28:132–163.
Gould, S. J., and R. C. Lewontin. 1979. The spandrels of San Marco and
the Panglossian paradigm. Proc. R. Soc. Lond B. Biol. Sci. 205:581–
598.
Gray, G. S., and W. M. Fitch. 1983. Evolution of antibiotic resistance
genes: The DNA sequence of a kanamycin resistance gene from
Staphylococcus aureus. Mol. Biol. Evol. 1:57–66.
Hawkins, J. A. 2000. A survey of primary homology assessment: different botanists perceive and define characters in different ways. Pages
22–53 in Homology and systematics: Coding characters for phylogenetic analysis (R. Scotland and R. T. Pennington, eds.). Taylor and
Francis, London.
Hawkins, J. A., C. E. Hughes, and R. W. Scotland. 1997. Primary homology assessment, characters and character states. Cladistics 13:275–
283.
Hennig, W. 1966. Phylogenetic systematics. University of Illinois Press,
Urbana.
Hughes, A. L. 1994. The evolution of functionally novel proteins after
gene duplication. Proc. R. Soc. Lond. B Biol. Sci. 256:119–124.
Koonin, E. V. 2001. An apology for orthologs—or brave new memes.
Genome Biol. 2: comment 1005.1–1005.2.
Lewontin, R. C. 1978. Adaptation. Sci. Am. 239:213–230.
Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155.
Maddison, W. P. 1993. Missing data versus missing characters in phylogenetic analysis. Syst. Biol. 42:576–581.
Maslin, T. P. 1952. Morphological criteria of phyletic relationships. Syst.
Zool. 1:49–70.
Mayr, E. 1942. Systematics and the origin of species. Columbia University Press, New York.
Michener, C. D., and R. R. Sokal. 1957. A quantitative approach to a
problem in classification. Evolution 11:130–162.
Müller, G. B., and G. P. Wagner. 1991. Novelty in evolution: Restructuring the concept. Annu. Rev. Ecol. Syst. 22:229–256.
Nelson, G. 1994. Homology and systematics. Pages 101–149 in Homology: The hierarchical basis of comparative biology (B. K. Hall, ed.).
Academic Press, San Diego.
Nelson, G., and N. I. Platnick. 1981. Systematics and biogeography: Cladistics and vicariance. Columbia Univiversity Press,
New York.
Newman, S. A., and G. B. Müller. 2001. Epigenetic mechanisms of character origination. Pages 559–579 in The character concept in evolutionary biology (G. P. Wagner, ed.). Academic Press, San Diego.
Nixon, K. C., and Q. D. Wheeler. 1990. An amplification of the phylogenetic species concept. Cladistics 6:211–223.
Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, New
York.
Page, R. D. M. 1994. Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst. Biol.
43:58–77.
Panchen, A. L. 1994. Richard Owen and the concept of homology. Pages
21–62 in Homology: The hierarchical basis of comparative biology
(B. K. Hall, ed.). Academic Press, San Diego.
Patterson, C. 1982. Morphological characters and homology. Pages 21–
74 in Problems of phylogenetic reconstruction (G. P. Wagner, ed.).
Academic Press, London.
Patterson, C. 1988. Homology in classical and molecular biology. Mol.
Biol. Evol. 5:603–625.
Pimentel, R. A., and R. Riggins. 1987. The nature of cladistic data.
Cladistics 3:201–209.
Platnick, N. I. 1978. Classification, historical narratives, and hypotheses. Syst. Zool. 27:365–369.
Platnick, N. I. 1979. Philosophy and the transformation of cladistics.
Syst. Zool. 28:537–546.
Pleijel, F. 1995. On character coding for phylogeny reconstruction.
Cladistics 11:309–315.
Remane, A. 1952. Die Grundlagen des natürlichen Systems der vergleichenden Anatomie und der Phylogenetik. Geest and Portig, Leipzig.
2005
POINTS OF VIEW
Riedl, R. 1978. Order in living organisms: A systems analysis of evolution. John Wiley, Chichester.
Rudall, P. J., and R. M. Bateman. 2002. Roles of synorganisation, zygomorphy and heterotopy in floral evolution: The gynostemium and
labellum of orchids and other lilioid monocots. Biol. Rev. (Camb.)
77:403–441.
Sanderson, M. J., and J. J. Doyle. 1992. Reconstruction of organismal
phylogenies from multigene families: Paralogy, concerted evolution,
and homoplasy. Syst. Biol. 41:4–17.
Schuh, R. T. 2000. Biological systematics: Principles and applications.
Cornell University Press, Ithaca, New York.
Shubin, N. H., and C. R. Marshall. 2000. Fossils, genes, and the origin of novelty. Pages 324–340 in Deep time: Paleobiology’s perspective (D. H. Erwin and S. L. Wing, eds.). Paleobiology 26
(4, supplement).
Simmons, M. P., C. D. Bailey, and K. C. Nixon. 2000. Phylogeny reconstruction using duplicate genes. Mol. Biol. Evol. 17:469–473.
Simmons, M. P. and J. V. Freudenstein. 2002a. Artifacts of coding
amino acids and other composite characters for phylogenetic analysis. Cladistics 18:354–365.
Simmons, M. P., and J. V. Freudenstein. 2002b. Uninode coding vs
gene tree parsimony for phylogenetic reconstruction using duplicate genes. Mol. Phylogenet. Evol. 23:481–498.
Sneath, P. A., and R. R. Sokal. 1962. Numerical taxonomy. Nature
193:855–860.
Solignac, M., C. Periquet, D. Anxolabehere, and C. Petit. 1995.
Génétique et evolution. Hermann, Paris.
Sonnhammer, E. L. L., and E. V. Koonin. 2002. Orthology, paralogy and
proposed classification for paralog subtypes. Trends Genet. 18:619–
620.
Stadler, B. M. R., P. F. Stadler, G. P. Wagner, and W. Fontana. 2001.
The topology of the possible: Formal spaces underlying patterns of
evolutionary change. J. Theor. Biol. 213:241–274.
Wagner, G. P. 1989a. The biological homology concept. Annu. Rev. Ecol.
Syst. 20:51–69.
973
Wagner, G. P. 1989b. The origin of morphological characters and the
biological basis of homology. Evolution 43:1157–1171.
Wagner, G. P., and P. F. Stadler. 2003. Quasi-independence, homology
and the unity of type: A topological theory of characters. J. Theor.
Biol. 220:505–527.
Wheeler, W. C. 1996. Optimization alignment: The end of multiple sequence alignment in phylogenetics? Cladistics 12:1–10.
Wheeler, W. C., and D. S. Gladstein. 1994. MALIGN: A multiple sequence alignment program. J. Hered. 85:417–418.
Wheeler, W. C., and R. L. Honeycutt. 1988. Paired sequence difference in
ribosomal RNAs: Evolutionary and phylogenetic implications. Mol.
Biol. Evol. 5:90–96.
Wiley, E. O. 1978. The evolutionary species concept reconsidered. Syst.
Zool. 27:88–92.
Wiley, E. O. 1980. Phylogenetic systematics and vicariance biogeography. Syst. Bot. 5:194–220.
Wiley, E. O. 1981. Phylogenetics. John Wiley and Sons, New York.
Wilkinson, 1995. A comparison of two methods of character construction. Cladistics 11:297–308.
Won, H., and S. S. Renner. 2003. Horizontal gene transfer from flowering plants to Gnetum. Proc. Natl. Acad. Sci. USA 100:10824–
10829.
Zhang, J. 2003. Evolution by gene duplication: an update. Trends Ecol.
Evol. 18:292–298.
Zimmer, E. A., S. L. Martin, S. M. Beverely, Y. W. Kan, and A. C. Wilson.
1980. Rapid duplication and loss of genes coding for the α chains of
hemoglobin. Proc. Natl. Acad. Sci. USA 77:2158–2162.
Zouine, M., Q. Sculo, and B. Labedan. 2002. Correct assignment of homology is crucial when genomics meets molecular evolution. Comp.
Funct. Genomics 3:488–493.
First submitted 29 July 2004; reviews returned 2 February 2005;
final acceptance 7 June 2005
Associate Editor: Allan Baker
Syst. Biol. 54(6):973–983, 2005
c Society of Systematic Biologists
Copyright ISSN: 1063-5157 print / 1076-836X online
DOI: 10.1080/10635150500354647
Underparameterized Model of Sequence Evolution Leads to Bias in the Estimation
of Diversification Rates from Molecular Phylogenies
LIAM J. R EVELL,1 LUKE J. HARMON,1,2 AND R ICHARD E. G LOR 1,3
2
1
Department of Biology, Campus Box 1229, Washington University, St. Louis, Missouri 63130, USA; E-mail: [email protected] (L.J.R.)
Current address: Biodiversity Centre, University of British Columbia, 6270 University Blvd., Vancouver, British Columbia V6T 1Z4,
Canada;
E-mail: [email protected]
3
Current address: Center for Population Biology, University of California, One Shields Avenue, Davis, California 95616, USA;
E-mail: [email protected]
Macroevolutionary inferences from molecular phylogenies are becoming increasingly common (see Harvey
et al., 1996; Mooers and Heard, 1997; Pagel, 1999;
Barraclough and Nee, 2001). Many methods in which
phylogenies are invoked for historical inference assume
that a molecular phylogeny is an errorless representation of the underlying phylogenetic history of the included taxa (but see Lutzoni et al., 2001; Huelsenbeck
et al., 2000; Huelsenbeck and Rannala, 2003). However,
molecular phylogenies are estimates of this history based
on a particular model of evolution; thus, there is some
error associated with their estimation (Huelsenbeck and
Kirkpatrick, 1996). Here we explore the effects of a particular type of error in phylogenetic branch-length estimation, that caused by assuming an underparameterized model of molecular evolution, on the γ -statistic of
Pybus and Harvey (2000), a statistic that tests for changes
in the rate of diversification through time. Although we
restrict our attention to the estimation of diversification
rates, our findings are germane to any macroevolutionary inferences relying on the accurate estimation of phylogenetic branch lengths such as molecular dating (e.g.,