2005 POINTS OF VIEW and use of the incorrect Hastings ratio has a negligible effect on the clade posteriors derived these data; the largest difference in clade posterior probability between the two runs was only 0.016. ACKNOWLEDGEMENTS We thank Marc Suchard, Jeff Thorne, Rod Page, and an anonymous reviewer for many helpful suggestions for improving the manuscript, and we thank Fredrik Ronquist for valuable discussions on this topic. We thank the National Science foundation for financial support (MTH was supported by award DBI-0306047 and DLS, POL, and MTH were funded by EF 03-31495, part of the CIPRES project). BL acknowledges support from NIH grants R01 GM068950-01 and R01 GM069801-01. R EFERENCES Drummond, A. J., and A. Rambaut. 2003. Bayesian Evolutionary Analysis Sampling Trees (BEAST), v1.0. Available from http://evolve. zoo.ox.ac.uk/beast/. Green, P. J. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82:711– 732. Green, P. J. 2003. Trans-dimensional Markov chain Monte Carlo. Pages 179–198 in Highly structured stochastic systems (P. J. Green, N. L. Hjort, and S. Richardson, Eds.). Oxford University Press, Oxford, UK. 965 Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174. Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109. Huelsenbeck, J. P., and F. R. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755. Huelsenbeck, J. P., F. R. Ronquist, R. Nielsen, and J. P. Bollback. 2001. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294:2310–2314. Larget, B., and D. L. Simon. 1999. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16:750–759. Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equation of state calculations by fast computing machines. J. Chem. Phys. 21:1087–1092. Redelings, B. D., and M. A. Suchard. 2005. Joint Bayesian estimation of alignment and phylogeny. Syst. Biol. 54:401–418. Simon, D., and B. Larget. 2001. Bayesian analysis in molecular biology and evolution (BAMBE), 2.03 beta edition. Department of Mathematics and Computer Science, Duquesne University. Wilgenbusch, J. C., D. L. Warren, and D. L. Swofford. 2004. AWTY: A system for graphical exploration of MCMC convergence in Bayesian phylogenetic inference, v0.5. Available from http://ceb. csit.fsu.edu/awty. First submitted 6 December 2004; reviews returned 4 February 2005; final acceptance 17 May 2005 Associate Editor: Jeff Thorne Syst. Biol. 54(6):965–973, 2005 c Society of Systematic Biologists Copyright ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150500354654 Characters, States, and Homology J OHN V. FREUDENSTEIN Department of Evolution, Ecology and Organismal Biology, Ohio State University Herbarium, 1315 Kinnear Road, Columbus, Ohio 43212, USA; E-mail: [email protected] Characters are the fundamental units used to formalize hypotheses of homology for all phylogenetic analyses, meaning that the decision about how observations are translated into characters is of paramount importance in systematics. Clearly, the importance of characters also extends beyond systematics, being central in evolutionary process studies (cf. Gould and Lewontin, 1979), physiology, and any branch of biology that is concerned with the attributes of organisms. Therefore, it is important that an internally consistent, nonarbitrary, yet flexible way of viewing characters be available that can accommodate any type of organismal aspect. It is beyond the scope of this contribution to attempt to solve all problems with character delimitation and coding, but one important issue involving the distinction between characters and states remains problematic that might be clarified via review and consideration in the light of current thinking in systematics. Although the idea of homologous structures among taxa has a long history (cf. Panchen, 1994), the distinction between the terms character and character state was not introduced until the middle of the 20th century. Mayr (1942), for example, used the term character to denote the particular attribute of an organism (e.g., red flowers, backbone, or five petals), not distinguishing between character and state. It was with the rise of numerical approaches to taxonomy that the character/state distinction became common. Maslin (1952) described a “chronocline” that relates a series of characters through time and is equivalent to the current concept of transformation series. Michener and Sokal (1957) distinguished between the character/state usage (which they employed) and the practice of calling all attributes simply characters, but ascribed no conceptual implications to the difference. Cain and Harrison (1958) did not use the term state, but did assign different numerical values to characters. Sneath 966 SYSTEMATIC BIOLOGY and Sokal (1962) and Davis and Heywood (1963) made a clear distinction between characters and states. Hennig (1966) and Wiley (1980, 1981) also recognized this conceptual distinction, although they considered character and state to be synonyms (equivalent to state) and, like Maslin (1952), used transformation series for what many now term characters. Farris et al. (1970) provided clear definitions of character and character state based on Hennig (1966) for the purposes of phylogenetic analysis: A character (“transformation series” of Hennig) is a collection of mutually exclusive states (attributes; features; “characters,” “character states,” or “stages of expression” of Hennig) which a) have a fixed order of evolution such that b) each state is derived directly from just one other state, and c) there is a unique state from which every other state is eventually derived. Pimentel and Riggins (1987) defined a character more simply as “a feature of organisms that can be evaluated as a variable with two or more mutually exclusive and ordered states.” Many authors use characters that correspond essentially to these definitions (except that a priori state ordering is not usually required). However, neither of these definitions circumscribes characters well with respect to other such units because they do not specify where a character “begins and ends.” In fact, in the context of current definitions, Brower (2000) noted that, “it is not possible . . . to know with certainty where one character ends and the next begins.” These definitions, for example, do not preclude various portions of a single transformation series being called different characters and therefore embody a significant degree of arbitrariness. The distinction between the notions of character and state has itself at times been challenged as unnecessary and arbitrary. Bock (1973) stated that “no distinction exists between characters and character states. The latter are simply characteristics which may be homologous with a more restrictive conditional phrase.” Platnick (1978) agreed with this notion and later (1979) explained his position more fully: . . . all characters can be seen as modifications (or restrictions) of other characters, and the grouping of character states within a character can be seen as just arbitrarily delimiting clusters of separate characters that are increasingly more restricted in generality (i.e., that form nested sets of increasingly modified versions of other characters). Further indications of the perceived arbitrary nature of characters can be found. Eldredge and Cracraft (1980) considered the terms character and state to indicate only “relative levels of similarity within a given hierarchy.” Nelson and Platnick (1981) and Patterson (1988) saw no distinction between characters and states. Ghiselin (1984), from a philosophical perspective, suggested abandoning both terms and substituting “feature” for both. Pleijel (1995) defined characters and states to be the columns and cell values, respectively, in data matrices—a pragmatic but conceptually minimalist approach. Many others (e.g., Pimentel and Riggins, 1987; VOL. 54 Brower and Schawaroch, 1996; and Hawkins et al., 1997) have argued in favor of the character/state distinction based on its usefulness. The crucial point explored here is not whether the distinction between characters and states is useful, but whether it is arbitrary. Platnick’s claim of arbitrariness begs a justification for the use of both concepts beyond simple convenience, as does the uncertainty in character circumscription in the definitions of Farris et al. (1970) and Pimentel and Riggins (1987). I argue here that there is a conceptual justification for the distinction and that it is to be found in our notions of the different homology relations that are commonly recognized for genetic features—in short, characters correspond essentially to paralogs and their states to their orthologs and this distinction should be embraced as a paradigm for all data types. Furthermore, there are practical coding implications that follow from the way that characters are viewed and these need to be considered when empirical studies are undertaken. CHARACTERS —O NTOLOGY AND EPISTEMOLOGY As with any system in which a theoretical framework has real applications, it is important to distinguish the conceptual basis for characters (telling us what a character is) from the practical operation of finding them and to recognize that the resulting empirical units may correspond only imperfectly to the conceptual ideal. This can be due to complexity in the empirical case that obscures mapping to the conceptual framework. The imperfect correspondence of the empirical to the conceptual does not detract from the usefulness of the latter concept, however. A parallel to the character situation exists with species concepts (Frost and Kluge, 1994; Baum and Shaw, 1995), where we conceive of a conceptual unit (such as the Evolutionary Species; Wiley, 1978) as well as methods that allow us to approximate such a unit in practice (e.g., Phylogenetic Species of Nixon and Wheeler, 1990). HOMOLOGY AND M OLECULAR CHARACTERS It is helpful to focus first on molecular characters to examine homology relations because the situation is at least superficially more straightforward than with morphology due to the discrete and “simple” nature of the characters. An important first point is that the way that attributes of taxa originate with respect to each other is the key to their homology relation. Fitch (1970) distinguished between orthologous and paralogous proteins, the former representing variants of a protein in different species, whereas the latter are proteins found in a single individual that resulted from a gene duplication event. Solignac et al. (1995) recognized that paralogy and orthology, as originally defined, do not cover all possible shades of homology relations; they coined the term metalogous to refer to the relationship between paralogs that have been separated by a speciation event (and therefore appear in different taxa). Koonin (2001) also proposed the same term for this situation. Sonnhammer and Koonin (2002) subsequently coined the terms inparalog 2005 POINTS OF VIEW and outparalog to denote the same distinction. Paralogy is commonly used more broadly than its original definition and will be used here to include Solignac et al.’s metalogous relation. Gene duplication is well established as a mechanism (perhaps the dominant one) for creating new loci that can diverge and specialize in function (Ohno, 1970; Hughes, 1994; Zhang, 2003). The paralogy relation defines groups of orthologs, because duplicated loci may be free to diverge independently, resulting in mutually exclusive transformation series of orthologs. The importance of correctly distinguishing these patterns has been discussed frequently (e.g., Goodman et al., 1979; Sanderson and Doyle, 1992; Zouine et al., 2002), because mistaking the nature of gene relationships can lead to errors in the reconstruction of taxon trees. Two other mechanisms for generating new loci are fusion of previously separate elements (domain shuffling) and acquisition of foreign genetic material (lateral transfer). Domain shuffling is a process by which gene segments that code for protein domains are combined to yield new loci (Doolittle, 1995). This process ultimately depends on duplication as well, because the raw materials for new genes are derived from partial copies of other, perhaps still functional, genes. In this sense, the pattern is a subset of paralogy, but essentially results in a “reverse paralogy” event, because instead of yielding new, potentially independently changing DNA segments, two or more segments are combined into a single unit. Lateral transfer of genetic material among taxa is best known from microbial genomes, but Won and Renner (2003) described a case of transfer of a mitochondrial intron among seed plant lineages, suggesting that increased scrutiny of genomes is likely to reveal additional cases among multicellular organisms. The overall extent of this process in the history of organisms is far from known. When a DNA segment is transferred laterally (i.e., nonhierarchically) between taxa, it yields a relationship to the homologous native segment that was termed xenology by Gray and Fitch (1983). That particular homology relation is of less interest here than the simple fact that foreign DNA can become part of a genome. With either the introduction of foreign DNA or the fusion of segments, a new segment is established that will behave essentially as a paralog when compared to a previously existing similar locus, in the sense that it can accumulate its own orthologs as mutations occur. Thus, new loci may arise by duplication, fusion, or insertion from a foreign source; these processes generate a new unit that at least potentially has its own distinct fate. The reification of loci as systematic characters ultimately depends only on their individualization from other such segments, rather than the specific process by which they achieved this independence. Individualization as used here is the acquisition of transformational (as opposed to genomic or linkage) independence, meaning that two features can, at least potentially, change independently. This is essentially the same as “quasiindependence,” named by Lewontin (1978) as a property necessary for features to be susceptible to adaptive 967 change and taken up by Stadler et al. (2001) and Wagner and Stadler (2003). If a particular DNA segment is known not to be independent of another (such as members of a tandem repeat array that undergo concerted evolution), it should not be called a distinct character, because independence is a basic requirement for systematic characters (e.g., Cain and Harrison, 1958; Wheeler and Honeycutt, 1988; Schuh, 2000). M ORPHOLOGICAL CHARACTERS Patterson (1988) described the parallels between morphological and molecular homology, but did not examine the range of morphological situations that exist, nor did he make the conceptual connection between homology relation and the distinction between characters and states. In fact, he explicitly rejected the use of character states as distinct from characters. If the key to character individualization is transformational independence, and characters come into being from preexisting ones by duplication, fusion, or foreign acquisition, then the homology paradigm for molecular characters should also apply to any other characters that arise through such processes. Müller and Wagner (1991) reviewed the concept of morphological novelty and described a number of ways that apomorphies can arise that correspond to the aforementioned processes, including differentiation of repeated elements, synorganization of elements, apparent de novo origin (“new elements”), and change of shape and context. The duplication of morphological structures (serial or iterative homology) is equivalent to genetic duplication, and like it leads to a paralogy relationship (Patterson, 1988). Animal segmentation is perhaps the classic case of serial homology and one that is easy to recognize. However, Nelson (1994) argued that many more morphological features can have paralogous relationships. He cited the example of mammalian hair and mammary glands, which he concluded are related because they are both epidermal derivatives and are connected to each other by a series of other homologs “in a manner perhaps only analogous to gene duplication.” He further concluded that ultimately “all characters are homologous.” The fact that new morphological features become individualized from coexisting structures is sufficient to render their hierarchic pattern comparable to that expected with paralogous genes, because each individualized structure is at least potentially free to change independently of others, producing a group of orthologs (states). Synorganization of morphological structures is equivalent to fusion of genetic segments as in domain shuffling. A good example is the column in orchids, the result of fusion among styles and stamens (Rudall and Bateman, 2002). Other examples are described in Müller and Wagner (1991). In both morphological and molecular situations, one might expect a range of degree to which the synorganization is accomplished. It is difficult to imagine that any morphological structures can “come from nowhere”; they must be the result of underlying mutations, shifts in developmental 968 SYSTEMATIC BIOLOGY patterns, or other modification of preexisting information. One view describes morphological innovation as likely arising at least in part by the stabilization of epigenetic variation, followed by the “capture” and encoding of this innovation at the molecular level (Newman and Müller, 2001). There are many cases where the origin of a morphological structure is unclear; Müller and Wagner (1991) used the corpus callosum of the brain as an example where no precursor is evident. Although novel morphological structures clearly are not imported into an organism as foreign genes may be, the resulting pattern can be the same—in both instances, a new transformation series (character) is established, the equivalent of a paralog, with its resulting orthologs (states). Changes in shape and context result only in modifications of previously existing structures and therefore do not produce new transformation series (characters), only new states (orthologs) in currently existing paralogs. It is important to note that the application of the concepts of paralogy and orthology, fusion, and foreign acquisition to phenotypic features such as physical structures does not depend on or imply a direct connection between processes at the genetic level and at the phenotypic level. In other words, paralogous morphological structures need not correspond in a one-to-one way to underlying paralogous genes. The relationship between genotype and morphological structure is complex and poorly understood for most characters (Shubin and Marshall, 2000). Wagner (1989a, 1989b) stated that “It seems implausible that continuity of gene lineages alone could account for the homology of morphological features” and that “the continuity of descent is an epiphenomenon of the continuity of gene lineages.” At the level of individual genes, it is known that homologous loci may control either homologous or nonhomologous structures (Bolker and Raff, 1996). Because duplicated genes only rarely give rise to new proteins (Lynch and Conery, 2000), and only some of these will contribute to structural features, it is not clear that gene duplication can be implicated in a direct way to explain the diversity of morphological features. If, as is argued here, all characters can be viewed in the same way with respect to the homology relations that they exhibit, then a broadly applicable nonarbitrary distinction between characters and states exists and the extent of an individual character with respect to others is clear. I propose the following definitions that incorporate this reasoning: Characters are individualized assemblages of features (states) among taxa that are the result of duplications, fusions, or foreign acquisitions (“novelties”) and whose elements exhibit paralogous or equivalent nonorthologous relationships to other such assemblages. Character states are mutually exclusive features among taxa of a single paralog-equivalent assemblage that exhibit orthologous relationships to each other. Note that these definitions leave no room for ambiguity as to the sets of states that comprise a character and VOL. 54 do not permit portions of a single transformation series to be called different characters. O PERATIONAL ASPECTS Although the concept of a character may be easily stated, the empirical exercise of sorting orthologs into their paralogs can be a difficult one and is an integral part of the process of establishing homology hypotheses. Indeed, there have been empirical methods proposed to deal explicitly with accommodating molecular paralogs in phylogenetic analysis. Goodman et al. (1979) and Page (1994 and subsequent work) developed the method of “reconciled trees” to fit gene trees (or, more broadly, character trees) into their species trees by minimizing (and thus identifying) duplications and other events, whereas Simmons et al. (2000) developed the “uninode” approach for coding taxa that have experienced known duplications. Advocates of each of these methods have argued its merits (Simmons and Freudenstein, 2002b; Cotton and Page, 2003). De Pinna (1991) wrote that “the decision whether any two or more attributes comprise a single transformation series or two or more independent series is one of the most basic, albeit still confusing issues in systematics.” Patterson (1982, 1988) and de Pinna (1991), among others, described three tests that can be used to distinguish homology relations—similarity, congruence, and conjunction. Similarity refers to perceived degree of sameness among features, and as such is subjective, although criteria exist (e.g., Remane, 1950). Congruence depends on topological relationship on a cladogram—whether a feature represents a synapomorphy or not. Conjunction asks whether two features are present in the same organism. Whether all of these are viewed equally as tests or if similarity is viewed only as an initial criterion (de Pinna, 1991; Brower and Schwaroch,1996) is not important for this discussion. Orthologs pass all three tests as long as states are mutually exclusive, whereas paralogs and their equivalents fail conjunction, because a state from each of two or more characters will be present in the same organism. Hence, conjunction can be used to distinguish features that are states of different characters from those that are states of a single character. As an example, imagine that there are four epidermal protrusion shapes that can occur—rounded, pointed, linear, and squared. If only one of these types is observed in each organism belonging to the study taxa, one might hypothesize that these are all states of the same character. However, if states co-occur in an organism, then they cannot be states of the same character. If rounded and pointed are observed to co-occur and linear and squared co-occur, then we hypothesize that two characters are involved. However, we may still not know the exact character circumscriptions—i.e., if the two characters are (rounded, linear) and (pointed, squared) or (rounded, squared) and (linear, pointed) without some other type of information, such as topographic correspondence (specifying in sufficient detail on which part of the organism particular protuberances 2005 POINTS OF VIEW occur). Observing co-occurrence of any other pair (such as rounded and linear) will answer that question, however, because the two states that co-occur cannot be part of the same character. Brower (2000) stated that, “Character state identity, but not topographical identity, is tested by character congruence in cladistic analysis,” but this is not strictly true. Congruence can be applied as a test for all characters but it cannot distinguish between errors due to paralogy and orthology; homoplasy will be detected but its basis will not be clear. That is, any particular instance of homoplasy could be explained either by the character state having been derived independently in two taxa (falsifying only character state identity) or by the state actually belonging to another character (falsifying both topographic identity and character state identity). Although it is true that we usually hold character hypotheses fixed when interpreting the pattern of character state changes that results from a phylogenetic analysis (i.e., we do not change the sequence alignment or shift morphological states among characters) and attribute homoplasy to incorrect character state homology assessment, errors in assignment of states to their characters must remain a possible explanation. That is why instances of homoplasy can prompt us to reconsider homology hypotheses of either type in the reciprocal illumination process. One approach in which hypotheses of both character and state truly are evaluated simultaneously is the DNA alignment procedures developed by Wheeler (Wheeler and Gladstein, 1994; Wheeler, 1996) in which alternative alignments are evaluated by parsimony (MALIGN) or sequence transformations are created directly on the trees they imply (POY). No attempt is made to distinguish hypotheses of the two types, however. I NDIVIDUALIZATION AND CODING LEVELS Having discussed what characters are conceptually and how to distinguish them operationally, I now turn to practical aspects of their coding. Because not every distinguishable feature need be individualized with respect to transformational independence, not every one should be called a character. Particular hairs on a mammal would not be considered individualized units, for example, unless each truly could transform independently of others. However, “hair” as a whole might be individualized. Whether any particular individualized morphological structure can in fact transform independently of another is an empirical question, and one that exists for genes as well (cf. concerted evolution; Zimmer et al., 1980). One might question whether individualization is a phenomenon that exists apart from our ability to recognize it, or whether it is simply a matter of our perception. I argue that it is real, to the extent that there is some minimal feature that is potentially free to change apart from other such units. Ultimately, this feature is the nucleotide. This does not preclude individualization also among phenotypic features based on the nucleotide, however. Such phenotypic features would include amino acids, secondary chemical compounds, morphological struc- 969 tures, etc. This diversity of levels exists because different processes operate to cause change at different organizational levels and there is a resulting hierarchy of parts within an organism as well as among taxa (Riedl, 1978). Discovering the individualized unit at any particular level is an empirical exercise—one that is attempted routinely in the delineation of characters. Wagner (1989b), for example, argued for the importance of developmental constraints in identifying minimal morphological units of change. The level of characters used in a phylogenetic analysis is a matter of choice, because in many cases either reductive (minimal) characters or composite characters that are based on them could be utilized (Wilkinson, 1995; Simmons and Freudenstein, 2002a). This difference in level should not be confused with the distinction between characters and states and is the reason that the term “minimal level” does not appear in the character and state definitions presented here—because it would force the definition to the level of the nucleotide. Perhaps in that sense there remains arbitrariness in these definitions, but if so it is a result of the multiple levels of hierarchy that exist in and among living organisms. Some degree of flexibility in these definitions is required in order to capture the maximum amount of information available. Thus, for genetic data, the presence/absence of a particular locus may be scored, as may variants of each locus. The loci themselves are paralogs or their equivalents and are characters, whereas the variants of each are orthologs, or character states. Alternatively, individual base positions in an alignment may be scored. Base positions have a paralogous relationship to each other to the extent that increase in gene size is the result of duplications (e.g., slipped-strand mispairing) or insertions. Hence, base positions are also characters (to the extent that they are independent) and base identities at each position are states. The same is true for morphology, where we might have the choice of coding the face, the nose, or the nostril as the character. Whether both genomic and phenotypic characters based on them should be coded in the same analysis is a separate question (Agosti et al., 1996; Freudenstein et al., 2003). HIERARCHIC CHARACTER PATTERNS AND I MPLICATIONS FOR CODING : EXAMPLES Thus far we have a character concept that can be applied at different levels within an organism. Once characters and their states have been recognized, it is necessary to code them for analysis, which means fitting these constructs into a form that is interpretable by current phylogenetic analysis software, using what are termed hereafter “matrix characters.” This is another level at which some ambiguity occurs, because at least in some cases there is no single optimal way to represent the homology statements for the software (see examples below). The recognition of characters as paralogs or equivalents distinct from states as orthologs does not preclude flexibility in coding the matrix characters— 970 SYSTEMATIC BIOLOGY the number of paralogs need not equal the number of character columns, for example. Even though the correspondence between at least some homology hypotheses and matrix characters is imperfect, the conceptual view of a character does have certain logical implications for the matrix character(s) used to represent it. Platnick (1979) imagined “a great chain of characters (or synapomorphies, or homologies) stretching from those of complete generality, which are true for all life, on to those true for only a single species.” The logical approach to coding such variation in a matrix would be as a series of binary characters, as suggested by Platnick (1979) and Pleijel (1995). Coding in this presence-absence way could easily be done for simple characters that correspond to paralogs as described here, because they should all be observable in each organism. However, if there was variation within any of the paralogs, those variants would be orthologs (creating a transformation series), and coding them as binary characters would pose a problem. The problem is that because only a single state for each character is observable in an organism, only it can be coded as present, even though that state is part of the larger transformation series, any antecedent states of which are also truly present (in modified form). Platnick (1979) criticized unordered multistate coding because it does not preserve this hierarchic information among states, but unless one is willing to make assumptions about ancestral states, neither does binary coding. At least coding orthologs as states of a nonadditive multistate character eliminates some of the problems and biases encountered when attempting to code them as binary characters (Pimentel and Riggins, 1987; Hawkins et al., 1997; Hawkins, 2000). Those problems stem from the fact that transformation from the presence state (“1”) of one binary character to the presence state of another always requires two steps (loss of one and gain of the other) and thus biases against a direct transformation between those states, making independent gain of two features cost the same as transformation between them. It is useful to consider some examples of characters and states to see how these concepts relate to matrix coding. Eldredge and Cracraft (1980:30) provided an example to illustrate the arbitrariness of characters and states: . . . the character “feathers” is a common similarity of all birds, although specific character-states of the character “feathers” (e.g., variation in color texture, and pattern) would be similarities common to various groups of birds. At the same time, it is apparent that even the character “feathers” could be considered a character-state, say within the vertebrates, if the systematist were considering the “character” to be the vertebrate integument. . . An illustration of this situation is provided in Figure 1a, where symbols are used to denote features. In this diagram, the triangle represents a precursor indumentum type (such as the chondrichthyan denticle) and the square a scale, believed to be transformed at some point among tetrapods into a feather, designated by a circle. Among feathers there are three distinct unspecified types, designated by ticks on the circle. Because birds VOL. 54 FIGURE 1. A cladogram for five taxa (a) and corresponding data matrix (b), which shows three ways of encoding the variation. have both scales and feathers (i.e., only some of their scales are transformed into feathers), there is the potential that they represent two characters (because having both would appear to mean that they fail the conjunction test). However, if it is just a proportion of the original scales that are transformed into feathers, then it becomes a question of whether the feathers and remaining scales are transforming independently. If they are, then two characters would be warranted, because two sets of orthologs have been established. If not (as assumed for the purpose of this example), then the variation comprises a single set of orthologs, so a single nonadditive multistate character could be used (Fig. 1b). However, this coding allows transformation between any two states in one step, which may be undesirable if one wishes to emphasize the homology of feathers. Another well-known option for this situation is to use two matrix characters, one coding for each different basic type of indumentum (denticle, scale, feather), and a second to code for the different types of feathers (and coding taxa with scales as inapplicable for feather type). Although this approach uses two columns in the matrix, it yields the same minimum number of apomorphies (four) as the multistate option. It does incur the problems of inapplicable states, however (Maddison, 1993). A third option would be to use a binary character for each feature, but although this approach also uses four apomorphies, it requires two steps to go between any two states, making independent derivation of states cost the same as transformation among them. Homology relationships among features can be more complex than this simple example when paralogy is also involved. Doyle and Davis (1998:106) illustrated a situation where a gene duplication occurs, resulting in a more complicated pattern. Their diagram is adapted here in Figure 2a to illustrate the different situations that can arise, using symbols to represent characters and states (which could be either molecular or morphological), in a character phylogeny that is equivalent to a gene tree. The symbols indicate similar units—those that have circles, regardless of shading, would all be recognized as “the same thing,” whereas shading differences indicate more minor variants. Three duplication events are depicted. Under the paradigm described here, a new character is created at each duplication event. Therefore, four characters (paralogs) are depicted in Figure 2a. The corresponding taxon tree is shown in Figure 2b. 2005 POINTS OF VIEW 971 FIGURE 2. A character (gene) tree for five taxa (a), its corresponding taxon tree (b), and two possible matrices for the data (c, d). The diamonds in (a) represent duplication events, symbols at tick marks represent transformations, and symbols at terminals are the states for those taxa. “D” in the matrix indicates a duplication matrix character. The multistate matrix character designated by (N) is an alternative way of coding the duplications. Doyle and Davis (1998) made the point that although the loci in taxa whose divergence followed the duplication are clearly paralogous to one another, each paralog must be viewed as orthologous to the feature in the taxon preceding the duplication. In terms of hierarchic relationship only, each of the two paralogs has exactly the same relationship to the sister of the pair, regardless of function. However, it may well be that one paralog retains the function/morphology of the ancestral locus, whereas the other acquires a new function/morphology. Fitch (2000) suggested the term isortholog to designate the member of the duplicated pair that retains the ancestral function. This relation can be seen in Figure 2a, where the character “square” persists through the first duplication event, at which point a new character (“circle”) is created by the duplication. The square in taxon A is coded as an ortholog to the squares in taxa B to E, and it is also an ortholog to the circle character. This means that a particular feature can be a state in each of two characters, which is not problematic because it will always represent the plesiomorphic state in each of them. Is a new character created even when two paralogs are indistinguishable? Yes; this is illustrated in Figure 2a at duplication 3. There are no differences between the paralogs that can be coded as character states. However, the duplication itself (the origination of the new character), as evidenced by the presence of a pair of identical structures, can be coded. This is similar to the “uninode” approach of Simmons et al. (2000) for analyzing paralogous loci simultaneously with those of preexisting unduplicated taxa, in which they coded a separate character in the species tree analysis to mark a duplication. Hence, two types of information potentially can be coded in a data set—the number of characters that are present in the taxa and the variation that occurs within each. When the number of characters present among taxa is uniform (i.e., with no taxa having “absence” states), such as with an invariant length DNA sequence or a morphological dataset in which all characters are present in all taxa, then one matrix character per paralog is sufficient to code the orthologs. If the number of characters present among taxa varies, this information can be coded in different ways. One option is to code a presence/absence matrix character for each paralog (representing duplication/other origination events) and a separate (perhaps multistate) matrix character to score orthologs for each paralog (Fig. 2c and as in Fig. 1 discussed above). In this case, the state for a paralog that is scored for taxa that preceded the duplication would be “inapplicable.” No matrix character was coded for the “keystone” character in the example in Figure 2 because there was no ortholog variation for that character. An equivalent way of accounting for the character number (paralog) variation would be to score a single character to enumerate the paralogs—essentially a transformation of characters rather than states (Fig. 2d). An example of this would be coding the number of vertebrae for a group. This is a meristic character, which only enumerates the iterated units but does not code any variation they might have. Additional matrix characters would be needed to code variation for individual vertebrae. Another option is to code both the presence/absence information for a character and its ortholog variation in a single matrix character (coding “absence” as a state; Fig. 2e). In this case a matrix character needs to be used specifically to score duplication 3, which was not accompanied by any ortholog changes. The number of state changes is five for all of these options. The important element that all of these coding options have in common is the recognition that character originations and state transformations are distinct and that both can be recorded and used as synapomorphies. 972 SYSTEMATIC BIOLOGY This is consistent with the idea that taxon phylogenies are built from character phylogenies and that orthology and paralogy-type relations should be distinguished and used in the proper context. In summary, there is a nonarbitrary basis for the distinction between characters and states at any level and it is the difference between paralogs and orthologs, which can be recognized empirically by application of the conjunction criterion. Use of a character definition based on the paralogy/orthology difference results in circumscription of characters as whole transformation series. This definition does not preclude flexibility in the matrix coding of characters, however. ACKNOWLEDGEMENTS The author thanks Allan Baker, Marymegan Daly, Jerrold Davis, Roderic Page, Kurt Pickett, Daniel Potter, Chris Randle, Mark Simmons, Peter Stevens, Günter Wagner, John Wenzel, and Mark Wilkinson for discussion of these ideas and comments on the manuscript. R EFERENCES Agosti, D., D. Jacobs, and R. DeSalle. 1996. On combining protein sequences and nucleic acid sequences in phylogenetic analysis: The homeobox protein case. Cladistics 12:65–82. Baum, D. A., and K. L. Shaw. 1995. Genealogical perspectives on the species problem. Pages 289–303 in Experimental and molecular approaches to plant biosystematics (P. C. Hoch and A. G. Stephenson, eds.). Missouri Botanical Garden, St. Louis, Missouri. Bock, W. J. 1973. Philosophical foundations of classical evolutionary classification. Syst. Zool. 22:375–392. Bolker, J. A., and R. A. Raff. 1996. Developmental genetics and traditional homology. BioEssays 18:489–494. Brower A. V. Z. 2000. Homology and the inference of systematic relationships: Some historical and philosophical perspectives. Pages 10–21 in Homology and systematics: Coding characters for phylogenetic analysis (R. Scotland and R. T. Pennington, eds.). Taylor and Francis, London. Brower, A. V. Z., and V. Schawaroch. 1996. Three steps of homology assessment. Cladistics 12:265–272. Cain, A. J., and G. A. Harrison. 1958. An analysis of the taxonomist’s judgment of affinity. Proc. Zool. Soc. Lond. 131:85–98. Cotton, J. A., and R. D. M. Page. 2003. Gene tree parsimony vs. uninode coding for phylogenetic reconstruction. Mol. Phylogenet. Evol. 29:298–308. Davis, P. H., and V. H. Heywood. 1963. Principles of angiosperm taxonomy. Van Nostrand, Princeton, New Jersey. de Pinna, M. C. C. 1991. Concepts and tests of homology in the cladistic paradigm. Cladistics 7:367–394. Doolittle, R. F. 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64:287–314. Doyle, J. J., and J. I Davis. 1998. Homology in molecular phylogenetics: a parsimony perspective. Pages 101–131 in Molecular systematics of plants II: DNA sequencing (D. E. Soltis, P. S. Soltis, and J. J. Doyle, eds.). Kluwer, Boston. Eldredge, N., and J. Cracraft. 1980. Phylogenetic patterns and the evolutionary process. Columbia University Press, New York. Farris, J. S., A. G. Kluge, and M. J. Eckardt. 1970. A numerical approach to phylogenetic systematics. Syst. Zool. 19:172–191. Fitch, W. M. 1970. Distinguishing homologous from analogous proteins. Syst. Zool. 19:99–113. Fitch, W. M. 2000. Homology: a personal view on some of the problems. Trends Genet. 16:227–231. Freudenstein, J. V., K. M. Pickett, M. P. Simmons, and J. W. Wenzel. 2003. From basepairs to birdsongs: phylogenetic data in the age of genomics. Cladistics 19:333–347. Frost, D. R., and A. G. Kluge. 1994. A consideration of epistemology in systematic biology, with special reference to species. Cladistics 10:259–294. VOL. 54 Ghiselin, M. T. 1984. “Definition,” “character,” and other equivocal terms. Syst. Zool. 33:104–110. Goodman, M., J. Czelusniak, G. W. Moore, A. E. Romero-Herrera, and G. Matsuda. 1979. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 28:132–163. Gould, S. J., and R. C. Lewontin. 1979. The spandrels of San Marco and the Panglossian paradigm. Proc. R. Soc. Lond B. Biol. Sci. 205:581– 598. Gray, G. S., and W. M. Fitch. 1983. Evolution of antibiotic resistance genes: The DNA sequence of a kanamycin resistance gene from Staphylococcus aureus. Mol. Biol. Evol. 1:57–66. Hawkins, J. A. 2000. A survey of primary homology assessment: different botanists perceive and define characters in different ways. Pages 22–53 in Homology and systematics: Coding characters for phylogenetic analysis (R. Scotland and R. T. Pennington, eds.). Taylor and Francis, London. Hawkins, J. A., C. E. Hughes, and R. W. Scotland. 1997. Primary homology assessment, characters and character states. Cladistics 13:275– 283. Hennig, W. 1966. Phylogenetic systematics. University of Illinois Press, Urbana. Hughes, A. L. 1994. The evolution of functionally novel proteins after gene duplication. Proc. R. Soc. Lond. B Biol. Sci. 256:119–124. Koonin, E. V. 2001. An apology for orthologs—or brave new memes. Genome Biol. 2: comment 1005.1–1005.2. Lewontin, R. C. 1978. Adaptation. Sci. Am. 239:213–230. Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Maddison, W. P. 1993. Missing data versus missing characters in phylogenetic analysis. Syst. Biol. 42:576–581. Maslin, T. P. 1952. Morphological criteria of phyletic relationships. Syst. Zool. 1:49–70. Mayr, E. 1942. Systematics and the origin of species. Columbia University Press, New York. Michener, C. D., and R. R. Sokal. 1957. A quantitative approach to a problem in classification. Evolution 11:130–162. Müller, G. B., and G. P. Wagner. 1991. Novelty in evolution: Restructuring the concept. Annu. Rev. Ecol. Syst. 22:229–256. Nelson, G. 1994. Homology and systematics. Pages 101–149 in Homology: The hierarchical basis of comparative biology (B. K. Hall, ed.). Academic Press, San Diego. Nelson, G., and N. I. Platnick. 1981. Systematics and biogeography: Cladistics and vicariance. Columbia Univiversity Press, New York. Newman, S. A., and G. B. Müller. 2001. Epigenetic mechanisms of character origination. Pages 559–579 in The character concept in evolutionary biology (G. P. Wagner, ed.). Academic Press, San Diego. Nixon, K. C., and Q. D. Wheeler. 1990. An amplification of the phylogenetic species concept. Cladistics 6:211–223. Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, New York. Page, R. D. M. 1994. Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst. Biol. 43:58–77. Panchen, A. L. 1994. Richard Owen and the concept of homology. Pages 21–62 in Homology: The hierarchical basis of comparative biology (B. K. Hall, ed.). Academic Press, San Diego. Patterson, C. 1982. Morphological characters and homology. Pages 21– 74 in Problems of phylogenetic reconstruction (G. P. Wagner, ed.). Academic Press, London. Patterson, C. 1988. Homology in classical and molecular biology. Mol. Biol. Evol. 5:603–625. Pimentel, R. A., and R. Riggins. 1987. The nature of cladistic data. Cladistics 3:201–209. Platnick, N. I. 1978. Classification, historical narratives, and hypotheses. Syst. Zool. 27:365–369. Platnick, N. I. 1979. Philosophy and the transformation of cladistics. Syst. Zool. 28:537–546. Pleijel, F. 1995. On character coding for phylogeny reconstruction. Cladistics 11:309–315. Remane, A. 1952. Die Grundlagen des natürlichen Systems der vergleichenden Anatomie und der Phylogenetik. Geest and Portig, Leipzig. 2005 POINTS OF VIEW Riedl, R. 1978. Order in living organisms: A systems analysis of evolution. John Wiley, Chichester. Rudall, P. J., and R. M. Bateman. 2002. Roles of synorganisation, zygomorphy and heterotopy in floral evolution: The gynostemium and labellum of orchids and other lilioid monocots. Biol. Rev. (Camb.) 77:403–441. Sanderson, M. J., and J. J. Doyle. 1992. Reconstruction of organismal phylogenies from multigene families: Paralogy, concerted evolution, and homoplasy. Syst. Biol. 41:4–17. Schuh, R. T. 2000. Biological systematics: Principles and applications. Cornell University Press, Ithaca, New York. Shubin, N. H., and C. R. Marshall. 2000. Fossils, genes, and the origin of novelty. Pages 324–340 in Deep time: Paleobiology’s perspective (D. H. Erwin and S. L. Wing, eds.). Paleobiology 26 (4, supplement). Simmons, M. P., C. D. Bailey, and K. C. Nixon. 2000. Phylogeny reconstruction using duplicate genes. Mol. Biol. Evol. 17:469–473. Simmons, M. P. and J. V. Freudenstein. 2002a. Artifacts of coding amino acids and other composite characters for phylogenetic analysis. Cladistics 18:354–365. Simmons, M. P., and J. V. Freudenstein. 2002b. Uninode coding vs gene tree parsimony for phylogenetic reconstruction using duplicate genes. Mol. Phylogenet. Evol. 23:481–498. Sneath, P. A., and R. R. Sokal. 1962. Numerical taxonomy. Nature 193:855–860. Solignac, M., C. Periquet, D. Anxolabehere, and C. Petit. 1995. Génétique et evolution. Hermann, Paris. Sonnhammer, E. L. L., and E. V. Koonin. 2002. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 18:619– 620. Stadler, B. M. R., P. F. Stadler, G. P. Wagner, and W. Fontana. 2001. The topology of the possible: Formal spaces underlying patterns of evolutionary change. J. Theor. Biol. 213:241–274. Wagner, G. P. 1989a. The biological homology concept. Annu. Rev. Ecol. Syst. 20:51–69. 973 Wagner, G. P. 1989b. The origin of morphological characters and the biological basis of homology. Evolution 43:1157–1171. Wagner, G. P., and P. F. Stadler. 2003. Quasi-independence, homology and the unity of type: A topological theory of characters. J. Theor. Biol. 220:505–527. Wheeler, W. C. 1996. Optimization alignment: The end of multiple sequence alignment in phylogenetics? Cladistics 12:1–10. Wheeler, W. C., and D. S. Gladstein. 1994. MALIGN: A multiple sequence alignment program. J. Hered. 85:417–418. Wheeler, W. C., and R. L. Honeycutt. 1988. Paired sequence difference in ribosomal RNAs: Evolutionary and phylogenetic implications. Mol. Biol. Evol. 5:90–96. Wiley, E. O. 1978. The evolutionary species concept reconsidered. Syst. Zool. 27:88–92. Wiley, E. O. 1980. Phylogenetic systematics and vicariance biogeography. Syst. Bot. 5:194–220. Wiley, E. O. 1981. Phylogenetics. John Wiley and Sons, New York. Wilkinson, 1995. A comparison of two methods of character construction. Cladistics 11:297–308. Won, H., and S. S. Renner. 2003. Horizontal gene transfer from flowering plants to Gnetum. Proc. Natl. Acad. Sci. USA 100:10824– 10829. Zhang, J. 2003. Evolution by gene duplication: an update. Trends Ecol. Evol. 18:292–298. Zimmer, E. A., S. L. Martin, S. M. Beverely, Y. W. Kan, and A. C. Wilson. 1980. Rapid duplication and loss of genes coding for the α chains of hemoglobin. Proc. Natl. Acad. Sci. USA 77:2158–2162. Zouine, M., Q. Sculo, and B. Labedan. 2002. Correct assignment of homology is crucial when genomics meets molecular evolution. Comp. Funct. Genomics 3:488–493. First submitted 29 July 2004; reviews returned 2 February 2005; final acceptance 7 June 2005 Associate Editor: Allan Baker Syst. Biol. 54(6):973–983, 2005 c Society of Systematic Biologists Copyright ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150500354647 Underparameterized Model of Sequence Evolution Leads to Bias in the Estimation of Diversification Rates from Molecular Phylogenies LIAM J. R EVELL,1 LUKE J. HARMON,1,2 AND R ICHARD E. G LOR 1,3 2 1 Department of Biology, Campus Box 1229, Washington University, St. Louis, Missouri 63130, USA; E-mail: [email protected] (L.J.R.) Current address: Biodiversity Centre, University of British Columbia, 6270 University Blvd., Vancouver, British Columbia V6T 1Z4, Canada; E-mail: [email protected] 3 Current address: Center for Population Biology, University of California, One Shields Avenue, Davis, California 95616, USA; E-mail: [email protected] Macroevolutionary inferences from molecular phylogenies are becoming increasingly common (see Harvey et al., 1996; Mooers and Heard, 1997; Pagel, 1999; Barraclough and Nee, 2001). Many methods in which phylogenies are invoked for historical inference assume that a molecular phylogeny is an errorless representation of the underlying phylogenetic history of the included taxa (but see Lutzoni et al., 2001; Huelsenbeck et al., 2000; Huelsenbeck and Rannala, 2003). However, molecular phylogenies are estimates of this history based on a particular model of evolution; thus, there is some error associated with their estimation (Huelsenbeck and Kirkpatrick, 1996). Here we explore the effects of a particular type of error in phylogenetic branch-length estimation, that caused by assuming an underparameterized model of molecular evolution, on the γ -statistic of Pybus and Harvey (2000), a statistic that tests for changes in the rate of diversification through time. Although we restrict our attention to the estimation of diversification rates, our findings are germane to any macroevolutionary inferences relying on the accurate estimation of phylogenetic branch lengths such as molecular dating (e.g.,
© Copyright 2026 Paperzz