Contribution of Homoplasy and of Ancestral Polymorphism to the Evolution of Genes in Anthropoid Primates Colm O’hUigin,*1 Yoko Satta,† Naoyuki Takahata,† and Jan Klein* *Max-Planck-Institut für Biologie, Abteilung Immungenetik, Corrensstrasse 42, Tübingen, Germany; and †Department of Biosystems Science, The Graduate University for Advanced Studies, Hayama, Kanagawa, Japan Molecular phylogenies of lineages that split from one another in short succession are often difficult to resolve because different loci and different sites within the same locus yield incongruent relationships. The incongruity is commonly attributed to two causes: differential assortment of ancestral polymorphisms and homoplasy. To assess the relative contribution of these two causes, sequences of 57 segments from 51 loci in six primate lineages (human, chimpanzee, gorilla, orangutan, macaque, and tamarin, abbreviated as H, C, G, O, M, and T, respectively) were subjected to ‘‘partitioning’’ analysis, in which phylogenetically informative sites were identified in all 15 pairwise comparisons of each of the 57 segments and tallied for their support or lack thereof for each of the theoretically possible phylogenies. The six lineages include one of the best known cases of a difficult-to-resolve phylogeny: the trichotomy (H, C, G), in which the three lineages may have diverged from each other within a short period of time. In this period many of the ancestral polymorphisms apparently persisted and yielded phylogenetically incongruent signals. By contrast, no ancestral polymorphism is expected to have survived during the interval separating the divergences of the O, M, and T lineages from the ancestor of the (H, C, G) group. Any phylogenetic incompatibilities at sites in the O, M, and T lineages relative to the (H, C, G) group are therefore presumably the result of homoplasy. The frequency of homoplasy estimated in this manner is unexpectedly high: 12% for the (H, C, G) clade and 19% for the (H, C, G, O) clade. At least three-quarters of the 48% incompatibility observed in the (H, C) clade is attributable to the sorting out of ancestral polymorphisms coupled with intragenic recombination. Possible reasons for this high level of homoplasy in the O, M, and T lineages are discussed, and a computer simulation has been carried out to produce a model explaining the observed data. Introduction Attempts to determine phylogenetic relationships among closely allied taxa often yield discordant results depending on the gene or even part of the gene used in the reconstruction. As a consequence, phylogenies of many groups of taxa remain unresolved. Perhaps the best known example of a prolonged controversy now tentatively resolved by a consensus based on sequences of a large set of genes is the one involving the human species, specifically the question of its closest living relative (Miyamoto et al. 1988; Bailey 1993; Rogers 1993; Ruvolo 1997; Satta, Klein, and Takahata 2000). After elimination of the orangutan (O), favored by some anthropologists and paleontologists (Schwartz 1987), the issue came to be referred to as ‘‘the trichotomy problem,’’ the question of the relationship among three species, human (H), chimpanzee (C), and gorilla (G). The consensus approach identifies the chimpanzee as the nearest living relative of humans, but the evidence supporting this conclusion is not overwhelming. In the most recent and largest study encompassing 45 loci and 47 1 Present address: Department of Genetics, Trinity College, Dublin, Ireland. Abbreviations: C, chimpanzee; G, gorilla; H, human; M, macaque; NCBI, National Center for Biotechnology Information; NWM, New World monkeys; O, orangutan; OTU, operational taxonomic unit; OWM, Old World monkeys; PCR, polymerase chain reaction; T, tamarin. Key words: homoplasy, parallelism, convergent evolution, trichotomy, primates, polymorphism. Address for correspondence and reprints: Colm O’hUigin, MaxPlanck-Institut für Biologie, Abteilung Immungenetik, Corrensstrasse 42, D-72076 Tübingen, Germany. E-mail: [email protected]. Mol. Biol. Evol. 19(9):1501–1513. 2002 q 2002 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038 kb of sequence (Satta, Klein, and Takahata 2000), the consistency of the inferred relationship is rather poor. Of the 174 sites that are informative regarding the relationship between H, C, G, and O, only 91 (52%) support the (H, C) clade. Almost half the sites support alternative phylogenetic relationships between the species—either the (H, G) or the (C, G) clade. Inconsistency in the inferred patterns of shared-derived substitutions (incompatibility) is apparent both between and within the loci of the three species comprising the trichotomy. The two main causes commonly invoked to explain why different portions of the genome provide different answers regarding the phylogenetic relationships within a group of taxa are assortment of ancestral polymorphism and homoplasy. In the former case, an ancestral population of species H, C, and G may contain two alleles, a and b, at locus 1 and two other alleles, x and y, at locus 2. If, for example, at locus 1 the a allele is subsequently fixed in species G, whereas allele b is fixed in species C and H, C will be judged as the closest relative of H by the analysis of this locus. If, on the other hand, allele x at locus 2 is fixed in species C, whereas the y allele is fixed in species G and H, G rather than C will appear to be the closest relative of H. Similarly, at the nucleotide level, two sites within a single gene may yield contradictory phylogenetic information if recombination takes place between them and their polymorphism is differentially resolved among the species. The second major cause of phylogenetic ambiguity, homoplasy (i.e., independently attained similarity at a site), is commonly differentiated into parallel evolution (similarity acquired from the same ancestral condition) and evolutionary convergence (similarity attained from different ancestral conditions). Thus, for example, if a changes to b indepen1501 ABBREVIATION ZFX ZNFN1A1 CXCR4 APOB APOB ACAT2 VHL NGFB SCG2 ADRB2 UOX DRPLA PRNP NPPA ADRB3 ZFY ZFY DRD4 RNASE6 CCR5 BRCA1 OXTR IFNG ABO TNF TNF BCYRAN1 B2M LYZ FUT2 PROC C4B F9 RHAG DMP1 IL16 COX4 ODC1 DAF POMC PAH LCAT OPN1SW APOA1 AFP HBBP1 FPR1 GENE NAME Zinc finger protein, X-linked . . . . . . . . . . . . . . . . . . . Zinc finger protein, subfamily 1A, member 1. . . . . . Chemokine (C-X-C) receptor 4 . . . . . . . . . . . . . . . . . Apolipoprotein B: segment 1 . . . . . . . . . . . . . . . . . . . Apolipoprotein B100: segment 2 . . . . . . . . . . . . . . . . TCP1-ACAT2 overlap . . . . . . . . . . . . . . . . . . . . . . . . . Von Hippel-Lindau tumor supressor gene . . . . . . . . . Nerve growth factor, beta subunit . . . . . . . . . . . . . . . Secretogranin II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adrenergic receptor, beta-2 . . . . . . . . . . . . . . . . . . . . . Urate oxidase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dentatorubral pallidoluysian atrophy . . . . . . . . . . . . . Prion protein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Natriuretic peptide precursor A . . . . . . . . . . . . . . . . . Adrenergic receptor, beta-3 . . . . . . . . . . . . . . . . . . . . . Zinc finger protein, Y-linked: segment 1. . . . . . . . . . Zinc finger protein, Y-linked: segment 2. . . . . . . . . . Dopamine receptor D4 . . . . . . . . . . . . . . . . . . . . . . . . Ribonuclease A family, 6 . . . . . . . . . . . . . . . . . . . . . . Chemokine (C-C) receptor 5. . . . . . . . . . . . . . . . . . . . Breast cancer, type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . Oxytocin receptor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interferon gamma. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ABO blood group . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tumor necrosis factor: segment 1 . . . . . . . . . . . . . . . Tumor necrosis factor: segment 2 . . . . . . . . . . . . . . . Brain cytoplasmic RNA 1. . . . . . . . . . . . . . . . . . . . . . Beta-2-microglobulin . . . . . . . . . . . . . . . . . . . . . . . . . . Lysozyme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fucosyl transferase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . Protein C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complement component 4B . . . . . . . . . . . . . . . . . . . . Hemophilia B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rhesus blood group–associated glycoprotein . . . . . . Dental matrix acidic phosphoprotein 1 . . . . . . . . . . . Interleukin 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cytochrome c oxidase, subunit IV . . . . . . . . . . . . . . . Ornithine decarboxylase 1. . . . . . . . . . . . . . . . . . . . . . Decay-accelerating factor for complement . . . . . . . . Proopiomelanocortin . . . . . . . . . . . . . . . . . . . . . . . . . . Phenylalanine hydroxylase . . . . . . . . . . . . . . . . . . . . . Lecithin cholesterol acyltransferase . . . . . . . . . . . . . . Opsin 1, short wave. . . . . . . . . . . . . . . . . . . . . . . . . . . Apolipoprotein A-I . . . . . . . . . . . . . . . . . . . . . . . . . . . Alpha-fetoprotein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hemoglobin, beta pseudogene 1. . . . . . . . . . . . . . . . . Formyl peptide receptor 1. . . . . . . . . . . . . . . . . . . . . . Xp22.2-p21.3 7p12 2q21 2p24 2p24 6q25.3-q26 3p26-p25 1p13.1 2q35-q36 5q32-q34 1p22 12p13.31 20pter-p12 1p36.2 8p12-p11.2 Yp11.3 Yp11.3 11p15.5 14 3p21 17q21 3p26.2 12q14 9q34 6q21.3 6q21.3 2p16 15q21-q22 12 19q13.3 2q13-q14 6p21.3 Xq27.1-q27.2 6p21.1-p11 4q21 15q26.1 16q22-qter 2p25 1q32 2p23.3 12q24.1 16q22.1 7q31.3-q32 11q23-q24 4q11-q13 11p15.5 19q13.4 LOCATION Table 1 List of 57 Segments of 51 Loci Examined in the Present Study 397 554 1,003 175 588 517 432 630 1,011 1,136 375 478 759 452 941 397 693 261 453 1,019 3,408 827 426 568 738 1,499 584 990 447 993 381 504 1,028 1,218 877 930 437 1,274 655 490 1,137 1,412 1,201 1,147 575 9,174 1,005 SIZE Human X59738 U40462 AF025375 DS41109 M14162 NT007122 AF010238 X52599 M25756 J02960 M27696 D38529 AF085477 X01471 X72861 J03134 U24118 I12349 U64998 AF011500 AF005068 AC008151 J00219 (2) AP000505 X02910 AF020057 M17987 M21119 AB004861 U47685 U07852 K02053 AF031548 U89012 AF077011 U90915 M81740 AB003312 J00291 AF204239 X04981 U53874 J00098 M16110 D56900 M37128 AB041907 AY091915 U89798 DS41109 AY091920 AY091925 (1) AY091930 AY091935 AY091940 M69165 AJ133270 U08296 AY091944 AY091949 AB041908 U24117 AF010294 AF037081 AF011540 AF019075 AY091953 AF1647 (2) AF195663 AY091959 AF067778 AY091964 U76912 AF080603 U77647 L38806 AY091969 AF177621 AY091974 AF007879 AF042747 AY091981 AY091986 AY091991 AY091996 AY092001 AF039433 AY092007 U21916 D56900 X97745 Chimpanzee AB041909 AY091916 AF172232 DS41109 AY091921 AY091926 (1) AY091931 AY091936 AY091941 U03509 AJ133271 U08300 AY091945 AY091950 AB041910 U24119 AF010296 AF037088 AF105291 AF019076 AY091954 AF1647 (2) AF195664 AY091960 AF067779 AY091965 U76913 AF080605 U77648 L38799 AY091970 AF177622 AY091975 AY091979 AF042750 AY091982 AY091987 AY091992 AY091997 AY092002 AF039425 AY092008 M38272 D56900 X97736 Gorilla ORIGIN OF AB041911 AY091917 AF172231 DS41109 AY091922 AY091927 (1) AY091932 AY091937 AY091942 M69167 AJ133272 U08305 AY091946 AY091951 AB041912 X72698 AF010298 AF037082 AF075446 AF019077 AY091955 AF1647 (2) AF195665 AY091961 AF067780 AY091966 U76914 AF111935 U77650 L38805 AY091971 AF177623 AY091976 AY091980 AF042753 AY091983 AY091988 AY091993 AY091998 AY092003 AY092006 AY092009 AY092011 D56900 X97735 Orangutan SEQUENCE OWM AB041917 AY091918 AF172224 DS41109 AY091923 AY091928 (1) AY091933 AY091938 L38905 M69168 AJ133274 U08311 AY091947 U63591 AB041918 X58931 AF125666 AF037089 AF075450 AF019078 AY091956 L26024 (2) AF195667 AY091962 AF067784 AY091967 X60236 AF080607 U77651 L38802 AY091972 AF177625 AY091977 AF017107 AF042759 AY091984 AY091989 AY091994 AY091999 AY092004 AF158976 M83242 AY092012 D56900 X97734 NWM AB041921 AY091919 AF178084 DS41109 AY091924 AY091929 (1) AY091934 AY091939 AY091943 M69169 AJ133275 U08304 AY091948 AY091952 AB041922 X58936 AF125669 AF037086 AF161923 AF019079 AY091957 X64659 AY091958 AF195668 AY091963 AF067788 AY091968 U76922 AF111936 U77649 L38807 AY091973 AF132980 AY091978 AF017109 AF042765 AY091985 AY091990 AY091995 AY092000 AY092005 U53875 AY092010 AY092013 D56900 AY092014 1502 O’hUigin et al. NOTE.—Wherever available, locus names and abbreviations are used according to the Human Genetic Nomenclature Committee as listed in the NCBI. Chromosomal location is taken from the NCBI database. The size of the aligned segments, after exclusion of indels, as well as initiation and stop codons, is given in base pairs. The origin of the sequences is indicated by accession codes in the GenBank database or literature references for the six OTUs represented in the study. The references are Kominato et al. (1992) (2) and Woodward et al. (2000) (1). OWM, Old World monkeys; NWM, New World monkeys. NWM OWM AF209081 D510763 AY092018 X71338 AJ002049 X86385 X51890 X61092 U24098 U24096 AF211185 D510763 AY092017 AF215714 AJ002052 X86383 AY092022 AY092024 U24101 U24104 Orangutan Gorilla AF211186 D510763 AY092016 AF215712 AJ002050 X86382 AY092021 AY092023 U24097 U24100 AF209082 D510763 AY092015 AF215711 AJ002051 X86380 AY092020 X61089 U24103 U24102 Chimpanzee Human AF155912 D510763 M11319 AF215713 U23824 L10101 L43402 V00565 X16545 M24157 1,446 8,932 1,335 539 558 612 1,382 860 477 476 SIZE LOCATION 5p13-p12 11p15.5 7q22 16p13.2 2p22-p21 Yp11.3 5q31.1 11p15.5 14q24-q31 14q24-q31 GHR HBG1 EPO PRM2 MSH2 SRY IL3 INS RNASE3 RNASE2 ABBREVIATION GENE NAME Table 1 Continued Growth hormone receptor . . . . . . . . . . . . . . . . . . . . . . Hemoglobin, gamma globin A . . . . . . . . . . . . . . . . . . Erythropoietin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sperm protamine P2 . . . . . . . . . . . . . . . . . . . . . . . . . . Mismatch repair enzyme MHS2. . . . . . . . . . . . . . . . . Sex-determining region Y . . . . . . . . . . . . . . . . . . . . . . Interleukin 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Insulin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ribonuclease A family, 3 . . . . . . . . . . . . . . . . . . . . . . Ribonuclease A family, 2 . . . . . . . . . . . . . . . . . . . . . . ORIGIN OF SEQUENCE AF209080 D510763 AY092019 X71335 AJ002053 X86386 X74878 J02989 U24099 U24099 Homoplasy in Primate Genes 1503 dently in G and H, whereas it remains unaltered in C, species G will appear to be more closely related to H than C, although in reality it may have diverged earlier than C from the lineage leading to H. The extent to which ancestral polymorphism and homoplasy contribute to the obfuscation of a phylogenetic relationship is not known. In most molecular phylogenetic reconstructions, attempts are made to take homoplasy into account by correcting the observed sequence for presumed hidden substitutions with the help of one of the correction formulas available (Nei and Kumar 2000, pp. 33–50). The underlying assumptions of all these formulas are stochasticity of the evolutionary process at the molecular level and neutrality of the substitutions. The formulas differ in the extent to which they take into account various factors that may influence the stochasticity of the process, such as the ratio of transitions to transversions or the four-nucleotide content of the sequence. Here we attempt to actually measure the extent to which ancestral polymorphism and homoplasy influence phylogenetic reconstruction. To this end, we use a large collection of primate sequences, one half of which we obtained in our Tübingen laboratory and the other half from databases. The collection was assembled for a variety of purposes, the estimate of the relative influence of ancestral polymorphism and homoplasy on phylogenetic reconstruction being one of them. The data set includes sequences of human, chimpanzee, gorilla, and orangutan, as well as representative species of Old World monkeys (OWM) and New World monkeys (NWM). It thus covers a range of divergence times extending from 5 MYA (the human-chimpanzee split; White, Suwa, and Asfaw 1998) to nearly 50 MYA (the Platyrrhini-Catarrhini split dated by Kumar and Hedges [1998] to 47.6 6 8.3 MYA). It is this wide span of evolutionary time that allows us to use the data set for the present purpose. The expectation is that the degree to which ancestral polymorphism and homoplasy obscure phylogenetic relationships depends on the particular time frame of the evolutionary process. To understand the reason for this dependence, consider two time intervals, one encompassing the period during which three closely related species lineages diverged from one another (e.g., G from [H, C], followed by the divergence of H from C) and the second covering the period from the first divergence (i.e., G from [H, C]) to the present time. In the first case, we must take into account that the resolution of ancestral neutral polymorphisms in a population consisting of 105 breeding individuals may take up to 3 Myr (Takahata 1993; Takahata and Satta 1997). Hence, if the interval between the first and the second divergence was ,3 Myr (as it probably was in the case of the H, C, and G lineages), then the resolution of the ancestral polymorphism can be expected to have confounded the phylogeny of the three lineages. On the other hand, if the interval between the first and the second divergences was .3 Myr (as was in the case of the divergences of OWM, NWM, and ape lineages), the resolution of the ancestral polymorphism should not have had any confounding effect. As for the interval from the 1504 O’hUigin et al. first divergence to the present, the length of the divergence time determines how much homoplasy can be expected. Because homoplasy at the molecular level generally involves more than one substitution at a site and because in stochastic processes two hits at a site are more probable in a long time interval than in a short one, in the short interval, the frequency of homoplasy can be expected to be negligibly low, unless the substitution rate is very high (Takahata 1995). Therefore, homoplasy may have confounded the phylogenetic relationship among some of the NWM, OWM, and ape genes, but it might not have influenced the phylogeny of the human, chimpanzee, and ape genes. The objectives of the present study were to estimate the degree of phylogenetic incompatibility for clades in which only homoplasy could be the cause, to infer the mechanisms by which homoplasy arises, and to determine the importance of homoplasy in phylogenetic reconstruction. human genome database entries. If conflicts occurred, sequences that differed from other primate sequences by the fewest substitutions were used. When more than one sequence from a particular primate lineage was available in the databases, sequences from cotton-top tamarin, bear macaque, and common chimpanzee were chosen for consistency. In other cases, the sequence of the nearest available relative from the same lineage was used. For simplicity, we refer to the representatives of the individual lineages as operational taxonomic units (OTUs). Throughout the text, the human, chimpanzee, gorilla, orangutan, macaque (OWM), and tamarin (NWM) lineages (OTUs) are abbreviated as H, C, G, O, M, and T, respectively. Some genes (UOX, FPR1, HBG1) are functional in certain lineages but have become inactivated in others. In such cases, the gene is treated according to its functional state in the majority of the OTUs. Materials and Methods The Data Set Partitioning Analysis The collection of sequences used in the present study comprised orthologous genes at 51 loci in species representing the major groups of anthropoid primates: human, African and Asian great apes, OWM and NWM (table 1). The apes were represented by the common chimpanzee (Pan troglodytes), the pygmy chimpanzee (P. paniscus), lowland gorilla (Gorilla gorilla), and orangutan (Pongo pygmaeus); the OWM by the bear macaque (Macaca arctoides), the rhesus macaque (M. mulatta), the crab-eating macaque (M. fascicularis), the Japanese macaque (M. fuscata), the gelada baboon (Theropithecus gelada), the yellow baboon (Papio cynocephalus), the patas monkey (Erythrocebus patas), and the green monkey (Cercopithecus aethiops); and the NWM by the cotton-top tamarin (Saguinus oedipus), the golden-mantled tamarin (S. tripartitus), the common marmoset (Callithrix jacchus), the black tufted-ear marmoset (C. penicillata), the black-capped capuchin (Cebus apella), the common squirrel-monkey (Saimiri sciureus), the Bolivian squirrel monkey (S. boliviensis), the northern night monkey (Aotus trivirgatus), the southern night monkey (A. azarae), the red howler monkey (Alouatta seniculus), and the spider monkey (Ateles sp.). The human loci came from different chromosomes, and they represented different functional categories, from ubiquitously expressed housekeeping genes to genes restricted in their expression to specific tissues. The orthology of the loci was checked by phylogenetic analysis which led to the exclusion of four of the original 55 loci: HLA-G, RNR1 (ribosomal RNA1, 28S), and MICA, on grounds of possible paralogy within multigene families, and IVL (involucrin) because of a complicated mode of evolution (Teumer and Green 1989). The extent of sequence variability caused by either polymorphism or polymerase chain reaction (PCR) and sequencing errors was estimated by independently amplifying and resequencing segments of nine of the 51 genes. Human sequences were checked for accuracy by comparing them with the corresponding segments of the Sequences were aligned by eye; only in the case of the globin genes and ApoB segment 1 was an alignment obtained from the databases. All variable sites were then identified and classified as two-base, three-base, or fourbase sites according to the number of nucleotide types found at each of them in the different species. Following Satta, Klein, and Takahata (2000), each variable site was classified as consisting of singletons, doubletons, and tripletons depending on whether a variant nucleotide occurred in one, two, or three of the six OTUs, respectively. This classification is sufficient to adequately and unambiguously describe the configuration of a site of six OTUs; it is unnecessary to count quatratons (nucleotide shared by four OTUs), pentatons (sharing by five OTUs), or hexatons (an invariant nucleotide) at a site because these are already incorporated in the description of the site in terms of singletons, doubletons, and tripletons. For example, the sharing of a nucleotide by five sequences at a site (a pentaton) is already inferred by the observation that such a site consists of one singleton, no doubletons, and no tripletons. A quatraton is inferred either from the presence of two singletons, no doubletons, and no tripletons or from the presence of no singletons, one doubleton, and no tripletons. Consequently, groupings above the level of tripleton are redundant in the classification of a site consisting of six OTUs. The total number of different singletons, doubletons, and tripletons that the six species could theoretically yield is 6, 15, and 20, respectively. The partitioning pattern of a given site provides information about OTUs because a shared character is more likely to be derived from a single mutation at the stem of two branches than from two independent mutational events. To explain the system of partitioning, consider a site occupied by nucleotides g, c, c, a, a, and a in the OTUs H, C, G, O, M, and T, respectively. (To avoid confusion between nucleotide and OTU designations, here and subsequently we use italicized lowercase letters for the former and roman type uppercase letters for the latter.) The site contains one singleton because the nu- Homoplasy in Primate Genes 1505 length of the phylogeny, the number of homoplasies is expected to increase with an increase in these two parameters. The estimated degree of homoplasy can therefore be expected to vary according to the species chosen as an out-group and the interval from the first divergence to the present time. Computer Simulation FIG. 1.—The simulated phylogeny of six species. The values t5 to t1 correspond to the divergence times of branches leading to tip sequences of T, M, O, G, C, and H, respectively. A0 is an ancestral sequence, and N1 to N4 are node sequences. The values d1 to d9 are the per-site nucleotide divergences of the individual branches. cleotide g occurs only in H and in no other OTU. The site also contains a doubleton because it is occupied by a c in both C and G but in no other OTU. Finally, the site also contains one tripleton because it is occupied by an a in O, M, or T but by a different nucleotide in the remaining three OTUs. If we did not know the root of the tree, this partitioning pattern would suggest that C and G share a common ancestor, as do O, M, and T (so that C, G, and H would share a common ancestor as well). But because we know the root, we can explain the observed pattern by assuming a single mutation in the stem of the C, G, and H branches. The partitioning of sites is then taken one step further. We notice that the chosen site has a g in H but that it is occupied by two different nucleotides in the other OTUs, c in C and G but a in O, M, and T. Altogether, the site is occupied by three different nucleotides in the six OTUs, and so it is classified as a three-base site. The sole singleton is classified as a three-base singleton. Similarly, the site contains a three-base doubleton and a three-base tripleton. If the site were occupied by g, a, a, a, a, and a in H, C, G, O, M, and T, respectively, it would consist of a two-base singleton, no doubletons, and no tripletons. The classification was used to provide support for or against the arrangements of the OTUs into specific clades. Singletons are not phylogenetically informative and do not provide support for or against particular clades. Doubletons and tripletons that support the separation of OTUs into clades consistent with the consensus phylogeny (the one depicted in fig. 1) are called compatible. Doubletons and tripletons that support separation into clades inconsistent with the consensus phylogeny are called incompatible. A site can contain more than one doubleton or tripleton and so can be informative about more than one clade in the phylogeny. Because the number of variable sites in a data set is a function of the evolutionary rate and the total branch To examine the effects of mutation rate and nucleotide composition on the frequency of homoplasy, computer simulations were undertaken. The divergence times of the H, C, G, O, M, and T lineages were designated as t1 to t5 (fig. 1). Under a given mutation rate m (per site per unit time), the expected number of nucleotide substitutions per site on branches leading to the different OTUs was based on actual sequence data (see Results). The values used were d5 5 t5m 5 0.05, d4 5 t4m 5 0.027, d3 5 t3m 5 0.0125, d2 5 t2m 5 0.005, and d1 5 t1m 5 0.0045 for the T, M, O, G, and C or H lineages, respectively. For each OTU, a single sequence was generated in each replication. The simulation began with an ancestral sequence, A0, composed of 1,000 identical nucleotides (all a’s). To generate a representative sequence of OTU T, for each of the 1,000 sites, a single uniform variable v was generated. If v was smaller than d5, a substitution was introduced at the site. The identity of the introduced nucleotide was determined by using the four-state model, in which a nucleotide has an equal probability of changing to any of the three remaining nucleotides (an assumption underlying the Jukes-Cantor model; see Jukes and Cantor 1969), or the two-state model, in which an a can change only to t (or vice versa) and g can only change to c (or vice versa); such a situation can occur in extremely at- or gc-rich regions of the genome. If, however, the simulation is started with a’s only, the g↔c change can never happen. The two-state model also covers the case in which transitions and transversions have very different rates. The simulation assumes only two bases, a and t or g and c. If instead of these two bases, two different bases, such as a and g, are considered, the simulation does not change in principle and the a and g model corresponds to the extreme case of a strong substitutional bias. To simulate the divergence of lineages, four node sequences (N1 to N4) and six tip sequences (T, M, O, G, C, and H) were generated (fig. 1). For example, N1 was generated from A0 with probability of substitutions of d6 5 d5 2 d4 at each site in the same way as above. From N1, the N2 node sequence and the M tip sequence were generated by nucleotide substitutions with probabilities of d7 5 d4 2 d3 and d3, respectively. The extent of incompatibility of the generated tip sequences was then examined by the same method as that applied to the actual data, and the extent of compatibility of the (H, C), (H, C, G), and (H, C, G, O) clades was estimated from the simulated sequences. To obtain the distribution of the extent of incompatibility, 10,000 incompatibility values for each clade were generated, and the proportion of incompatibility values that fell in a particular range was calculated. The range was defined as one of 20 di- 1506 O’hUigin et al. visions of an interval extending from 0 to 1.0. To examine the effect of an increased mutation rate, a 10 times higher mutation rate was obtained by increasing the di (for i 5 1 to 5) values 10-fold. An ‘‘incompatibility value’’ was defined as the proportion of sites incompatible with the proposed phylogeny relative to all sites informative with respect to this phylogeny. For example, consider the (H, C, G) phylogeny: if there are x, y, and z sites supporting (H, C) G, (H, G) C, and (C, G) H phylogenies, respectively, then the incompatibility value for the (H, C, G) phylogeny is given by (y 1 z)/ (x 1 y 1 z). Incompatibility values for (H, C, G, O) and (H, C, G, O, M) phylogenies can be calculated in a similar way. Results and Discussion Characteristics of the Data Set The data set used in the final analysis comprised 57 segments of 51 genes in six OTUs and hence 306 sequences altogether. The length of the sequences varied depending on the gene. The total length of the concatenated sequences, after the removal of gaps and initiation and stop codons, was 62,533 bp. Fifty-four thousand seventy-three of the total number of sites (86.3%) were invariant, and 1,402 (15.2%) of the 8,457 (13.7%) variable sites were phylogenetically informative. At the variable sites, singletons were present in similar numbers in H (287), C (256), and G (308) sequences. In the O, M, and T sequences, the number of singletons increased to 684, 1,705, and 4,189, respectively, corresponding to their increasingly greater phylogenetic distance from the other OTUs. To estimate the extent to which the interspecies comparisons might be influenced by either intraspecies polymorphism or by errors in sequence determination (either during PCR amplification or during sequencing), nine randomly chosen gene segments were reamplified and resequenced from all or nearly all the nonhuman OTUs. Comparison of the ‘‘old’’ and the ‘‘new’’ sequences (a total of about 30 kb in length) revealed 44 differences. The number of differences varied from gene to gene, being highest in APOA1 (12 differences in a total of 4.8 kb of sequence from four OTUs) and lowest in POMC (two differences in a total of 2.4 kb of sequence from four OTUs). The mean was 1.5 differences per kilobasepair of sequence, all the differences being singletons (i.e., they were not shared with any other sequence in the alignment). Significantly, all the incompatible sites included in the resequenced set could be confirmed. Partitioning Analysis to Identify the Nearest Living Relative of the Human Species In a partitioning analysis, the sites at which differences occur between the studied OTUs are considered individually in terms of their support or the lack thereof for a particular phylogeny. Initially, the differential sites are classified into singletons, doubletons, or tripletons for each of the six OTUs separately or for the various combinations of the OTUs, as described in Materials Table 2 Partition of Variable Sites in the 51 Loci from Six OTUs Partition Total Two-Base Three-Base Four-Base Singletons H ....... C. . . . . . . . G ....... O ....... M....... T. . . . . . . . 287 257 308 685 1,705 4,188 257 225 274 598 1,498 3,909 25 31 30 86 202 271 5 1 4 1 5 8 Doubletons H C ..... H G ..... H O ..... H M..... H T. . . . . . C G ..... C O ..... C M..... C T. . . . . . G O ..... G M..... G T. . . . . . O M..... O T. . . . . . M T ..... 46 14 6 14 31 29 8 8 15 6 17 31 85 88 660 43 11 5 13 30 24 6 8 13 6 16 29 60 79 638 3 3 1 1 1 5 2 0 2 0 1 2 25 9 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Tripletons H C G ... H C O ... H C M. . . H C T ... H G O... H G M. . . H G T ... H O M. . . H O T ... H M T... C G O ... C G M. . . C G T ... C O M. . . C O T ... C M T... G O M. . . G O T ... G M T... O M T... 332 21 6 8 14 6 5 3 3 7 10 1 1 0 2 0 3 0 0 3 291 14 5 8 12 3 4 3 2 6 6 0 0 0 0 0 0 0 0 0 41 7 1 0 2 3 1 0 1 1 4 1 1 0 2 0 3 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NOTE.—Classification of sites is explained in the text. The numbers of sites falling into the individual categories are given. and Methods (table 2). In the next step, the 41 partitions that are theoretically possible with a set of six OTUs are further classified into two-, three-, or four-base categories (see Materials and Methods). Phylogenetically informative partitions are then identified, their deposition regarding individual phylogenies is noted, and the sites supporting a particular phylogeny are tallied. Because a larger number of phylogenies are possible for clades with a greater number of OTUs, only those partitions that informatively group one OTU of the (H, C, G), (H, C, G, O), or (H, C, G, O, M) phylogenies with the appropriate out-group (O, M, or T, respectively) are considered in estimating the extent of incompatibility for each clade. Table 2 displays the number of sites that fall into the individual partitions into which the differential sites Homoplasy in Primate Genes in the alignment of the sequences from the six OTUs can be classified. The partitions differ in their phylogenetic informativeness. Singletons are uninformative about phylogenetic relationships among the species; only sites that contain at least two types of nucleotides, each type being shared by at least two OTUs, are considered to be phylogenetically informative. All the 15 possible doubletons, regardless of the number of nucleotide types found at the site, can be phylogenetically informative. By contrast, only the two- and three-base categories of the tripletons are phylogenetically informative because a four-base tripleton in reality consists of one tripleton and three singletons. Altogether, 1,402 sites were found to be phylogenetically informative, and of these, approximately 90% provided information regarding the grouping of the OTUs under consideration here. The remaining 10% gave information on groupings not relevant to the study; for example, the grouping of O with M or of C, G, and M. Of the 89 sites that are informative about the H, C, and G relationship, 46 sites (52%) were found to support the (H, C) clade excluding all the other OTUs, a value similar to that found by Satta, Klein, and Takahata (2000). Of these, 43 sites were of the two-base type, and 3 sites were of the three-base type. Fourteen sites (11 of the two-base type and 3 of the three-base type) supported the (H, G) clade, and 27 sites (24 of the twobase and 3 of the three-base type) supported the (C, G) clade. Thus, the results of the partitioning analysis uphold the conclusion reached in an earlier study with a different data set (Satta, Klein, and Takahata 2000), namely, that the chimpanzee is the nearest living relative of Homo sapiens. At the same time, however, the high proportion of incompatibility between the phylogenetically informative sites (with 48% of the sites supporting alternative phylogenies) indicates that sorting out of ancestral polymorphisms, homoplasy, or both have blurred the phylogenetic signals that might have otherwise indicated clearly the disjunction of the H, C, and G lineages. Dissociation of Ancestral Polymorphism from Homoplasy The results described in the preceding section indicate that the gorilla lineage diverged from the lineage leading to the common ancestor of human and chimpanzee before these last two species (lineages) diverged from each other. The interval between these two divergences was apparently relatively short, probably not more than 1–3 Myr, well within the range of persistence of ancestral polymorphism in a large population (Takahata 1995). To estimate to what degree homoplasy might have contributed to the blurring of phylogenetic signals during this interval, it is necessary to extend the partitioning analysis by including more distantly related lineages (O, M, T) into it. Both the paleontological (Martin 1993) and molecular (Sarich and Wilson 1967; Sibley, Comstock, and Ahlquist 1990; Horai et al. 1992) data indicate that the orangutan lineage diverged from the lineage leading to the common ancestor of human, 1507 chimpanzee, and gorilla 12–15 MYA. Because the human and chimpanzee lineages diverged from each other 5 MYA, and the gorilla lineage diverged from the lineage of the (H, C) ancestor not more than 8 MYA (Horai et al. 1992), an interval of .4 Myr separated the divergence of the orangutan lineage from that of the (H, C, G) lineage. This interval is too long for any ancestral polymorphism (except that maintained by balancing selection; see Klein et al. 1998) to survive, and so any incompatibilities between phylogenetically informative sites in the analysis of the (H, C, G, O) phylogeny should be attributable to homoplasy. The inclusion of the O OTU in the partitioning analysis revealed the existence of 332 sites that support the (H, C, G) clade excluding O with M as the out-group (table 3). Forty-five sites are inconsistent with this grouping in that they include O in the clade and exclude H (10 sites), C (14 sites), or G (21 sites). Thus, the (H, C, G) clade is supported by 88% of the informative sites, with the remaining 12% of sites supporting alternative phylogenies. The incompatibilities of the latter sites are presumably the result of homoplasy in the lineages leading to M on the one hand and to the (H, C, G, O) lineage on the other. The increase in the proportion of sites that support the standard phylogeny from 52% for the (H, C) clade to 88% for the (H, C, G) clade is presumably a reflection of the corresponding decrease in the contribution of ancestral polymorphism to the evolution of the two clades. Similarly, the (H, C, G, O) grouping excluding M and T is supported by 638 (81%) of the 789 informative sites, the remaining 19% of sites supporting alternative groupings: (H, C, G, M), 10%; (H, C, M, O), 3.6%; (H, M, G, O), 1.6%; or (M, C, G, O), 3.8%. Here, homoplasy can be assumed to have affected 19% of the informative sites. The increase in homoplasy from 12% for the (H, C, G, O) lineage to 19% for the (H, C, G, O, M) lineage is as expected, taking into account the increased divergence time of the out-group T to the latter group (.45 Myr; Martin 1993) in comparison with that of the out-group M to the former group (;30 Myr; Martin 1993) and assuming that the probability of substitution is a function of time. By assuming that maximally 12% of the informative sites have also been influenced by homoplasy during the evolution of the (H, C, G) group, we estimate that in maximally one-quarter of the 48% incompatible sites found in this group, the incongruence with the consensus phylogeny is the result of homoplasy. The incongruence of the remaining threequarters of incompatible sites is presumably caused by the segregation of ancestral polymorphisms. Possible Reasons for the High Level of Homoplasy The observation that 19% and 12% of the phylogenetically informative sites have undergone homoplasious substitutions during the time interval between the divergence of the Platyrrhini from the Catarrhini lineage and of the OWM from the ape lineage, respectively, is surprising and unexpected. The common perception, supported by computer simulations based on the stan- 1508 O’hUigin et al. Table 3 Partitioning of Informative Sites in the 51 Loci Divided into Coding and Noncoding Regions Sites Partition All (A) All (62,533 bp) . . . . . . . . (H, C) (G, O) (H, G) (C, O) (C, G) (H, O) Total (H, C, G) (O, M) (H, C, O) (G, M) (H, G, O) (C, M) (C, G, O) (H, M) Total (H, C, G, O) (M, T) (H, C, G, M) (O, T) (H, C, M, O) (G, T) (H, M, G, O) (C, T) (M, C, G, O) (H, T) Total (H, C) (G, O) (H, G) (C, O) (C, G) (H, O) Total (H, C, G) (O, M) (H, C, O) (G, M) (H, G, O) (C, M) (C, G, O) (H, M) Total (H, C, G, O) (M, T) (H, C, G, M) (O, T) (H, C, M, O) (G, T) (H, M, G, O) (C, T) (M, C, G, O) (H, T) Total (H, C) (G, O) (H, G) (C, O) (C, G) (H, O) Total (H, C, G) (O, M) (H, C, O) (G, M) (H, G, O) (C, M) (C, G, O) (H, M) Total (H, C, G, O) (M, T) (H, C, G, M) (O, T) (H, C, M, O) (G, T) (H, M, G, O) (C, T) (M, C, G, O) (H, T) Total 46 14 29 89 332 21 14 10 377 638 79 29 13 30 789 15 5 7 27 129 8 6 5 148 255 31 10 3 13 312 31 9 22 62 203 13 8 5 229 382 48 19 10 17 476 (B) Coding (29,451 bp) . . . . . (C) Noncoding (32,921 bp) . . Two-Base Three-Base Four-Base 43 11 24 3 3 5 291 14 12 6 41 7 2 4 0 0 0 638 79 29 13 30 14 4 5 1 1 2 112 4 4 2 17 4 2 3 0 0 0 255 31 10 3 13 29 7 19 2 2 3 179 10 8 4 24 3 0 1 0 0 0 382 48 19 10 17 NOTE.—The table shows the number of phylogenetically informative sites supporting each partition. OTUs forming a clade are included in parentheses. The sites are classified according to the system explained in the text. dard models of molecular evolution (see below), is that homoplasies in intervals of these lengths are rare, on the order of a few percent at most. Even in cases in which intense selection is known or suspected to drive the substitution process (Takahata 1995) or in which functional convergences at the molecular level are postulated (Swanson, Irwin, and Wilson 1991; Irwin, White, and Wilson 1993; Lawn, Schwartz, and Patthy 1997), homoplasy is believed to be an exception rather than a rule. The following question therefore arises: What might be the reason for the high homoplasy found in the primate lineages? In what follows, we consider three possible answers to this question. The first possibility is that the observed homoplasy is a manifestation of selection pressure for convergence in function. Many of the studied genes (ABO, RHAG, PRM2, RNASE3, SRY) are known or postulated to be under moderate-to-strong selection pressure (O’hUigin, Sato, and Klein 1997; Zhang, Rosenberg, and Nei 1998; Wyckoff, Wang, and Wu 2000). Could this pressure be responsible for the high homoplasy? To test this possibility, we divided the data set into coding and noncoding subsets and carried out the partitioning analysis separately on the two subsets (table 3, parts B and C). Of the 29,451 coding sites of the 51 loci, 3,018 (10%) were found to be variable, and of these, 545 (18%) sites are phylogenetically informative. Of the 545 sites, 27, 148, and 312 are informative about the (H, C, G), (H, C, G, O), and (H, C, G, O, M) phylogenies, respectively. Fifteen of 27 (56%) relevant informative sites support the Homoplasy in Primate Genes (H, C) clade, whereas 129 of 148 (87%) sites support the (H, C, G) clade, and 255 of 312 (82%) sites are compatible with the (H, C, G, O) clade. Similarly, of the 32,921 noncoding sites, 5,424 (16%) have been found to be variable, and of these, 855 (16%) are phylogenetically informative. Of these, 62, 229, and 476 sites are informative about the (H, C, G), (H, C, G, O), and (H, C, G, O, M) phylogenies, respectively. Thirtyone of the 62 (50%) relevant informative sites support the (H, C) clade, 203 of 229 (89%) the (H, C, G) clade, and 382 of 476 (80%) the (H, C, G, O) clade. Thus, the differences in the proportion of incompatibilities between the coding and noncoding regions are small, and no strong tendency for homoplasy arising preferentially in the coding regions is apparent. Selection therefore does not appear to play a dominant part in the generation of homoplasy. The second possibility is that the observed high level of homoplasy is caused by a bias in nucleotide composition. The mechanisms producing bias in equilibrium nucleotide frequencies in different genomic regions are unclear (Wolfe, Sharp, and Li 1989). Because maintenance of compositional bias increases the probability of like substitutions, such bias could be expected to indirectly influence the extent of homoplasy at certain sites in a gene and in certain regions of a genome. This effect should be most pronounced in the third positions of codons and noncoding regions where mutational bias is the primary determinant of nucleotide composition. The absence of a significant increase in homoplasy in noncoding regions noted above therefore provides an argument against this explanation. Further evidence emerges from an examination of the nucleotide composition of the individual genes (table 4). Compositional bias was measured by using the method of Kornegay et al. (1993). It was then related to the number of variable sites, number of phylogenetically informative sites, and number of sites compatible or incompatible with the consensus phylogeny (table 4). The measurement of incompatibility was limited to the (H, C, G, O) and (H, C, G, O, M) groups where no contribution of ancestral polymorphism should occur. Most (45 of 57 segments) of the genes in the data set showed some degree of incompatibility which ranged from 0% to 72%. The relatively high percentages of incompatibility found in some of the short genes (ZFX, ACAT2, DRD4, PROC) may be caused by stochastic effects associated with a low number of informative sites. The longer genes containing more informative sites probably provide a more reliable estimate of incompatibility. The gist of the comparison is that compositional bias in either coding or noncoding regions does not appear to have a strong effect on the degree of homoplasy found in the individual genes. Genes showing the highest levels of compositional bias often show a below average (e.g., ADBR3, MSH2) or no (e.g., ZNFN1A1, APOB) homoplasy. By contrast, genes with high levels of homoplasy may have a low (e.g., RNASE3, PROC) or moderate (e.g., PRM2, LCAT) degree of compositional bias. The third theoretically possible explanation for the observed high average level of homoplasy in the primate 1509 genes is a high mutation rate with an associated increased probability of multiple hits. An increased mutation rate can have a variety of causes, one of which is the presence of CpG dinucleotides prone to methylation and thus to a high frequency of C→T transitions (Green et al. 1990). To test the effect of an increased mutation rate, substitution rates were estimated for the individual genes. Assuming some correspondence between mutation and substitution rates and using Kimura’s two-parameter method (Kimura 1980), we calculated the persite substitution rate (K) for each gene in all 15 pairwise comparisons of the six OTUs and then summed up the values and expressed the sum as percent K—a measure we refer to as ‘‘SK%’’ in table 4. The rates were calculated for all sites of a given gene and for synonymous sites separately. Except for a few genes (generally the slowly evolving ones), the two rate estimates correlated reasonably well with each other. The general tendency revealed by the estimates is that genes with higher mutation rates show higher degrees of homoplasy. Thus, the 14 most rapidly evolving gene segments in the (H, C, G, O) clade have 23% incompatible sites on average (100 of 437 sites), whereas the most slowly evolving gene segments have a mean incompatibility of 15% (9 of 60 sites), the overall mean incompatibility for this clade being 17%. Eleven of the 12 gene segments that show no incompatibilities caused by homoplasy are in the slowly evolving set, and there is no segment without incompatibility in the set of the 25 most rapidly evolving gene segments. Nonetheless, it must be pointed out that although the tendency for the association of incompatibility with higher mutation (substitution) rates does exist, the association is not very strong and it is not clear to what degree the higher incompatibility levels could be attributed to the increased rates. At least some of the incompatibility may be related to certain special evolutionary characteristics of some of the genes. Thus, for example, the high incompatibility level of the RNASE3 gene found in the (H, C, G, O) clade (13 incompatibilities at 15 sites) may be attributed to its retention of the character of the RNASE2 gene from which it arose by duplication following the divergence of the Catarrhini from the Platyrrhini (Zhang, Rosenberg, and Nei 1998). Search for the Cause of High Homoplasy by Computer Simulation Partitioning analysis excluded selection pressure, but not nucleotide composition from being responsible for the high homoplasy of the primate genes, and provided an indication that variation in the mutation rate of different sites might be a factor. To test whether a combination of the two factors, nucleotide composition bias and variability of mutation rates, might explain the data, a computer simulation was carried out. The influence of nucleotide composition was simulated by using either the four-state or the two-state model of molecular evolution. The effect of the mutation rate was assessed by letting the genes evolve with a rate of d5 5 0.05 and then again with a 10-fold higher rate of d5 5 0.5. (As mentioned earlier, the strong transition bias model is 1510 O’hUigin et al. Table 4 Gene-by-Gene Analysis of Incompatibility SK% BIAS GENE Third All All Synonymous ZFX . . . . . . . . ZNFN1A1 . . . CXCR4. . . . . . APOB . . . . . . . APOB . . . . . . . ACAT2 . . . . . . VHL . . . . . . . . NGFB . . . . . . . SCG2 . . . . . . . ADRB2. . . . . . UOX . . . . . . . . DRPLA. . . . . . PRNP . . . . . . . NPPA . . . . . . . ADBR3. . . . . . ZFY . . . . . . ZFY . . . . . . . . DRD4 . . . . . . . RNASE6. . . . . CCR5 . . . . . . . BRCA1. . . . . . OXTR. . . . . . . IFNG . . . . . . . ABO . . . . . . . . TNF . . . . . . . . TNF . . . . . . . . BCYRN1 . . . . B2M . . . . . . . . LYZ . . . . . . . . FUT2 . . . . . . . PROC . . . . . . . C4B . . . . . . . . F9 . . . . . . . . . . RHAG . . . . . . DMP1 . . . . . . . IL16 . . . . . . . . COX4 . . . . . . . ODC1 . . . . . . . DAF . . . . . . . . POMC . . . . . . PAH . . . . . . . . LCAT . . . . . . . OPN1SW . . . . APOA1. . . . . . AFP . . . . . . . . HBBP1 . . . . . . FPR1. . . . . . . . GHR . . . . . . . . HBG1 . . . . . . . EPO . . . . . . . . PRM2 . . . . . . . MSH2 . . . . . . . SRY . . . . . . . . IL3 . . . . . . . . . INS . . . . . . . . . RNASE3. . . . . RNASE2. . . . . Totals . . . . . . . 0.107 0.596 0.284 0.455 0.182 0.296 0.270 0.207 0.099 0.292 — 0.184 0.222 0.205 0.466 0.090 — 0.575 0.197 0.161 0.241 0.536 0.080 0.573 — 0.369 — 0.079 0.115 0.344 — 0.110 0.168 0.088 0.134 0.100 0.380 0.154 0.313 0.544 0.231 0.265 0.254 0.474 0.136 — 0.205 — 0.213 0.264 0.320 0.545 0.109 0.132 0.441 0.155 0.125 — — — — — 0.258 — — — — 0.085 — — 0.133 — — 0.233 — — — — — — — 0.089 0.183 0.060 0.129 — — 0.039 0.192 0.194 — — — — 0.130 0.232 — 0.096 0.200 0.100 0.146 0.160 0.147 — 0.103 0.130 0.041 0.233 0.159 — 0.061 0.182 — — 18.6 18.7 23.5 28.4 61.9 30.7 32.4 37.0 37.1 40.0 40.6 44.5 48.6 51.7 52.5 53.9 129.9 55.4 55.7 56.1 57.4 57.6 59.1 63.8 65.0 66.7 65.3 66.2 66.6 66.7 67.9 68.1 69.1 69.8 70.8 71.6 72.5 75.0 78.6 80.7 80.7 81.7 82.7 91.3 88.3 96.4 100.6 104.0 108.0 108.3 112.7 111.8 118.0 123.1 146.2 151.4 171.5 80.3 55.0 64.9 85.5 66.6 46.7 95.8 98.6 70.9 85.6 40.6 92.8 136.2 85.6 119.1 227.9 129.9 132.9 84.7 114.7 70.3 126.1 94.9 145.5 65.0 74.7 65.3 66.3 78.9 161.8 67.9 77.1 84.9 92.2 122.2 125 121.7 108.2 78.9 154.9 85.6 92.7 92.4 112.7 100.7 96.4 152.8 104 111.9 117.5 84.1 141.4 182.1 120.5 161.5 151.6 255.6 NUMBER Variable 13 20 40 9 61 27 24 38 68 77 25 36 57 42 89 31 139 26 41 99 334 82 44 60 50 179 65 117 46 112 45 59 117 133 105 113 50 161 84 62 154 191 161 172 87 1,448 167 248 1,588 238 99 106 119 282 186 112 120 8,457 OF SITES (H, C) (H, C, G) (H, C, G, O) Informative c i c i c i %I 2 3 10 1 11 6 6 6 5 16 5 7 19 7 9 17 24 3 7 10 59 8 8 13 12 24 10 16 14 15 6 10 26 37 17 18 12 25 16 19 24 34 25 30 13 236 21 32 262 34 17 17 19 32 29 19 19 1,402 0 0 1 0 1 2 1 0 0 1 0 1 0 1 0 0 2 0 0 1 0 0 0 0 0 0 0 2 0 0 1 0 4 0 1 0 0 2 0 1 1 1 1 0 0 5 0 1 10 0 0 2 0 2 1 0 0 46 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 2 0 4 0 0 0 1 0 0 0 0 0 3 3 6 1 3 7 2 4 0 0 0 3 0 0 43 0 1 2 0 2 1 2 2 1 4 2 0 7 2 2 2 10 0 0 1 13 2 2 1 2 10 1 3 2 2 1 6 3 18 3 3 4 8 3 7 4 8 8 11 2 53 5 4 60 10 1 3 7 9 2 3 7 332 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 3 1 2 0 0 1 0 0 0 1 1 0 1 0 2 0 1 1 0 1 1 0 1 1 1 4 1 2 9 3 2 0 1 0 0 0 1 45 1 1 5 1 5 1 2 3 3 7 2 5 8 3 6 14 7 1 5 4 37 1 3 9 8 9 7 9 10 11 0 0 13 7 8 9 7 10 9 6 14 11 12 14 6 118 7 13 116 12 5 7 8 15 14 2 7 638 1 0 0 0 0 2 1 0 0 1 1 0 1 0 1 0 3 1 2 0 3 1 1 1 1 2 0 2 1 1 1 0 2 1 2 5 0 0 2 1 2 8 2 1 1 32 3 4 29 2 3 2 0 3 3 13 2 151 50.0 0.0 0.0 0.0 0.0 50.0 20.0 0.0 0.0 8.3 20.0 0.0 11.8 0.0 11.1 0.0 19.0 66.7 28.6 0.0 10.7 40.0 37.5 9.1 9.1 13.6 0.0 14.3 7.7 13.3 66.7 0.0 15.8 3.8 26.7 29.4 8.3 5.3 14.3 13.3 14.3 29.6 13.0 7.4 20.0 17.4 25.0 26.1 17.8 18.5 45.5 16.7 6.2 11.1 15.8 72.2 17.6 16.8 NOTE.—Genes are as listed in table 1. Nucleotide composition bias for third codon positions or all noncoding sites was calculated according to Kornegay et al. (1993). SK% is the sum of the percentage divergence (see text). c and i indicate the number of sites compatible and incompatible with the shown phylogeny, respectively. %I indicates the combined percentage of incompatibility for the (H, C, G) and (H, C, G, O) clades. Homoplasy in Primate Genes FIG. 2.—Simulation results of the effects of the mutation rate and nucleotide composition on the observed extent of homoplasy. The abscissa shows the proportion of informative sites that are incompatible with the (H, C) clade (A), the (H, C, G) clade (B), or the (H, C, G, O) clade (C) in a set of generated sequences. The ordinate shows the proportion of replicates among 10,000 supporting the indicated incompatibility. m, mutation rate; L, low; H, high (i.e., 10 times higher than L); 2S, two-state model; 4S, four-state (Jukes-Cantor) model. covered by the two-state model.) The first of these two values was obtained by taking the average divergence at synonymous sites of the 51 studied loci calculated from the comparison of the T sequences with the sequences of the other five OTUs (i.e., d5 5 0.09862/2 5 0.05). Thus, the combination of the two variants of each of the two factors tested four different conditions under which the genes evolved. The simulated evolutionary process was aimed at producing six OTUs related to one another in the manner depicted in figure 1 (for further details see Materials and Methods). The simulation was repeated 10,000 times for each of the four sets of conditions, and the results were summarized in a graphic form separately for the (H, C), (H, C, G), and (H, C, G, O) clades (fig. 2; panels A, B, and C, respectively). Plotted on the abscissa of the graph are the incompatibility values or the proportions of informative sites incompatible with the particular clade in the set of artificially generated sequences. The ordinate indicates the frequency with which each particular proportion occurred in the 10,000 replicates; it can also be interpreted as the probability of obtaining a particular proportion of incompatible sites in one simulation experiment or at one locus. The simulation reveals that under the low mutation rate and the application of either the two- or four-state model, more than 80% of the replications show no incompatibility in both the (H, C) and the (H, C, G) clades (fig. 2A and B). In the remaining ,20% replications, 1511 incompatibilities do occur, but because they are rare, they are widely scattered among the replications. Moreover, because mutations are generally rare, when an even rarer incompatible mutation does occur, it has a large effect on the incompatibility value. Consequently, the simulation is subject to a considerable ‘‘noise’’ reflected in the scattering of incompatibilities. Because the (H, C, G, O) clade is deeper than either the (H, C) or the (H, C, G) clade, the larger number of substitutions found in the deeper phylogeny results in less noise in the simulation results than that found in the shallower clades (fig. 2C). Although more incompatibilities occur, their proportion in the total number of substitutions varies far less than when only a few informative sites are present. The increased number of incompatibilities is reflected in the observation that only 55% of replicates produce no incompatibilities under the four-state model and in that the value drops to 19% under the two-state model. Higher mutation rates increase the proportion of incompatibility in each of the three clades in both the fourand two-state models. In the case of the (H, C) clade, the proportion of replicates without incompatibility falls below 10% under the four-state model and to 0% under the two-state model. Incompatibilities are distributed in varying proportions through most replicates, the variation again being the result of sampling effects in the shallow phylogeny. In the (H, C, G) clade, no replicate under either the four- or two-state model is without incompatibility. The distribution of incompatibility shows a peak at ;25% under the four-state model and a broadshouldered peak at ;35% under the two-state model. Finally, in the (H, C, G, O) clade, the peak under the four-state model moves to ;30% and that under the two-state model to an equilibrium value of 80%. In the two-state model, four incompatibilities arise for every compatibility generated. From these observations the following conclusions can be drawn. First, the level of homoplasy is insignificant when the mutation rate is uniformly low at all the sites and when the nucleotide composition is unbiased. Second, the two-state model does not markedly affect the extent of homoplasy in comparison with the fourstate model when the range of sequence divergence is low (,10%). Third, a high mutation rate greatly increases the extent of homoplasy: even for a pair of OTUs with a short divergence time, the proportion of loci with compatible sites only is reduced to ,10%. The reduction takes place under both models, but it is more pronounced under the two-state than under the four-state model. Simulation Based on a Mixed Rate Model Taking these results into consideration and taking into account the possibility that mutation rates may vary from site to site and between different regions of a gene or genome, a mixed rate model was constructed and used in another set of computer simulations. To simulate the situation encountered with the actual data set more realistically, the number of replicates was reduced from 10,000 to 51, corresponding to the number of genes in 1512 O’hUigin et al. the set. And to provide for the heterogeneity of the mutation rate, we allowed 900 of the 1,000 sites at each gene to mutate at the low (1m) rate and the remaining 100 sites at the 10 times higher (10m) rate. In all other respects the simulation was carried out as in the first experiment. The observed compatibility values—62%, 85%, and 85% for the (H, C), (H, C, G), and (H, C, G, O) clades, respectively—were in reasonably good agreement with the actual data. We have thus far considered compatible or incompatible sites for a particular clade irrespective of their occurrences within or between loci. However, a locus can be incompatible with a particular clade in two different ways: in one, all the informative (either compatible or incompatible) sites at the locus are incompatible (interlocus incompatibility), and in the other, the locus contains some sites incompatible with each other (intralocus incompatibility). In the experimental data, of the 57 sequence segments (table 4), 34 were informative for the (H, C) clade. Of these, six loci or segments (18%) showed intralocus incompatibility and contained 21 incompatible and 20 compatible sites within these loci. The relative extent of intralocus incompatibility for the (H, C) clade (21 vs. 20) was much larger than that for the (H, C, G) clade (44 vs. 244) and for the (H, C, G, O) clade (150 vs. 562). The remaining 28 segments supported either the (H, C) (18 segments, 53%) or the (H, G) and (C, G) (10 segments, 29%, of interlocus incompatibility) grouping unambiguously. Hence, in total, 82% of segments supported a single phylogeny. This proportion reduced to 28/53 5 53% for the (H, C, G) clade and 15/56 5 27% for the (H, C, G, O) clade, each with only one segment being incompatible with either of these clades. Similarly, in terms of the numbers of interlocus incompatible versus compatible sites, there were 22 versus 26 for the (H, C) clade but 1 versus 88 for the (H, C, G) clade and 1 versus 76 for the (H, C, G, O) clade. Thus, the actual data showed that compared with the (H, C, G) and (H, C, G, O) clades, both intraand interlocus incompatibilities were notably high for the (H, C) clade. Although the simulation result was in good agreement with the observed low extent of interlocus incompatibility for the (H, C, G) and (H, C, G, O) clades, it failed to account for the observed high extent of interlocus incompatibility for the (H, C) clade. Although by using a mixed rate model it was possible to generate a high degree of intralocus incompatibility, the extent of intralocus incompatibility tended to increase because the clade included distantly related OTUs. This simulation result was again inconsistent with the observed high extent of intralocus incompatibility for the (H, C) clade and the low extent for the (H, C, G) and (H, C, G, O) clades. Thus, the simulation result suggested that the relatively high extent of intra- and interlocus incompatibility observed in the (H, C) clade cannot be accounted for by high mutation rates at particular sites or by homoplasy. It must therefore have a different cause, namely, ancestral polymorphism. Conclusions To sum up, comparative analysis of 57 sequences obtained from 51 genes in six primate OTUs representing the human, chimpanzee, gorilla, orangutan, OWM, and NWM lineages reveals the occurrence of homoplasy (parallelism or convergence) at much higher frequencies than is generally expected (19% in the six-OTU lineage and 12% in the OWM and ape lineages). Together with ancestral polymorphism, homoplasy is therefore a major source of incongruence in phylogenetic reconstructions. Whereas ancestral polymorphism may obscure only phylogenies of lineages that diverged from one another within a short interval on the evolutionary time scale, no such restriction applies to homoplasy. Of the three major factors considered here as potential causes of the observed high level of homoplasy, no compelling evidence could be mustered for the effect of selection. In particular, the observation that homoplasy is distributed equally between coding and noncoding parts of the genome argues against this explanation. Selection, however, is responsible for an increased tendency toward parallel substitutions in certain genes, such as those of the major histocompatibility complex (O’hUigin 1995; Kriener et al. 2000), not included in the data set used in the present study. Both the analysis of this data set and computer simulations implicate the two other factors—variation in nucleotide composition and, in particular, in mutation (substitution) rates. Taken in isolation, these two factors, however, do not fully account for the observed high level of homoplasy. The simulation study indicates that even in extreme cases of biased nucleotide composition, 10% incompatibility is reached very rarely when standard substitution models are applied. On the other hand, high mutation rates do appear to account for some but not all of the observed incompatibility, especially that observed in the (H, C, G) phylogeny. But because we observed a high extent of interlocus incompatibility, whereas simulation led to a high degree of intralocus incompatibility, ancestral polymorphism must be an important factor in determining the (H, C, G) phylogeny. Compared with the (H, C, G) and (H, C, G, O) clades, the large proportion of intralocus-incompatible sites relative to that of intralocus-compatible sites in the (H, C) clade cannot be accounted for by homoplasy. Rather, it is most likely the result of the combined effect of intragenic recombination and ancestral polymorphism in the stem lineage of humans, chimpanzees, and gorillas. Therefore, the contribution of homoplasy caused by mutations may be insignificant with respect to the (H, C, G) phylogeny. On the other hand, in the case of the (H, C, G, O) and (H, C, G, O, M) phylogenies, the simulation indicates that if a small percentage of nucleotide sites are allowed to undergo mutations (substitution) at a rate 5- to 10-fold higher than the normal rate, homoplasy (and phylogenetic incompatibility) will occur at levels similar to those observed at the 51 loci under study. The number of interlocus-incompatible loci as well as interlocus-incompatible sites for the (H, C) clade is much larger than that for the (H, C, G) or (H, C, G, O) clades. If the latter incompatibility is attributed Homoplasy in Primate Genes to homoplasy, then the former must be attributed to ancestral polymorphism. The inference that a category of rapidly evolving sites must exist in primate DNA has implications for phylogenetic studies. Such sites might be expected to contribute inordinately to phylogenetic information obtained on lineages diverging in rapid succession because few other sites will have undergone substitution within the short time interval. Examples of rapidly evolving sites in specific genes are known (Green et al. 1990), but the extent of their occurrence in the genome has not been determined. In cases in which rapidly evolving sites are the major source of information about phylogenetic relationships, they may have undergone several substitutions before more slowly evolving sites could contribute to phylogenetic resolution. In such cases, a phylogeny may be built primarily on sites that show a high degree of incompatibility and is likely to be incorrect. Acknowledgments We thank Dr. Naoko Takezaki for critical reading of the manuscript, Hana Jandova and Solveig Hirschle for technical assistance, and Jane Kraushaar for editorial assistance. LITERATURE CITED BAILEY, W. J. 1993. Hominoid trichotomy: a molecular overview. Evol. Anthropol. 2:100–108. GREEN, P. M., A. J. MONTANDON, D. R. BENTLEY, R. LJUNG, I. M. NILSSON, and F. GIANNELLI. 1990. The incidence and distribution of CpG→TpG transitions in the coagulation factor IX gene. Nucleic Acids Res. 18:3227–3231. HORAI, S., Y. SATTA, K. HAYASAKA, R. KONDO, T. INOUE, T. ISHIDA, S. HAYASHI, and N. TAKAHATA. 1992. Man’s place in hominoidea revealed by mitochondrial DNA genealogy. J. Mol. Evol. 35:32–43. IRWIN, D. M., R. T. WHITE, and A. C. WILSON. 1993. Characterization of the cow stomach lysozyme genes: repetitive DNA and concerted evolution. J. Mol. Evol. 37:355–366. JUKES, T. H., and C. R. CANTOR. 1969. Evolution of protein molecules. Pp. 21–132 in H. N. MUNRO, ed. Mammalian protein metabolism III. Academic Press, New York. KIMURA, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120. KLEIN, J., A. SATO, S. NAGL, and C. O’HUIGIN. 1998. Molecular trans-species polymorphism. Annu. Rev. Ecol. Syst. 29: 1–21. KOMINATO, Y., P. D. MCNEILL, M. YAMAMOTO, M. RUSSEL, S.-I. HAKOMORI, and F. YAMAMOTO. 1992. Animal histoblood group ABO genes. Biochem. Biophys. Res. Commun. 189:154–165. KORNEGAY, J. R., T. D. KOCHER, L. A. WILLIAMS, and A. C. WILSON. 1993. Pathways of lysozyme evolution inferred from the sequences of cytochrome b in birds. J. Mol. Evol. 37:367–379. KRIENER, K., C. O’HUIGIN, H. TICHY, and J. KLEIN. 2000. Convergent evolution of major histocompatibility complex molecules in humans and New World monkeys. Immunogenetics 51:169–178. KUMAR, S., and S. B. HEDGES. 1998. A molecular timescale for vertebrate evolution. Nature 392:917–920. 1513 LAWN, R. M., K. SCHWARTZ, and L. PATTHY. 1997. Convergent evolution of apolipoprotein (a) in primates and hedgehog. Proc. Natl. Acad. Sci. USA 94:11992–11997. MARTIN, R. D. 1993. Primate origins: plugging the gaps. Nature 363:223–234. MIYAMOTO, M., B. F. KOOP, J. L. SLIGHTOM, M. GOODMAN, and M. TENNANT. 1988. Molecular systematics of higher primates: genealogical relations and classification. Proc. Natl. Acad. Sci. USA 85:7627–7631. NEI, M., and S. KUMAR. 2000. Molecular evolution and phylogenetics. Oxford University Press, Oxford. O’HUIGIN, C. 1995. Quantifying the degree of convergence in primate Mhc-DRB genes. Immunol. Rev. 143:123–140. O’HUIGIN, C., A. SATO, and J. KLEIN. 1997. Evidence for convergent evolution of A and B blood group antigens in primates. Hum. Genet. 101:141–148. ROGERS, J. 1993. The phylogenetic relationships among Homo, Pan and Gorilla: a population genetics perspective. J. Hum. Evol. 25:201–215. RUVOLO, M. 1997. Molecular phylogeny of the hominoids: inferences from multiple independent DNA sequence data sets. Mol. Biol. Evol. 14:248–265. SARICH, V. M., and A. C. WILSON. 1967. Rates of albumin evolution in primates. Proc. Natl. Acad. Sci. USA 58:142– 148. SATTA, Y., J. KLEIN, and N. TAKAHATA. 2000. DNA archives and our nearest relative: the trichotomy problem revisited. Mol. Phylogenet. Evol. 14:259–275. SCHWARTZ, J. H. 1987. The red ape. Orang-utans & human origins. Houghton Mifflin Company, Boston. SIBLEY, C. G., J. A. COMSTOCK, and J. E. AHLQUIST. 1990. DNA hybridization evidence of hominoid phylogeny: a reanalysis of the data. J. Mol. Evol. 30:202–236. SWANSON, K. W., D. M. IRWIN, and A. C. WILSON. 1991. Stomach lysozyme gene of the langur monkey: tests for convergence and positive selection. J. Mol. Evol. 33:418– 425. TAKAHATA, N. 1993. Allelic genealogy and human evolution. Mol. Biol. Evol. 10:2–22. ———. 1995. Mhc diversity and selection. Immunol. Rev. 143:225–247. TAKAHATA, N., and Y. SATTA. 1997. Evolution of the primate lineage leading to modern humans: phylogenetic and demographic inferences from DNA sequences. Proc. Natl. Acad. Sci. USA 94:4811–4815. TEUMER, J., and H. GREEN. 1989. Divergent evolution of part of the involucrin gene in the hominoids: unique intragenic duplications in the gorilla and human. Proc. Natl. Acad. Sci. USA 86:1283–1286. WHITE, T. D., G. SUWA, and B. ASFAW. 1998. Australopithecus ramidus, a new species of early hominid from Aramis, Ethiopia. Nature 371:306–312. WOLFE, K. H., P. M. SHARP, and W. H. LI. 1989. Mutation rates differ among regions of the mammalian genome. Nature 337:283–285. WOODWARD, E. R., A. BUCHBERGER, S. C. CLIFFORD, L. D. HURST, N. A. AFFARA, and E. R. MAHER. 2000. Comparative sequence analysis of the VHL tumor suppressor gene. Genomics 65:253–265. WYCKOFF, G. J., W. WANG, and C. I. WU. 2000. Rapid evolution of male reproductive genes in the descent of man. Nature 403:304–309. ZHANG, J., H. ROSENBERG, and M. NEI. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:3708–3713. NARUYA SAITOU, reviewing editor Accepted May 2, 2002
© Copyright 2026 Paperzz