Contribution of Homoplasy and of Ancestral

Contribution of Homoplasy and of Ancestral Polymorphism to the
Evolution of Genes in Anthropoid Primates
Colm O’hUigin,*1 Yoko Satta,† Naoyuki Takahata,† and Jan Klein*
*Max-Planck-Institut für Biologie, Abteilung Immungenetik, Corrensstrasse 42, Tübingen, Germany; and †Department of
Biosystems Science, The Graduate University for Advanced Studies, Hayama, Kanagawa, Japan
Molecular phylogenies of lineages that split from one another in short succession are often difficult to resolve
because different loci and different sites within the same locus yield incongruent relationships. The incongruity is
commonly attributed to two causes: differential assortment of ancestral polymorphisms and homoplasy. To assess
the relative contribution of these two causes, sequences of 57 segments from 51 loci in six primate lineages (human,
chimpanzee, gorilla, orangutan, macaque, and tamarin, abbreviated as H, C, G, O, M, and T, respectively) were
subjected to ‘‘partitioning’’ analysis, in which phylogenetically informative sites were identified in all 15 pairwise
comparisons of each of the 57 segments and tallied for their support or lack thereof for each of the theoretically
possible phylogenies. The six lineages include one of the best known cases of a difficult-to-resolve phylogeny: the
trichotomy (H, C, G), in which the three lineages may have diverged from each other within a short period of time.
In this period many of the ancestral polymorphisms apparently persisted and yielded phylogenetically incongruent
signals. By contrast, no ancestral polymorphism is expected to have survived during the interval separating the
divergences of the O, M, and T lineages from the ancestor of the (H, C, G) group. Any phylogenetic incompatibilities
at sites in the O, M, and T lineages relative to the (H, C, G) group are therefore presumably the result of homoplasy.
The frequency of homoplasy estimated in this manner is unexpectedly high: 12% for the (H, C, G) clade and 19%
for the (H, C, G, O) clade. At least three-quarters of the 48% incompatibility observed in the (H, C) clade is
attributable to the sorting out of ancestral polymorphisms coupled with intragenic recombination. Possible reasons
for this high level of homoplasy in the O, M, and T lineages are discussed, and a computer simulation has been
carried out to produce a model explaining the observed data.
Introduction
Attempts to determine phylogenetic relationships
among closely allied taxa often yield discordant results
depending on the gene or even part of the gene used in
the reconstruction. As a consequence, phylogenies of
many groups of taxa remain unresolved. Perhaps the
best known example of a prolonged controversy now
tentatively resolved by a consensus based on sequences
of a large set of genes is the one involving the human
species, specifically the question of its closest living relative (Miyamoto et al. 1988; Bailey 1993; Rogers 1993;
Ruvolo 1997; Satta, Klein, and Takahata 2000). After
elimination of the orangutan (O), favored by some anthropologists and paleontologists (Schwartz 1987), the
issue came to be referred to as ‘‘the trichotomy problem,’’ the question of the relationship among three species, human (H), chimpanzee (C), and gorilla (G). The
consensus approach identifies the chimpanzee as the
nearest living relative of humans, but the evidence supporting this conclusion is not overwhelming. In the most
recent and largest study encompassing 45 loci and 47
1 Present address: Department of Genetics, Trinity College, Dublin, Ireland.
Abbreviations: C, chimpanzee; G, gorilla; H, human; M, macaque; NCBI, National Center for Biotechnology Information; NWM,
New World monkeys; O, orangutan; OTU, operational taxonomic unit;
OWM, Old World monkeys; PCR, polymerase chain reaction; T,
tamarin.
Key words: homoplasy, parallelism, convergent evolution, trichotomy, primates, polymorphism.
Address for correspondence and reprints: Colm O’hUigin, MaxPlanck-Institut für Biologie, Abteilung Immungenetik, Corrensstrasse
42, D-72076 Tübingen, Germany. E-mail: [email protected].
Mol. Biol. Evol. 19(9):1501–1513. 2002
q 2002 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
kb of sequence (Satta, Klein, and Takahata 2000), the
consistency of the inferred relationship is rather poor.
Of the 174 sites that are informative regarding the relationship between H, C, G, and O, only 91 (52%) support the (H, C) clade. Almost half the sites support alternative phylogenetic relationships between the species—either the (H, G) or the (C, G) clade. Inconsistency in the inferred patterns of shared-derived
substitutions (incompatibility) is apparent both between
and within the loci of the three species comprising the
trichotomy.
The two main causes commonly invoked to explain
why different portions of the genome provide different
answers regarding the phylogenetic relationships within
a group of taxa are assortment of ancestral polymorphism
and homoplasy. In the former case, an ancestral population of species H, C, and G may contain two alleles, a
and b, at locus 1 and two other alleles, x and y, at locus
2. If, for example, at locus 1 the a allele is subsequently
fixed in species G, whereas allele b is fixed in species C
and H, C will be judged as the closest relative of H by
the analysis of this locus. If, on the other hand, allele x
at locus 2 is fixed in species C, whereas the y allele is
fixed in species G and H, G rather than C will appear to
be the closest relative of H. Similarly, at the nucleotide
level, two sites within a single gene may yield contradictory phylogenetic information if recombination takes
place between them and their polymorphism is differentially resolved among the species. The second major
cause of phylogenetic ambiguity, homoplasy (i.e., independently attained similarity at a site), is commonly differentiated into parallel evolution (similarity acquired
from the same ancestral condition) and evolutionary convergence (similarity attained from different ancestral conditions). Thus, for example, if a changes to b indepen1501
ABBREVIATION
ZFX
ZNFN1A1
CXCR4
APOB
APOB
ACAT2
VHL
NGFB
SCG2
ADRB2
UOX
DRPLA
PRNP
NPPA
ADRB3
ZFY
ZFY
DRD4
RNASE6
CCR5
BRCA1
OXTR
IFNG
ABO
TNF
TNF
BCYRAN1
B2M
LYZ
FUT2
PROC
C4B
F9
RHAG
DMP1
IL16
COX4
ODC1
DAF
POMC
PAH
LCAT
OPN1SW
APOA1
AFP
HBBP1
FPR1
GENE NAME
Zinc finger protein, X-linked . . . . . . . . . . . . . . . . . . .
Zinc finger protein, subfamily 1A, member 1. . . . . .
Chemokine (C-X-C) receptor 4 . . . . . . . . . . . . . . . . .
Apolipoprotein B: segment 1 . . . . . . . . . . . . . . . . . . .
Apolipoprotein B100: segment 2 . . . . . . . . . . . . . . . .
TCP1-ACAT2 overlap . . . . . . . . . . . . . . . . . . . . . . . . .
Von Hippel-Lindau tumor supressor gene . . . . . . . . .
Nerve growth factor, beta subunit . . . . . . . . . . . . . . .
Secretogranin II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Adrenergic receptor, beta-2 . . . . . . . . . . . . . . . . . . . . .
Urate oxidase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dentatorubral pallidoluysian atrophy . . . . . . . . . . . . .
Prion protein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Natriuretic peptide precursor A . . . . . . . . . . . . . . . . .
Adrenergic receptor, beta-3 . . . . . . . . . . . . . . . . . . . . .
Zinc finger protein, Y-linked: segment 1. . . . . . . . . .
Zinc finger protein, Y-linked: segment 2. . . . . . . . . .
Dopamine receptor D4 . . . . . . . . . . . . . . . . . . . . . . . .
Ribonuclease A family, 6 . . . . . . . . . . . . . . . . . . . . . .
Chemokine (C-C) receptor 5. . . . . . . . . . . . . . . . . . . .
Breast cancer, type 1 . . . . . . . . . . . . . . . . . . . . . . . . . .
Oxytocin receptor. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Interferon gamma. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ABO blood group . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tumor necrosis factor: segment 1 . . . . . . . . . . . . . . .
Tumor necrosis factor: segment 2 . . . . . . . . . . . . . . .
Brain cytoplasmic RNA 1. . . . . . . . . . . . . . . . . . . . . .
Beta-2-microglobulin . . . . . . . . . . . . . . . . . . . . . . . . . .
Lysozyme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fucosyl transferase 2 . . . . . . . . . . . . . . . . . . . . . . . . . .
Protein C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Complement component 4B . . . . . . . . . . . . . . . . . . . .
Hemophilia B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rhesus blood group–associated glycoprotein . . . . . .
Dental matrix acidic phosphoprotein 1 . . . . . . . . . . .
Interleukin 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cytochrome c oxidase, subunit IV . . . . . . . . . . . . . . .
Ornithine decarboxylase 1. . . . . . . . . . . . . . . . . . . . . .
Decay-accelerating factor for complement . . . . . . . .
Proopiomelanocortin . . . . . . . . . . . . . . . . . . . . . . . . . .
Phenylalanine hydroxylase . . . . . . . . . . . . . . . . . . . . .
Lecithin cholesterol acyltransferase . . . . . . . . . . . . . .
Opsin 1, short wave. . . . . . . . . . . . . . . . . . . . . . . . . . .
Apolipoprotein A-I . . . . . . . . . . . . . . . . . . . . . . . . . . .
Alpha-fetoprotein . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hemoglobin, beta pseudogene 1. . . . . . . . . . . . . . . . .
Formyl peptide receptor 1. . . . . . . . . . . . . . . . . . . . . .
Xp22.2-p21.3
7p12
2q21
2p24
2p24
6q25.3-q26
3p26-p25
1p13.1
2q35-q36
5q32-q34
1p22
12p13.31
20pter-p12
1p36.2
8p12-p11.2
Yp11.3
Yp11.3
11p15.5
14
3p21
17q21
3p26.2
12q14
9q34
6q21.3
6q21.3
2p16
15q21-q22
12
19q13.3
2q13-q14
6p21.3
Xq27.1-q27.2
6p21.1-p11
4q21
15q26.1
16q22-qter
2p25
1q32
2p23.3
12q24.1
16q22.1
7q31.3-q32
11q23-q24
4q11-q13
11p15.5
19q13.4
LOCATION
Table 1
List of 57 Segments of 51 Loci Examined in the Present Study
397
554
1,003
175
588
517
432
630
1,011
1,136
375
478
759
452
941
397
693
261
453
1,019
3,408
827
426
568
738
1,499
584
990
447
993
381
504
1,028
1,218
877
930
437
1,274
655
490
1,137
1,412
1,201
1,147
575
9,174
1,005
SIZE
Human
X59738
U40462
AF025375
DS41109
M14162
NT007122
AF010238
X52599
M25756
J02960
M27696
D38529
AF085477
X01471
X72861
J03134
U24118
I12349
U64998
AF011500
AF005068
AC008151
J00219
(2)
AP000505
X02910
AF020057
M17987
M21119
AB004861
U47685
U07852
K02053
AF031548
U89012
AF077011
U90915
M81740
AB003312
J00291
AF204239
X04981
U53874
J00098
M16110
D56900
M37128
AB041907
AY091915
U89798
DS41109
AY091920
AY091925
(1)
AY091930
AY091935
AY091940
M69165
AJ133270
U08296
AY091944
AY091949
AB041908
U24117
AF010294
AF037081
AF011540
AF019075
AY091953
AF1647
(2)
AF195663
AY091959
AF067778
AY091964
U76912
AF080603
U77647
L38806
AY091969
AF177621
AY091974
AF007879
AF042747
AY091981
AY091986
AY091991
AY091996
AY092001
AF039433
AY092007
U21916
D56900
X97745
Chimpanzee
AB041909
AY091916
AF172232
DS41109
AY091921
AY091926
(1)
AY091931
AY091936
AY091941
U03509
AJ133271
U08300
AY091945
AY091950
AB041910
U24119
AF010296
AF037088
AF105291
AF019076
AY091954
AF1647
(2)
AF195664
AY091960
AF067779
AY091965
U76913
AF080605
U77648
L38799
AY091970
AF177622
AY091975
AY091979
AF042750
AY091982
AY091987
AY091992
AY091997
AY092002
AF039425
AY092008
M38272
D56900
X97736
Gorilla
ORIGIN
OF
AB041911
AY091917
AF172231
DS41109
AY091922
AY091927
(1)
AY091932
AY091937
AY091942
M69167
AJ133272
U08305
AY091946
AY091951
AB041912
X72698
AF010298
AF037082
AF075446
AF019077
AY091955
AF1647
(2)
AF195665
AY091961
AF067780
AY091966
U76914
AF111935
U77650
L38805
AY091971
AF177623
AY091976
AY091980
AF042753
AY091983
AY091988
AY091993
AY091998
AY092003
AY092006
AY092009
AY092011
D56900
X97735
Orangutan
SEQUENCE
OWM
AB041917
AY091918
AF172224
DS41109
AY091923
AY091928
(1)
AY091933
AY091938
L38905
M69168
AJ133274
U08311
AY091947
U63591
AB041918
X58931
AF125666
AF037089
AF075450
AF019078
AY091956
L26024
(2)
AF195667
AY091962
AF067784
AY091967
X60236
AF080607
U77651
L38802
AY091972
AF177625
AY091977
AF017107
AF042759
AY091984
AY091989
AY091994
AY091999
AY092004
AF158976
M83242
AY092012
D56900
X97734
NWM
AB041921
AY091919
AF178084
DS41109
AY091924
AY091929
(1)
AY091934
AY091939
AY091943
M69169
AJ133275
U08304
AY091948
AY091952
AB041922
X58936
AF125669
AF037086
AF161923
AF019079
AY091957
X64659
AY091958
AF195668
AY091963
AF067788
AY091968
U76922
AF111936
U77649
L38807
AY091973
AF132980
AY091978
AF017109
AF042765
AY091985
AY091990
AY091995
AY092000
AY092005
U53875
AY092010
AY092013
D56900
AY092014
1502
O’hUigin et al.
NOTE.—Wherever available, locus names and abbreviations are used according to the Human Genetic Nomenclature Committee as listed in the NCBI. Chromosomal location is taken from the NCBI database. The size of
the aligned segments, after exclusion of indels, as well as initiation and stop codons, is given in base pairs. The origin of the sequences is indicated by accession codes in the GenBank database or literature references for the six
OTUs represented in the study. The references are Kominato et al. (1992) (2) and Woodward et al. (2000) (1). OWM, Old World monkeys; NWM, New World monkeys.
NWM
OWM
AF209081
D510763
AY092018
X71338
AJ002049
X86385
X51890
X61092
U24098
U24096
AF211185
D510763
AY092017
AF215714
AJ002052
X86383
AY092022
AY092024
U24101
U24104
Orangutan
Gorilla
AF211186
D510763
AY092016
AF215712
AJ002050
X86382
AY092021
AY092023
U24097
U24100
AF209082
D510763
AY092015
AF215711
AJ002051
X86380
AY092020
X61089
U24103
U24102
Chimpanzee
Human
AF155912
D510763
M11319
AF215713
U23824
L10101
L43402
V00565
X16545
M24157
1,446
8,932
1,335
539
558
612
1,382
860
477
476
SIZE
LOCATION
5p13-p12
11p15.5
7q22
16p13.2
2p22-p21
Yp11.3
5q31.1
11p15.5
14q24-q31
14q24-q31
GHR
HBG1
EPO
PRM2
MSH2
SRY
IL3
INS
RNASE3
RNASE2
ABBREVIATION
GENE NAME
Table 1
Continued
Growth hormone receptor . . . . . . . . . . . . . . . . . . . . . .
Hemoglobin, gamma globin A . . . . . . . . . . . . . . . . . .
Erythropoietin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sperm protamine P2 . . . . . . . . . . . . . . . . . . . . . . . . . .
Mismatch repair enzyme MHS2. . . . . . . . . . . . . . . . .
Sex-determining region Y . . . . . . . . . . . . . . . . . . . . . .
Interleukin 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Insulin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ribonuclease A family, 3 . . . . . . . . . . . . . . . . . . . . . .
Ribonuclease A family, 2 . . . . . . . . . . . . . . . . . . . . . .
ORIGIN
OF
SEQUENCE
AF209080
D510763
AY092019
X71335
AJ002053
X86386
X74878
J02989
U24099
U24099
Homoplasy in Primate Genes
1503
dently in G and H, whereas it remains unaltered in C,
species G will appear to be more closely related to H
than C, although in reality it may have diverged earlier
than C from the lineage leading to H.
The extent to which ancestral polymorphism and
homoplasy contribute to the obfuscation of a phylogenetic relationship is not known. In most molecular phylogenetic reconstructions, attempts are made to take homoplasy into account by correcting the observed sequence for presumed hidden substitutions with the help
of one of the correction formulas available (Nei and Kumar 2000, pp. 33–50). The underlying assumptions of
all these formulas are stochasticity of the evolutionary
process at the molecular level and neutrality of the substitutions. The formulas differ in the extent to which
they take into account various factors that may influence
the stochasticity of the process, such as the ratio of transitions to transversions or the four-nucleotide content of
the sequence.
Here we attempt to actually measure the extent to
which ancestral polymorphism and homoplasy influence
phylogenetic reconstruction. To this end, we use a large
collection of primate sequences, one half of which we
obtained in our Tübingen laboratory and the other half
from databases. The collection was assembled for a variety of purposes, the estimate of the relative influence
of ancestral polymorphism and homoplasy on phylogenetic reconstruction being one of them. The data set includes sequences of human, chimpanzee, gorilla, and
orangutan, as well as representative species of Old
World monkeys (OWM) and New World monkeys
(NWM). It thus covers a range of divergence times extending from 5 MYA (the human-chimpanzee split;
White, Suwa, and Asfaw 1998) to nearly 50 MYA (the
Platyrrhini-Catarrhini split dated by Kumar and Hedges
[1998] to 47.6 6 8.3 MYA). It is this wide span of
evolutionary time that allows us to use the data set for
the present purpose. The expectation is that the degree
to which ancestral polymorphism and homoplasy obscure phylogenetic relationships depends on the particular time frame of the evolutionary process. To understand the reason for this dependence, consider two time
intervals, one encompassing the period during which
three closely related species lineages diverged from one
another (e.g., G from [H, C], followed by the divergence
of H from C) and the second covering the period from
the first divergence (i.e., G from [H, C]) to the present
time. In the first case, we must take into account that
the resolution of ancestral neutral polymorphisms in a
population consisting of 105 breeding individuals may
take up to 3 Myr (Takahata 1993; Takahata and Satta
1997). Hence, if the interval between the first and the
second divergence was ,3 Myr (as it probably was in
the case of the H, C, and G lineages), then the resolution
of the ancestral polymorphism can be expected to have
confounded the phylogeny of the three lineages. On the
other hand, if the interval between the first and the second divergences was .3 Myr (as was in the case of the
divergences of OWM, NWM, and ape lineages), the resolution of the ancestral polymorphism should not have
had any confounding effect. As for the interval from the
1504
O’hUigin et al.
first divergence to the present, the length of the divergence time determines how much homoplasy can be expected. Because homoplasy at the molecular level generally involves more than one substitution at a site and
because in stochastic processes two hits at a site are
more probable in a long time interval than in a short
one, in the short interval, the frequency of homoplasy
can be expected to be negligibly low, unless the substitution rate is very high (Takahata 1995). Therefore, homoplasy may have confounded the phylogenetic relationship among some of the NWM, OWM, and ape
genes, but it might not have influenced the phylogeny
of the human, chimpanzee, and ape genes. The objectives of the present study were to estimate the degree of
phylogenetic incompatibility for clades in which only
homoplasy could be the cause, to infer the mechanisms
by which homoplasy arises, and to determine the importance of homoplasy in phylogenetic reconstruction.
human genome database entries. If conflicts occurred,
sequences that differed from other primate sequences by
the fewest substitutions were used. When more than one
sequence from a particular primate lineage was available
in the databases, sequences from cotton-top tamarin,
bear macaque, and common chimpanzee were chosen
for consistency. In other cases, the sequence of the nearest available relative from the same lineage was used.
For simplicity, we refer to the representatives of the individual lineages as operational taxonomic units
(OTUs). Throughout the text, the human, chimpanzee,
gorilla, orangutan, macaque (OWM), and tamarin
(NWM) lineages (OTUs) are abbreviated as H, C, G, O,
M, and T, respectively. Some genes (UOX, FPR1,
HBG1) are functional in certain lineages but have become inactivated in others. In such cases, the gene is
treated according to its functional state in the majority
of the OTUs.
Materials and Methods
The Data Set
Partitioning Analysis
The collection of sequences used in the present
study comprised orthologous genes at 51 loci in species
representing the major groups of anthropoid primates:
human, African and Asian great apes, OWM and NWM
(table 1). The apes were represented by the common
chimpanzee (Pan troglodytes), the pygmy chimpanzee
(P. paniscus), lowland gorilla (Gorilla gorilla), and
orangutan (Pongo pygmaeus); the OWM by the bear
macaque (Macaca arctoides), the rhesus macaque (M.
mulatta), the crab-eating macaque (M. fascicularis), the
Japanese macaque (M. fuscata), the gelada baboon
(Theropithecus gelada), the yellow baboon (Papio cynocephalus), the patas monkey (Erythrocebus patas),
and the green monkey (Cercopithecus aethiops); and the
NWM by the cotton-top tamarin (Saguinus oedipus), the
golden-mantled tamarin (S. tripartitus), the common
marmoset (Callithrix jacchus), the black tufted-ear marmoset (C. penicillata), the black-capped capuchin (Cebus apella), the common squirrel-monkey (Saimiri sciureus), the Bolivian squirrel monkey (S. boliviensis), the
northern night monkey (Aotus trivirgatus), the southern
night monkey (A. azarae), the red howler monkey (Alouatta seniculus), and the spider monkey (Ateles sp.).
The human loci came from different chromosomes, and
they represented different functional categories, from
ubiquitously expressed housekeeping genes to genes restricted in their expression to specific tissues. The orthology of the loci was checked by phylogenetic analysis which led to the exclusion of four of the original
55 loci: HLA-G, RNR1 (ribosomal RNA1, 28S), and
MICA, on grounds of possible paralogy within multigene families, and IVL (involucrin) because of a complicated mode of evolution (Teumer and Green 1989).
The extent of sequence variability caused by either polymorphism or polymerase chain reaction (PCR) and sequencing errors was estimated by independently amplifying and resequencing segments of nine of the 51
genes. Human sequences were checked for accuracy by
comparing them with the corresponding segments of the
Sequences were aligned by eye; only in the case of
the globin genes and ApoB segment 1 was an alignment
obtained from the databases. All variable sites were then
identified and classified as two-base, three-base, or fourbase sites according to the number of nucleotide types
found at each of them in the different species. Following
Satta, Klein, and Takahata (2000), each variable site was
classified as consisting of singletons, doubletons, and
tripletons depending on whether a variant nucleotide occurred in one, two, or three of the six OTUs, respectively. This classification is sufficient to adequately and
unambiguously describe the configuration of a site of
six OTUs; it is unnecessary to count quatratons (nucleotide shared by four OTUs), pentatons (sharing by five
OTUs), or hexatons (an invariant nucleotide) at a site
because these are already incorporated in the description
of the site in terms of singletons, doubletons, and tripletons. For example, the sharing of a nucleotide by five
sequences at a site (a pentaton) is already inferred by
the observation that such a site consists of one singleton,
no doubletons, and no tripletons. A quatraton is inferred
either from the presence of two singletons, no doubletons, and no tripletons or from the presence of no singletons, one doubleton, and no tripletons. Consequently,
groupings above the level of tripleton are redundant in
the classification of a site consisting of six OTUs. The
total number of different singletons, doubletons, and tripletons that the six species could theoretically yield is
6, 15, and 20, respectively. The partitioning pattern of
a given site provides information about OTUs because
a shared character is more likely to be derived from a
single mutation at the stem of two branches than from
two independent mutational events.
To explain the system of partitioning, consider a
site occupied by nucleotides g, c, c, a, a, and a in the
OTUs H, C, G, O, M, and T, respectively. (To avoid
confusion between nucleotide and OTU designations,
here and subsequently we use italicized lowercase letters
for the former and roman type uppercase letters for the
latter.) The site contains one singleton because the nu-
Homoplasy in Primate Genes
1505
length of the phylogeny, the number of homoplasies is
expected to increase with an increase in these two parameters. The estimated degree of homoplasy can therefore be expected to vary according to the species chosen
as an out-group and the interval from the first divergence to the present time.
Computer Simulation
FIG. 1.—The simulated phylogeny of six species. The values t5
to t1 correspond to the divergence times of branches leading to tip
sequences of T, M, O, G, C, and H, respectively. A0 is an ancestral
sequence, and N1 to N4 are node sequences. The values d1 to d9 are
the per-site nucleotide divergences of the individual branches.
cleotide g occurs only in H and in no other OTU. The
site also contains a doubleton because it is occupied by
a c in both C and G but in no other OTU. Finally, the
site also contains one tripleton because it is occupied by
an a in O, M, or T but by a different nucleotide in the
remaining three OTUs. If we did not know the root of
the tree, this partitioning pattern would suggest that C
and G share a common ancestor, as do O, M, and T (so
that C, G, and H would share a common ancestor as
well). But because we know the root, we can explain
the observed pattern by assuming a single mutation in
the stem of the C, G, and H branches. The partitioning
of sites is then taken one step further. We notice that the
chosen site has a g in H but that it is occupied by two
different nucleotides in the other OTUs, c in C and G
but a in O, M, and T. Altogether, the site is occupied by
three different nucleotides in the six OTUs, and so it is
classified as a three-base site. The sole singleton is classified as a three-base singleton. Similarly, the site contains a three-base doubleton and a three-base tripleton.
If the site were occupied by g, a, a, a, a, and a in H,
C, G, O, M, and T, respectively, it would consist of a
two-base singleton, no doubletons, and no tripletons.
The classification was used to provide support for
or against the arrangements of the OTUs into specific
clades. Singletons are not phylogenetically informative
and do not provide support for or against particular
clades. Doubletons and tripletons that support the separation of OTUs into clades consistent with the consensus phylogeny (the one depicted in fig. 1) are called
compatible. Doubletons and tripletons that support separation into clades inconsistent with the consensus phylogeny are called incompatible. A site can contain more
than one doubleton or tripleton and so can be informative about more than one clade in the phylogeny. Because the number of variable sites in a data set is a
function of the evolutionary rate and the total branch
To examine the effects of mutation rate and nucleotide composition on the frequency of homoplasy, computer simulations were undertaken. The divergence
times of the H, C, G, O, M, and T lineages were designated as t1 to t5 (fig. 1). Under a given mutation rate
m (per site per unit time), the expected number of nucleotide substitutions per site on branches leading to the
different OTUs was based on actual sequence data (see
Results). The values used were d5 5 t5m 5 0.05, d4 5
t4m 5 0.027, d3 5 t3m 5 0.0125, d2 5 t2m 5 0.005, and
d1 5 t1m 5 0.0045 for the T, M, O, G, and C or H
lineages, respectively. For each OTU, a single sequence
was generated in each replication. The simulation began
with an ancestral sequence, A0, composed of 1,000
identical nucleotides (all a’s). To generate a representative sequence of OTU T, for each of the 1,000 sites, a
single uniform variable v was generated. If v was smaller than d5, a substitution was introduced at the site. The
identity of the introduced nucleotide was determined by
using the four-state model, in which a nucleotide has an
equal probability of changing to any of the three remaining nucleotides (an assumption underlying the
Jukes-Cantor model; see Jukes and Cantor 1969), or the
two-state model, in which an a can change only to t (or
vice versa) and g can only change to c (or vice versa);
such a situation can occur in extremely at- or gc-rich
regions of the genome. If, however, the simulation is
started with a’s only, the g↔c change can never happen.
The two-state model also covers the case in which transitions and transversions have very different rates. The
simulation assumes only two bases, a and t or g and c.
If instead of these two bases, two different bases, such
as a and g, are considered, the simulation does not
change in principle and the a and g model corresponds
to the extreme case of a strong substitutional bias. To
simulate the divergence of lineages, four node sequences
(N1 to N4) and six tip sequences (T, M, O, G, C, and
H) were generated (fig. 1). For example, N1 was generated from A0 with probability of substitutions of d6 5
d5 2 d4 at each site in the same way as above. From
N1, the N2 node sequence and the M tip sequence were
generated by nucleotide substitutions with probabilities
of d7 5 d4 2 d3 and d3, respectively. The extent of
incompatibility of the generated tip sequences was then
examined by the same method as that applied to the
actual data, and the extent of compatibility of the (H,
C), (H, C, G), and (H, C, G, O) clades was estimated
from the simulated sequences. To obtain the distribution
of the extent of incompatibility, 10,000 incompatibility
values for each clade were generated, and the proportion
of incompatibility values that fell in a particular range
was calculated. The range was defined as one of 20 di-
1506
O’hUigin et al.
visions of an interval extending from 0 to 1.0. To examine the effect of an increased mutation rate, a 10
times higher mutation rate was obtained by increasing
the di (for i 5 1 to 5) values 10-fold. An ‘‘incompatibility value’’ was defined as the proportion of sites incompatible with the proposed phylogeny relative to all
sites informative with respect to this phylogeny. For example, consider the (H, C, G) phylogeny: if there are x,
y, and z sites supporting (H, C) G, (H, G) C, and (C,
G) H phylogenies, respectively, then the incompatibility
value for the (H, C, G) phylogeny is given by (y 1 z)/
(x 1 y 1 z). Incompatibility values for (H, C, G, O) and
(H, C, G, O, M) phylogenies can be calculated in a
similar way.
Results and Discussion
Characteristics of the Data Set
The data set used in the final analysis comprised
57 segments of 51 genes in six OTUs and hence 306
sequences altogether. The length of the sequences varied
depending on the gene. The total length of the concatenated sequences, after the removal of gaps and initiation and stop codons, was 62,533 bp. Fifty-four thousand seventy-three of the total number of sites (86.3%)
were invariant, and 1,402 (15.2%) of the 8,457 (13.7%)
variable sites were phylogenetically informative. At the
variable sites, singletons were present in similar numbers in H (287), C (256), and G (308) sequences. In the
O, M, and T sequences, the number of singletons increased to 684, 1,705, and 4,189, respectively, corresponding to their increasingly greater phylogenetic distance from the other OTUs.
To estimate the extent to which the interspecies
comparisons might be influenced by either intraspecies
polymorphism or by errors in sequence determination
(either during PCR amplification or during sequencing),
nine randomly chosen gene segments were reamplified
and resequenced from all or nearly all the nonhuman
OTUs. Comparison of the ‘‘old’’ and the ‘‘new’’ sequences (a total of about 30 kb in length) revealed 44
differences. The number of differences varied from gene
to gene, being highest in APOA1 (12 differences in a
total of 4.8 kb of sequence from four OTUs) and lowest
in POMC (two differences in a total of 2.4 kb of sequence from four OTUs). The mean was 1.5 differences
per kilobasepair of sequence, all the differences being
singletons (i.e., they were not shared with any other sequence in the alignment). Significantly, all the incompatible sites included in the resequenced set could be
confirmed.
Partitioning Analysis to Identify the Nearest Living
Relative of the Human Species
In a partitioning analysis, the sites at which differences occur between the studied OTUs are considered
individually in terms of their support or the lack thereof
for a particular phylogeny. Initially, the differential sites
are classified into singletons, doubletons, or tripletons
for each of the six OTUs separately or for the various
combinations of the OTUs, as described in Materials
Table 2
Partition of Variable Sites in the 51 Loci from Six OTUs
Partition
Total
Two-Base
Three-Base Four-Base
Singletons
H .......
C. . . . . . . .
G .......
O .......
M.......
T. . . . . . . .
287
257
308
685
1,705
4,188
257
225
274
598
1,498
3,909
25
31
30
86
202
271
5
1
4
1
5
8
Doubletons
H C .....
H G .....
H O .....
H M.....
H T. . . . . .
C G .....
C O .....
C M.....
C T. . . . . .
G O .....
G M.....
G T. . . . . .
O M.....
O T. . . . . .
M T .....
46
14
6
14
31
29
8
8
15
6
17
31
85
88
660
43
11
5
13
30
24
6
8
13
6
16
29
60
79
638
3
3
1
1
1
5
2
0
2
0
1
2
25
9
22
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Tripletons
H C G ...
H C O ...
H C M. . .
H C T ...
H G O...
H G M. . .
H G T ...
H O M. . .
H O T ...
H M T...
C G O ...
C G M. . .
C G T ...
C O M. . .
C O T ...
C M T...
G O M. . .
G O T ...
G M T...
O M T...
332
21
6
8
14
6
5
3
3
7
10
1
1
0
2
0
3
0
0
3
291
14
5
8
12
3
4
3
2
6
6
0
0
0
0
0
0
0
0
0
41
7
1
0
2
3
1
0
1
1
4
1
1
0
2
0
3
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
NOTE.—Classification of sites is explained in the text. The numbers of sites
falling into the individual categories are given.
and Methods (table 2). In the next step, the 41 partitions
that are theoretically possible with a set of six OTUs
are further classified into two-, three-, or four-base categories (see Materials and Methods). Phylogenetically
informative partitions are then identified, their deposition regarding individual phylogenies is noted, and the
sites supporting a particular phylogeny are tallied. Because a larger number of phylogenies are possible for
clades with a greater number of OTUs, only those partitions that informatively group one OTU of the (H, C,
G), (H, C, G, O), or (H, C, G, O, M) phylogenies with
the appropriate out-group (O, M, or T, respectively) are
considered in estimating the extent of incompatibility
for each clade.
Table 2 displays the number of sites that fall into
the individual partitions into which the differential sites
Homoplasy in Primate Genes
in the alignment of the sequences from the six OTUs
can be classified. The partitions differ in their phylogenetic informativeness. Singletons are uninformative
about phylogenetic relationships among the species;
only sites that contain at least two types of nucleotides,
each type being shared by at least two OTUs, are considered to be phylogenetically informative. All the 15
possible doubletons, regardless of the number of nucleotide types found at the site, can be phylogenetically
informative. By contrast, only the two- and three-base
categories of the tripletons are phylogenetically informative because a four-base tripleton in reality consists
of one tripleton and three singletons.
Altogether, 1,402 sites were found to be phylogenetically informative, and of these, approximately 90%
provided information regarding the grouping of the
OTUs under consideration here. The remaining 10%
gave information on groupings not relevant to the study;
for example, the grouping of O with M or of C, G, and
M. Of the 89 sites that are informative about the H, C,
and G relationship, 46 sites (52%) were found to support
the (H, C) clade excluding all the other OTUs, a value
similar to that found by Satta, Klein, and Takahata
(2000). Of these, 43 sites were of the two-base type, and
3 sites were of the three-base type. Fourteen sites (11
of the two-base type and 3 of the three-base type) supported the (H, G) clade, and 27 sites (24 of the twobase and 3 of the three-base type) supported the (C, G)
clade. Thus, the results of the partitioning analysis uphold the conclusion reached in an earlier study with a
different data set (Satta, Klein, and Takahata 2000),
namely, that the chimpanzee is the nearest living relative
of Homo sapiens. At the same time, however, the high
proportion of incompatibility between the phylogenetically informative sites (with 48% of the sites supporting
alternative phylogenies) indicates that sorting out of ancestral polymorphisms, homoplasy, or both have blurred
the phylogenetic signals that might have otherwise indicated clearly the disjunction of the H, C, and G
lineages.
Dissociation of Ancestral Polymorphism from
Homoplasy
The results described in the preceding section indicate that the gorilla lineage diverged from the lineage
leading to the common ancestor of human and chimpanzee before these last two species (lineages) diverged
from each other. The interval between these two divergences was apparently relatively short, probably not
more than 1–3 Myr, well within the range of persistence
of ancestral polymorphism in a large population (Takahata 1995). To estimate to what degree homoplasy might
have contributed to the blurring of phylogenetic signals
during this interval, it is necessary to extend the partitioning analysis by including more distantly related lineages (O, M, T) into it. Both the paleontological (Martin
1993) and molecular (Sarich and Wilson 1967; Sibley,
Comstock, and Ahlquist 1990; Horai et al. 1992) data
indicate that the orangutan lineage diverged from the
lineage leading to the common ancestor of human,
1507
chimpanzee, and gorilla 12–15 MYA. Because the human and chimpanzee lineages diverged from each other
5 MYA, and the gorilla lineage diverged from the lineage of the (H, C) ancestor not more than 8 MYA (Horai
et al. 1992), an interval of .4 Myr separated the divergence of the orangutan lineage from that of the (H, C,
G) lineage. This interval is too long for any ancestral
polymorphism (except that maintained by balancing selection; see Klein et al. 1998) to survive, and so any
incompatibilities between phylogenetically informative
sites in the analysis of the (H, C, G, O) phylogeny
should be attributable to homoplasy.
The inclusion of the O OTU in the partitioning
analysis revealed the existence of 332 sites that support
the (H, C, G) clade excluding O with M as the out-group
(table 3). Forty-five sites are inconsistent with this
grouping in that they include O in the clade and exclude
H (10 sites), C (14 sites), or G (21 sites). Thus, the (H,
C, G) clade is supported by 88% of the informative sites,
with the remaining 12% of sites supporting alternative
phylogenies. The incompatibilities of the latter sites are
presumably the result of homoplasy in the lineages leading to M on the one hand and to the (H, C, G, O) lineage
on the other. The increase in the proportion of sites that
support the standard phylogeny from 52% for the (H,
C) clade to 88% for the (H, C, G) clade is presumably
a reflection of the corresponding decrease in the contribution of ancestral polymorphism to the evolution of the
two clades.
Similarly, the (H, C, G, O) grouping excluding M
and T is supported by 638 (81%) of the 789 informative
sites, the remaining 19% of sites supporting alternative
groupings: (H, C, G, M), 10%; (H, C, M, O), 3.6%; (H,
M, G, O), 1.6%; or (M, C, G, O), 3.8%. Here, homoplasy can be assumed to have affected 19% of the informative sites. The increase in homoplasy from 12%
for the (H, C, G, O) lineage to 19% for the (H, C, G,
O, M) lineage is as expected, taking into account the
increased divergence time of the out-group T to the latter group (.45 Myr; Martin 1993) in comparison with
that of the out-group M to the former group (;30 Myr;
Martin 1993) and assuming that the probability of substitution is a function of time. By assuming that maximally 12% of the informative sites have also been influenced by homoplasy during the evolution of the (H,
C, G) group, we estimate that in maximally one-quarter
of the 48% incompatible sites found in this group, the
incongruence with the consensus phylogeny is the result
of homoplasy. The incongruence of the remaining threequarters of incompatible sites is presumably caused by
the segregation of ancestral polymorphisms.
Possible Reasons for the High Level of Homoplasy
The observation that 19% and 12% of the phylogenetically informative sites have undergone homoplasious substitutions during the time interval between the
divergence of the Platyrrhini from the Catarrhini lineage
and of the OWM from the ape lineage, respectively, is
surprising and unexpected. The common perception,
supported by computer simulations based on the stan-
1508
O’hUigin et al.
Table 3
Partitioning of Informative Sites in the 51 Loci Divided into Coding and Noncoding
Regions
Sites
Partition
All
(A) All (62,533 bp) . . . . . . . .
(H, C) (G, O)
(H, G) (C, O)
(C, G) (H, O)
Total
(H, C, G) (O, M)
(H, C, O) (G, M)
(H, G, O) (C, M)
(C, G, O) (H, M)
Total
(H, C, G, O) (M, T)
(H, C, G, M) (O, T)
(H, C, M, O) (G, T)
(H, M, G, O) (C, T)
(M, C, G, O) (H, T)
Total
(H, C) (G, O)
(H, G) (C, O)
(C, G) (H, O)
Total
(H, C, G) (O, M)
(H, C, O) (G, M)
(H, G, O) (C, M)
(C, G, O) (H, M)
Total
(H, C, G, O) (M, T)
(H, C, G, M) (O, T)
(H, C, M, O) (G, T)
(H, M, G, O) (C, T)
(M, C, G, O) (H, T)
Total
(H, C) (G, O)
(H, G) (C, O)
(C, G) (H, O)
Total
(H, C, G) (O, M)
(H, C, O) (G, M)
(H, G, O) (C, M)
(C, G, O) (H, M)
Total
(H, C, G, O) (M, T)
(H, C, G, M) (O, T)
(H, C, M, O) (G, T)
(H, M, G, O) (C, T)
(M, C, G, O) (H, T)
Total
46
14
29
89
332
21
14
10
377
638
79
29
13
30
789
15
5
7
27
129
8
6
5
148
255
31
10
3
13
312
31
9
22
62
203
13
8
5
229
382
48
19
10
17
476
(B) Coding (29,451 bp) . . . . .
(C) Noncoding (32,921 bp) . .
Two-Base Three-Base Four-Base
43
11
24
3
3
5
291
14
12
6
41
7
2
4
0
0
0
638
79
29
13
30
14
4
5
1
1
2
112
4
4
2
17
4
2
3
0
0
0
255
31
10
3
13
29
7
19
2
2
3
179
10
8
4
24
3
0
1
0
0
0
382
48
19
10
17
NOTE.—The table shows the number of phylogenetically informative sites supporting each partition. OTUs forming a
clade are included in parentheses. The sites are classified according to the system explained in the text.
dard models of molecular evolution (see below), is that
homoplasies in intervals of these lengths are rare, on the
order of a few percent at most. Even in cases in which
intense selection is known or suspected to drive the substitution process (Takahata 1995) or in which functional
convergences at the molecular level are postulated
(Swanson, Irwin, and Wilson 1991; Irwin, White, and
Wilson 1993; Lawn, Schwartz, and Patthy 1997), homoplasy is believed to be an exception rather than a rule.
The following question therefore arises: What might be
the reason for the high homoplasy found in the primate
lineages? In what follows, we consider three possible
answers to this question.
The first possibility is that the observed homoplasy
is a manifestation of selection pressure for convergence
in function. Many of the studied genes (ABO, RHAG,
PRM2, RNASE3, SRY) are known or postulated to be
under moderate-to-strong selection pressure (O’hUigin,
Sato, and Klein 1997; Zhang, Rosenberg, and Nei 1998;
Wyckoff, Wang, and Wu 2000). Could this pressure be
responsible for the high homoplasy? To test this possibility, we divided the data set into coding and noncoding
subsets and carried out the partitioning analysis separately on the two subsets (table 3, parts B and C). Of
the 29,451 coding sites of the 51 loci, 3,018 (10%) were
found to be variable, and of these, 545 (18%) sites are
phylogenetically informative. Of the 545 sites, 27, 148,
and 312 are informative about the (H, C, G), (H, C, G,
O), and (H, C, G, O, M) phylogenies, respectively. Fifteen of 27 (56%) relevant informative sites support the
Homoplasy in Primate Genes
(H, C) clade, whereas 129 of 148 (87%) sites support
the (H, C, G) clade, and 255 of 312 (82%) sites are
compatible with the (H, C, G, O) clade. Similarly, of
the 32,921 noncoding sites, 5,424 (16%) have been
found to be variable, and of these, 855 (16%) are phylogenetically informative. Of these, 62, 229, and 476
sites are informative about the (H, C, G), (H, C, G, O),
and (H, C, G, O, M) phylogenies, respectively. Thirtyone of the 62 (50%) relevant informative sites support
the (H, C) clade, 203 of 229 (89%) the (H, C, G) clade,
and 382 of 476 (80%) the (H, C, G, O) clade. Thus, the
differences in the proportion of incompatibilities between the coding and noncoding regions are small, and
no strong tendency for homoplasy arising preferentially
in the coding regions is apparent. Selection therefore
does not appear to play a dominant part in the generation
of homoplasy.
The second possibility is that the observed high
level of homoplasy is caused by a bias in nucleotide
composition. The mechanisms producing bias in equilibrium nucleotide frequencies in different genomic regions are unclear (Wolfe, Sharp, and Li 1989). Because
maintenance of compositional bias increases the probability of like substitutions, such bias could be expected
to indirectly influence the extent of homoplasy at certain
sites in a gene and in certain regions of a genome. This
effect should be most pronounced in the third positions
of codons and noncoding regions where mutational bias
is the primary determinant of nucleotide composition.
The absence of a significant increase in homoplasy in
noncoding regions noted above therefore provides an
argument against this explanation. Further evidence
emerges from an examination of the nucleotide composition of the individual genes (table 4). Compositional
bias was measured by using the method of Kornegay et
al. (1993). It was then related to the number of variable
sites, number of phylogenetically informative sites, and
number of sites compatible or incompatible with the
consensus phylogeny (table 4). The measurement of incompatibility was limited to the (H, C, G, O) and (H,
C, G, O, M) groups where no contribution of ancestral
polymorphism should occur. Most (45 of 57 segments)
of the genes in the data set showed some degree of
incompatibility which ranged from 0% to 72%. The relatively high percentages of incompatibility found in
some of the short genes (ZFX, ACAT2, DRD4, PROC)
may be caused by stochastic effects associated with a
low number of informative sites. The longer genes containing more informative sites probably provide a more
reliable estimate of incompatibility. The gist of the comparison is that compositional bias in either coding or
noncoding regions does not appear to have a strong effect on the degree of homoplasy found in the individual
genes. Genes showing the highest levels of compositional bias often show a below average (e.g., ADBR3,
MSH2) or no (e.g., ZNFN1A1, APOB) homoplasy. By
contrast, genes with high levels of homoplasy may have
a low (e.g., RNASE3, PROC) or moderate (e.g., PRM2,
LCAT) degree of compositional bias.
The third theoretically possible explanation for the
observed high average level of homoplasy in the primate
1509
genes is a high mutation rate with an associated increased probability of multiple hits. An increased mutation rate can have a variety of causes, one of which is
the presence of CpG dinucleotides prone to methylation
and thus to a high frequency of C→T transitions (Green
et al. 1990). To test the effect of an increased mutation
rate, substitution rates were estimated for the individual
genes. Assuming some correspondence between mutation and substitution rates and using Kimura’s two-parameter method (Kimura 1980), we calculated the persite substitution rate (K) for each gene in all 15 pairwise
comparisons of the six OTUs and then summed up the
values and expressed the sum as percent K—a measure
we refer to as ‘‘SK%’’ in table 4. The rates were calculated for all sites of a given gene and for synonymous
sites separately. Except for a few genes (generally the
slowly evolving ones), the two rate estimates correlated
reasonably well with each other. The general tendency
revealed by the estimates is that genes with higher mutation rates show higher degrees of homoplasy. Thus,
the 14 most rapidly evolving gene segments in the (H,
C, G, O) clade have 23% incompatible sites on average
(100 of 437 sites), whereas the most slowly evolving
gene segments have a mean incompatibility of 15% (9
of 60 sites), the overall mean incompatibility for this
clade being 17%. Eleven of the 12 gene segments that
show no incompatibilities caused by homoplasy are in
the slowly evolving set, and there is no segment without
incompatibility in the set of the 25 most rapidly evolving gene segments. Nonetheless, it must be pointed out
that although the tendency for the association of incompatibility with higher mutation (substitution) rates does
exist, the association is not very strong and it is not clear
to what degree the higher incompatibility levels could
be attributed to the increased rates. At least some of the
incompatibility may be related to certain special evolutionary characteristics of some of the genes. Thus, for
example, the high incompatibility level of the RNASE3
gene found in the (H, C, G, O) clade (13 incompatibilities at 15 sites) may be attributed to its retention of the
character of the RNASE2 gene from which it arose by
duplication following the divergence of the Catarrhini
from the Platyrrhini (Zhang, Rosenberg, and Nei 1998).
Search for the Cause of High Homoplasy by
Computer Simulation
Partitioning analysis excluded selection pressure,
but not nucleotide composition from being responsible
for the high homoplasy of the primate genes, and provided an indication that variation in the mutation rate of
different sites might be a factor. To test whether a combination of the two factors, nucleotide composition bias
and variability of mutation rates, might explain the data,
a computer simulation was carried out. The influence of
nucleotide composition was simulated by using either
the four-state or the two-state model of molecular evolution. The effect of the mutation rate was assessed by
letting the genes evolve with a rate of d5 5 0.05 and
then again with a 10-fold higher rate of d5 5 0.5. (As
mentioned earlier, the strong transition bias model is
1510
O’hUigin et al.
Table 4
Gene-by-Gene Analysis of Incompatibility
SK%
BIAS
GENE
Third
All
All
Synonymous
ZFX . . . . . . . .
ZNFN1A1 . . .
CXCR4. . . . . .
APOB . . . . . . .
APOB . . . . . . .
ACAT2 . . . . . .
VHL . . . . . . . .
NGFB . . . . . . .
SCG2 . . . . . . .
ADRB2. . . . . .
UOX . . . . . . . .
DRPLA. . . . . .
PRNP . . . . . . .
NPPA . . . . . . .
ADBR3. . . . . .
ZFY . . . . . .
ZFY . . . . . . . .
DRD4 . . . . . . .
RNASE6. . . . .
CCR5 . . . . . . .
BRCA1. . . . . .
OXTR. . . . . . .
IFNG . . . . . . .
ABO . . . . . . . .
TNF . . . . . . . .
TNF . . . . . . . .
BCYRN1 . . . .
B2M . . . . . . . .
LYZ . . . . . . . .
FUT2 . . . . . . .
PROC . . . . . . .
C4B . . . . . . . .
F9 . . . . . . . . . .
RHAG . . . . . .
DMP1 . . . . . . .
IL16 . . . . . . . .
COX4 . . . . . . .
ODC1 . . . . . . .
DAF . . . . . . . .
POMC . . . . . .
PAH . . . . . . . .
LCAT . . . . . . .
OPN1SW . . . .
APOA1. . . . . .
AFP . . . . . . . .
HBBP1 . . . . . .
FPR1. . . . . . . .
GHR . . . . . . . .
HBG1 . . . . . . .
EPO . . . . . . . .
PRM2 . . . . . . .
MSH2 . . . . . . .
SRY . . . . . . . .
IL3 . . . . . . . . .
INS . . . . . . . . .
RNASE3. . . . .
RNASE2. . . . .
Totals . . . . . . .
0.107
0.596
0.284
0.455
0.182
0.296
0.270
0.207
0.099
0.292
—
0.184
0.222
0.205
0.466
0.090
—
0.575
0.197
0.161
0.241
0.536
0.080
0.573
—
0.369
—
0.079
0.115
0.344
—
0.110
0.168
0.088
0.134
0.100
0.380
0.154
0.313
0.544
0.231
0.265
0.254
0.474
0.136
—
0.205
—
0.213
0.264
0.320
0.545
0.109
0.132
0.441
0.155
0.125
—
—
—
—
—
0.258
—
—
—
—
0.085
—
—
0.133
—
—
0.233
—
—
—
—
—
—
—
0.089
0.183
0.060
0.129
—
—
0.039
0.192
0.194
—
—
—
—
0.130
0.232
—
0.096
0.200
0.100
0.146
0.160
0.147
—
0.103
0.130
0.041
0.233
0.159
—
0.061
0.182
—
—
18.6
18.7
23.5
28.4
61.9
30.7
32.4
37.0
37.1
40.0
40.6
44.5
48.6
51.7
52.5
53.9
129.9
55.4
55.7
56.1
57.4
57.6
59.1
63.8
65.0
66.7
65.3
66.2
66.6
66.7
67.9
68.1
69.1
69.8
70.8
71.6
72.5
75.0
78.6
80.7
80.7
81.7
82.7
91.3
88.3
96.4
100.6
104.0
108.0
108.3
112.7
111.8
118.0
123.1
146.2
151.4
171.5
80.3
55.0
64.9
85.5
66.6
46.7
95.8
98.6
70.9
85.6
40.6
92.8
136.2
85.6
119.1
227.9
129.9
132.9
84.7
114.7
70.3
126.1
94.9
145.5
65.0
74.7
65.3
66.3
78.9
161.8
67.9
77.1
84.9
92.2
122.2
125
121.7
108.2
78.9
154.9
85.6
92.7
92.4
112.7
100.7
96.4
152.8
104
111.9
117.5
84.1
141.4
182.1
120.5
161.5
151.6
255.6
NUMBER
Variable
13
20
40
9
61
27
24
38
68
77
25
36
57
42
89
31
139
26
41
99
334
82
44
60
50
179
65
117
46
112
45
59
117
133
105
113
50
161
84
62
154
191
161
172
87
1,448
167
248
1,588
238
99
106
119
282
186
112
120
8,457
OF
SITES
(H, C)
(H, C, G)
(H, C, G, O)
Informative
c
i
c
i
c
i
%I
2
3
10
1
11
6
6
6
5
16
5
7
19
7
9
17
24
3
7
10
59
8
8
13
12
24
10
16
14
15
6
10
26
37
17
18
12
25
16
19
24
34
25
30
13
236
21
32
262
34
17
17
19
32
29
19
19
1,402
0
0
1
0
1
2
1
0
0
1
0
1
0
1
0
0
2
0
0
1
0
0
0
0
0
0
0
2
0
0
1
0
4
0
1
0
0
2
0
1
1
1
1
0
0
5
0
1
10
0
0
2
0
2
1
0
0
46
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
2
0
4
0
0
0
1
0
0
0
0
0
3
3
6
1
3
7
2
4
0
0
0
3
0
0
43
0
1
2
0
2
1
2
2
1
4
2
0
7
2
2
2
10
0
0
1
13
2
2
1
2
10
1
3
2
2
1
6
3
18
3
3
4
8
3
7
4
8
8
11
2
53
5
4
60
10
1
3
7
9
2
3
7
332
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
3
1
2
0
0
1
0
0
0
1
1
0
1
0
2
0
1
1
0
1
1
0
1
1
1
4
1
2
9
3
2
0
1
0
0
0
1
45
1
1
5
1
5
1
2
3
3
7
2
5
8
3
6
14
7
1
5
4
37
1
3
9
8
9
7
9
10
11
0
0
13
7
8
9
7
10
9
6
14
11
12
14
6
118
7
13
116
12
5
7
8
15
14
2
7
638
1
0
0
0
0
2
1
0
0
1
1
0
1
0
1
0
3
1
2
0
3
1
1
1
1
2
0
2
1
1
1
0
2
1
2
5
0
0
2
1
2
8
2
1
1
32
3
4
29
2
3
2
0
3
3
13
2
151
50.0
0.0
0.0
0.0
0.0
50.0
20.0
0.0
0.0
8.3
20.0
0.0
11.8
0.0
11.1
0.0
19.0
66.7
28.6
0.0
10.7
40.0
37.5
9.1
9.1
13.6
0.0
14.3
7.7
13.3
66.7
0.0
15.8
3.8
26.7
29.4
8.3
5.3
14.3
13.3
14.3
29.6
13.0
7.4
20.0
17.4
25.0
26.1
17.8
18.5
45.5
16.7
6.2
11.1
15.8
72.2
17.6
16.8
NOTE.—Genes are as listed in table 1. Nucleotide composition bias for third codon positions or all noncoding sites was calculated according to Kornegay et
al. (1993). SK% is the sum of the percentage divergence (see text). c and i indicate the number of sites compatible and incompatible with the shown phylogeny,
respectively. %I indicates the combined percentage of incompatibility for the (H, C, G) and (H, C, G, O) clades.
Homoplasy in Primate Genes
FIG. 2.—Simulation results of the effects of the mutation rate and
nucleotide composition on the observed extent of homoplasy. The abscissa shows the proportion of informative sites that are incompatible
with the (H, C) clade (A), the (H, C, G) clade (B), or the (H, C, G,
O) clade (C) in a set of generated sequences. The ordinate shows the
proportion of replicates among 10,000 supporting the indicated incompatibility. m, mutation rate; L, low; H, high (i.e., 10 times higher than
L); 2S, two-state model; 4S, four-state (Jukes-Cantor) model.
covered by the two-state model.) The first of these two
values was obtained by taking the average divergence
at synonymous sites of the 51 studied loci calculated
from the comparison of the T sequences with the sequences of the other five OTUs (i.e., d5 5 0.09862/2 5
0.05). Thus, the combination of the two variants of each
of the two factors tested four different conditions under
which the genes evolved. The simulated evolutionary
process was aimed at producing six OTUs related to one
another in the manner depicted in figure 1 (for further
details see Materials and Methods). The simulation was
repeated 10,000 times for each of the four sets of conditions, and the results were summarized in a graphic
form separately for the (H, C), (H, C, G), and (H, C, G,
O) clades (fig. 2; panels A, B, and C, respectively). Plotted on the abscissa of the graph are the incompatibility
values or the proportions of informative sites incompatible with the particular clade in the set of artificially
generated sequences. The ordinate indicates the frequency with which each particular proportion occurred in the
10,000 replicates; it can also be interpreted as the probability of obtaining a particular proportion of incompatible sites in one simulation experiment or at one locus.
The simulation reveals that under the low mutation
rate and the application of either the two- or four-state
model, more than 80% of the replications show no incompatibility in both the (H, C) and the (H, C, G) clades
(fig. 2A and B). In the remaining ,20% replications,
1511
incompatibilities do occur, but because they are rare,
they are widely scattered among the replications. Moreover, because mutations are generally rare, when an even
rarer incompatible mutation does occur, it has a large
effect on the incompatibility value. Consequently, the
simulation is subject to a considerable ‘‘noise’’ reflected
in the scattering of incompatibilities. Because the (H, C,
G, O) clade is deeper than either the (H, C) or the (H,
C, G) clade, the larger number of substitutions found in
the deeper phylogeny results in less noise in the simulation results than that found in the shallower clades (fig.
2C). Although more incompatibilities occur, their proportion in the total number of substitutions varies far
less than when only a few informative sites are present.
The increased number of incompatibilities is reflected in
the observation that only 55% of replicates produce no
incompatibilities under the four-state model and in that
the value drops to 19% under the two-state model.
Higher mutation rates increase the proportion of incompatibility in each of the three clades in both the fourand two-state models. In the case of the (H, C) clade,
the proportion of replicates without incompatibility falls
below 10% under the four-state model and to 0% under
the two-state model. Incompatibilities are distributed in
varying proportions through most replicates, the variation again being the result of sampling effects in the
shallow phylogeny. In the (H, C, G) clade, no replicate
under either the four- or two-state model is without incompatibility. The distribution of incompatibility shows
a peak at ;25% under the four-state model and a broadshouldered peak at ;35% under the two-state model.
Finally, in the (H, C, G, O) clade, the peak under the
four-state model moves to ;30% and that under the
two-state model to an equilibrium value of 80%. In the
two-state model, four incompatibilities arise for every
compatibility generated.
From these observations the following conclusions
can be drawn. First, the level of homoplasy is insignificant when the mutation rate is uniformly low at all the
sites and when the nucleotide composition is unbiased.
Second, the two-state model does not markedly affect
the extent of homoplasy in comparison with the fourstate model when the range of sequence divergence is
low (,10%). Third, a high mutation rate greatly increases the extent of homoplasy: even for a pair of OTUs
with a short divergence time, the proportion of loci with
compatible sites only is reduced to ,10%. The reduction takes place under both models, but it is more pronounced under the two-state than under the four-state
model.
Simulation Based on a Mixed Rate Model
Taking these results into consideration and taking
into account the possibility that mutation rates may vary
from site to site and between different regions of a gene
or genome, a mixed rate model was constructed and
used in another set of computer simulations. To simulate
the situation encountered with the actual data set more
realistically, the number of replicates was reduced from
10,000 to 51, corresponding to the number of genes in
1512
O’hUigin et al.
the set. And to provide for the heterogeneity of the mutation rate, we allowed 900 of the 1,000 sites at each
gene to mutate at the low (1m) rate and the remaining
100 sites at the 10 times higher (10m) rate. In all other
respects the simulation was carried out as in the first
experiment. The observed compatibility values—62%,
85%, and 85% for the (H, C), (H, C, G), and (H, C, G,
O) clades, respectively—were in reasonably good agreement with the actual data.
We have thus far considered compatible or incompatible sites for a particular clade irrespective of their
occurrences within or between loci. However, a locus
can be incompatible with a particular clade in two different ways: in one, all the informative (either compatible or incompatible) sites at the locus are incompatible
(interlocus incompatibility), and in the other, the locus
contains some sites incompatible with each other (intralocus incompatibility). In the experimental data, of
the 57 sequence segments (table 4), 34 were informative
for the (H, C) clade. Of these, six loci or segments
(18%) showed intralocus incompatibility and contained
21 incompatible and 20 compatible sites within these
loci. The relative extent of intralocus incompatibility for
the (H, C) clade (21 vs. 20) was much larger than that
for the (H, C, G) clade (44 vs. 244) and for the (H, C,
G, O) clade (150 vs. 562). The remaining 28 segments
supported either the (H, C) (18 segments, 53%) or the
(H, G) and (C, G) (10 segments, 29%, of interlocus incompatibility) grouping unambiguously. Hence, in total,
82% of segments supported a single phylogeny. This
proportion reduced to 28/53 5 53% for the (H, C, G)
clade and 15/56 5 27% for the (H, C, G, O) clade, each
with only one segment being incompatible with either
of these clades. Similarly, in terms of the numbers of
interlocus incompatible versus compatible sites, there
were 22 versus 26 for the (H, C) clade but 1 versus 88
for the (H, C, G) clade and 1 versus 76 for the (H, C,
G, O) clade. Thus, the actual data showed that compared
with the (H, C, G) and (H, C, G, O) clades, both intraand interlocus incompatibilities were notably high for
the (H, C) clade.
Although the simulation result was in good agreement with the observed low extent of interlocus incompatibility for the (H, C, G) and (H, C, G, O) clades, it
failed to account for the observed high extent of interlocus incompatibility for the (H, C) clade. Although by
using a mixed rate model it was possible to generate a
high degree of intralocus incompatibility, the extent of
intralocus incompatibility tended to increase because the
clade included distantly related OTUs. This simulation
result was again inconsistent with the observed high extent of intralocus incompatibility for the (H, C) clade
and the low extent for the (H, C, G) and (H, C, G, O)
clades. Thus, the simulation result suggested that the
relatively high extent of intra- and interlocus incompatibility observed in the (H, C) clade cannot be accounted
for by high mutation rates at particular sites or by homoplasy. It must therefore have a different cause, namely, ancestral polymorphism.
Conclusions
To sum up, comparative analysis of 57 sequences
obtained from 51 genes in six primate OTUs representing the human, chimpanzee, gorilla, orangutan, OWM,
and NWM lineages reveals the occurrence of homoplasy
(parallelism or convergence) at much higher frequencies
than is generally expected (19% in the six-OTU lineage
and 12% in the OWM and ape lineages). Together with
ancestral polymorphism, homoplasy is therefore a major
source of incongruence in phylogenetic reconstructions.
Whereas ancestral polymorphism may obscure only
phylogenies of lineages that diverged from one another
within a short interval on the evolutionary time scale,
no such restriction applies to homoplasy. Of the three
major factors considered here as potential causes of the
observed high level of homoplasy, no compelling evidence could be mustered for the effect of selection. In
particular, the observation that homoplasy is distributed
equally between coding and noncoding parts of the genome argues against this explanation. Selection, however, is responsible for an increased tendency toward
parallel substitutions in certain genes, such as those of
the major histocompatibility complex (O’hUigin 1995;
Kriener et al. 2000), not included in the data set used
in the present study. Both the analysis of this data set
and computer simulations implicate the two other factors—variation in nucleotide composition and, in particular, in mutation (substitution) rates. Taken in isolation,
these two factors, however, do not fully account for the
observed high level of homoplasy. The simulation study
indicates that even in extreme cases of biased nucleotide
composition, 10% incompatibility is reached very rarely
when standard substitution models are applied. On the
other hand, high mutation rates do appear to account for
some but not all of the observed incompatibility, especially that observed in the (H, C, G) phylogeny. But
because we observed a high extent of interlocus incompatibility, whereas simulation led to a high degree of
intralocus incompatibility, ancestral polymorphism must
be an important factor in determining the (H, C, G) phylogeny. Compared with the (H, C, G) and (H, C, G, O)
clades, the large proportion of intralocus-incompatible
sites relative to that of intralocus-compatible sites in the
(H, C) clade cannot be accounted for by homoplasy.
Rather, it is most likely the result of the combined effect
of intragenic recombination and ancestral polymorphism
in the stem lineage of humans, chimpanzees, and gorillas. Therefore, the contribution of homoplasy caused by
mutations may be insignificant with respect to the (H,
C, G) phylogeny. On the other hand, in the case of the
(H, C, G, O) and (H, C, G, O, M) phylogenies, the
simulation indicates that if a small percentage of nucleotide sites are allowed to undergo mutations (substitution) at a rate 5- to 10-fold higher than the normal rate,
homoplasy (and phylogenetic incompatibility) will occur at levels similar to those observed at the 51 loci
under study. The number of interlocus-incompatible loci
as well as interlocus-incompatible sites for the (H, C)
clade is much larger than that for the (H, C, G) or (H,
C, G, O) clades. If the latter incompatibility is attributed
Homoplasy in Primate Genes
to homoplasy, then the former must be attributed to ancestral polymorphism.
The inference that a category of rapidly evolving
sites must exist in primate DNA has implications for
phylogenetic studies. Such sites might be expected to
contribute inordinately to phylogenetic information obtained on lineages diverging in rapid succession because
few other sites will have undergone substitution within
the short time interval. Examples of rapidly evolving
sites in specific genes are known (Green et al. 1990),
but the extent of their occurrence in the genome has not
been determined. In cases in which rapidly evolving
sites are the major source of information about phylogenetic relationships, they may have undergone several
substitutions before more slowly evolving sites could
contribute to phylogenetic resolution. In such cases, a
phylogeny may be built primarily on sites that show a
high degree of incompatibility and is likely to be
incorrect.
Acknowledgments
We thank Dr. Naoko Takezaki for critical reading
of the manuscript, Hana Jandova and Solveig Hirschle
for technical assistance, and Jane Kraushaar for editorial
assistance.
LITERATURE CITED
BAILEY, W. J. 1993. Hominoid trichotomy: a molecular overview. Evol. Anthropol. 2:100–108.
GREEN, P. M., A. J. MONTANDON, D. R. BENTLEY, R. LJUNG,
I. M. NILSSON, and F. GIANNELLI. 1990. The incidence and
distribution of CpG→TpG transitions in the coagulation
factor IX gene. Nucleic Acids Res. 18:3227–3231.
HORAI, S., Y. SATTA, K. HAYASAKA, R. KONDO, T. INOUE, T.
ISHIDA, S. HAYASHI, and N. TAKAHATA. 1992. Man’s place
in hominoidea revealed by mitochondrial DNA genealogy.
J. Mol. Evol. 35:32–43.
IRWIN, D. M., R. T. WHITE, and A. C. WILSON. 1993. Characterization of the cow stomach lysozyme genes: repetitive
DNA and concerted evolution. J. Mol. Evol. 37:355–366.
JUKES, T. H., and C. R. CANTOR. 1969. Evolution of protein
molecules. Pp. 21–132 in H. N. MUNRO, ed. Mammalian
protein metabolism III. Academic Press, New York.
KIMURA, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies
of nucleotide sequences. J. Mol. Evol. 16:111–120.
KLEIN, J., A. SATO, S. NAGL, and C. O’HUIGIN. 1998. Molecular trans-species polymorphism. Annu. Rev. Ecol. Syst. 29:
1–21.
KOMINATO, Y., P. D. MCNEILL, M. YAMAMOTO, M. RUSSEL,
S.-I. HAKOMORI, and F. YAMAMOTO. 1992. Animal histoblood group ABO genes. Biochem. Biophys. Res. Commun.
189:154–165.
KORNEGAY, J. R., T. D. KOCHER, L. A. WILLIAMS, and A. C.
WILSON. 1993. Pathways of lysozyme evolution inferred
from the sequences of cytochrome b in birds. J. Mol. Evol.
37:367–379.
KRIENER, K., C. O’HUIGIN, H. TICHY, and J. KLEIN. 2000. Convergent evolution of major histocompatibility complex molecules in humans and New World monkeys. Immunogenetics 51:169–178.
KUMAR, S., and S. B. HEDGES. 1998. A molecular timescale
for vertebrate evolution. Nature 392:917–920.
1513
LAWN, R. M., K. SCHWARTZ, and L. PATTHY. 1997. Convergent evolution of apolipoprotein (a) in primates and hedgehog. Proc. Natl. Acad. Sci. USA 94:11992–11997.
MARTIN, R. D. 1993. Primate origins: plugging the gaps. Nature 363:223–234.
MIYAMOTO, M., B. F. KOOP, J. L. SLIGHTOM, M. GOODMAN,
and M. TENNANT. 1988. Molecular systematics of higher
primates: genealogical relations and classification. Proc.
Natl. Acad. Sci. USA 85:7627–7631.
NEI, M., and S. KUMAR. 2000. Molecular evolution and phylogenetics. Oxford University Press, Oxford.
O’HUIGIN, C. 1995. Quantifying the degree of convergence in
primate Mhc-DRB genes. Immunol. Rev. 143:123–140.
O’HUIGIN, C., A. SATO, and J. KLEIN. 1997. Evidence for convergent evolution of A and B blood group antigens in primates. Hum. Genet. 101:141–148.
ROGERS, J. 1993. The phylogenetic relationships among Homo,
Pan and Gorilla: a population genetics perspective. J. Hum.
Evol. 25:201–215.
RUVOLO, M. 1997. Molecular phylogeny of the hominoids: inferences from multiple independent DNA sequence data
sets. Mol. Biol. Evol. 14:248–265.
SARICH, V. M., and A. C. WILSON. 1967. Rates of albumin
evolution in primates. Proc. Natl. Acad. Sci. USA 58:142–
148.
SATTA, Y., J. KLEIN, and N. TAKAHATA. 2000. DNA archives
and our nearest relative: the trichotomy problem revisited.
Mol. Phylogenet. Evol. 14:259–275.
SCHWARTZ, J. H. 1987. The red ape. Orang-utans & human
origins. Houghton Mifflin Company, Boston.
SIBLEY, C. G., J. A. COMSTOCK, and J. E. AHLQUIST. 1990.
DNA hybridization evidence of hominoid phylogeny: a reanalysis of the data. J. Mol. Evol. 30:202–236.
SWANSON, K. W., D. M. IRWIN, and A. C. WILSON. 1991.
Stomach lysozyme gene of the langur monkey: tests for
convergence and positive selection. J. Mol. Evol. 33:418–
425.
TAKAHATA, N. 1993. Allelic genealogy and human evolution.
Mol. Biol. Evol. 10:2–22.
———. 1995. Mhc diversity and selection. Immunol. Rev.
143:225–247.
TAKAHATA, N., and Y. SATTA. 1997. Evolution of the primate
lineage leading to modern humans: phylogenetic and demographic inferences from DNA sequences. Proc. Natl.
Acad. Sci. USA 94:4811–4815.
TEUMER, J., and H. GREEN. 1989. Divergent evolution of part
of the involucrin gene in the hominoids: unique intragenic
duplications in the gorilla and human. Proc. Natl. Acad. Sci.
USA 86:1283–1286.
WHITE, T. D., G. SUWA, and B. ASFAW. 1998. Australopithecus
ramidus, a new species of early hominid from Aramis, Ethiopia. Nature 371:306–312.
WOLFE, K. H., P. M. SHARP, and W. H. LI. 1989. Mutation
rates differ among regions of the mammalian genome. Nature 337:283–285.
WOODWARD, E. R., A. BUCHBERGER, S. C. CLIFFORD, L. D.
HURST, N. A. AFFARA, and E. R. MAHER. 2000. Comparative sequence analysis of the VHL tumor suppressor gene.
Genomics 65:253–265.
WYCKOFF, G. J., W. WANG, and C. I. WU. 2000. Rapid evolution of male reproductive genes in the descent of man.
Nature 403:304–309.
ZHANG, J., H. ROSENBERG, and M. NEI. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:3708–3713.
NARUYA SAITOU, reviewing editor
Accepted May 2, 2002