Amino acid and nucleotide recurrence in aligned

© 2000 Oxford University Press
Nucleic Acids Research, 2000, Vol. 28, No. 19 3801–3810
Amino acid and nucleotide recurrence in aligned
sequences: synonymous substitution patterns in
association with global and local base compositions
Manami Nishizawa and Kazuhisa Nishizawa*
Department of Biochemistry, Teikyo University School of Medicine, Kaga, Itabashi, Tokyo 173, Japan
Received May 15, 2000; Revised and Accepted August 14, 2000
ABSTRACT
The tendency for repetitiveness of nucleotides in
DNA sequences has been reported for a variety of
organisms. We show that the tendency for repetitive
use of amino acids is widespread and is observed
even for segments conserved between human and
Drosophila melanogaster at the level of >50% amino acid
identity. This indicates that repetitiveness influences not
only the weakly constrained segments but also those
sequence segments conserved among phyla. Not
only glutamine (Q) but also many of the 20 amino
acids show a comparable level of repetitiveness.
Repetitiveness in bases at codon position 3 is
stronger for human than for D.melanogaster,
whereas local repetitiveness in intron sequences is
similar between the two organisms. While genes for
immune system-specific proteins, but not ancient
human genes (i.e. human homologs of Escherichia
coli genes), have repetitiveness at codon bases 1 and
2, repetitiveness at codon base 3 for these groups is
similar, suggesting that the human genome has at
least two mechanisms generating local repetitiveness.
Neither amino acid nor nucleotide repetitiveness is
observed beyond the exon boundary, denying the
possibility that such repetitiveness could mainly
stem from natural selection on mRNA or protein
sequences. Analyses of mammalian sequence alignments show that while the ‘between gene’ GC content
heterogeneity, which is linked to ‘isochores’, is a
principal factor associated with the bias in substitution
patterns in human, ‘within gene’ heterogeneity in
nucleotide composition is also associated with such
bias on a more local scale. The relationship amongst
the various types of repetitiveness is discussed.
INTRODUCTION
Amino acid sequences of eukaryotic proteins have a tendency
for recurrent use of identical amino acids (1–5). For human
proteins, within a 1–10 residue distance of a glutamine (Q)
residue 27–38% more Q than expected by chance tends to
occur (5). Among human proteins, modern proteins that are
unique to human (or mammals) have a higher repetitiveness
(5). Such a tendency for clustering of the same amino acids has
been suggested to be caused by the tendency for simplicity of
DNA, which has been studied for non-coding and coding
regions by various methods (6–10). Such studies have discovered
concatemeric repeats of short DNA sequences, or microsatellites,
which are likely to be mainly generated by replication slippage
(10,11). They are highly polymorphic sequences and have
drawn much interest due to their usefulness in the construction
of genetic maps (see for example 12), assessment of genetic
diversity within a population (see for example 13–18) and due
to the importance in several genetic diseases (see for example
19,20). The most well-characterized case of the latter is expansion of the trinucleotide sequence CAG, which results in
stretching of the Q residues in amino acid sequences and is
believed to be the cause of many diseases involving myotonic
dystrophy (see for example 21–24). Other examples of triplet
repeats implicated in diseases include the CGG/CCG repeat
(see for example 25–27) and the GAA/TTC repeat (see for
example 28).
Because the majority of studies on microsatellites have been
concerned with particular loci with repetitive segments, it is
not yet clear how widely such repetitiveness is found in normal
genes, where sequences are under obvious functional
constraints. It seems important to ask whether the tendency for
repetitiveness can affect genetic variation, especially of those
sequences which are conserved between evolutionarily distant
organisms and thus not likely to be ‘junk’. In fact, the general
tendency depicted by statistical analyses (5,29,30) could
merely reflect the contribution of a small number of highly
repetitive sequences, because they are more frequently found
in regions which seemingly encode the weakly constrained
protein segments (5; our unpublished results). This is an
important issue because repetitiveness in conserved segments
may imply the involvement of a mechanism which does not
require a change in length. Of note, Golding and Glickman
have reported several convincing examples for ‘sequencedirected mutagenesis’ using human interferon genes (31).
Unlike replication slippage (6,10,32), sequence-directed
mutagenesis does not always result in alterations of gene length
(31). How widely is such repetitiveness present in conserved
sequences? And if it is widespread, how is it maintained?
First we studied the general statistical features of repetitiveness in mammals and Drosophila melanogaster, instead of
focusing on some particular examples. The general tendency
*To whom correspondence should be addressed. Tel: +81 3 3964 1211; Fax: +81 3 5375 6366; Email: [email protected]
3802 Nucleic Acids Research, 2000, Vol. 28, No. 19
for repetitiveness is present even for well-conserved sequences
(having >50% amino acid identity between human and
D.melanogaster). A similar degree of repetitiveness has been
observed for various codons for many of the 20 amino acids.
The local repetitiveness of concern here tends to be disrupted
by introns. We also analyzed the local repetitiveness in DNA
sequences and show that some local repetitiveness in humans
cannot be explained by sequence-directed mutagenesis. In fact,
even human genes encoding ancient proteins have repetitiveness
in bases at the codon 3 position, while their amino acid
sequences do not have a significant level of repetitiveness.
In general, nucleotide substitutions in a given species consist
of two steps: first, mutations in genes; second, fixation of the
mutant genes in the population (33). Analyses of synonymous
substitutions in humans show that both global (between gene)
and local (within gene) base compositions are factors associated
with a local imbalance in substitution patterns. While the
global factor, which is linked to the isochore scale GC content
heterogeneity, is strong for humans but not for mouse, the local
factor (on an ∼20 nt scale) can account for the very local
regional imbalance in substitutions for both species.
MATERIALS AND METHODS
Systems for sequence alignment and repetitiveness analyses
By modifying the BLAST program (34), we implemented our
system for repetitiveness analyses with a system for sequence
compilation based on homology searches between two
organisms. All the C source codes and the sequence alignments
used are available from us or from our Web site (http://
village.infoweb.ne.jp/~gene ), respectively. We first compared
protein sequences obtained from the SwissProt database (http://
www.expasy.ch/sprot/ ) between organisms and then, for the
sequences aligned with a high similarity (i.e. less than a
BLAST score of e–40 for the human–D.melanogaster comparison
and less than e–200 for the human–mouse and mouse–bovine
comparisons), we collected the corresponding cDNA and/or
genomic DNA sequences. All of the homologs were aligned
again using the CLUSTALW method (35) provided as Megalign
in Lasergene (Dnastar, Madison, WI). The intron sequences
used in Table 1 (human and D.melanogaster introns with a
total of 2.5 × 105 and 6.4 × 105 nt obtained from 573 and 542
genes, respectively) were downloaded from the NCBI web site
without any artifactual selections.
Protocol A
From the human–D.melanogaster alignments, we obtained highly
conserved peptide segments using the following procedure. First,
score each segment (of five residues) by summing the value of
each position scored based on the Blosum62 matrix (36). Then,
starting from the segments scored above the threshold T
(typically 19), scan the sequence in both directions using a
window of five residues, then record the score. The scanning is
stopped just before the window hits a ‘gap’ (unaligned
position) or it scores lower than 0. Collect the regions over
which the window was scanned at least once as ‘highly
conserved peptide segments’.
Repetitiveness in human–D.melanogaster aligned sequences
(Figs 1–3)
A homology search of all the fruit fly proteins in the SwissProt
database versus human proteins, followed by re-alignment and
trimming according to protocol A, yielded 1007 aligned
segments, which have ∼25–100% amino acid identity with few
gaps, between the two organisms. (The aligned segments are
available on our Web site; see above). Monotonous sequences
such as QQQQ… were removed using a filter (described under
a default condition on the BLAST web site; http://
www.ncbi.nlm.nih.gov ).
The neighbor residues of, for example, Q were analyzed as
follows (see also ref. 2 and our Web site). Given a Q residue at
position i, we looked at the amino acid at each nearby position
i + n (n = 1, … , 25 or more if appropriate). Such data for each
n value are summed over all the Q residues, so that we know
the amino acid ‘composition’ at a given distance (n) from Q. For
example, in a 20-residue sequence MRKRQHSAVQNQTKCYRKSA there are three Q residues (3/20 = 15%). If we look
at only the +2 position (italic) from each Q, we find S, Q and
K. Thus, at n = 2 of Q, Q occurs at 1/3 = 33.3%, about twice the
average (33.3/15 = 2.2).
We define this index (here 2.2) as the ‘normalized frequency
of Q at n = 2 of Q’. Such an analysis is performed for different
n values, for all amino acids, using all the sequences in the
alignments. Note that if there is no specific correlation between
residues, all of the indices should be 1.0. We thus use {FYX(i)/FY}
as the normalized frequency of Y at position i of X. In this
study we were mainly concerned with FXX(i), which is just a
special case of FYX(i), where X = Y. Thus
FXX(i) = {number of X residues at the i position from each of
the X residues} ÷ {total number of residues at the i position
from each of X residues}.
Analysis of repetitiveness in DNA sequences (Fig. 3 and
Table 1)
We employed the ‘dinucleotide method’, by which two dinucleotide sequences are compared position-by-position and given a
score of 6 (3 for each position) when they are identical, –2 (–1 for
each) when neither position has the same base and 2 [i.e. 3 + (–1)]
when only one is identical. For example, given an AG at position
0, if another AG is found at position i, AG(i) is scored as 6. If
TG is found at position j, AG(j) is scored as 2, because AG and
TG are the same at only one position. In a manner similar to the
amino acid repetitiveness analysis, repetitiveness scores are
cumulatively calculated for 16 (= 4 × 4) individual dinucleotide
motifs (and for a different distance n). These profiles are
usually combined into one profile (as in Fig. 3) after weighting
according to the relative frequency of each motif.
Intra-exon and inter-exon analyses of amino acid
recurrence (Fig. 5)
GenBank was screened with the keywords ‘human’, ‘complete
cds’ (for title search) and ‘genomic’ (for text search) and all the
obtained genomic sequences (133 genes, 960 exons) were
compiled. For analyses of the effect of introns on repetitiveness in
the protein sequences, the presence of an exon boundary was
taken into account upon data collection: amino acid recurrence
was analyzed as described above but was sorted into two
matrices based on whether an exon boundary exists in the
Nucleic Acids Research, 2000, Vol. 28, No. 19 3803
interval. For example, if the above 20-residue sequence is
encoded by a gene segment with one exon boundary at position
H in MRKRQHSAVQNQTKCYRKSA, S7 (i.e. at n = 2 of Q5)
is encoded by a different exon from that encoding Q5. In this
case, distinct matrices were used for the position n = 2 of Q5
(that has S) and for position n = 2 of Q12 (that has K), because,
for the latter, there is no intron between Q12 and K14.
Synonymous substitution rate and repetitiveness (Table 2
and Supplementary Material available at NAR Online)
We first collected all of the human/bovine/mouse sequence
alignments which can be mutually aligned with a BLAST
score (e-value) better (less) than e–200. The alignments were realigned using CLUSTALW. Because many alignments were
partial, we trimmed the alignment to obtain segments which
were longer than 50 residues and had amino acid identities
>95% with no gap. The alignment data are available on our
Web site (see above). The relationship between local repetitiveness and the occurrence of synonymous substitutions was
analyzed as described in Supplementary Data.
Protocol B: synonymous substitutions and local nucleotide
compositions
To analyze the association between the ‘within gene’ variation
of the nucleotide content and the pattern of synonymous
substitution, we used the following indices. (We first explain
the parameters and then show the algorithm for the analyses.)
Let Flocal(G|A→Gsynon), for example, denote the proportion of
nucleotide G in ‘neighbor nucleotides’ in the ancestral
sequences, given that an ancestral A has synonymously
changed to G. (As described in Supplementary Data, we used a
parsimony method and did not take into account the position
where the ancestral sequence cannot be determined.) Although
there are alternative ways which can seemingly be arbitrarily
chosen, in this study we used those 11 neighbor nucleotides
located at the positions shown by x in xxx xxx ABC Dxx xxx.
Note that C is the position that we examined for occurrence of
synonymous substitution. Position D, as well as A and B, were
not considered due to possible direct linkage resulting from the
dinucleotide motif and/or codon preference. For example, in
the following alignments
… AGG GTC CTA TCG TCG CGG CCA… (human)
… AGG GTC CTA TCG TCA CGG CCA… (bovine)
… AGG GTC CTA TCG TCA CGG CCA… (mouse),
Flocal(G|A→Gsynon) for the human sequence is 27.3% (3/11 = 27.3;
see the underlined bases). Similarly, we define Flocal(G|A→A),
i.e. the local G content in the segments near the A which has
not changed despite the ‘chance’ for synonymous substitution
to G. For example, codon 3 in the above example is CTA
(encoding Leu), which can be synonymously substituted to
CTG. Considering the 11 neighbor nucleotides (AGG GTC
and CG TCA), Flocal(G|A→A) here becomes 4/11 = 36.4%.
Because both Flocal(G|A→Gsynon) and Flocal(G|A→A) are
likely to be under the influence of the GC content of the gene,
in this case we subtracted ‘G% of the gene’ from each. Thus, we
define, ∆Flocal(G|A→Gsynon) and ∆Flocal(G|A→A) as follows.
∆Flocal(G|A→Gsynon) = Flocal(G|A→Gsynon) – Fgene(G)
∆Flocal(G|A→A) = Flocal(G|A→A) – Fgene(G),
where Fgene(G) is the content (%) of G in the gene.
While these two indices deal with each of the A residues
which can be synonymously substituted, in some analyses we
also used ∆Filocal(G|A→Gsynon) and ∆Filocal(G|A→A), each
denoting the mean value of ∆Flocal(G|A→Gsynon) and
∆Flocal(G|A→A) analyzed over the gene i.
To analyze the general trend of the substitution patterns and
nucleotide compositions, we also introduced indices which
consider all the patterns of transition: A→G, G→A, T→C and
C→T, i.e.
∆Flocal(J|I→Jsynon) = Flocal(J|I→Jsynon) – Fgene(J)
∆Flocal(J|I→I) = Flocal(J|I→I) – Fgene(J),
where the pair I and J denotes any of A and G, G and A, T and
C and C and T. Following the above rule, we used
∆Filocal(J|I→Jsynon) [and ∆Filocal(J|I→I)] to denote the local
richness of the nucleotide (in comparison with the percentage
of the relevant nucleotide over the gene), given the occurrence
(and absence) of synonymous substitutions.
RESULTS AND DISCUSSION
Repetitiveness in conserved sequences
A BLAST-based system (34) for sequence comparison was
used to find homologous protein sequences between human
and D.melanogaster. To detect any clustering between
different (or the same) amino acids in the sequences obtained,
we cumulatively calculated the frequency of occurrence of
different amino acids in the proximity (1–5 residues downstream)
of each amino acid type. The results, i.e. the average amino
acid composition near individual amino acid types, are shown
in Figure 1A. A tendency for amino acid repetitiveness is
evident (from the high repetitiveness score along the diagonal),
which is in agreement with our previous findings (5). While
Figure 1A shows the frequency of amino acid Y near amino
acid X [or, correctly, FYX(i)/FY (averaged over i = +1~5),
where the average frequency of Y normalizes the profile],
FXX(i), i.e. the proportion of X at position i from X, was also
calculated. Then the obtained profiles (for the 20 amino acids)
were combined as described in Materials and Methods. The
combined profiles [ΣX FXX(i)] of the human and D.melanogaster proteins show that D.melanogaster proteins tend to
have a higher level of amino acid repetitiveness than
homologous human proteins (Fig. 1B).
Note, however, that the analyses in Figure 1 were based on
the sequences of entire proteins, not conserved segments. To
examine whether repetitiveness can be found in conserved
segments, we performed the same analysis for aligned
segments trimmed using protocol A as described in Materials
and Methods. (These segments consist of 1007 pairs for which
the mean amino acid identity score was 63.3% and the average
number of gaps was only 0.88 per 1000 residues. For more
details see http://village.infoweb.ne.jp/~gene ). It is still
evident that in close proximity the same amino acid tends to be
used again more frequently than expected by chance (Fig. 2A).
The results were similar for human and D.melanogaster (not
shown), due to the high sequence similarity of the trimmed
alignments. Comparison between proximal and distant positions
for individual amino acids (Fig. 2B) shows that, except for a few
amino acids, the same type of amino acid tends to occur ∼5–20%
more frequently in close proximity (1–5 positions, closed
3804 Nucleic Acids Research, 2000, Vol. 28, No. 19
Figure 1. Amino acid occurrence near different amino acids. (A) Amino acid occurrence at positions +1,…, +5 (where + means downstream) from a given type of
amino acid. The results for human proteins which have at least partial homology with a D.melanogaster protein are shown. The scores are shown by color, indicating
the frequency as percent change from the frequency expected by chance. For example, occurrence of Q near E is shown in yellow, which corresponds to 10–20,
indicating a 1.10- to 1.20-fold frequency of occurrence of Q near E, compared with the average occurrence of Q. (B) Amino acid repetitiveness profile in the human
and D.melanogaster sequences used in (A). These profiles represent the occurrence of amino acid X (X is any of 20 amino acids) at the indicated position shown
on the abscissa from amino acid X. To obtain these profiles, we combined the 20 individual profiles for 20 amino acids after weighting according to the relative
frequency of each amino acid (see ref. 5 for more details). For example, a score of 1.36 at position +3 (D.melanogaster plot) means that, on average, amino acids
tend to be used 36% more frequently than average at position +3.
circles) than expected by chance. At an increasing distance the
relative frequency of the same amino acid tends to decrease
(open circles). Thus, in general, the same amino acids tend to
cluster, even in the conserved segments.
For a more unambiguous analysis, we classified the
conserved segments based on the levels of amino acid identity
and calculated the repetitive profiles of the classes, as shown in
Figure 2C. The tendency for recurrence of the same amino
acids is still present for well-conserved segments (>50% identity),
whereas ancient human proteins, consisting of proteins whose
Escherichia coli homologs are known, show a less significant
level of repetitiveness. Note, however, that the repetitiveness
scores in the proximity (the +1 to +10 positions) are generally
not as high in Figure 2C as in Figure 1B, supporting our
previous hypothesis that repetitiveness tends to be strong
where constraints are weak (5).
We also performed cumulative analyses of the recurrence of
dinucleotide motifs in the cDNA sequences that encode the
proteins used above (Fig. 3). Note that, due to the bias in codon
usage and in amino acid occurrence, 3n (i.e. 3, 6, 9, …) intervals
always tend to have a high score (37). (In Fig. 3 the points
scoring >0.2 are the results for 3n intervals.) Intriguingly, for
both the entire proteins (Fig. 3A) and the segments (Fig. 3B)
the ‘non-3n’ repetitiveness (as represented in CAG TCA)
appears higher in near than in more distant positions (see
points <0.1 in Fig. 3A and B). A statistical test verifies this
notion (i.e. scores at positions +2 to +20 are higher than those
at +50 to +75 with P < 0.01 for both Fig. 3A and B). Simulation
analyses show that this difference cannot be attributed to any
feature of the encoded amino acid sequences per se (2; our
unpublished data).
In the above analyses of the ‘3n intervals’ the DNA nucleotide
positions (1, 2 and 3) in the codon were not discriminated.
Hence, we split the 3n repetitiveness into three distinct phases,
1–2, 2–3 and 3–1, where each number indicates the nucleotide
position in a codon. The results for 2–3 and 3–1, in comparison
with the data for mononucleotide repetitiveness, suggest that
they reflect the repetitiveness of the bases at the codon 3 position (not shown). Mononucleotide analysis shows that while
both human and D.melanogaster show repetitiveness at the
codon 3 position, the local repetitiveness is clearer for human
(Fig. 4A). (Note that the general level of the profile is likely
associated with between gene heterogeneity in GC content,
because normalization by overall frequency of each nucleotide
type enhanced the human profile, as shown in Fig. 4B.)
Strikingly, mononucleotide repetitiveness at codon position 3
Nucleic Acids Research, 2000, Vol. 28, No. 19 3805
Figure 2. Local repetitiveness in the human protein segments that can be aligned with D.melanogaster sequences. (A) Amino acid occurrence at positions +1,…,
+5 for a given type of amino acids. Results are presented in the same way as in Figure 1A. (B) Normalized frequency of the indicated amino acid type in the
proximity (1–5 residues, filled circles) of and at positions more distant (20–30 residues, open circles) from the same amino acid. For example, in the proximity of
Q residues (see the filled circle in column Q), Q occurs at a 1.18-fold frequency of the average (= 1.0). The data for C (cysteine) were 1.9 (1–5 residues) and 1.4
(20–30 residues). (C) Amino acid repetitiveness of the human segments as in (A) but classified based on the level of identity between human and D.melanogaster.
(We did not obtain segments with 25–50% identity under our criteria which did not allow frequent gaps.) The results for the ancient human proteins are also shown
(blue line). For each group, the profiles for 20 amino acids [as in (A)] were combined into one profile as described (5). To reduce fluctuation due to the limited
number of total residues, the profiles were smoothed with a window of three neighbor positions (as described on our Web site).
3806 Nucleic Acids Research, 2000, Vol. 28, No. 19
Table 1. Repetitiveness in intron DNA sequences of human and D.melanogaster
Permuted sequencea
Average over intervals
<12b
13–29
30–60
61–100
(mean ± SD)
Human
28.645
27.307
26.973
26.702
26.620 ± 0.116
D.melanogaster
28.538
27.563
27.119
26.950
26.693 ± 0.066
Human
8.861
8.059
7.777
7.594
7.112 ± 0.073
D.melanogaster
8.865
8.057
7.688
7.532
7.119 ± 0.041
Human
2.922
2.468
2.302
2.203
1.905 ± 0.037
D.melanogaster
3.049
2.474
2.230
2.163
1.910 ± 0.025
Human
1.078
0.789
0.699
0.654
0.512 ± 0.020
D.melanogaster
1.204
0.792
0.668
0.635
0.512 ± 0.014
Human
0.444
0.279
0.225
0.196
0.139 ± 0.012
D.melanogaster
0.518
0.270
0.210
0.194
0.138 ± 0.007
Human
27.860
18.582
15.843
13.714
13.002 ± 0.886
D.melanogaster
27.247
20.405
16.794
15.331
12.975 ± 0.534
Mononucleotide (1-tuple) (×100)
2-tuple (×100)
3-tuple (×100)
4-tuple (×100)
5-tuple (×100)
Dinucleotide (×100)
aNucleotides
were permuted within each intron and then intervals (positions) +10 to +30 were analyzed.
the k-tuple method, k – 12 positions were considered. For example, for the 3-tuple method, positions 3–12 were analyzed, because positions
+1 and +2 overlap the query word itself.
bFor
is generally insensitive to amino acid repetitiveness (Fig. 4C).
For example, while ancient proteins have weak amino acid
repetitiveness, repetitiveness of codon 3 bases of the human
genes for such proteins is comparable with that of human genes
encoding immune system-specific proteins (Fig. 4C) (for
immune system proteins see ref. 5). The results for (codon
positions) 1–2, when applied to these sets of genes, were
largely similar to those for amino acid repetitiveness (Fig. 4D
and data not shown). This finding, i.e. the discrepancy between
repetitiveness in amino acid sequences and that in the codon 3
bases, suggests that in human, repetitiveness can be partly
generated by a mechanism which does not require the repetitive use of a short motif consisting of a few nucleotides.
We next considered the question of whether or not some
constraints on mRNA sequences such as local codon usage
contribute to the local repetitiveness shown above. Hence, we
collected many human sequences for which the positions of the
exon boundaries are known. The repetitiveness score for amino
acids (Fig. 5) and nucleotides (not shown) is markedly lower
beyond the exon boundary of the associated gene. (Note that
we here examine the similarity between those exons encoding
consecutive parts of the mRNA, not an exon and intron.) Thus,
exon boundaries disrupt the amino acid and nucleotide
repetitiveness. This finding argues that the tendency for amino
acid recurrence does not result mainly from selection regarding
the use of codons.
Local repetitiveness in DNA sequences: introns and codon
3 positions
To examine further the claim that DNA has its own tendency to
generate repetitiveness and that this is the primary force generating the major part of the amino acid repetitiveness, we
analyzed the local repetitiveness in intron sequences. Intron
sequences of 573 human and 542 D.melanogaster genes were
analyzed by several methods, such as the dinucleotide method,
mononucleotide method and k-tuple method (Table 1). (While
our dinucleotide method gives partial scores for imperfect
matches, the k-tuple method gives scores only for perfect
matches, as represented by ATC in ATCXXATC.) In general,
D. melanogaster and human show similar degrees of local
repetitiveness. (Table 1 also shows the level of repetitiveness
in artificial intron sequences, generated by permuting the
nucleotides randomly within each intron, as a control.)
Local base composition has some effect on substitution
patterns
As described in the next section, the absolute value of the
repetitiveness score is greatly influenced by the global
(isochore scale) unevenness in GC content. However, it seems
likely that local repetitiveness, as shown above, is independent
of the isochore effect and associated with the local difference
in substitution pattern. Hence, we compiled alignments of
homologs among human, bovine and mouse and, based on the
parsimony method, inferred substitution at codon position 3 in
Nucleic Acids Research, 2000, Vol. 28, No. 19 3807
to the local bias (i.e. within gene heterogeneity) in nucleotide
composition are significantly more frequent than those which
homogenize the local bias.
Table 2. Local contenta of nucleotide J near the site of a synonymous I→J
transition (where I→J is any of the transitions A→G, G→A, T→C and C→T)
and the site with no change (i.e. I→I)
{local J% – gene J%} ± SD (n)
t-testb
Near transition (I→J)
0.721 ± 4.72 (415)
P < 0.01
No transition (I→I)
0.040 ± 1.42 (415)
Human
Mouse
Near transition (I→J)
0.50 ± 3.5 (510)
No transition (I→I)
0.17 ± 1.6 (510)
P < 0.05
Bovine
Figure 3. Repetitiveness in cDNA sequences. Repetitiveness was analyzed by
the dinucleotide scanning method (5; see also Materials and Methods).
(A) Results for entire cDNAs encoding the proteins analyzed in Figure 1.
(B) Results for cDNA segments corresponding to protein segments analyzed
in Figure 2. Note that the general level of non-3n profiles is different between
(A) and (B), which is most likely caused by stronger constraints on the
conserved segments (B) regarding the base composition at each position of the
codons.
each lineage. We then analyzed the local frequency of nucleotide J (J is either A, G, C or T), i.e. local J%, near the sites
where a I→J synonymous transition ({I,J} is any of {A,G},
{G,A}, {T,C} or {C,T}) has occurred in human. As a control,
we also analyzed the local J% near the site of I→I, i.e. positions
where the I→J transition has not occurred in human despite
there being a chance of it. (Only codon position 3 was analyzed
by protocol B as set out in Materials and Methods). In general,
{local J% – gene J%} is greater for those segments subject to a
synonymous transition than that for those segments without such
transitions (P < 0.001 for human and bovine and P < 0.005 for
mouse; data not shown). Analyses of substitution rates for sites
with different local nucleotide compositions showed consistent
results. For example, when {local A% – gene A%} is >20%,
synonymous G→A transitions occurred at 6.97% of eligible G
residues (n = 1957), while the total rate was 5.98% (hypergeometric distribution, P < 0.001). Similarly, the A→G rate
was 8.34% for A residues in a G-rich region (control 6.50%),
T→C was 9.04% (control 8.53%) and C→T was 8.55%
(control 7.25%) (both P < 0.001). While these data are
consistent with the idea that substitutions tend to maintain
local unevenness in nucleotide composition, they do not rule
out the possibility that a very small number of genes predominantly contribute to the difference. Hence, we also calculated,
for each gene, the average scores of {local J% – gene J%} near
I→J and I→I and then the pairs of data were analyzed
(Table 2). Again, there was a difference (P < 0.01) between the
results for near I→J and near I→I. A weak but similar
tendency was observed for mouse. For bovine, the strength of
the local factor was comparable to that of human. We thus
concluded that, in general, those transitions which contribute
Near transition (I→J)
0.68 ± 4.4 (457)
No transition (I→I)
0.01 ± 1.5 (457)
P < 0.005
J% – gene J%} represents ∆Filocal(J|I→Jsynon) for near transition and
∆Filocal(J|I→I) for near no transition, which are as defined in Materials and
Methods. The pairs of indices were determined for each gene and then the
pairs were treated as independent data. Genes which had been subjected to a
very small number (<5) of synonymous transitions were not used.
bSignificance of the difference between the data for transition and no transition.
a{local
Isochore and its effect on the repetitiveness and
substitution
While our study was originally intended to determine local
factors, our procedure was unexpectedly found to be convenient
for studying an isochore scale effect. Here we summarize our
findings concerning global scale GC content unevenness and
its effect on substitution. [Note that this section deals with the
global scale effect, not directly related to local repetitiveness.
The data for this section (Tables S1–S5) are available at NAR
Online as Supplementary Data.] First, human/bovine/mouse
alignments show that our repetitiveness score is greatly influenced by GC unevenness within the genome and the trend in
substitution pattern. For example, substitutions in the mouse
lineage resulted in a marked reduction in the repetitiveness
score while in the bovine lineage this decrease is small
(Table S1). This is because, for mouse, substitutions tend to
destroy the GC unevenness by GC homogenization (Tables S1
and S2; see also 38). Secondly, while our analyses were based
on well-conserved (slowly evolving) genes, the well-known
phenomenon of high substitution rates in murids is evident
(Table S1; 39,40). Thirdly, in human (but not in mouse), genes
with different GC contents are subject to differently biased
directions of substitution, which largely maintains the human
isochore structure (Tables S3–S5). Finally, despite the
isochore-related difference, a general trend of a bias towards
AT is observed for human (Table S3). This trend is found for
both GC-poor, GC-medium and GC-rich genes (Tables S3 and
S5).
These findings indicate that stationarity is unlikely to be the
case with the human as well as mouse genes that we analyzed.
It has been shown that in human HLA genes more GC→AT
3808 Nucleic Acids Research, 2000, Vol. 28, No. 19
Figure 4. Repetitiveness of codon 3 bases and codon position 1–2 bases. (A) Codon 3 bases of human and D.melanagaster genes. (B) As (A) but data normalized
to the total frequency of each base prior to summing over individual base profiles. (C) Codon position 3 bases of human ancient and immune system-specific genes.
(D) Codon 1–2 bases of the genes analyzed in (C).
mutations have occurred, but that polymorphic alleles generated
by AT→GC mutations tend to segregate (spread in the population)
to a greater extent than do GC→AT mutant genes (41).
Although such findings provide evidence for the presence of
selection favoring GC mutation at codon position 3, it is not
clear to what extent the condition of stationarity (which is
required for the approach as in ref. 39) is also satisfied for nonHLA genes. The fact that there is no correlation between GC
content and substitution rate (42) suggests that the general
trend towards AT substitutions is not specific to those slowly
Nucleic Acids Research, 2000, Vol. 28, No. 19 3809
Figure 5. Effect of exon boundaries on amino acid repetitiveness. Intra-exon
analysis was performed in a manner similar to that for Figures 1B and 3, while
inter-exon analysis dealt only with the cases where an exon boundary was
found between the two amino acids concerned.
evolving genes we analyzed. A careful examination of stationarity
utilizing multiple species alignments may be helpful.
Although it seems difficult to draw any conclusions from our
data regarding the selection/mutation debate (41), we surmise
that mutation plays a greater role than selection. First, presence of
both a general trend towards GC→AT substitutions (Table S3)
and a marked difference in substitution pattern among
isochores with different GC contents (Tables S3 and S5) seem
to be difficult to explain from selection theory. Secondly, if
selection is the primary factor, it is difficult to explain why
mouse, bovine and human have totally different general trends
and intensity of the isochore-specific effect (Tables S3 and S4
and data not shown). Notably, a recent study on pseudogenes
strongly argues that mutation is the major factor (43).
There is a regional difference in substitution rate within the
human genome, suggesting that substitution rate is greatly
affected by mutation rate, which varies among regions of the
genome (42). Interestingly, the isochore structure does not
overlap the regional difference in substitution rate over the
human genome. The efficiency of the DNA repair system
varies over the genome, potentially explaining the regional
differences in substitution rate (42,44). Together with these
findings, the complex nature of the substitution pattern shown
in our data lead us to surmise that there are many factors
influencing the mutation rate and pattern.
Conclusions and further questions
We summarize the above findings as follows. First, any type of
local repetitiveness we have described in this study is different
from the ‘isochore (>100 kb) scale’ GC content unevenness,
because all the local repetitiveness is found on the scale of
<∼20 residues (<∼60 nt). Second, with respect to local
repetitiveness in DNA sequences, the human appears to have at
least two types of repetitivenesses, one of which is not found in
D.melanogaster cDNAs, i.e. while introns of human and
D.melanogaster have similar patterns of repetitiveness (Table 1),
codon base 3 repetitiveness has a markedly different pattern
(Fig. 4A and B). In support of this, human ancient genes
(i.e. homologs of E.coli genes) have a significant level of
repetitiveness in codon base 3 (Fig. 4C), despite the fact that
the amino acid sequences and nucleotides at codon positions 1
and 2 have no significant level of repetitiveness (Fig. 4D and
unpublished data). Third, nucleotide substitution analyses
showed, rather unexpectedly, that the global GC content is an
influential factor strongly associated with the direction of
substitutions. The mouse has a different strategy for genomic
evolution from the human; the substitution patterns in mouse
are not as dependent on global GC% as those in human (Tables S3
and S4). Finally, the data nonetheless show that some part of
the bias in substitution pattern is associated not with global but
with local nucleotide content.
In our view, repetitiveness in amino acid sequences is largely
DNA sequence directed and similar to the repetitiveness in
introns in the sense of evolutionary distribution and possible
underlying mechanism; both could result from strand slippage,
although introns specifically prefer purine–pyrimidine dinucleotide motifs, such as CACACA (37). However, while strand
slippage is generally believed to change the length, our data
imply that some strand slippage does not require a change in
length.
Besides such slippage-like mechanisms, we also propose
that amino acid repetitiveness may be partly generated in
human by a local bias in substitution pattern and thus bias in
nucleotide composition. We previously showed that locally
GC-rich segments tend to encode R (Arg) (3). (On the local
scale G content and C content are not coupled. Therefore, GC-rich
segments are generated as an overlap of G-rich and C-rich
segments.) Therefore, some amino acid repetitiveness should
result from nucleotide composition bias. Thus, amino acid
repetitiveness is likely to be generated by at least two genomic
causes. One is sequence-directed mutagenesis, which is
possibly related to slippage. The other is a local bias in nucleotide content, which, as we show, is generated by a bias in
nucleotide substitution.
Due to the limited data that could be used for substitution
analysis, much remains unresolved. For example, we cannot
determine the relative contributions of the above factors
contributing to amino acid repetitiveness. In the future,
advanced statistical methods and their application on further
expansion of the genome database, especially of mammals,
may allow a more detailed examination regarding the relative
strength of the local as opposed to the global scale correlation
and assessment of various local factors. While the substitution
analyses in this study were concerned mostly with isolated
substitution events, in the future we may be able to use sufficient data for the analysis of double substitutions (such as
AA→GG; see 45), which may provide more information about
sequence-directed mutagenesis.
SUPPLEMENTARY MATERIAL
See Supplementary Material available at NAR Online.
3810 Nucleic Acids Research, 2000, Vol. 28, No. 19
ACKNOWLEDGEMENTS
We thank the anonymous reviewers for their insightful
comments. This study was supported in part by Grants-in-Aid
for Scientific Research from the Ministry of Education Science
and Culture, Japan.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Green,H. and Wang,N. (1994) Proc. Natl Acad. Sci. USA, 91, 4298–4302.
Karlin,S. and Burge,C. (1996) Proc. Natl Acad. Sci. USA, 93, 1560–1565.
Nishizawa,M. and Nishizawa,K. (1998) J. Mol. Evol., 47, 385–393.
Nishizawa,M. and Nishizawa,K. (1999) Proteins, 37, 284–292.
Nishizawa,K., Nishizawa,M. and Kim,K.S. (1999) J. Mol. Biol., 294,
937–953.
Tautz,D., Trick,M. and Dover,G.A. (1986) Nature, 322, 652–656.
Hancock,J.M. (1996) Bioessays, 18, 421–425.
Ohno,S. (1984) J. Mol. Evol., 20, 313–321.
Haring,D. and Kypr,J. (1999) J. Biomol. Struct. Dyn., 17, 267–273.
Levinson,G. and Gutman,G.A. (1987) Mol. Biol. Evol., 4, 203–221.
Sia,E.A., Jinks-Robertson,S. and Petes,T.D. (1997) Mutat. Res., 383,
61–70.
Dib,C., Faure,S., Fizames,C., Samson,D., Drouot,N., Vignal,A.,
Millasseau,P., Marc,S., Hazan,J., Seboun,E., Lathrop,M., Gyapay,G.,
Morissette,J. and Weissenbach,J. (1996) Nature, 380, 152–154.
Tautz,D. (1989) Nucleic Acids Res., 17, 6463–6471.
Cooper,G., Amos,W., Hoffman,D. and Rubinsztein,D.C. (1996)
Hum. Mol. Genet., 5, 1759–1766.
Kimmel,M., Chakraborty,R., Stivers,D.N. and Deka,R. (1996) Genetics,
143, 549–555.
Slatkin,M. (1995) Genetics, 139, 457–462.
Nielsen,R. (1997) Genetics, 146, 711–716.
Zhivotovsky,L.A. and Feldman,M.W. (1995) Proc. Natl Acad. Sci. USA,
92, 11549–11552.
Sinden,R.R. (1999) Am. J. Hum. Genet., 64, 346–353.
Chastain,P.D. and Sinden,R.R. (1998) J. Mol. Biol., 275, 405–411.
21. Wells,R.D. and Warren,S.T. (eds) (1998) Genetic Instabilities and
Hereditary Neurological Diseases. Academic Press, San Diego, CA.
22. Fu,Y.H, Pizzuti,A., Fenwick,R.G., King,J., Rajnarayan,S., Dunne,P.W.,
Dubel,J., Nasser,G.A., Ashizawa,T., de Jong,P., Wierringa,B.,
Korneluk,R., Perryman,M.B., Epstein,H.F. and Caskey,C.T. (1992)
Science, 255, 1256–1258.
23. Weber,J.L. and Wong,C. (1993) Hum. Mol. Genet., 2, 1123–1128.
24. Eriksson,M., Ansved,T., Edstrom,L., Anvret,M. and Carey,N. (1999)
Hum. Mol. Genet., 8, 1053–1060.
25. Caskey,C.T., Pizzuti,A., Fu,Y.H., Fenwick,R.G.,Jr and Nelson,D.L.
(1992) Science, 256, 784–789.
26. Pimentel,M.M. (1999) Int. J. Mol. Med., 3, 639–645.
27. Stallings,R.L. (1994) Genomics, 21, 116–121.
28. De Michele,G., Cavalcanti,F., Criscuolo,C., Pianese,L., Monticelli,A.,
Filla,A. and Cocozza,S. (1998) Hum. Mol. Genet., 7, 1901–1906.
29. Pupko,T. and Graur,D. (1999) J. Mol. Evol., 48, 313–316.
30. Mar Alba,M., Santibanez-Koref,M.F. and Hancock,J.M. (1999) J. Mol. Evol.,
49, 789–797.
31. Goldman,G.B. and Glickman,B.W. (1985) Proc. Natl Acad. Sci. USA, 82,
8577–8581.
32. Hancock,J.M. (1995) J. Mol. Evol., 41, 1038–1047.
33. Nei,M. (1987) Molecular Evolutionary Genetics. Columbia University
Press, New York, NY, pp. 19–38.
34. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990)
J. Mol. Biol., 215, 403–410.
35. Higgins,D.G., Thompson,J.D. and Gibson,T.J. (1996) Methods Enzymol.,
266, 383–402.
36. Henikoff,S. and Henikoff,J.G. (1993) Proteins, 17, 49–61.
37. Arques,D.G. and Michel,C.J. (1987) Nucleic Acids Res., 15, 7581–7592.
38. Mouchiroud,D., Robinson,M. and Gautier,C. (1997) Gene, 205, 317–322.
39. Britten,R.J. (1986) Science, 231, 1393–1398.
40. Li,W.-H., Tanimura,M. and Sharp,P.M. (1987) J. Mol. Evol., 25, 330–342.
41. Eyre-Walker,A. (2000) Genetics, 152, 675–683.
42. Matassi,G., Sharp,P.M. and Gautier,C. (1999) Curr. Biol., 9, 786–791.
43. Francino,M.P. and Ochman,H. (1999) Nature, 400, 30–31.
44. Hanawalt,P.C. (1989) Genome, 31, 605–611.
45. Averof,M., Rokas,A., Wolfe,K.H. and Sharp,P.M. (2000) Science, 287,
1283–1286.