volume io Number 81982 Nucleic Acids Research David J.lipmaji

volume io Number 81982
Nucleic Acids Research
Comparative analysis of nucleic acid sequences by their general constraints
David J.lipmaji and Jacob Maizel
Mathematical Research Branch, N.I.A.D.D.K., Building 31, Room 4B-54, and Laboratory of
Molecular Genetics, N.I.C.H.D., Bethesda, MD 20205, USA
Received 8 January 1982; Revised and Accepted 23 March 1982
ABSTRACT
We describe two measures of a nucleic acid sequence, derived from
Information Theory, which characterize the constraints toward nonuniform base
composition, and the constraints on the ordering of the bases. These two
measures distinguish extra-chromosomal coding sequences from all other coding
sequences examined.
The two measures separate eukaryotic coding sequences
into two groups: those with introns and those without introns. We have also
found a relationship between the general constraints of a subsequence and its
degree of conservation in related genes.
INTRODUCTION
The sequence of bases in a nucleic acid is the result of an interplay
between random mutations and functional constraints. We have used two global
measures, derived from Information Theory, to assess the intensity of
constraints acting on a sequence.
The first measure characterizes the
constraints acting toward a nonuniform base composition and the second measure
characterizes the constraints on the ordering of the bases.
With these two measures, we have found an essential difference in the
constraints acting on extra-chromosomal
coding sequences from those
constraints acting on prokaryotic, bacteriophage.eukaryotic
viral and
eukaryotic coding sequences.
Analysis of eukaryotic coding sequences with
these two measures separates them into two groups: those with introns, and
those without introns. We have also used these two measures to analyze the
subsequences of related genes. We detected similar constraints in regions
whose sequence is poorly conserved.
The constraints acting on the poorly
conserved subsequences appear quite different from those constraints acting on
well conserved subsequences.
BACKGROUND
There is an overlap in the terms used in Information Theory and those of
© IRL Press Limited. 1 Falconbero Court London W 1 V 5FG, U.K.
0305-1048/82/1008-27238 2.00/0
2723
Nucleic Acids Research
Biology.
have
The
a
concepts
precise
of Information Theory
mathematical
definition.
to be discussed
However,
this
in this
paper
application
of
Information Theory to biological systems is new, and the significance of these
concepts
remains
consistent
unclear.
correlations
Though
between
we
shall
measures
later show
of nucleic
some
acid
interesting
sequences
and
and
the
concepts of Information Theory, one must be careful with its terminology as it
is applied to biological systems.
Entropy,H, of a probability
as
the
"information"
equating
entropy with
two nucleic
base,
of
and
acid
information
sequences;
sequence
#2
is
adenine, no additional
of its bases, since
distribution was defined
the distribution
information
that
#1 has a base
adenine.
succession
is unique.
are possible having
is
important
possibilities.
in
Once
as
and utility
follows.
composition
one
knows
that
of
Consider
of
2SZ
each
is
1002
#2
can be obtained by examining
sequence
composition of sequence #1 tells very
many sequences
in 1948 by Shannon
The validity
can be demonstrated
sequence
100Z
(1).
the
sequence
However, knowledge of the base
little about its base sequence, because
that same base composition.
identifying
One gains more information
sequence
Each base in
#1
among
those
from an identification of each base
of sequence #1 than from each base of sequence #2.
Another
redundancy.
can be
aspect
struck out
improves
of
communication
that
Because of the redundancy
the
of a message
reliability
redundancy,
when
improvement
is
there
but
achieved,
Theory
deals
of
degree
however,
with
language, many
it can still be understood.
of reception
is some
Information
in the English
Redundancy
the message, over one which has no
of "noise" in the transmission.
at
is
letters
the
expense
of
the
This
quantity
of
"information" in the message, as will be seen later.
One might
level
of
speculate whether a demand
DNA
interaction
or
mutations.
replication,protein
simply
a
These demands
relative
alone
does
conformation,
resistance
to
for reliability manifest
on the evolution of the nucleic
redundancy
for increased
not
acid sequence.
specify
the
level
reliability
is at the
protein-nucleic
the
"noise"
themselves
as
acid
of
random
constraints
At this point, the measure of
in the biological
system
which
imposes these constraints.
The
Information
Theory
doublet,triplet,...,n-tuple
parameters
counts.
A
are based
sequence
on
singlet
mainly
2724
on
at
any
the
point
level
of
in
evolutionary
the
individual
tine.
bases
overlapping
evolves under many
constraints, each constraint making a varying contribution
sequence
or
Sose
constraints
and have
different
to the form of the
little
may
effect
operate
on
the
Nucleic Acids Research
relationship of neighboring bases. An example of such a constraint would be a
requirement for relatively weak base pairing in a region of double stranded
DBA. This constraint could work toward a high A,T base content without having
a specific effect on the relationship of neighboring bases. One might propose
that the codon table imposes a strong constraint on the relationship of
neighboring bases but because of the degeneracy of the third codon
position,the codon table imposes a weaker constraint on the relationship of
doublets to their neighboring base. The relationship between measures based
on the different tuple counts can give a sense of the relative contributions
of the various constraints acting on a sequence.
As defined by Shannon (1), entropy,
4
H ™ - I p.log p. (equation 1)
x
i-1 1
where logarithms are to base 2 and H is measured in bits. Maximum entropy for
nucleic acid sequences, Hmax, occurs with 251 each base and is equal to 2
bits.
One can calculate an entropy based on overlapping doublet counts,
Hm(l); triplet counts, Hm(2); n-tuple counts, Hm(n-l), where Hm(l) is defined
as:
4
Hm(l) » - Z
i.j-l
p.. log p.,.
1J
(equation 2)
I/J
where P J J " the probability of doublet ij, and PJ/J™ the probability of base j
following base i (2,pp.47-72).
Gatlin proposed two measures of redundancy, Dl and D2, defined as
follows:
Dl-Hmax-H
D2-H-Hm(l)
(equation 3)
(equation 4)
(2,pp.73-105).
Dl represents the divergence from maximum entropy due to base composition, and
reflects the constraints toward nonuniform base composition. D2 is a measure
of the divergence from independence of the bases.
If the bases are
independent, then by definition, pj:" P£*p:, P£/j"Pj, equation 2 reduces to
equation 1 and D2" 0. For example, as D2 increases, the probability of a T at
a particular sequence position increasingly depends on at least its nearest
neighbors.
Formally, Shannon's measure for R, the total redundancy, is
R-(D1+D2)/Hmax
(equation 5)
2725
Nucleic Acids Research
Hmax is a constant and from here on we will consider R, the total redundancy,
to be:
R-D1+D2
(equation 6)
From equations 3 and 4, it can be seen that R, the total redundancy should
also be:
R-Hmax - Hm(l)
(equation 7)
Therefore, the degree of divergence from 25Z each base, and the degree of
divergence from independence of the bases, determines the magnitude of Hm(l)
with respect to Hmax.
To summarize, Dl is the divergence from maximum entropy
due to base composition.
D2 is the divergence from independence of the
bases.
R, the total redundancy,
entropy.
Dl measures the constraints toward nonuniform base composition, D2
is the total
divergence
from maximum
measures the constraints on the ordering of the bases, and R measures the
total constraints.
METHODS
To calculate Dl and D2 on a nucleic acid sequence, singlet counts and
overlapping doublet counts are tabulated for a subsequence. H is estimated by
the following:
4
H - - £ n.y N log(n i / N)
i-1
(equation 8)
where n^ - the number of occurrences of base i, and N • the total number of
bases in the subsequence. Hm(l) is estimated by the following:
Hm(l) - -
4
Z
i,j-l
n../ N log(n../ n.) (equation 9)
1J
1J
X
where n^i - the number of occurrences of doublet ij, n^ - the total number of
occurrences of base i in the entire subsequence, and H " the total number of
overlapping doublets in the subsequence.
A qualitative analysis of the flow of information along a sequence was
obtained by measuring Dl and D2 in overlapping 120 base segments from one end
of the sequence to the other and then plotting the resulting Dl and D2 values
against sequence position.
The Dl and D2 values were plotted above the
position of the midpoint of the subsequence sampled. Generally, 20 base jumps
were used between overlapping subsequences. For example, on a 1000 base long
sequence, the first sample includes bases #1-120, Dl and D2 are plotted above
2726
Nucleic Acids Research
position 60; the second sample includes bases #21-140,Dl and D2 are plotted
above position 80. Alternatively, Dl and D2 would be measured on just the
coding regions. For example, to measure the Dl and D2 for the beta-globin
genes studied, coding regions #1, #2, and #3 were grouped as one sample; the
two introns were excluded.
Dl and D2 are statistical measures and it is important to know their
variability due to sampling error.
We have conducted simulations which
generate sequences with known Dl and D2. At sample sizes of 300 bases,
measurements of Dl and D2 are accurate to within approximately ±.01 bits. At
sample sizes less than 300 bases, reliability decreases and a systematic
overestimation in the magnitude of Dl and D2 becomes significant.
In the graphical analysis, all measures are made on the same sample
size; the systematic overestimation is the same for all samples. To estimate
the reliability of measures on sample sizes of 120 bases, the mouse beta major
chain gene for hemoglobin was randomly shuffled and overlapping samples were
measured as in the graphical analysis. (The expected D1,D2 values should then
be the some everywhere in the sequence.) The standard deviation for the Dl
measure was .018 bits, and for the D2 measure, the standard deviation was .024
bits. We will consider as statistically significant, those differences in
amplitude of at least 3 standard deviations (.075 bits for D2).
Analysis of homology between sequences was done with the dot matrix
program of J. Maizel (3). The dot matrix technique allows a rapid,
qualitative analysis of homology. In this program, every base of one sequence
(sequence #1) is compared to every base of another sequence (sequence #2), and
the results are plotted as a dot matrix diagram. The columns of the matrix
represent each base of sequence #1, and the rows represent each base of
sequence #2. If base #30 in sequence #1 matches base #50 in sequence #2, then
a dot is placed in column #30, row #50 of the matrix diagram. Homology is
indicated by an increased density of dots along a -45° diagonal; perfect
homology appears as a solid line. One can reduce the total number of dots on
the matrix by requiring a set number of consecutive base matches for a dot to
appear. This is useful when comparing two highly homologous sequences.
All the sequences studied in this paper were obtained from the nucleic
acid sequence data bank at the National Biomedical Research Foundation (4).
References for the sequences are available from the data bank.
Most of the calculations were done on the HP 9845A, and graphs were
generated on the HP 9872S plotter.
2727
Nucleic Acids Research
RESULTS
Table 1 contains the sequences analyzed by the graphical method only.
Tables 2A-C contain the Dl (constraints toward nonuniform base composition)
and D2 (constraints on the ordering of the bases) values calculated on coding
regions only.
The sequences on Tables 2A-C were studied with the graphical
method as well.
In figure 1, we have plotted R (the total constraints) versus
D2 (constraints on the ordering of the bases) from the values on Tables 2AC.
From both Tables 2A-C and
figure 1, it can be seen that the extra-
chronosomal coding sequences are the only group in which Dl has failed to
remain small with respect to the range of values of D2. This was true for the
extra-chromosomal
1).
R
sequences
analyzed
by the graphical method
alone
(table
It should also be noted that the eukaryotic coding sequences have higher
(the total
constraints),
relative
to all other groups, while
their Dl
(constraints toward nonuniform base composition) has remained small.
Figure 2 is a plot of Dl versus D2 for eukaryotic coding regions listed
on Tables 2A-B.
diagonal.
The sequences cluster into two groups, above and below the
Above the diagonal, except for two interferon sequences and the rat
somatotropin precursor mRNA, are exclusively coding regions from genes with
introns.
Below the diagonal, with the exception of the insulin precursor gene
and the yeast actin gene, are coding regions from genes without introns.
The
human corticotropin-lipotropin gene was not plotted, but it would group with
Table 1
Sequences Analyzed by Graphical Methods Only
Hemoglobin Alpha Chain Pseudogene - Human
12S and 16S rRNA and tRNA Genes- Human Mitochondrion
Alpha-Amylase, Pancreatic, mRNA - Mouse
Alpha-Amylase, Salivary, mRNA - Mouse
Hemoglobin Alpha Chain Gene - Mouse
Corticotropin-Lipotropin Precursor mRNA - Bovine
Somatostatin Precursor mRNA - Angelfish
Minicircle 51 DNA (0.98 kb) - Trypanosoma Brucel Kinetoplast
Minicircle 201 DHA (1.0 kb) - Trypanosoma Brucei Kinetoplast
Genome - Cauliflower Mosaic Virus
2728
Nucleic Acids Research
Table 2A
Dl and D2 for Coding Regions Only
Organism: Gene
Region
# of Bases
Dl
D2
345
.0014
.0919
345
.0034
.1154
441
.0107
.1542
384
.0057
.0760
441
.0133
.1440
462
.0088
.0987
678
.0078
.0733
351
.0091
.0866
1158
.0096
.0862
1365
.0128
.0812
1008
1125
330
.0143
.0127
.0659
.0839
.0411
.1468
330
.0455
.0945
Coding
564
.0009
.1815
Coding
348
.0040
.0471
Coding
561
.0113
.1044
Coding
999
.0052
.0406
Coding
648
.0175
.0989
Coding
Coding
Coding
372
1509
330
.0145
.0196
.0396
.0544
.0597
.1098
Eukaryotic Genes with Introns
1 Mouse Ig heavy Chain V
region germline gene PJ14
Coding
2 Mouse Ig kappa chain V
Coding
region germline gene K2
3 Human hemoglobin B chain
gene
Coding
4 Mouse Tl Ig kappa chain V
region, differentiated gene
Coding
5 Mouse hemoglobin B minor
gene
Coding
6 Goat hemoglobin B chain
pseudogene
"Coding"
7 Rat prolactin precursor
Coding
gene
8 Mouse Ig kappa chain V
region germline gene MOPC41
Coding
9 Chicken ovalbumin mRNA
fragments
Coding
10 Mouse Ig y chain C region
germline gene
Coding
11 Mouse Ig y 2B chain C
region germline gene
(version 1)
Coding
Coding
12 Yeast actin gene
13 Human Insulin precursor gene Coding
14 Rat proinsulin 2 precursor
gene
Coding
Eukaryotic Genes without Introns
15 Human interferon a-2
precursor gene
16 Human choriogonadotropin a
chain precursor mRNA
17 Human Interferon B
precursor mRNA
18 Yeast glyceraldehyde-3phosphate dehydrogenase
19 Rat somatotropin precursor
mRNA
20 Sea urchin histone H2A gene
21 Rat pancreatic a amylase mRNA
22 Rat proinsulin 1 precursor
gene
2729
Nucleic Acids Research
Table 2B
Dl and 02 for Coding Regions Only
Organism
# of Bases
Dl
D2
Coding
759
.0226
.0600
Coding
345
.0358
.0943
Coding
Coding
369
327
.0314
.0319
.0633
.0542
Coding
Coding
408
663
.0398
.1225
.0520
.0295
Entire
genome
Coding
5292
.0023
.0412
756
.0084
.0463
Entire
359
.0203
.1047
Entire
genome
Coding
3182
.0087
.0276
1224
.0098
.0302
Entire
genome
Entire
Coding
5226
.0254
.0699
4963
835
.0291
.0241
.0707
.0497
1701
.0226
.0440
1713
.0199
.0384
477
.0133
.0125
Coding
2541
.0028
.0446
Coding
Coding
3351
2874
.0049
.0052
.0375
.0340
Coding
Coding
477
657
.0063
.0077
.0386
.0257
Coding
1251
.0263
.0293
Region
Eukaryotic Genes without Introns (Cont.)
23 Human somatotropin precursor
mPNfl
24 Bovine parathyroid hormone
precursor gene
25 Sea urchin hi stone H2B gene
26 Yeast Iso-1-cytochrome C
gene
27 Sea
urchin: hi stone H3 gene
28 Human cortiocotropinlipotropin precursor gene
Eukaryotic Viral Sequences
29 Polyoma strain A2
30 Influenza A/PR/8/34
matrix protein gene
31 Potato spindle tuber
viroid
32 Hepatitis B virus
33 Moloney murine sarcoma
virus/SRC region
34 SV40 wild type
35 BK virus strain MM
36 Influenza A/UDORN/72
segment 8 (NS1 & NS2 protein)
37 Influenza A/Vic/3/75
Coding
hemaaqlutinin oene
38 Influenza A/NT/60/68/29C
Coding
hemagglutinin gene
39 Tobacco mosaic gene coat
Codinq
protein gene
Prokaryotic (E. coli) Sequences
40 RPL K, RPL A, RPL J, RPL L
genes and RPOB gene fragment
41 TRP C, TRP B, TRP A genes
42 RNA polymerase 8 chain
fragment
43 Dihydrofolate reductase gene
44 Chloramphenicol acetyltransferase gene, transposon TN9
45 Lactose permease gene
2730
Nucleic Acids Research
Table 2C
Dl and D2 for Coding Regions Only
Organism
Region
f of Bases
Dl
D2
BacteHophage Sequences
46
47
T7 protein gene 0.3
MS2
48
A Cro, ell and 0 protein
genes
49 \ Int protein coding region
50 f)X174
51 T7 protein genes 1.1, 1.2
52 X N gene
53 FD
Coding
Entire
genome
Coding
351
3569
.0078
.0015
.0547
.0038
1386
.0110
.0262
Codinq
Entire
genome
Coding
Coding
Entire
genome
1068
5386
.0240
.0154
.0314
.0175
381
321
6908
.0178
.0569
.0361
.0157
.0324
.0157
Coding
807
.1377
.0415
Coding
753
.1610
.0348
Coding
516
.1093
.0225
Coding
1153
.1660
.0276
Coding
819
.0522
.0042
Coding
1490
.0245
.0075
Mitochondria! Sequences
54
Yeast: cytochrome oxidase
polypeptide III
55 Yeast: cytochrome oxidase
polypeptide II
56 Yeast - tRNA gene cluster
fragments
57 Yeast mitochondria cytochrome B gene
58 Human mi to: cytochrome
oxidase, polypeptide II,
Asp, Lys tRNA genes
Miscellaneous Sequences
59
Maize chloroplast 16S
rRNA gene
the other coding sequences without introns below the diagonal.
The rat
insulin precursor 2 gene was not plotted, but, because of its high homology
with the human insulin precursor gene, it would also fall below the
diagonal.
The selection of genes without introns is quite broad, but the
selection of genes with introns is biased by the high proportion of globin and
immunoglobulin sequences, because at the time of this study they made up the
bulk of available exons. Nevertheless, a trend is apparent.
Figure 3 shows an analysis of the hemoglobin beta-major chain gene of
mouse. Dl (constraints toward nonuniform base composition) is the solid line,
D2 (constraints on the ordering of the bases) is the broken line, and R (the
2731
Nucleic Acids Research
.25
•
n
R
•
s3
.125
a
-
a
an
£> Ijjj
o
-+0D
•
o
n
.2
D2
Figure »1
D2 versus R from sequences on Tables 2A-C.
• , eukaryotic coding sequences; + , eukaryotic viral coding sequences;
^ prokaryotic coding sequences;
0 , bacteriophage coding sequences; - ,
extra-chrooosomal coding sequences.
.2
D2
.1
Dl
.025
.05
Figure #2
D2 versus Dl from eukaryotic coding sequences on Tables 2A-B (excluding
the rat insulin precursor 2 gene and the human corticotropin-lipotropin
precursor gene). CD > coding sequences from genes with introns; ^ , coding
sequences from genes without introns.
2732
Nucleic Acids Research
Figure #3
Dl(solid) and D2(dashed) for overlapping 120 base samples of the mouse
beta major chain gene for hemoglobin. The samples overlap by 100 bases, and
the values are plotted above the midpoint of each sample (see Methods). Each
large x-axis tick mark represents 100 bases, each y-axis tick mark represents
.1 bit. 1, coding region #1; 2, coding region #2; 3, coding region #3; A,
intron A; B, intron B.
total constraints) is the sum of the D1.D2 amplitudes. In the coding regions
and Intron A, D2 is significantly greater than 01. There is a large Dl peak
in Intron B occurring in the only region which shows little sequence
conservation between beta-globin genes (5), and D2 is significantly lower
(approximately .1 bit) than in the coding regions or Intron A. All the betaglobin genes give a similar pattern, even in the poorly conserved Intron B.
Figures 4 and 5 show analyses of human corticotropin-lipotropin precursor
gene and bovine corticotropin-lipotropin precursor mRHA. These were the only
eukaryotic genes found to have Dl greater than D2 throughout the coding
regions. Starting at approximately base 325 of the human sequence, and base
275 of the bovine sequence, the D1,D2 patterns are very similar. Before this
point, the patterns are significantly dissimilar. Figure 6 is a dot matrix
diagram depicting the regions of homology between the two sequences. The
density of dots along the diagonal decreases in regions of lower homology. In
2733
Nucleic Acids Research
Figure #4
Dl(solid) and D2(daahed) for overlapping 120 base samples of the human
corticotropin-lipotropin precursor gene. The samples overlap by 100 bases,
and the values are plotted above the midpoint of each sample (see Methods).
Each large x-axis tick mart represents 100 bases, each y-axis tick mark
represents .1 bit. P, region coding for precursor fragment;
CT, region
coding for corticotropin; L, region coding for lipotropin.
the region where the D1.D2 patterns were dissimilar, there is little evident
homology.
For the bovine corticotropin-lipotropin precursor mRNA (and the
corresponding regions of the human gene); the region coding for the
melanotropin-gamma, the first part of the region coding for the lipotropinbeta, and the region immediately after the lipotropin-beta, also show little
homology. However, examining figures 4 and 5, it can be seen that the Dl
pattern is as similiar in these regions as in regions displaying strong
homology. Similar analyses of SV40 versus BK virus, and mouse salivary versus
pancreatic alpha-amylase genes, revealed the same general relationship of Dl
(constraints toward nonuniform base composition) and subsequence conservation.
DISCUSSION
From figure 1, it could be seen that at any set level of R (the total
constraints), the extra-chromosomal coding sequences have a larger Dl
2734
Nucleic Acids Research
Figure #5
Dl(eolid) and D2(dashed) for overlapping 120 base samples of the bovine
corticotropin-lipotropin precursor mRHA. The samples overlap by 100 bases,
and the values are plotted above the midpoint of each sample (see Methods).
Each large x-axis tick mark represents 100 bases, each y-axis tick mark
represents .1 bit. SS, region coding for the signal sequence; HG, region
coding for the melanotropin-gamma; CT, region coding for the corticotropin;
LB, region coding for the lipotropin-beta.
(constraints toward nonuniforra base composition) and smaller D2 (constraints
on the ordering of the bases) than the prokaryotic, bacteriophage, viral and
eukaryotic coding sequences. From figure 2, it could be seen that at any set
level of R, the eukaryotic coding sequences without introns have a larger Dl
and smaller D2 than those coding sequences (exons) which have introns. Is
there a similarity in the biological context of these observations?
The extra-chromosomal coding sequences have evolved in a static milieu;
the constraints on their sequences have been relatively constant aa compared
with the constraints acting on the other groups of coding sequences.
Considering Gilbert's idea that exons represent functional units which may be
shuffled around to obtain new proteins (6), one would assume that the exons
have, on the average, been confronted with more frequent changes in
evolutionary constraints than coding regions without introns (7). Though the
2735
Nucleic Acids Research
HUHA A N
•
I
f
4
|
f
"
•
.
•
'
.
5
1
.
•
8
i
-
•
•
*
•
•
a
9
2.
^ .
- S3
"
-.
' "
•
i
-
, V*'
'
i
-
,i
\
n
II
i
'
•
•
•
'
•
•
"
'
.
•
'
•
:
•
'
.
' " ' '
•
-
BOVINE
-'
•
.
'
=
. " .
MQ
<•
5
S_
'
""
v
'•
*
>-
•
•
,
'
'
'
CT
'
:*•
7_
f .;•('
vC>
LB
B_
•"
A
."*'"
' ••"'
".I -
'
'
"
•
'
•
"
•
:
\
'
;
'
:V
-"•'.
-
i
i
i
1
1
!
CT
1
i
f
LB
Figure #6
Dot matrix diagran comparing the human corticotropin-lipotropin precursor
gene with the bovine corticotropin-lipotropin precursor mRNA. The density of
dots along a -45° diagonal is correlated with homology (see Methods).
4
consecutive base matches were required for a dot to appear. Each tick mark
represents 100 bases. SS, region coding for the signal sequence; MG, region
coding for the melanotropin-gamma;
CT, region coding for the corticotropin;
LB, region coding for the lipotropin-beta; P, region coding for the precursor
fragment; L, region coding for the lipotropin.
eukaryotic coding sequences without
constraints
have,
on
introns have evolved under more varied
than the extra-chromosomal
coding
the average, been less varied
sequences, their constraints
than those of the exons.
These
observations suggest a relationship between the variability of the constraints
on
a
sequence
throughout
its
evolutionary
history,
and
the
relative
proportions of its Dl and D2 measures.
The Dl and D2 measures detected
relationships among groups of coding
sequences based on their general, underlying constraints.
detected
These measures also
relationships between corresponding subsequences of related genes.
Sequence homology is the property most often compared between corresponding
2738
Nucleic Acids Research
subsequences of related genes. Dl and D2 reflect more global, less specific
properties than sequence bomology. If two subsequences have much homology
then their D1.D2 measures mi at be similar. If the subsequences have little
homology then their D1,D2 measures may or may not be similar.
When two
subsequences have much homology, it is often assumed that their functional
constraints are similar. In the examples given of the beta-globin genes, the
corticotropin-lipotropin sequences, the alpha-anylase sequences, and the SV40,
BE virus sequences, the D1.D2 measures detected a similar pattern of
underlying constraints in subsequences which had little homology.
In all
these cases, Dl (constraints toward nonuniform base composition) was unusually
high relative to R (the total constraints). For example, in the beta-globin
genes, the coding regions have high D2 (constraints on the ordering of the
bases) relative to Dl and are well conserved among beta-globin genes. Intron
B has high Dl relative to D2 and is poorly conserved. Intron A has high D2
relative to Dl and is conserved to a far greater degree than Intron B (5).
Therefore, the D1.D2 measures of corresponding subsequences can reveal
similarities in underlying constraints.
There also appears to be a
relationship between these measures of a subsequence and the conservation of
that subsequence in related genes.
Smithies, et al. (8), in a detailed analysis of the human gamma and
gamma fetal globin genes, demonstrated
a correlation between A,T base
content and subsequence conservation.
They aligned the two sequences and
measured homology and A,T base content of corresponding subsequences. Using
graphic and rank correlation methods, they found a negative correlation
between A,T base content and subsequence conservation. They found the same
relationship in two mouse beta-globin genes and in two mouse immunoglobulin
genes.There is an essential relationship between the findings of Smithies,et
al. and our observation of a correlation between high relative Dl (constraints
toward nonuniform base composition) and subsequence conservation. Only 2 of
the 73 subsequences for which they presented the A,T base content were less
than approximately 451 A,T. Generally speaking, Dl, the divergence from 25Z
each base, will rise as A,T content increases beyond SOZ. In our analyses,
the Dl peaks of the globin genes were due to peaks in A and/or T; the Dl
peaks of the alpha-amylase genes were due to peaks in A at the expense of the
other three bases. The Dl peaks in the corticotropin-lipotropin sequences,
however, were due to peaks in G and/or C — a case in which subsequence
conservation was positively correlated with A,T content. The observations of
Smithies et al., though a more detailed analysis, appear to be a specific case
of our more general observation.
A more general interpretation for the
2737
Nucleic Acids Research
results
of
Smithies,et
al.,
is
that
it
is
Dl,
and
not
content, which is negatively correlated with subsequence
specifically
A,T
conservation.
A further point should be made about regions with high R of primarily Dl
origin.
Biologists
subsequence—the
However,
often use as the criteria of the selective pressure on a
degree
of
its
lack of conservation
constraints.
If,
in
fact,
conservation
does not always
these high Dl
between
indicate
related
lack of
regions were
under
sequences.
evolutionary
essentially
no
selective constraints, they would tend to accumulate random mutations and thus
their D2 and Dl would
because
of
tend toward zero.
increased
Dl,. remained high;
exists
on these poorly
conserved
highly
nonuniform
composition,
base
In the examples given, R, primarily
indicating
subsequences.
and not
that a strong
That
toward
constraint
constraint
is toward a
a particular
ordering
of
the bases.
Thus
the magnitude
in a potentially
These
measures,
understanding
accurately
and
its
the
of
to
its
concepts
evolution
and
the relationship
quantitatively
Smith
and
the
degree
of genes
and type of redundancy
predictive way,
behind
function
between
conservation
in a subsequence are
conservation
of
them,
in
related
may
prove
subsequences.
the type of redundancy
in
related
sequences,
To
useful
define
in a
one
related,
sequences.
in
more
subsequence
would
have
to
analyze the degree of conservation in subsequences of a family
(the beta-globins,
and Waterman
(9) and
for example) using an algorithm
such as that of
calculate
of 300 bases or
Dl
and
D2 on
samples
greater.
COHCLOSION
Tvo simple statistical measures derived from Information Theory were used
to
detect
sequences.
the
At
general,
any
underlying
constraints
toward
from all other coding
nonuniform
base
sequences
on
reveal
a
constant,
low
nucleic
acid
extra-chromosomal
sequences by their high Dl
composition),
(constraints on the ordering of the bases).
coding
acting
set level of R (the total constraints),
sequences were distingushed
(the
constraints
relative
to
their
D2
The Dl, D2 measures of the other
intensity
of
constraints
toward
a
nonuniform base composition and, particularly in the eukaryotes, an increasing
intensity
of constraints
on the ordering
the exona
generally
the highest
had
02
of the bases.
(constraints
on
For
the eukaryotes,
the ordering
of
the
bases) relative to Dl.
The
conserved
2738
D1,D2
measures
subsequences
of
detected
related
similar
genes
underlying
and
revealed
constraints
in
a relationship
poorly
between
Nucleic Acids Research
the base composition of a subsequence and its conservation in related genes.
ACKNOWLEDGEMENTS
D.L. would like to thank Michael Waterman, Temple Smith and John Miller
for their advice and useful conversations.
REFERENCES
1.
Shannon,C.,(1948) Bell System Tech.J.,27,379-423,623-656.
2.
Gatlin.L., (1972)lnformation Theory and The
Living System,Columbia University Press.
3.
Maisel J..Lenk.R.,(1981) P.N.A.S.,78,7665-7669.
4.
Nucleic Acid Sequence Data Base, National Biomedical
Research Foundation, Georgetown University Medical Center.
5.
Efstratiadis,A.,Posakony,J.,Maniatis,T.,Lawn,R.,0'Connell,
C.,Spritz,R.,DeRiel,J.,Forget,B.,Weissman,S.,81ightom,J.,
Blechl.A.,Smithies,0.,Baralle,F.,Shoulders,C.,Proudfoot,N.,
(1980) Cell,21,653-668.
6.
Gilbert,W.,(1978) Nature,271,501.
7.
Gilbert.W. (1981) Science, 214,1305-1312.
8.
Smithies,0.,Engels,W.,0evereux,J.,Slightom,J.,Shen,S.,
(1981) Cell,26,345-353.
9.
Smith,T.F.,Waterman,M.S.,(1981) J.Mol.Biol.,147,195197.
2739
Nucleic Acids Research