Key for protein coding sequence identification: computer analysis of

volume 10 Number 11982
Nucleic Acids Research
Key for protein coding sequence identification: computer analysis of codon strategy
Francis Rodier , Jaime Gabarro-Arpa, Ricardo Ehrlich and Claude Reiss
Institut de Recherche en Biologie Mole'culaire, CNRS and University of Paris VII, 2 Place Jussieu,
F-75251 Paris Cedex 5, France
Received 29 September 1981
ABSTRACT
The signal q u a l i f y i n g an AUG or GUG as an i n i t i a t o r i n mRNAs processed
by E.coli ribosomes is not found to be a systematic, l i t e r a l homology
sequence. In contrast, s t a b i l i t y analysis reveals that i n i t i a t o r s always
occure w i t h i n nucleic acid domains of low s t a b i l i t y , f o r which a high A/U
content i s observed. Since no aminoadd selection pressure can be detected
at N-termin1 of the p r o t e i n s , the A/U enrichment results from a biased usage
of the code degenerascy. A computer analysis 1s presented which allows easy
detection of the codon strategy. N-terminal codons carry rather systematically
A or U in t h i r d p o s i t i o n , which suggests a mechanism f o r t r a n s l a t i o n i n i t i a t i o n and helps to detect protein coding sequences i n sequenced DNA.
INTRODUCTION
The unambiguous i d e n t i f i c a t i o n of protein coding sequences (PCS) is a
primary goal of many DNA sequencing p r o j e c t s . However, even f o r species w i t h
f u l l y sequenced genomes and extensively explored genetic organization o r
biochemistry, l i k e bacteriophages, some proteins have long escaped i d e n t i f i c a t i o n ( 1 ) , and the existence of yet undiscovered proteins cannot be ruled
out. In prokaryotes, a PCS s t a r t i s believed to be universally encoded by an
i n i t i a t i o n codon (almost always AUG or GUG), whereas a non-sens t r i p l e t
(UAA, UGA.or UAG) ends the amino-acid coding. This does not mean however that
any coding frame, i . e . a sequence of codons,exactly bordered by an i n i t i a t i o n
codon on the 5 1 side and a non-sens codon on the 3' end, is a PCS. The
frequency of such sequences exceeds at least by an order of magnitude the
number of proteins r e a l l y encoded. As an example, f o r 0X174 ( 2 ) , G4 (3) and
f d (4) bacteriophages DNA, r e s p e c t i v e l y , 140, 118 and 103 coding frames
s t a r t i n g w i t h AUG, and 66, 70 and 76 s t a r t i n g with GUG are found, while less
than a dozen proteins have been i d e n t i f i e d f o r each of these phages, ( a l l but
one i n i t i a t e d with GUG).
As f i r s t observed by Shine and Dalgarno ( 5 ) , PCS translated by the
E.coli ribosomes are introduced, 5 to 15 bases upstream the i n i t i a t i o n codon,
©IRLPra*«LJmlt8d, 1 Falconberg Court, London W1V 5FG, U.K.
391
Nucleic Acids Research
by a sequence of usually three, sometimes more, bases complementary to the
16S rRNA 3'end ( ...CCUCC... ).
This introductory "keyword" is believed to point to the next downstream
AUG or GUG as an f-met codon and could thus identify a PCS. Again, for 0X174,
G4 and fd DNA respectively, 156, 121 and 116 spots have been detected, having
a substring of at least three contiguous bases of the keyword complement, I.e.
...GGAGG...
Among these, a large number are located between 5 and 20 bases upstream
from an AUG or GUG t r i p l e t : in 0X174, G4 and fd respectively, 73, 57 and 30
substrings of three contiguous bases of the ...GGAGG... set preceed, within
less than 20 bases, a coding frame starting with AUG, and 27, 29 and 27 a
coding frame starting with GUG. Thus, theShine-Dalgarno keyword does not
necessarily identify a close, downstream i n i t i a t o r . Also, in real mRNAs, the
f i r s t AUG or GUG downstream from the Shine-Dalgarno keyword is not necess a r i l y the i n i t i a t i o n ccdon (examples are gene J of G4 or 0X174 (2,3), gene C
of G4 (3); the Q replication gene (6) has even two AUG between the AGG
keyword and the f-met codon). Furthermore, PCS simply lacking the keyword
introduction are known (fd genes VII and IX for instance). The Shine-Dalgarno
sequence is thus neither necessary nor sufficient for qualifying an i n i t i a t o r .
Despite the use of powerful algorithms (F. Rodier, manuscript in prepar a t i o n ) , high-speed computer analysis did not reveal any other f i r m , homologous base sequence within 50 bases on both sides of a known PCS start
(when necessary, the mRNA sequence was completed to these l i m i t s with the
corresponding DNA sequence).
The non-dispensable signal required for qualifying an i n i t i a t i o n codon
is not found, in the systems studied, as a l i t e r a l homology sequence near the
5' end of the mRNA.
Since i t is most l i k e l y that the signal 1s present on the mRNA, i t must be a
collective property common to various base sequences.
Collective properties of base sequences
The derivation of thermodynamic functions of mRNA from the sequence
would be simple i f the structure of this molecule could be unambigously
derived from i t s sequence. I t has been shown that even for tRNA molecules,
no set of base-pairing schemes is available which could predict the secondary
(clover-leaf) structure in a l l cases. Prediction of the structure of large
RNA molecules is even more dubious(7).
However, an upper l i m i t to the thermodynamic functions of RNA can be set
by the values of the corresponding functions of the DNA template. The know392
Nucleic Acids Research
ledge of the DNA sequence allows the straighforward derivation of t t s thennodynamic function ( 8). These depend strongly on the precise local sequence,
as i l l u s t r a t e d for instance by the DNA s t a b i l i t y , a thermodynamic function
which is of inmediate biological relevance here (9). In a given environment,
a natural DNA molecule behaves as a sequence of discrete s t a b i l i t y domains.
Each domain has a precise location; i t s base pairs change state (paired or
unpaired) cooperatively, for a particular, typical value of a relative stabil i t y parameter p (see key to f i g . 1). Local base composition and distribution
control mainly the domain size, boundaries and relative s t a b i l i t y , but the
action between non-neighbour bases (teleaction) contributes also. The domains
can be observed experimentally by denaturation experiments, and are very
accurately predicted from the sequence by Azbel's model (10,11). Considering
the RNA segment transcribed from a given DNA s t a b i l i t y domain (with relative
s t a b i l i t y p ), an upper l i m i t to s t a b i l i t y is reached when this RNA segment
is f u l l y base-paired (neglecting t e r t i a r y structure contribution); i t s
relative s t a b i l i t y is then k p , k being a constant with, value depending on
the nature of the pairing partners (ribo-ribo or deoxy-ribo pairing).
I n i t i a t o r s are located within domains of low s t a b i l i t y
Superposition of the calculated s t a b i l i t y profiles (p ^s_.sequence) of
0X174, G4 and fd DNA (in replicative form, R.F.) on the experimental PCS maps
of these phages shows that they start translation within domains of s t r i kingly low relative s t a b i l i t y with respect to neighbour domains (Fig. 1).
Similar behaviour is observed for a l l prokaryotic and eukaryotic PCSs we have
studied so far (sample of more than 50 PCSs); only exceptions are genes E of
both G4 and (3X174 bacteriophages (see Fig. 1); the products of these genes
have not been identified on SDS-polyacrylamid gels (12). Table I gives the
addresses, length and relative s t a b i l i t i e s of these domains 1n 0X174, G4,
fd PCSs. A detailed account of the s t a b i l i t y analysis at protein coding
starts w i l l be published elsewhere (see also (13)).
Codon selection : a s t a t i s t i c a l analysis
The observed low s t a b i l i t y around an actual i n i t i a t o r results obviously
from a local A and U enrichment, an observation already reported by Scherrer
et a l . (14). This enrichment sets a bias on the base composition of the codons
following the i n i t i a t o r . Since we were unable to detect a significant bias
for any of the 20 amino-acids following the i n i t i a t o r s in the low s t a b i l i t y
dips, a particular use of the genetic code degenerascy must occure, allowing
A/U enrichment of codons without amino-acid selection.
393
Nucleic Acids Research
0.6-
0.S-
0A-
Figure 1 . S t a b i l i t y p r o f i l e of phage fd DNA (RF) , between
end of gene V I I and start of gene-III . ( The stab i l i t y profile (9) is the plot of DNA s t a b i l i t y parameter p
vs_ DNA sequence ; i f a change of state ( native/unwound ) is
induced in DNA by an agent X ( temperature , pH , concentration
of denaturing (bio)chemicals , mechanical torque on the molecule , . . . ) . p = IKrhiVlhc-hJ
• xm • XAT a n d XGC b e i n g
the measure of X inducing11 the chanfe of s'Eate 8f an actual DNA
domain ( see text ) , a pure random plydAT and polydGC respectively ) .
The value of environmental parameter (11) is W = 4.5
Possible ribosome attachement sites preceeding an
ATG or GTG by less than 20 bases are indicated by 8) .
The way this occurs can be analysed systematically by comparing for each
amino-acid coded within the low s t a b i l i t y dip the theoretical and observed
frequencies for finding a given nucleotide at a given codon rank. We w i l l give
a general treatment for this analysis, although a p r i o r i a A/U bias on the
f i r s t or second base of codons is l i k e l y to result in small over-all bias only
(two araino-acids, Arg and Leu, have codons with f i r s t base A/U or C, and Ser,
unique with second base choice, allows C or G only). This treatment is
useful for other purposes as well (purine-pyrimidine bias for instance).
Let FRECOD ( i , j , k ) be a tridimensional matrix, i varying from 1 to 21
and standing for the 20 amino-acids and the terminator, j varying from 1 to 4
394
Nucleic Acids Research
TABLE I
SPECIES
0X174
GENE NAME
A
A+
B
K
C
D
E
J
F
6
H
64
A
A+
B
K
C
D
E
J
F
6
H
fd
*
V
VII
IX
VIII
III
VI
I
IV
II
GENE
START
UNA DOMAIN
(W-4.5)
start 4 end
DNA DOMAIN
(W.4.5)
stability (p)
r
PHASE
64
580
1158
1520
1602
1859
2037
2317
2470
3864
4400
22
563
1057
1510
1599
1843
2034
2310
2457
3857
4347
114
699
1177
1579
1613
1881
2080
2333
2488
3914
4418
.40
.46
.41
.35
.35
.37
.52
.34
.35
.33
.43
13/16
23/39
4/5
14/20
2/2
6/6
4/13
5/5
5/S
12/16
5/5.
1
1
3
2
3
2
3
1
1
3
2
59
698
1276
1638
1720
1976
2154
2477
2600
4020
4564
53
532
1243
1612
1717
1893
2151
2440
2583
4006
4548
99
744
1371
1746
1749
2088
2199
2492
2619
4066
4570
.41
.42
.38
.36
.33
.36
.51
.27
.32
.40
.41
5/5
10/14
22/31
24/36
4/9
27/37
4/14
4/4
5/6
9/15
1/2
1
1
3
2
3
1
2
1
1
2
3
496
-843
1108
1206
1301
1579
2856
3197
4221
6007
436
817
1100
1200
1266
1557
2816
3137
4177
5914
508
864
1110
1229
1316
1669
2888
3227
4253
6034
.35
.30
.48
.37
.38
.34
.38
.29
.22
.41
3/3
6/6
5/6
3/4
31/35
7/9
8/9
8/10
6/3
3
2
3
2
1
3
2
1
2
3
r=nnnber of codons having A or U in third position /number of
codons located in the stability dip in which the initiator is
located.
395
Nucleic Acids Research
and standing for the nucleotide type (T.C.A.G) and k varying from 1 to 3 and
standing for the nucleotide rank in the codon. Element ( i . j . k ) of FRECOD is
the number of times a given nucleotide ( j ) is present at codon rank (k) for
the am1no-acid ( i ) . I t can be recognized that FRECOD is the matrix form of the
genetic code as presented by Grantham (15). From FRECOD, the (theoretical)
occurence frequency of nucleotide j
=FREC0D
0)
where
IQ =
4
of a given codon 1 Q at rank kQ is
FRECOD ( i Q , j , k Q ) is the total number of codons
for amino-acid i .
We now consider an actual PCS. As experimental counterpart to FRECOD.we
define a matrix FREKEX ( 1 , j ) , 1, j being defined as for FRECOD (for the sake
of s i m p l i c i t y , index k w i l l be omitted, the codon rank under consideration
being defined before hand). In the nucleic acid sequence, the codons for a
given amino-acid i contribute each to element ( i Q , j Q ) of FREKEX by 1 or 0,
depending on whether the nucleotide species j ' o is present or not at the
specified rank of the codon. Thus element ( i 0 , j Q ) of FREKEX is the number of
times nucleotide j appears at a given rank in the codons for ami no-acid iQ
1n the nucleic acid sequence investigated. The total number of analyzed
codons f o r amino-acid 1
is
n(10)=
and the
4
z FREKEX(io,jQ)
experimental frequency with which nucleotide j
Q
is found 1n codons
for amino-acid 1 at the specified rank is
Y(1 o ,J o )=FREKEX(i o ,J o )/n(i o )
Various comparison of experimental and theoretical frequencies can be
made by combinations of Y and X ratios. For instance,
Y(i ,1)+Y(i ,3).
Z ( i . A + U , 3 ) 5 3
0
X(io.l,3)+X(io,3.3.)
Y(i o ,2)+Y(i ,4)
or
Z(ii n0,G+C,3)
n , , )
0
X(1OO,2,3)+X(1
,2,3)+X(1OO,,4,3)
X(1
compare the experimental and theoretical frequencies for having A/U or G/C
in third position of codon i (Y dealing with rank 3 ) . These frequencies can
396
Nucleic Acids Research
be sunned over all values of i Q , to. yied global nucleotide selection pressure
at a given codon rank in the sequence investigated.
Coming back to the A/U bias observed in the low stability domains
enclosing initiators fcnd terminators), table I I shows that i t is carried
almost exclusively by the third base of the codons, except for protein E of
0X174 (the product of this gene has not been identified by SDS-polyacrylamide electrophoresis (12) ). Furthermore, detailed analysis of the
N-terminal amino-acid codons of 0X174, , G4 and fd genes show that the
A/U bias in third position covers the 8 f i r s t codons on the average (tablel
last column). This number compares to the number of codons (seven) protected
from digestion during translation 1nitiat1on(l5).
DISCUSSION
The results presented raise two questions of 1nterest:(i) is the low
stability observed around the initiator linked to the mechanism of translation start, and (ii) can this observation help to find out proteins from
the DNA sequence alone, without additional hints from biochemistry or
genetics ?
Because very little is known about the mechanism of translation
initiation, one can only speculate on the role of stability around the
initiator. A naive, mechanistic look suggests that this role may be important
at two, non exclusive levels at least. It has been suggested in several
TABU II
BACTERIOFtUSE «X 174
third b u i
P
RC
2.126
.636
1.408
.828
.746
1.643
.757
1.629
.587
2.351
1.269
.584
1.333
.500
1.214
.659
.783
1.498
.635
2.019
.462
3.229
.565
2.507
8EKE RAT
A
1. 3 5 3
A* 1.166
B
1.225
C
l .234
D
l .380
.741
El
.667
U
.800
F l .172
l .323
H
l .4(0
l .416
J
l
l .407
•
a
•
a
^560
2.035
2.513
II
67
79
69
35
64
53
24
29
26
60
50
29
23
BACTESIOCHAGC *>
third b iM
SEK RAT
R£C
P
A
1-.376
.596
2.309
A
1.304
.677
1.927
B
1.225
1.627
.753
C
.884
1.106
1.252
1.375
.603
2.283
0
1.224
.750
1.633
t
.947
1.051
1.109
.575
1.919
G 1.295
N
1.185
.802
1.479
J
1.484
.483
3.074
1.333
.651
2.048
1.375
.609
2.259
r
a
xz
n
101
134
92
90
82
31
58
28
UCTERIOPHASE fd
third b u *
P
RSC
.085 21.17
.914
1.182
V
1.500
.375
4.000
VI 1.324
.621
2.132
VII 1.385
.54!
2.538
VII 1.081
.914
1.182
IX 1.360
2.234
.509
X
1.500
.450
3.333
SOT RAT
1.823
I
III 1.081
H
50
36
12
42
12
36
24
28
80
30
44
26
Third base selection-pressure for tht N-teraindl codons located within the Initiator stabil i t y dip of IX 174,64 and fd ganes.RAT and RSC art tht ratios of the experioental to theoretical frequencies for having A or U.and S or C 1n third position (contributtd by a l l aninoac1ds).P-RAT/RfiC.H 1s tht nunbtrof N-ternrlrul codons considered. K, and E. are tht ((-terminal,
C, and Ej the C-ttrainal parts of genes K and E (-E.) respectively.U*6.3.
397
Nucleic Acids Research
places that,as mRNA is synthesized, i t could temporarily remain linked to the
DMA template
, and thus form part of a double-(mRNA-coding DNA strand)
or t r i p l e helix, from which i t has to be removed before processing by the
ribosome. I f the s t a b i l i t y of this helix were too strong near the 5' end of
the message, the removal of mRNA could be d i f f i c u l t or even Impossible,
thus disabling i n i t i a t i o n of translation.The local s t a b i l i t y observed near
i n i t i a t o r s may thus be requested to free mRNA from the template.
Furthermore,or alternatively, in order to have the free mRNA molecule
easily threaded Into the ribosome assembly, the 51 terminus of mRNA should
be devoid of extensive, stable internal structure. This could occur either
by preventing the bases at the 5' terminus from entering extensive hairpin
structures ( i . e . extended stretches complementary to the 5 1 end sequence
are not found in the rest of the mRNA), or by allowing only for short
hairpins of low stability in the 5' terminus of mRNA. The f i r s t scheme,
though not impossible, might involve highly complex control mechanisms,
whereas-the second could simply be met i f the 51 terminus of mRNA were
enriched in A and U, which is actually observed.
In these schemes, 16S rRNA could be an active helper in preparing the
5' end of mRNA for processing by the ribosome. Interaction with mRNA at the
(facultative) ribosome attachment site could remove the 5' end of the message
from the template (RNA homoduplex 1s more stable than heteroduplex (16) ),
and/or avoid, or break when present, weak secondary messenger structure (in
an A/U rich environment, the attachment site of mRNA is rich in G; thus the
s t a b i l i t y of even a short duplex with 16S rRNA is l i k e l y to overrate those
of l o c a l , All rich intramessenger structures). As a consequence, the translation level could be controlled by the balance between the s t a b i l i t i e s of
heteroduplex or hairpins involving the 5' end of the message, and the
s t a b i l i t y of the 16S rRNA-mRNA duplex around the Shine-Dalgarno sequence;
this is observed in several cases, were translation level data are available
(in the lac operon for instance, unpublished results).
The high A/U content around the i n i t i a t o r could also
serve other
purposes; one could think of I n i t i a t i o n factors having a high a f f i n i t y for
A/U rich sequences, or isoaccepting tRNA selection, e t c . ) . Due to the
absence of experimental evidence, a l l these hypothesis remain however
speculative.
Can the findings presented here help to detect proteins simply from
the genome sequence ? To answer the question, i t must f i r s t be assessed
whether a l l AUG or GUG, located in a s t a b i l i t y dip and followed by a set
398
Nucleic Acids Research
of codons with rather frequent selection of A or U i n t h i r d position, are
i n i t i a t o r s . RNAs which are not translated -E.coli 16S rRNA or viroid RNA
for instance- have AUG or GUG located always within nucleic acid domains of
s t a b i l i t y exceeding p = 55 (results not shown). Although not a proof, this
argument suggests strongly that i n i t i a t o r qualification depends indeed on
local s t a b i l i t y .
We have selected in the genomes of 0X174 and G4 those AUG or GUG which
satisfy the following conditions : ( i ) they should be located within a
section of an mRNA, for which hairpin s t a b i l i t y is beyond . 45; ( i i ) a major i t y of the N-terminal codons located in the same s t a b i l i t y dip than the
f-Met codon should have A/U in third position; ( i i i ) they should be associated with a ribosome attachment sites - a substring of three consecutive
bases at least of GGAGG located 5 to 20 bases upstream. The following
"practical" conditions were added : the reading frame introduced by the
i n i t i a t o r candidate should extend over 10 codons at least, should be located
in a transcribed part of the genome, should not be on the third reading
TABLE III
eat
UH
ten
itut 1 ml
Hypothetical
DM OONUfl
(1M • J>
rtirt
ttitniy (P
sum
i
inttr
PIUS!
134i
346
1(03
316
405
.42
337
411
1
1350
350
404
316
405
.42
337
11
2
ud In
pbttt
nth
A
H12
112
581
496
563
.42
600
23
2
-
•1753
1753
1798
1747
1774
.37
1738
IS
1
-
1*2007
2X7
•10
1811
1991
.42
1S86
101
3
E
12184
21S4
•10
2163
tm
.43
2170
42
3
E
12723
2725
3751
2649
2881
.41
2714
342
1
F
12732
2732
2855
2649
2SU
.41
2714
41
2
-
1*2735
2735
2855
2649
2811
.41
2729
40
2
-
14*68
4868
5384
4840
4928
.40
4856
172
2
H
KM
49S8
5151
4840
4921
.40
4898
81
3
-
an
5*4
(99
460
744
.42
5(2
43
I
eta
«22
1056
808
129
.42
813
78
2
.
O837
1837
1971
1824
1842
.37
IHJ
41
3
C
e*2O97
2097
2442
18*3
2090
.40
mm
US
2
E
«*2274
2274
2442
2201
2255
.42
226*
56
2
E
C2301
2X1
2442
2301
2343
.41
2279
47
2
82871
2871
2985
2793
2873
.44
2851
2
E
.
S3497
3417
Ml
3809
3842
.37
3493
a
128
1
S47O9
470*
5009
4687
4732
.47
4638
100
1
S513J
5235
5360
5033
5169
.41
5114
75
1
r
-
C3140
1140
5575
5033
5169
.41
5114
145
3
H
proUins
In the genomes of PX 174 and G 4 ( j «
Sequence Timbering according to reference
A has t e e n
DM OOKUX
(_l)for
textt).*)
0X 174 and (3)
jtandi for GUG I n i t i a t o r s
for G4.Reading frame of gene
taken as phase rajuber 1 In both c a s e s .
399
Nucleic Acids Research
frame in case two are already u t i l i z e d by established PCS. The hypothetical
proteins satisfying these conditions are l i s t e d i n Table I I I .
We notice that the average kp value for the putative gene starts is
higher than for established genes; some hypothetical proteins of both 0X174
and G4 are similar in size and location (0 + 2007 and G+ 2097, which both
include gene E, 0 2184 and G 2301, 0 2732 and G 2871, 0 4908 and G 5135).
We have shown recently that the local s t a b i l i t y characterizes also
prokaryotic promoters (17), allowing to describe a precise mechanism f o r
transcript i n i t i a t i o n (18). We believe that s t a b i l i t y homologies i . e . cooperative behaviour of bases play,in general,an important role 1n gene organization, expression and regulation and are elements of a "genetic signal code"
(19), for which the codon strategy deciphering program presented here is
useful.
To whom all correspondence should be addressed.
REFERENCES
1. F. Sanger, G.M. A i r , B.G. B a r r e l , N.L. Brown, A.R. Coulson, J . C . Fiddes,
C.A. Hutchinson, P.M. Slocombe, M. Smith, Nature 265. 657 (1977)
2. F. Sanger, A.R. Coulson, T. Friedmann, G.M. A i r , B.G. B a r r e l ,
N.L. Brown, J . C . Fiddes, C.A. Hutchinson I I I , P.M. Slocombe, M. S m i t h ,
J . M o l . B i o l . , 125. 225 (1978)
3. G.N. Godson, B . S T B a r r e l , R. Staden, J . C . Fiddes, Nature 276. 236 (1978)
4 . E. Beck, R. Sommer, E.A. Auerswald, Ch. Kurz, B. Z i n k , G. OTEerburg,
H. S c h a l l e r , K. S u g i s a k i , T. Okamoto, M. Takanami,
Nucleic Acids Res., 5_. 4495 (1978)
5. J . Shine, L. Dalgarno, Proc. N a t l . Acad. S c i . USA., 7 1 . 1342 (1974)
6. G. Pieczenik, P. Model, H.D. Robertson, J . M o l . B i o l . , 90. 191 (1974)
7. J . N i n i o , B i o c h i m i e , 6J.. 1133 (1979)
8. M. Ya. A z b e l , Biopolymers, 79. 61 (1980)
9. J . Gabarro-Arpa, R. E h r l i c h . T . Rodier, C. Reiss i n DNA recombination,
i n t e r a c t i o n s and r e p a i r , S. Z a d r a z i l and L. Sponar, e d s . , Pergamon,
211-221 (1980)
10. C. Reiss, J . Gabarro-Arpa,
i n Prog. M o l . and S u b c e l l . B i o l . ,
F.E. Hahn, e d . , S p r i n g e r - V e r l a g , 5. 1-30 (1977)
11. F. M i c h e l , J . Gabarro-Arpa, BiopoTymers, to appear ; see also
J . Gabarro-Arpa, P. Tougard, C. Reiss
Nature, 280. 515 (1979)
12. G.N. Godson, J . C . Fiddes, B.G. B a r r e l , F. Sanger, i n Single Stranded
DNA Phages, D.T. Denhardt, D. D r e s s i e r , D.S. Ray, e d s . , Cold Spring
Harbor L a b . , 51 (1978)
13. F. Rodier, J . Gabarro-Arpa, R. E h r l i c h , C. Reiss,
C.R. Acad. S c i . P a r i s , 291D. 199 (1980)
14. G.F.E. Scherer, M.D. WaTRTnshaw, S. A r n o t t , D.J. Morre,
Nucleic Acids Res., 8 . 3895 (1980)
15. R. Grantham,
FEBS F e t t e r s , 9 5 . 1 (1978)
16. M. R i l e y , B. M a i l i n g , M.J. Chamberlin,
J . M o l . B 1 o l . , 20. 359 (1966)
17. R. E h r l i c h , M. M a r i n , J . Gabarro-Arpa, F. Rodier, B. S c h m i t t , C. Reiss,
C.R. Acad. S c i . P a r i s , 291D. 5 (1980)
400
Nucleic Acids Research
18. R. E h r l i c h , M. Marin, J . Gabarro-Arpa, F. Rodier, B. Schmitt, C. Reiss,
C.R. Acad. S c i . P a r i s , 291D. 177 (1980)
19. R. E h r l i c h , F. Rodier, J . Gabarro-Arpa, M. Marin, B. Schmitt and C. Reiss,
C.R. Acad. S c i . P a r i s , S§ance 25 Mai 1981,
in press.
APPENDIX
Input data structure
Format
CARD 1 :
NAME
CARD 2 :
DECI( )
CARD 3 :
L
CARD 4 :
NOM ( )
CARD 5 :
NBASE ( )
label
A80
logical values
5A1
number of stretches to be studied
name for the stretch to be studied
A80
position of the codon base rank (1,2,3) to be analyzed
Repeats cards 4, 5 and 6 L times (one f each coding
sequence)
CARD 6 :
NSTAR1, NFIN1, NB0RN1, NSTAR2, NSTAR2, NB0RN2
NSTAR1
address of the beginning of the stability dip
enclosing the initiator
NFIN1
address of the end of the stability dip
enclosing the initiator
NB0RN1
address of the 1nitiator(AllG or GUG) of the coding
sequence
NSTAR2
address of the beginning of the stability dip
enclosing the terminator
NFIN2
address of the end of the stability dip
enclosing the terminator
NB0RN2
address of the terminator .
CARD 7 :
ADN (
CARD 8 , 9 . . . :
)
ADN or ARN sequence(in the form A.T.G.C) A80
Continues sequence u n t i l sequence ending.
401
Nucleic Acids Research
Printout structure
(for each gene to be analyzed)
1.
2.
Identification label
Indication of codon rank selected for the analysis
3.
4.
Name and comment of the DNA sequence under study
Start and end of s t a b i l i t y domain in which the I n i t i a t o r
(or terminator) 1s located
5.
Gene s t a r t address
6.
7.
A.T.G.C percentage per codon
a) Print of the messenger sequence
b) Print of the amino-acid sequence
c) Print of physical characteristics of each anvino-acid
(ACI, BAS, NGU, APO)
Print of s t a t i s t i c a l tables for each amino-acid and for the
whole sequence.
8.
Programs were written in F0RTRAN77 language, and they are implemented
on a UNIVAC 1100 computer.
402