volume 10 Number 11982 Nucleic Acids Research Key for protein coding sequence identification: computer analysis of codon strategy Francis Rodier , Jaime Gabarro-Arpa, Ricardo Ehrlich and Claude Reiss Institut de Recherche en Biologie Mole'culaire, CNRS and University of Paris VII, 2 Place Jussieu, F-75251 Paris Cedex 5, France Received 29 September 1981 ABSTRACT The signal q u a l i f y i n g an AUG or GUG as an i n i t i a t o r i n mRNAs processed by E.coli ribosomes is not found to be a systematic, l i t e r a l homology sequence. In contrast, s t a b i l i t y analysis reveals that i n i t i a t o r s always occure w i t h i n nucleic acid domains of low s t a b i l i t y , f o r which a high A/U content i s observed. Since no aminoadd selection pressure can be detected at N-termin1 of the p r o t e i n s , the A/U enrichment results from a biased usage of the code degenerascy. A computer analysis 1s presented which allows easy detection of the codon strategy. N-terminal codons carry rather systematically A or U in t h i r d p o s i t i o n , which suggests a mechanism f o r t r a n s l a t i o n i n i t i a t i o n and helps to detect protein coding sequences i n sequenced DNA. INTRODUCTION The unambiguous i d e n t i f i c a t i o n of protein coding sequences (PCS) is a primary goal of many DNA sequencing p r o j e c t s . However, even f o r species w i t h f u l l y sequenced genomes and extensively explored genetic organization o r biochemistry, l i k e bacteriophages, some proteins have long escaped i d e n t i f i c a t i o n ( 1 ) , and the existence of yet undiscovered proteins cannot be ruled out. In prokaryotes, a PCS s t a r t i s believed to be universally encoded by an i n i t i a t i o n codon (almost always AUG or GUG), whereas a non-sens t r i p l e t (UAA, UGA.or UAG) ends the amino-acid coding. This does not mean however that any coding frame, i . e . a sequence of codons,exactly bordered by an i n i t i a t i o n codon on the 5 1 side and a non-sens codon on the 3' end, is a PCS. The frequency of such sequences exceeds at least by an order of magnitude the number of proteins r e a l l y encoded. As an example, f o r 0X174 ( 2 ) , G4 (3) and f d (4) bacteriophages DNA, r e s p e c t i v e l y , 140, 118 and 103 coding frames s t a r t i n g w i t h AUG, and 66, 70 and 76 s t a r t i n g with GUG are found, while less than a dozen proteins have been i d e n t i f i e d f o r each of these phages, ( a l l but one i n i t i a t e d with GUG). As f i r s t observed by Shine and Dalgarno ( 5 ) , PCS translated by the E.coli ribosomes are introduced, 5 to 15 bases upstream the i n i t i a t i o n codon, ©IRLPra*«LJmlt8d, 1 Falconberg Court, London W1V 5FG, U.K. 391 Nucleic Acids Research by a sequence of usually three, sometimes more, bases complementary to the 16S rRNA 3'end ( ...CCUCC... ). This introductory "keyword" is believed to point to the next downstream AUG or GUG as an f-met codon and could thus identify a PCS. Again, for 0X174, G4 and fd DNA respectively, 156, 121 and 116 spots have been detected, having a substring of at least three contiguous bases of the keyword complement, I.e. ...GGAGG... Among these, a large number are located between 5 and 20 bases upstream from an AUG or GUG t r i p l e t : in 0X174, G4 and fd respectively, 73, 57 and 30 substrings of three contiguous bases of the ...GGAGG... set preceed, within less than 20 bases, a coding frame starting with AUG, and 27, 29 and 27 a coding frame starting with GUG. Thus, theShine-Dalgarno keyword does not necessarily identify a close, downstream i n i t i a t o r . Also, in real mRNAs, the f i r s t AUG or GUG downstream from the Shine-Dalgarno keyword is not necess a r i l y the i n i t i a t i o n ccdon (examples are gene J of G4 or 0X174 (2,3), gene C of G4 (3); the Q replication gene (6) has even two AUG between the AGG keyword and the f-met codon). Furthermore, PCS simply lacking the keyword introduction are known (fd genes VII and IX for instance). The Shine-Dalgarno sequence is thus neither necessary nor sufficient for qualifying an i n i t i a t o r . Despite the use of powerful algorithms (F. Rodier, manuscript in prepar a t i o n ) , high-speed computer analysis did not reveal any other f i r m , homologous base sequence within 50 bases on both sides of a known PCS start (when necessary, the mRNA sequence was completed to these l i m i t s with the corresponding DNA sequence). The non-dispensable signal required for qualifying an i n i t i a t i o n codon is not found, in the systems studied, as a l i t e r a l homology sequence near the 5' end of the mRNA. Since i t is most l i k e l y that the signal 1s present on the mRNA, i t must be a collective property common to various base sequences. Collective properties of base sequences The derivation of thermodynamic functions of mRNA from the sequence would be simple i f the structure of this molecule could be unambigously derived from i t s sequence. I t has been shown that even for tRNA molecules, no set of base-pairing schemes is available which could predict the secondary (clover-leaf) structure in a l l cases. Prediction of the structure of large RNA molecules is even more dubious(7). However, an upper l i m i t to the thermodynamic functions of RNA can be set by the values of the corresponding functions of the DNA template. The know392 Nucleic Acids Research ledge of the DNA sequence allows the straighforward derivation of t t s thennodynamic function ( 8). These depend strongly on the precise local sequence, as i l l u s t r a t e d for instance by the DNA s t a b i l i t y , a thermodynamic function which is of inmediate biological relevance here (9). In a given environment, a natural DNA molecule behaves as a sequence of discrete s t a b i l i t y domains. Each domain has a precise location; i t s base pairs change state (paired or unpaired) cooperatively, for a particular, typical value of a relative stabil i t y parameter p (see key to f i g . 1). Local base composition and distribution control mainly the domain size, boundaries and relative s t a b i l i t y , but the action between non-neighbour bases (teleaction) contributes also. The domains can be observed experimentally by denaturation experiments, and are very accurately predicted from the sequence by Azbel's model (10,11). Considering the RNA segment transcribed from a given DNA s t a b i l i t y domain (with relative s t a b i l i t y p ), an upper l i m i t to s t a b i l i t y is reached when this RNA segment is f u l l y base-paired (neglecting t e r t i a r y structure contribution); i t s relative s t a b i l i t y is then k p , k being a constant with, value depending on the nature of the pairing partners (ribo-ribo or deoxy-ribo pairing). I n i t i a t o r s are located within domains of low s t a b i l i t y Superposition of the calculated s t a b i l i t y profiles (p ^s_.sequence) of 0X174, G4 and fd DNA (in replicative form, R.F.) on the experimental PCS maps of these phages shows that they start translation within domains of s t r i kingly low relative s t a b i l i t y with respect to neighbour domains (Fig. 1). Similar behaviour is observed for a l l prokaryotic and eukaryotic PCSs we have studied so far (sample of more than 50 PCSs); only exceptions are genes E of both G4 and (3X174 bacteriophages (see Fig. 1); the products of these genes have not been identified on SDS-polyacrylamid gels (12). Table I gives the addresses, length and relative s t a b i l i t i e s of these domains 1n 0X174, G4, fd PCSs. A detailed account of the s t a b i l i t y analysis at protein coding starts w i l l be published elsewhere (see also (13)). Codon selection : a s t a t i s t i c a l analysis The observed low s t a b i l i t y around an actual i n i t i a t o r results obviously from a local A and U enrichment, an observation already reported by Scherrer et a l . (14). This enrichment sets a bias on the base composition of the codons following the i n i t i a t o r . Since we were unable to detect a significant bias for any of the 20 amino-acids following the i n i t i a t o r s in the low s t a b i l i t y dips, a particular use of the genetic code degenerascy must occure, allowing A/U enrichment of codons without amino-acid selection. 393 Nucleic Acids Research 0.6- 0.S- 0A- Figure 1 . S t a b i l i t y p r o f i l e of phage fd DNA (RF) , between end of gene V I I and start of gene-III . ( The stab i l i t y profile (9) is the plot of DNA s t a b i l i t y parameter p vs_ DNA sequence ; i f a change of state ( native/unwound ) is induced in DNA by an agent X ( temperature , pH , concentration of denaturing (bio)chemicals , mechanical torque on the molecule , . . . ) . p = IKrhiVlhc-hJ • xm • XAT a n d XGC b e i n g the measure of X inducing11 the chanfe of s'Eate 8f an actual DNA domain ( see text ) , a pure random plydAT and polydGC respectively ) . The value of environmental parameter (11) is W = 4.5 Possible ribosome attachement sites preceeding an ATG or GTG by less than 20 bases are indicated by 8) . The way this occurs can be analysed systematically by comparing for each amino-acid coded within the low s t a b i l i t y dip the theoretical and observed frequencies for finding a given nucleotide at a given codon rank. We w i l l give a general treatment for this analysis, although a p r i o r i a A/U bias on the f i r s t or second base of codons is l i k e l y to result in small over-all bias only (two araino-acids, Arg and Leu, have codons with f i r s t base A/U or C, and Ser, unique with second base choice, allows C or G only). This treatment is useful for other purposes as well (purine-pyrimidine bias for instance). Let FRECOD ( i , j , k ) be a tridimensional matrix, i varying from 1 to 21 and standing for the 20 amino-acids and the terminator, j varying from 1 to 4 394 Nucleic Acids Research TABLE I SPECIES 0X174 GENE NAME A A+ B K C D E J F 6 H 64 A A+ B K C D E J F 6 H fd * V VII IX VIII III VI I IV II GENE START UNA DOMAIN (W-4.5) start 4 end DNA DOMAIN (W.4.5) stability (p) r PHASE 64 580 1158 1520 1602 1859 2037 2317 2470 3864 4400 22 563 1057 1510 1599 1843 2034 2310 2457 3857 4347 114 699 1177 1579 1613 1881 2080 2333 2488 3914 4418 .40 .46 .41 .35 .35 .37 .52 .34 .35 .33 .43 13/16 23/39 4/5 14/20 2/2 6/6 4/13 5/5 5/S 12/16 5/5. 1 1 3 2 3 2 3 1 1 3 2 59 698 1276 1638 1720 1976 2154 2477 2600 4020 4564 53 532 1243 1612 1717 1893 2151 2440 2583 4006 4548 99 744 1371 1746 1749 2088 2199 2492 2619 4066 4570 .41 .42 .38 .36 .33 .36 .51 .27 .32 .40 .41 5/5 10/14 22/31 24/36 4/9 27/37 4/14 4/4 5/6 9/15 1/2 1 1 3 2 3 1 2 1 1 2 3 496 -843 1108 1206 1301 1579 2856 3197 4221 6007 436 817 1100 1200 1266 1557 2816 3137 4177 5914 508 864 1110 1229 1316 1669 2888 3227 4253 6034 .35 .30 .48 .37 .38 .34 .38 .29 .22 .41 3/3 6/6 5/6 3/4 31/35 7/9 8/9 8/10 6/3 3 2 3 2 1 3 2 1 2 3 r=nnnber of codons having A or U in third position /number of codons located in the stability dip in which the initiator is located. 395 Nucleic Acids Research and standing for the nucleotide type (T.C.A.G) and k varying from 1 to 3 and standing for the nucleotide rank in the codon. Element ( i . j . k ) of FRECOD is the number of times a given nucleotide ( j ) is present at codon rank (k) for the am1no-acid ( i ) . I t can be recognized that FRECOD is the matrix form of the genetic code as presented by Grantham (15). From FRECOD, the (theoretical) occurence frequency of nucleotide j =FREC0D 0) where IQ = 4 of a given codon 1 Q at rank kQ is FRECOD ( i Q , j , k Q ) is the total number of codons for amino-acid i . We now consider an actual PCS. As experimental counterpart to FRECOD.we define a matrix FREKEX ( 1 , j ) , 1, j being defined as for FRECOD (for the sake of s i m p l i c i t y , index k w i l l be omitted, the codon rank under consideration being defined before hand). In the nucleic acid sequence, the codons for a given amino-acid i contribute each to element ( i Q , j Q ) of FREKEX by 1 or 0, depending on whether the nucleotide species j ' o is present or not at the specified rank of the codon. Thus element ( i 0 , j Q ) of FREKEX is the number of times nucleotide j appears at a given rank in the codons for ami no-acid iQ 1n the nucleic acid sequence investigated. The total number of analyzed codons f o r amino-acid 1 is n(10)= and the 4 z FREKEX(io,jQ) experimental frequency with which nucleotide j Q is found 1n codons for amino-acid 1 at the specified rank is Y(1 o ,J o )=FREKEX(i o ,J o )/n(i o ) Various comparison of experimental and theoretical frequencies can be made by combinations of Y and X ratios. For instance, Y(i ,1)+Y(i ,3). Z ( i . A + U , 3 ) 5 3 0 X(io.l,3)+X(io,3.3.) Y(i o ,2)+Y(i ,4) or Z(ii n0,G+C,3) n , , ) 0 X(1OO,2,3)+X(1 ,2,3)+X(1OO,,4,3) X(1 compare the experimental and theoretical frequencies for having A/U or G/C in third position of codon i (Y dealing with rank 3 ) . These frequencies can 396 Nucleic Acids Research be sunned over all values of i Q , to. yied global nucleotide selection pressure at a given codon rank in the sequence investigated. Coming back to the A/U bias observed in the low stability domains enclosing initiators fcnd terminators), table I I shows that i t is carried almost exclusively by the third base of the codons, except for protein E of 0X174 (the product of this gene has not been identified by SDS-polyacrylamide electrophoresis (12) ). Furthermore, detailed analysis of the N-terminal amino-acid codons of 0X174, , G4 and fd genes show that the A/U bias in third position covers the 8 f i r s t codons on the average (tablel last column). This number compares to the number of codons (seven) protected from digestion during translation 1nitiat1on(l5). DISCUSSION The results presented raise two questions of 1nterest:(i) is the low stability observed around the initiator linked to the mechanism of translation start, and (ii) can this observation help to find out proteins from the DNA sequence alone, without additional hints from biochemistry or genetics ? Because very little is known about the mechanism of translation initiation, one can only speculate on the role of stability around the initiator. A naive, mechanistic look suggests that this role may be important at two, non exclusive levels at least. It has been suggested in several TABU II BACTERIOFtUSE «X 174 third b u i P RC 2.126 .636 1.408 .828 .746 1.643 .757 1.629 .587 2.351 1.269 .584 1.333 .500 1.214 .659 .783 1.498 .635 2.019 .462 3.229 .565 2.507 8EKE RAT A 1. 3 5 3 A* 1.166 B 1.225 C l .234 D l .380 .741 El .667 U .800 F l .172 l .323 H l .4(0 l .416 J l l .407 • a • a ^560 2.035 2.513 II 67 79 69 35 64 53 24 29 26 60 50 29 23 BACTESIOCHAGC *> third b iM SEK RAT R£C P A 1-.376 .596 2.309 A 1.304 .677 1.927 B 1.225 1.627 .753 C .884 1.106 1.252 1.375 .603 2.283 0 1.224 .750 1.633 t .947 1.051 1.109 .575 1.919 G 1.295 N 1.185 .802 1.479 J 1.484 .483 3.074 1.333 .651 2.048 1.375 .609 2.259 r a xz n 101 134 92 90 82 31 58 28 UCTERIOPHASE fd third b u * P RSC .085 21.17 .914 1.182 V 1.500 .375 4.000 VI 1.324 .621 2.132 VII 1.385 .54! 2.538 VII 1.081 .914 1.182 IX 1.360 2.234 .509 X 1.500 .450 3.333 SOT RAT 1.823 I III 1.081 H 50 36 12 42 12 36 24 28 80 30 44 26 Third base selection-pressure for tht N-teraindl codons located within the Initiator stabil i t y dip of IX 174,64 and fd ganes.RAT and RSC art tht ratios of the experioental to theoretical frequencies for having A or U.and S or C 1n third position (contributtd by a l l aninoac1ds).P-RAT/RfiC.H 1s tht nunbtrof N-ternrlrul codons considered. K, and E. are tht ((-terminal, C, and Ej the C-ttrainal parts of genes K and E (-E.) respectively.U*6.3. 397 Nucleic Acids Research places that,as mRNA is synthesized, i t could temporarily remain linked to the DMA template , and thus form part of a double-(mRNA-coding DNA strand) or t r i p l e helix, from which i t has to be removed before processing by the ribosome. I f the s t a b i l i t y of this helix were too strong near the 5' end of the message, the removal of mRNA could be d i f f i c u l t or even Impossible, thus disabling i n i t i a t i o n of translation.The local s t a b i l i t y observed near i n i t i a t o r s may thus be requested to free mRNA from the template. Furthermore,or alternatively, in order to have the free mRNA molecule easily threaded Into the ribosome assembly, the 51 terminus of mRNA should be devoid of extensive, stable internal structure. This could occur either by preventing the bases at the 5' terminus from entering extensive hairpin structures ( i . e . extended stretches complementary to the 5 1 end sequence are not found in the rest of the mRNA), or by allowing only for short hairpins of low stability in the 5' terminus of mRNA. The f i r s t scheme, though not impossible, might involve highly complex control mechanisms, whereas-the second could simply be met i f the 51 terminus of mRNA were enriched in A and U, which is actually observed. In these schemes, 16S rRNA could be an active helper in preparing the 5' end of mRNA for processing by the ribosome. Interaction with mRNA at the (facultative) ribosome attachment site could remove the 5' end of the message from the template (RNA homoduplex 1s more stable than heteroduplex (16) ), and/or avoid, or break when present, weak secondary messenger structure (in an A/U rich environment, the attachment site of mRNA is rich in G; thus the s t a b i l i t y of even a short duplex with 16S rRNA is l i k e l y to overrate those of l o c a l , All rich intramessenger structures). As a consequence, the translation level could be controlled by the balance between the s t a b i l i t i e s of heteroduplex or hairpins involving the 5' end of the message, and the s t a b i l i t y of the 16S rRNA-mRNA duplex around the Shine-Dalgarno sequence; this is observed in several cases, were translation level data are available (in the lac operon for instance, unpublished results). The high A/U content around the i n i t i a t o r could also serve other purposes; one could think of I n i t i a t i o n factors having a high a f f i n i t y for A/U rich sequences, or isoaccepting tRNA selection, e t c . ) . Due to the absence of experimental evidence, a l l these hypothesis remain however speculative. Can the findings presented here help to detect proteins simply from the genome sequence ? To answer the question, i t must f i r s t be assessed whether a l l AUG or GUG, located in a s t a b i l i t y dip and followed by a set 398 Nucleic Acids Research of codons with rather frequent selection of A or U i n t h i r d position, are i n i t i a t o r s . RNAs which are not translated -E.coli 16S rRNA or viroid RNA for instance- have AUG or GUG located always within nucleic acid domains of s t a b i l i t y exceeding p = 55 (results not shown). Although not a proof, this argument suggests strongly that i n i t i a t o r qualification depends indeed on local s t a b i l i t y . We have selected in the genomes of 0X174 and G4 those AUG or GUG which satisfy the following conditions : ( i ) they should be located within a section of an mRNA, for which hairpin s t a b i l i t y is beyond . 45; ( i i ) a major i t y of the N-terminal codons located in the same s t a b i l i t y dip than the f-Met codon should have A/U in third position; ( i i i ) they should be associated with a ribosome attachment sites - a substring of three consecutive bases at least of GGAGG located 5 to 20 bases upstream. The following "practical" conditions were added : the reading frame introduced by the i n i t i a t o r candidate should extend over 10 codons at least, should be located in a transcribed part of the genome, should not be on the third reading TABLE III eat UH ten itut 1 ml Hypothetical DM OONUfl (1M • J> rtirt ttitniy (P sum i inttr PIUS! 134i 346 1(03 316 405 .42 337 411 1 1350 350 404 316 405 .42 337 11 2 ud In pbttt nth A H12 112 581 496 563 .42 600 23 2 - •1753 1753 1798 1747 1774 .37 1738 IS 1 - 1*2007 2X7 •10 1811 1991 .42 1S86 101 3 E 12184 21S4 •10 2163 tm .43 2170 42 3 E 12723 2725 3751 2649 2881 .41 2714 342 1 F 12732 2732 2855 2649 2SU .41 2714 41 2 - 1*2735 2735 2855 2649 2811 .41 2729 40 2 - 14*68 4868 5384 4840 4928 .40 4856 172 2 H KM 49S8 5151 4840 4921 .40 4898 81 3 - an 5*4 (99 460 744 .42 5(2 43 I eta «22 1056 808 129 .42 813 78 2 . O837 1837 1971 1824 1842 .37 IHJ 41 3 C e*2O97 2097 2442 18*3 2090 .40 mm US 2 E «*2274 2274 2442 2201 2255 .42 226* 56 2 E C2301 2X1 2442 2301 2343 .41 2279 47 2 82871 2871 2985 2793 2873 .44 2851 2 E . S3497 3417 Ml 3809 3842 .37 3493 a 128 1 S47O9 470* 5009 4687 4732 .47 4638 100 1 S513J 5235 5360 5033 5169 .41 5114 75 1 r - C3140 1140 5575 5033 5169 .41 5114 145 3 H proUins In the genomes of PX 174 and G 4 ( j « Sequence Timbering according to reference A has t e e n DM OOKUX (_l)for textt).*) 0X 174 and (3) jtandi for GUG I n i t i a t o r s for G4.Reading frame of gene taken as phase rajuber 1 In both c a s e s . 399 Nucleic Acids Research frame in case two are already u t i l i z e d by established PCS. The hypothetical proteins satisfying these conditions are l i s t e d i n Table I I I . We notice that the average kp value for the putative gene starts is higher than for established genes; some hypothetical proteins of both 0X174 and G4 are similar in size and location (0 + 2007 and G+ 2097, which both include gene E, 0 2184 and G 2301, 0 2732 and G 2871, 0 4908 and G 5135). We have shown recently that the local s t a b i l i t y characterizes also prokaryotic promoters (17), allowing to describe a precise mechanism f o r transcript i n i t i a t i o n (18). We believe that s t a b i l i t y homologies i . e . cooperative behaviour of bases play,in general,an important role 1n gene organization, expression and regulation and are elements of a "genetic signal code" (19), for which the codon strategy deciphering program presented here is useful. To whom all correspondence should be addressed. REFERENCES 1. F. Sanger, G.M. A i r , B.G. B a r r e l , N.L. Brown, A.R. Coulson, J . C . Fiddes, C.A. Hutchinson, P.M. Slocombe, M. Smith, Nature 265. 657 (1977) 2. F. Sanger, A.R. Coulson, T. Friedmann, G.M. A i r , B.G. B a r r e l , N.L. Brown, J . C . Fiddes, C.A. Hutchinson I I I , P.M. Slocombe, M. S m i t h , J . M o l . B i o l . , 125. 225 (1978) 3. G.N. Godson, B . S T B a r r e l , R. Staden, J . C . Fiddes, Nature 276. 236 (1978) 4 . E. Beck, R. Sommer, E.A. Auerswald, Ch. Kurz, B. Z i n k , G. OTEerburg, H. S c h a l l e r , K. S u g i s a k i , T. Okamoto, M. Takanami, Nucleic Acids Res., 5_. 4495 (1978) 5. J . Shine, L. Dalgarno, Proc. N a t l . Acad. S c i . USA., 7 1 . 1342 (1974) 6. G. Pieczenik, P. Model, H.D. Robertson, J . M o l . B i o l . , 90. 191 (1974) 7. J . N i n i o , B i o c h i m i e , 6J.. 1133 (1979) 8. M. Ya. A z b e l , Biopolymers, 79. 61 (1980) 9. J . Gabarro-Arpa, R. E h r l i c h . T . Rodier, C. Reiss i n DNA recombination, i n t e r a c t i o n s and r e p a i r , S. Z a d r a z i l and L. Sponar, e d s . , Pergamon, 211-221 (1980) 10. C. Reiss, J . Gabarro-Arpa, i n Prog. M o l . and S u b c e l l . B i o l . , F.E. Hahn, e d . , S p r i n g e r - V e r l a g , 5. 1-30 (1977) 11. F. M i c h e l , J . Gabarro-Arpa, BiopoTymers, to appear ; see also J . Gabarro-Arpa, P. Tougard, C. Reiss Nature, 280. 515 (1979) 12. G.N. Godson, J . C . Fiddes, B.G. B a r r e l , F. Sanger, i n Single Stranded DNA Phages, D.T. Denhardt, D. D r e s s i e r , D.S. Ray, e d s . , Cold Spring Harbor L a b . , 51 (1978) 13. F. Rodier, J . Gabarro-Arpa, R. E h r l i c h , C. Reiss, C.R. Acad. S c i . P a r i s , 291D. 199 (1980) 14. G.F.E. Scherer, M.D. WaTRTnshaw, S. A r n o t t , D.J. Morre, Nucleic Acids Res., 8 . 3895 (1980) 15. R. Grantham, FEBS F e t t e r s , 9 5 . 1 (1978) 16. M. R i l e y , B. M a i l i n g , M.J. Chamberlin, J . M o l . B 1 o l . , 20. 359 (1966) 17. R. E h r l i c h , M. M a r i n , J . Gabarro-Arpa, F. Rodier, B. S c h m i t t , C. Reiss, C.R. Acad. S c i . P a r i s , 291D. 5 (1980) 400 Nucleic Acids Research 18. R. E h r l i c h , M. Marin, J . Gabarro-Arpa, F. Rodier, B. Schmitt, C. Reiss, C.R. Acad. S c i . P a r i s , 291D. 177 (1980) 19. R. E h r l i c h , F. Rodier, J . Gabarro-Arpa, M. Marin, B. Schmitt and C. Reiss, C.R. Acad. S c i . P a r i s , S§ance 25 Mai 1981, in press. APPENDIX Input data structure Format CARD 1 : NAME CARD 2 : DECI( ) CARD 3 : L CARD 4 : NOM ( ) CARD 5 : NBASE ( ) label A80 logical values 5A1 number of stretches to be studied name for the stretch to be studied A80 position of the codon base rank (1,2,3) to be analyzed Repeats cards 4, 5 and 6 L times (one f each coding sequence) CARD 6 : NSTAR1, NFIN1, NB0RN1, NSTAR2, NSTAR2, NB0RN2 NSTAR1 address of the beginning of the stability dip enclosing the initiator NFIN1 address of the end of the stability dip enclosing the initiator NB0RN1 address of the 1nitiator(AllG or GUG) of the coding sequence NSTAR2 address of the beginning of the stability dip enclosing the terminator NFIN2 address of the end of the stability dip enclosing the terminator NB0RN2 address of the terminator . CARD 7 : ADN ( CARD 8 , 9 . . . : ) ADN or ARN sequence(in the form A.T.G.C) A80 Continues sequence u n t i l sequence ending. 401 Nucleic Acids Research Printout structure (for each gene to be analyzed) 1. 2. Identification label Indication of codon rank selected for the analysis 3. 4. Name and comment of the DNA sequence under study Start and end of s t a b i l i t y domain in which the I n i t i a t o r (or terminator) 1s located 5. Gene s t a r t address 6. 7. A.T.G.C percentage per codon a) Print of the messenger sequence b) Print of the amino-acid sequence c) Print of physical characteristics of each anvino-acid (ACI, BAS, NGU, APO) Print of s t a t i s t i c a l tables for each amino-acid and for the whole sequence. 8. Programs were written in F0RTRAN77 language, and they are implemented on a UNIVAC 1100 computer. 402
© Copyright 2026 Paperzz