353 Bioscience Reports I, 353-375 (1981) Printed in Great Britain DNA s e q u e n c i n g and g e n e s t r u c t u r e Nobel l e c t u r e , g D e c e m b e r 1980 Walter GILBERT Department of Biochemistry and Molecular Biology, Harvard University, The Biological Laboratories, Cambridge, Massachusetts 02138, U.S.A. When we w o r k out the s t r u c t u r e of DNA m o l e c u l e s , we e x a m i n e the f u n d a m e n t a l level t h a t underlies all p r o c e s s e s in living cells. DNA is the i n f o r m a t i o n s t o r e t h a t u l t i m a t e l y d i c t a t e s the s t r u c t u r e of e v e r y gene p r o d u c t , d e l i n e a t e s e v e r y p a r t of the o r g a n i s m . The order of t h e bases along DNA c o n t a i n s the c o m p l e t e set of i n s t r u c t i o n s t h a t m a k e up the g e n e t i c i n h e r i t a n c e . We do not know how to i n t e r p r e t t h o s e instructions; l i k e a c h i l d , we c a n s p e l l o u t t h e a l p h a b e t without u n d e r s t a n d i n g m o r e than a f e w words on a page. I c a m e to the c h e m i c a l DNA sequencing by a c c i d e n t . Since the middle s i x t i e s m y w o r k had f o c u s s e d on t h e c o n t r o l of g e n e s in b a c t e r i a , studying a s p e c i f i c gene product, a p r o t e i n r e p r e s s o r m a d e by the c o n t r o l g e n e f o r t h e l a c o p e r o n ( t h e c l u s t e r of g e n e s t h a t m e t a b o l i z e the sugar l a c t o s e ) . Benno M~iller-Hill and I had i s o l a t e d and c h a r a c t e r i z e d this m o l e c u l e during t h e l a t e s i x t i e s a n d d e m o n s t r a t e d t h a t this p r o t e i n bound to b a c t e r i a l DNA i m m e d i a t e l y at the b e g i n n i n g of t h e f i r s t g e n e of t h e t h r e e - g e n e cluster that this r e p r e s s o r c o n t r o l l e d (1,2). In the y e a r s since then, my l a b o r a t o r y had shown t h a t this protein a c t e d by p r e v e n t i n g t h e RNA p o l y m e r a s e f r o m copying the l a c operon genes into RNA. I had used the f a c t t h a t the l a c r e p r e s s o r bound to DNA a t a s p e c i f i c r e g i o n , t h e o p e r a t o r , to i s o l a t e the DNA of this region by digesting all of the r e s t of the DNA with D N a s e to l e a v e only a small f r a g m e n t bound to t h e r e p r e s s o r , protected f r o m t h e a c t i o n of the e n z y m e . This i s o l a t e d a t w e n t y f i v e - b a s e - p a i r f r a g m e n t of DNA out of the 3 million base pairs in the bacterial chromosome. In t h e e a r l y s e v e n t i e s , Allan M a x a m and I worked out the s e q u e n c e of this small f r a g m e n t (3) by copying this DNA into short f r a g m e n t s of RNA and using on t h e s e RNA copies the s e q u e n c i n g m e t h o d s t h a t had b e e n d e v e l o p e d by S a n g e r a n d h i s c o l l e a g u e s in the l a t e sixties. This was a laborious process t h a t took s e v e r a l years. When a s t u d e n t , N a n c y Maizels, then d e t e r m i f l e d t h e s e q u e n c e of t h e f i r s t 63 b a s e s of the m e s s e n g e r RNA for the l a c operon genes, we d i s c o v e r e d t h a t t h e l a c r e p r e s s o r b o u n d t o D N A immediately a f t e r the s t a r t of the m e s s e n g e r RNA ( 4 ) , in a region t h a t lies under t h e RNA p o l y m e r a s e when it binds to DNA to i n i t i a t e RNA synthesis. We c o n t i n u e d to c h a r a c t e r i z e the l a c o p e r a t o r by sequencing a n u m b e r of m u t a t i o n s ( o p e r a t o r c o n s t i t u t i v e m u t a t i o n s ) t h a t d a m a g e d the ability of the r e p r e s s o r to bind to DNA. We w a n t e d to d e t e r m i n e m o r e DNA s e q u e n c e in the region to d e f i n e t h e p o l y m e r a s e binding site and o t h e r e l e m e n t s involved in l a c g e n e control; h o w e v e r , t h a t s e q u e n c e w a s w o r k e d o u t in a n o t h e r l a b o r a t o r y by D i c k s o n , A b e l s o n , B a r n e s , and R e z n i k o f f ( 5 ) . Thus by the middle 9 Nobel Foundation 1981 354 GILBERT s e v e n t i e s I knew all the sequences t h a t I had been curious about, and my students (David Pribnow and 3ohn Majors) and I w e r e t r y i n g t o a n s w e r q u e s t i o n s a b o u t t h e i n t e r a c t i o n of the RNA p o l y m e r a s e and o t h e r control f a c t o r s with DNA. At this point, a n o t h e r line of e x p e r i m e n t s was opened up by a new suggestion. Andrei Mirzabekov c a m e to visit me in early 1975. The purpose of his visit was twofold; to describe e x p e r i m e n t s t h a t he had been doing using dimethyl s u l f a t e to m e t h y l a t e the guanines and t h e adenines in DNA and to urge me to do a similar e x p e r i m e n t with the l a c repressor. Dimethyl s u l f a t e m e t h y l a t e s the guanines uniquely a t t h e /r p o s i t i o n , w h i c h is exposed in the major groove of the DNA double helix, while it m e t h y l a t e s the adenines at the /r position which is e x p o s e d in the minor groove (Fig. i ) . Mirzabekov had used this p r o p e r t y to a t t e m p t to d e t e r m i n e the disposition o f ~ h i s t o n e s and o f c e r t a i n antibiotics on the DNA molecule bY observing the blocking of the i n c o r p o r a t i o n of r a d i o a t i v e m e t h y l groups onto t h e g u a n i n e s and adenines of bulk DNA. He urged me to use this groove s p e c i f i c i t y to learn something a b o u t the i n t e r a c t i o n of the 7ac r e p r e s s o r w i t h t h e lac operator. H o w e v e r , the amounts of 2ac o p e r a t o r available were e x t r e m e l y small, and t h e r e was no o b v i o u s way of e x a m i n i n g t h e protein sitting on DNA to ask which bases in the sequence the protein would p r o t e c t against a t t a c k by the dimethyl s u l f a t e r e a g e n t . It was not until a f t e r a second visit by Mirzabekov t h a t an idea finally e m e r g e d . He and I, and Allan M a x a m and 3 a y G r a l l a , h a d lunch together. D u r i n g our c o n v e r s a t i o n I had an i d e a :[or an e x p e r i m e n t , which u l t i m a t e l y underlies our s e q u e n c i n g m e t h o d . We k n e w we c o u l d o b t a i n a defined DNA f r a g m e n t , 55 base-pairs long, which c a r r i e d near its c e n t e r the region to which the l a c r e p r e s s o r bound. This fragment was made by cutting the DNA sequentially with two different r e s t r i c t i o n enzymes, each defining one end of the fragment (see Fig. 2). Secondly, I knew that at every base along the DNA at which methylation occurred, that base could be removed by heat. Furthermore, once that had happened, only a sugar would be left holding the DNA chain together, and that sugar could be hydrolysed, in principle, in alkali to break the DNA chain. I put these ideas together by conjecturing that i i we labelled one end of one strand of the DNA fragment with radioactive phosphate, we might determine the point of methylation by measuring the distance between the labelled end and the point of breakage. We could get such labelled DNA by isolating a DNA fragment (by length by e l e c t r o phoresis through polyacrylamide gels) made by c u t t i n g w i t h one restriction enzyme, labelling both ends of t h a t f r a g m e n t , and then c u t t i n g it again w i t h a second r e s t r i c t i o n enzyme to release two separable double-stranded fragments, each having a label at one end but not the other. Using polynucleotide kinase this procedure would introduce a radioactive label into the 5' end of one of the DNA strands of the fragment bearing the operator while leaving the other unlabelled (Fig. 2). If we then modified t h a t DNA w i t h d i m e t h y l sulfate so that only an occasional adenine or guanine would be methylated, heated, and cleaved the DNA with alkali at the point of depurination, we would release among other fragments a labelled fragment extending from the unique point of labelling to the f i r s t point of breakage. Fig. 3 shows this idea. Any fragments frem the DNA SEQUENCING AND GENE S T R U C T U R E 355 ,,oor Sugar HI CH3 o H ./H ~.N-H / / ' \\ / 0 Sugar T N'~~ \ 4"1 l N~ uoor CH 3 Fig. i. Methylated cytosine-guanine and thymineadenine base pairs. The top of the figure shows a cytosine-guanine base pair methylated at the N7 position of guanine. The bottom of the figure shows a thymine-adenine base pair methylated at the N3 position of adenine. The region above each of the base pairs is exposed in the major groove of DNA. The region below each of the base pairs lies between the sugar phosphate backbones in the minor groove of the DNA double helix. o t h e r s t r a n d w o u l d be u n l a b e l l e d , as w o u l d any f r a g m e n t s arising beyond t h e f i r s t p o i n t of b r e a k a g e . If we c o u l d s e p a r a t e ~t h e s e fragments by s i z e , as we could in principle by e l e c t r o p h o r e s i s on a polyacrylamide g e l , w e m i g h t be a b l e to a s s o c i a t e t h e l a b e l l e d f r a g m e n t s b a c k to the known s e q u e n c e and thus i d e n t i f y e a c h guanine and a d e n i n e in t h e o p e r a t o r t h a t h a d b e e n m o d i f i e d by d i m e t h y l sulfate. If we could do the m o d i f i c a t i o n in the p r e s e n c e of the Zac r e p r e s s o r p r o t e i n bound to the DNA f r a g m e n t , then if t h e r e p r e s s o r lay close to the N7 of a guanine, we would not m o d i f y the DNA a t t h a t base, and the c o r r e s p o n d i n g f r a g m e n t w o u l d n o t a p p e a r in t h e analytical pattern. 356 GILBERT Alu Alu Hpa Hae Fig. 2. P r o c e d u r e for obtaining a double-stranded DNA fragment u n i q u e l y l a b e l l e d at one end of one strand. The figure shows the restriction cuts for the enzymes Alul (target AG/CT) and Hpa (target C/CGG producing an uneven end) in the neighborhood of the 2ac operator. The 2ac repressor is shown b o u n d to the DNA. By cutting the DNA from this region first with the enzyme A2u, then labelling with radioactive p32 the 5' termini of both strands of the DNA with polynucleotide kinase, and then cutting in turn with the enzyme Hpa, we can isolate a DNA fragment that carries the binding site for the repressor uniquely labelled at one end of one strand. I set out to do this e x p e r i m e n t . Allan M a x a m m a d e t h e labelled DNA f r a g m e n t s , and I began to learn how to modify and to b r e a k the DNA. This involved analysing the r e l e a s e of the bases f r o m DNA and the breakage steps separately. Finally the experiment was put together. Fig. 4 shows the results: an a u t o r a d i o g r a m of the e l e c t r o p h o r e t i c p a t t e r n displays a series of bands e x t e n d i n g downward in size f r o m t h e f u l l - l e n g t h f r a g m e n t , e a c h caused by the c l e a v a g e of the DNA at an adenine or a guanine. The s a m e t r e a t m e n t of the DNA f r a g m e n t with d i m e t h y l s u l f a t e , now c a r r i e d out in the p r e s e n c e of the r e p r e s s o r , produced a similar p a t t e r n , e x c e p t t h a t s o m e of the bands w e r e missing (lane one versus lane two in Fig. #). The e x p e r i m e n t was c l e a r l y a success in t h a t the p r e s e n c e of the r e p r e s s o r b l o c k e d the a t t a c k by d i m e t h y l s u l f a t e on s o m e of the guanines and s o m e of the adenines in the o p e r a t o r (6). I hoped t h a t the size d i s c r i m i n a t i o n would be a c c u r a t e enough to p e r m i t the a s s i g n m e n t of e a c h band in the p a t t e r n to a s p e c i f i c base in t h e s e q u e n c e . This proved true b e c a u s e the spacing in the p a t t e r n , and the p r e s e n c e of light and dark bands, the dark bands c o r r e s p o n d i n g to guanines and the light ones to a d e n i n e s , w e r e s u f f i c i e n t l y c h a r a c t e r i s t i c to c o r r e l a t e the two. The guanines r e a c t a b o u t five t i m e s m o r e rapidly with d i m e t h y l s u l f a t e , w h i l e t h e m e t h y l a t e d adenines a r e r e l e a s e d f r o m DNA m o r e rapLdly than t h e guanines during heating; the shift in i n t e n s i t i e s as a f u n c t i o n DNA SEQUENCING AND GENE STRUCTURE 3.57 I I I I I I I I I I""II I I I I I I I I I I G GG G 171111111111111111111 G GG G Modify CH3 ~I~ i ili i i i i i i i i i I i i i I i i,,ii'i G GG ,•I•'ilIIII G l l l ' l l l ' l l l l l l l l G 11111111 G ~"II G I I II I II Displace I I I I I GG I| I II II'~'I I I | | G ~,'I G CH3 GG II'I I I G 11111 G Cleave and Separate G Fig. 3. Outline of procedure to produce fragments of DNA by breaking the DNA at guanines. Consider an end-labelled strand of DNA. We modify an occasional guanine by methylation with d i m e t h y l s u l f a t e . Heating the DNA will then displace that guanine from the DNA strand, leaving behind the bare sugar; cleaving the DNA with alkali will break the DNA at the missing guanine; the fragments are then separated by size, the actual size of the fragment (followed because it carries the radioactive label) being determined by the position of the modified guanine. ol t h e t i m e of h e a t t r e a t m e n t could be used to establish unambiguously which base was which originally. Furt herm ore, the gel pat t ern is so c l e a r , bands corresponding to fragments differing by one base were resolved. At this point it was clear that this technique could determine the adenines and guanines along DNA for distances of the order of 40 nucleotides. By determining the purines on one strand and the purines on the c o m p l e m e n t a r y strand (as Fig. 4 does), one has in principle a com pl e t e sequencing method. 358 GILBERT H a v i n g in h a n d a r e a c t i o n t h a t will d e t e r m i n e and distinguish adenines and guanines, could we find r e a c t i o n s t h a t would distinguish c y t o s i n e s and t h y m i n e s ? Allan M a x a m and I t u r n e d our a t t e n t i o n to this end. ( F i r s t we e x a m i n e d a s e c o n d b i n d i n g s i t e f o r t h e Z a c r e p r e s s o r t h a t lies a f e w hundred bases f u r t h e r along the DNA, under the I i r s t gene of the operon. This binding site has no physiological function. We could l o c a t e this binding site on a r e s t r i c t i o n f r a g m e n t by r e p e a t i n g the m e t h y l a t i o n - p r o t e c t i o n e x p e r i m e n t and identifying bases p r o t e c t e d by t h e l a c r e p r e s s o r . I used the m e t h y l a t i o n p a t t e r n to a t t e m p t to p r e d i c t the positions ol the adenines and the guanines in t h e u n k n o w n D N A s e q u e n c e ; Allan M a x a m then used the w a n d e r i n g spot sequencing m e t h o d of Sanger and his c o w o r k e r s to d e t e r m i n e the DNA s e q u e n c e of this region to v e r i f y t h a t we had m a d e a s u c c e s s f u l prediction.) Allan Maxam then went on to do the next p a r t of t h e development. We k n e w t h a t h y d r a z i n e would a t t a c k the c y t o s i n e s DNA S E Q U E N C I N G AND GENE S T R U C T U R E 359 Fig. 4. Methylation protection experiment with the 2ac repressor. The columns show the pattern of cleavage along each strand of the 53- to 55-base-long fragment bearing the 2ac operator. The second column from the left represents the DNA (labelled at the 5' end of the 53-base-long strand) treated by dimethyl sulfate, cleaved by heat and alkali. The dark bands correspond to breaks at guanines; the light bands are breaks at adenines. The second column of the figure reads from the bottom: -,-,AAA--,--,---,---,-AA .... A-A-AA-A-A-.... The first column shows this same double-stranded piece of DNA treated with dimethyl sulfate in the presence of the 2ac repressor. The repressor prevents the interaction of dimethyl sulfate with the guanine a third of the way up the pattern and blocks the reaction with two adenines in the upper third of the pattern. The bands corresponding to these two adenines represent fragments that differ in length by one base~ 30 and 31 long. The right-hand side of this pattern shows the same experiment done with the label at the end of the other DNA strand. The sequence at the far right would be: -,-G-GGAA--G-GAG-GGA-AA-AA---.... and t h y m i n e s in DNA and d a m a g e t h e m s u f f i c i e n t l y , or e l i m i n a t e t h e m to f o r m a h y d r a z o n e , so t h a t a f u r t h e r t r e a t m e n t of t h e D N A w i t h b e n z a i d e h y d e i o l l o w e d by alkali (or a t r e a t m e n t with an a m i n e ) , would c l e a v e at the d a m a g e d base. This soon g a v e us a similar p a t t e r n , but b r o k e the DNA at the c y t o s i n e s and t h y m i n e s w i t h o u t d i s c r i m i n a t i o n . Allan M a x a m then d i s c o v e r e d t h a t salt, o n e - m o l a r s a l t in t h e 1 5 - m o l a r h y d r a z i n e , a l t e r e d t h e r e a c t i o n to s u p p r e s s t h e r e a c t i v i t y of the t h y m i n e s . The two r e a c t i o n s t o g e t h e r then positioned and distinguished t h e t h y m i n e s a n d t h e c y t o s i n e s in a D N A s e q u e n c e . This last d i s c r i m i n a t i o n c o n c e p t u a l l y c o m p l e t e d the m e t h o d . To i m p r o v e t h e d i s c r i m i n a t i o n b e t w e e n the purines, and to provide r e d u n d a n t i n f o r m a tion which would s e r v e to m a k e the s e q u e n c i n g m o r e s e c u r e a g a i n s t e r r o r s , we u s e d t h e f a c t t h a t the m e t h y l a t e d adenosines d e p u r i n a t e m o r e rapidly, e s p e c i a l l y in acid, to r e l e a s e the a d e n o s i n e s p r e f e r e n tially and thus to obtain four r e a c t i o n s : one for A ' s , one p r e f e r e n t i a l for G ' s , one for C ' s and T ' s , and one for C ' s d e t e r m i n i n g t h e T ' s by difference. T h i s s t a g e of t h e w o r k w a s c o m p l e t e d within a f e w months. As the range of resolution on the gels was e x t e n d e d t o w a r d 100 b a s e s , the c l e a v a g e at t h e p y r i m i d i n e s was not s a t i s f a c t o r y ; t h e result of the i n c o m p l e t e c l e a v a g e w a s t h a t t h e l o n g e r f r a g m e n t s c o n t a i n e d a v a r i e t y of i n t e r n a l d a m a g e s and the p a t t e r n blurred out. A f t e r m a n y m o n t h s of s e a r c h i n g , an a n s w e r was f o u n d . A primary a m i n e , aniline, will displace the h y d r a z i n e p r o d u c t s and p r o d u c e a b e t a e l i m i n a t i o n t h a t r e l e a s e s t h e p h o s p h a t e f r o m t h e 3' p o s i t i o n on t h e sugar, but it will not r e l e a s e the o t h e r p h o s p h a t e , and the m o b i l i t y of a DNA f r a g m e n t with a blocked 3' p h o s p h a t e bearing a s u g a r - a n i l i n e residue is d i f f e r e n t f r o m t h e f r e e - p h o s p h a t e - e n d e d chains from 360 GILBERT the other reactions. A s e c o n d a r y a m i n e , p i p e r i d i n e , is far m o r e e f f e c t i v e and triggers both beta eliminations as well as eliminating all t h e b r e a k d o w n p r o d u c t s of t h e h y d r a z i n e r e a c t i o n f r o m the sugar. This r e a g e n t c o m p l e t e d the DNA sequencing techniques (7). Although the d e v e l o p m e n t of the t e c h n i q u e s continued for a n o t h e r nine months, they were distributed f r e e l y to other groups t h a t w a n t e d to use t h e m . Fig. 5 shows an actual sequencing p a t t e r n from the 1978 period, used in the work described in (8). F i g . 6 s h o w s t w o e x a m p l e s of t h e c h e m i s t r y (9). The logic behind the chemical m e t h o d is to divide the a t t a c k into two steps. In the first we use a r e a g e n t t h a t c a r r i e s the s p e c i f i c i t y , DNA SEQUENCING AND GENE S T R U C T U R E 361 Fig. 5. Actual sequencing pattern from the 1978 period. Products of four different chemical reactions, applied to a DNA fragment about 150 bases long~ are electrophoresed on a polyacrylamide gel; three loadings produce sets of patterns that have moved different distances down the gel. The four columns correspond to reactions that break the DNA: i) primarily at the adenines, 2) only at the guanines, 3) at the cytosines but not the thymines, and 4) at both the cytosines and thymines. The very shortest fragments are at the bottom right-hand side of the picture and the sequence is read up the gel recognizing first the band in the left-hand column corresponding to A, a band in the two right-hand columns corresponding to C, a band in the far right-hand column corresponding to T, a band in the left-hand column corresponding to A, and so forth. After reading up as far as possible, the sequence continues in the sets of bands at the left-hand side of the gel and then still further in the pattern in the center of the gel. From the original p h o t o g r a p h the sequence of the entire fragment can be read. The fragment is from the genomic DNA corresponding to the variable region of the lambda light chain of mouse immunoglobulin (8). but we l i m i t the e x t e n t of t h a t r e a c t i o n to only one base out of s e v e r a l hundred possible t a r g e t s in e a c h DNA f r a g m e n t . This p e r m i t s t h e r e a c t i o n to be used in the domain of g r e a t e s t s p e c i f i c i t y : only the v e r y initial s t a g e s of a c h e m i c a l r e a c t i o n a r e i n v o l v e d . The s e c o n d s t e p , t h e c l e a v a g e of t h e D N A s t r a n d , must be c o m p l e t e . Since the t a r g e t has a l r e a d y been distinguished f r o m the o t h e r bases along t h e DNA chain by the p r e l i m i n a r y d a m a g e , we can use vigorous, q u a n t i t a t i v e r e a c t i o n conditions. The result is a clean break, r e l e a s i n g a f r a g m e n t w i t h o u t hidden d a m a g e s , which is required if the m o b i l i t i e s of t h e f r a g m e n t s a r e to be v e r y closely c o r r e l a t e d so t h a t the bands will not blur. ( T h e s p e c i f i c i t y need be only a b o u t a f a c t o r of ten for t h e s e q u e n c e to be read u n a m b i g u o u s l y . ) Today, l a t e r d e v e l o p m e n t s of the t e c h n i q u e (9) have m o d i f i e d the guanine r e a c t i o n and r e p l a c e d the d i m e t h y l s u l f a t e adenosine r e a c t i o n with a d i r e c t d e p u r i n a t i o n r e a c t i o n t h a t r e l e a s e s both the adenines and guanines equally. These changes, and the i n t r o d u c t i o n of the very thin gels by S a n g e r ' s group (10), now m a k e it possible to read s e q u e n c e s out b e t w e e n 200-400 bases f r o m the point of labelling. The actual c h e m i c a l workup, the analysis on gels, and the a u t o r a d i o g r a p h y are the s h o r t p a r t of the process. The m a j o r t i m e spent in DNA sequencing is s p e n t in the p r e p a r a t i o n of the DNA f r a g m e n t s and on the e l e m e n t s of strategy. The speed of the sequencing c o m e s only in p a r t f r o m t h e a b i l i t y t o r e a d o f f q u i c k l y s e v e r a l h u n d r e d bases of D N A - at a glance. The m o r e i m p o r t a n t e l e m e n t is the linear p r e s e n t a t i o n of t h e p r o b l e m . R a t h e r than s e q u e n c e r a n d o m l y , one can begin at one end of a r e s t r i c t i o n m a p and m o v e r a t i o n a l l y through a g e n e - or c o n s t r u c t the r e s t r i c t i o n map as o n e goes. 362 GILBERT o ~H~ HN , , J ~ N \ o G 0 ,/ 0 HN IO1 5L-.-O_ / 0 0 0 ,~,,,~Hc=o II H2N N 5LO-- P--O--CH3 CH. >~ N--H I 0- O H N/I""NI~N ~ 5 - O - - PI - - O - - C ~ n _v 0- 0 I 0 I OH- -O~P=O I 0 0 I I - 0 - - P=O I 0 -O--P=O I 0 1~ 0 0 , 51--0-- IP--O- + 0 , II 5--O--P--O--CH3 OH H I I ~/"~ H,C--C--CH--CH--C--N~..~ CH3 H~ N ~ H c , . = O HzN , ~ N ~"'NH= -~(~,,--.-~ 9 - + o- O-0-- OH- ~ 0 I -O--P==O I 0 P--=O I 0 0 b HN~CH3 H ~ cH3 o 3 ?u r II 5--O--P~0--CH2 , 0 I "0--P--O I 0 0 o Sin--O--P--O-- CHE o+ ~. ,~ ", | H2CBC__CHmCH_C~N~ + -0 ! -O--P--O I 0 0! "O--PmO I 0 0 I -O--P~O I 0 9-o-,~-o~ O.- e/"-~ ~c-%2 01 "O--PmO I 0 ,~ ;; f II 5 - 0 - - .P--O--CH2 ' O" ~_~o 0 I -O~PmO I 0 " N-NH 2 DNA S E Q U E N C I N G AND GENE S T R U C T U R E 363 Fig. 6. Examples of the detailed chemistry involved in breaking the DNA. (a) The guanine breakage. The guanines are first methylated with dimethyl sulfate. The imidazole ring is opened by treatment with alkali ( d u r i n g the piperidine treatment). Piperidine displaces the base and then t r i g g e r s two b e t a eliminations that release both phosphates from the sugar and cleave the DNA strand leaving a 3' and a 5' phosphate. (b) The hydrazine attack on a thymine that breaks the DNA at the pyrimidines. T h e f i r s t l o n g s e q u e n c e was done by a g r a d u a t e student, Phillip F a r a b a u g h , who used the new t e c h n i q u e s to s e q u e n c e t h e gene for t h e 2ac repressor (11). The p r o t e i n s e q u e n c e of this gene p r o d u c t had been worked out in the e a r l y s e v e n t i e s by B e y r e u t h e r and his c o workers (12). S i n c e the a m i n o - a c i d s e q u e n c e was known, he could quickly (a f e w m o n t h s ) e s t a b l i s h t h e DNA s e q u e n c e . However, the DNA s e q u e n c e showed t h a t t h e r e w e r e e r r o r s in the p r o t e i n s e q u e n c e , two a m i n o acids dropped a t one place and e l e v e n at a n o t h e r . Since t h e p r o t e i n s e q u e n c e contains 360 a m i n o acids, he had to work out a gene of 1080 bases. DNA s e q u e n c i n g is f a s t e r and m o r e a c c u r a t e t h a n protein sequencing. T h e r e a s o n f o r t h i s is t h a t DNA is a linear i n f o r m a t i o n s t o r e . B e c a u s e the c h e m i s t r y of e a c h r e s t r i c t i o n f r a g m e n t is like any o t h e r - t h e y d i f f e r only in length - t h e r e is no p a r t i c u l a r reason for losing t r a c k of t h e m , e x c e p t for t h e v e r y s m a l l e s t . By s e q u e n c i n g across the joins b e t w e e n the f r a g m e n t s , one e s t a b l i s h e s an u n a m b i g u o u s order. P r o t e i n s , on the o t h e r hand, a r e strings of a m i n o acids used by N a t u r e to c r e a t e a wide v a r i e t y of c h e m i s t r i e s . When a p r o t e i n is f r a g m e n t e d , the fragments can exhibit quite different properties, s o m e of which m a y be unusually u n f o r t u n a t e in t e r m s of solubility or loss. T h e r e is no simple way of keeping a c c o u n t of t h e t o t a l c o n t e n t of a m i n o acids, or of the order of f r a g m e n t s , as t h e r e is for DNA, w h e r e t h e length of t h e r e s t r i c t i o n f r a g m e n t s can easily be measured. 3effrey Miller and his coworkers had done an extensive analysis of the appearance of mutations in the 2ac repressor gene. Three sites in the gene are hotspots, at which the mutation rate is some 10 times higher than at other sites. D N A sequencing showed that at each of these sites there was a modified base, a 5-methyl cytosine) in the sequence (13). (The chemical sequencing detects the presence of the 5-methyl cytosine directly, because the methyl group suppresses c o m p l e t e l y t h e r e a c t i v i t y of this base in the h y d r a z i n e r e a c t i o n . A b l a n k s p a c e a p p e a r s in t h e s e q u e n c e , but on t h e o t h e r s t r a n d is a guanine.) T h e h i g h m u t a t i o n r a t e is a t r a n s i t i o n t o a t h y m i n e . 5-Methyl c y t o s i n e o c c u r s at a low f r e q u e n c y in DNA" this o b s e r v a t i o n shows t h a t it is a m u t a g e n . What is the e x p l a n a t i o n ? D e a m i n a t i o n of c y t o s i n e to uracil o c c u r s n a t u r a l l y . If this o c c u r r e d in DNA it could lead to a t r a n s i t i o n ; h o w e v e r , it usually does not, s i n c e t h e r e is an e n z y m e t h a t scans DNA e x a m i n i n g it for deoxyuridine ( 1 4 ) . When it finds this base in DNA, m i s m a t c h e d or not, it b r e a k s t h e g l y c o s i d i c bond and r e m o v e s the uracil. This is then r e c o g n i z e d as a d e f e c t in 364 GILBERT D N A , a n d a n o t h e r group of e n z y m e s then r e p a i r the d e p y r i m i d i n a t e d spot. H o w e v e r , 5 - m e t h y l c y t o s i n e d e a m i n a t e s to t h y m i n e - a n a t u r a l c o m p o n e n t of DNA. On r e p a i r or r e s y n t h e s i s a t r a n s i t i o n will ensue. This whole a r g u m e n t explains why t h y m i n e is used in DNA - the e x t r a m e t h y l g r o u p s e r v e s to suppress the e f f e c t s of the n a t u r a l r a t e of deamination. To f i n d o u t how e a s y and how a c c u r a t e DNA sequencing was, I asked a student, G r e g o r S u t c l i f f e , to s e q u e n c e the ampicillin r e s i s t a n c e gene, the beta-lactamase gene, of Escherichia coli. This gene is c a r r i e d on a v a r i e t y of plasmids, including a small c o n s t r u c t e d plasmid, p B R 3 2 2 , in E. c o 2 / . All t h a t he knew a b o u t the protein was an a p p r o x i m a t e m o l e c u l a r weight, and t h a t a c e r t a i n r e s t r i c t i o n c u t on the plasmid inactivated t h a t gene. He had no previous e x p e r i e n c e with DNA sequencing when he s e t out to work out the s t r u c t u r e of DNA for this gene. A f t e r seven months he had worked out a b o u t 1000 bases of d o u b l e - s t r a n d e d DNA, sequencing one s t r a n d a n d t h e n sequencing the o t h e r for c o n f i r m a t i o n . The unique long reading f r a m e d e t e r m i n e d the s e q u e n c e of the protein p r o d u c t of this gene, a p r o t e i n of 286 r e s i d u e s ( 1 5 ) . We t h o u g h t t h a t t h e D N A s e q u e n c e was unambiguous. Luckily t h e r e was a v a i l a b l e , f r o m A m b l e r ' s l a b o r a t o r y , p a r t i a l s e q u e n c e i n f o r m a t i o n a b o u t the protein which had been o b t a i n e d as a result of s e v e r a l y e a r s ' work a t t e m p t i n g to develop a s e q u e n c e for the b e t a - l a c t a m a s e (16). This i n f o r m a t i o n , while not s u f f i c i e n t to d e t e r m i n e the protein s e q u e n c e d i r e c t l y , was a d e q u a t e to c o n f i r m t h a t t h e p r e d i c t i o n of t h e D N A sequencing was c o r r e c t . S u t c l i f f e then became very enthusiastic and s e q u e n c e d t h e r e s t of t h e p l a s m i d pBR322 during the n e x t six months, to finish his thesis. He s e q u e n c e d both strands of this ~ 3 6 2 - b a s e - p a i r - l o n g plasmid in o r d e r to c o n f i r m the s e q u e n c e (17). The c h e m i c a l sequencing is unambiguou% e x c e p t for an o c c a s i o n a l c h a r a c t e r i s t i c f e a t u r e in t h e D N A f r a g m e n t itself t h a t c a u s e s it to m o v e a n o m a l o u s l y during the gel e l e c t r o p h o r e s i s . As longer and longer s t r a n d s a r e being a n a l y s e d on the gel, a hairpin loop can f o r m at one end of the f r a g m e n t if the s e q u e n c e is s u f f i c i e n t l y self-complementary. As the f r a g m e n t a t i o n passes through this portion of the m o l e c u l e , the mobilities on the gel do not d e c r e a s e u n i f o r m l y as a function of length, but some of the m o l e c u l e s m o v e a b e r r a n t l y , a f e a t u r e c a l l e d c o m p r e s s i o n , b e c a u s e the bands on an a u t o r a d i o g r a p h b e c o m e close t o g e t h e r , or can o v e r l a p to c o n c e a l one or m o r e bases. T h i s r a r e f e a t u r e o c c u r s a b o u t o n c e e v e r y t h o u s a n d bases. It is resolved by sequencing the o p p o s i t e strand in the o t h e r d i r e c t i o n along the double-stranded m o l e c u l e ( o r t h e s a m e s t r a n d in the o p p o s i t e c h e m i c a l d i r e c t i o n ) b e c a u s e the hairpin will f o r m w h e n a d i f f e r e n t r e g i o n of t h e s e q u e n c e is exposed and the c o m p r e s s i o n f e a t u r e will o c c u r in a d i f f e r e n t place in the s e q u e n c e . If b o t h s t r a n d s of t h e DNA helix a r e s e q u e n c e d , the s e q u e n c e can be unambiguous. The Structure of Genes T h e f i r s t g e n e s to be s e q u e n c e d , t h o s e in b a c t e r i a , yielded an e x p e c t e d s t r u c t u r e : a contiguous series of codons lying upon t h e DNA b e t w e e n an initiation signal and one of the t e r m i n a t o r signals. B e f o r e the position at which the RNA copy will s t a r t , t h e r e lies a site for t h e R N A p o l y m e r a s e , i n t e r a c t i n g with the Pribnow box, a region of DNA S E Q U E N C I N G AND GENE S T R U C T U R E 365 s e q u e n c e homology lying one turn of the helix b e f o r e the initial base of t h e m e s s e n g e r RNA, and also with a n o t h e r r e g i o n of h o m o l o g y , thirty-five bases before the start. Thus one could u n d e r s t a n d the b a c t e r i a l gene in t e r m s of a binding s i t e for t h e RNA p o l y m e r a s e , and f u r t h e r binding sites for r e p r e s s o r s and a c t i v a t o r p r o t e i n s around and under the p o l y m e r a s e . A l t e r n a t i v e l y , the c o n t r o l on t r a n s c r i p t i o n could be e x e r c i s e d by a c o n t r o l of the t e r m i n a t i o n function: new proteins or an e l e g a n t t r a n s l a t i o n c o n t r o l (18) could d e t e r m i n e w h e t h e r or not the p o l y m e r a s e would r e a d past a s t o p signal into a new gene. When t h e f i r s t g e n e s f r o m v e r t e b r a t e s were transferred into bacteria by t h e r e c o m b i n a n t D N A t e c h n i q u e s a n d s e q u e n c e d , an entirely different structure emerged. The coding s e q u e n c e s for globin (19,20), for immunoglobulin (21), and for o v a l b u m i n (22) did not lie on t h e D N A a s a c o n t i n u o u s s e r i e s of c o d o n s b u t r a t h e r ' w e r e interrupted by long s t r e t c h e s of non-coding DNA. The d i s c o v e r y of RNA splicing in a d e n o v i r u s by S h a r p a n d his c o w o r k e r s ( 2 3 ) a n d Broker and R o b e r t s and t h e i r c o w o r k e r s (24) p a v e d t h e way for this new s t r u c t u r e . They had shown t h a t a f t e r the original t r a n s c r i p t i o n of D N A i n t o a l o n g RNA, regions of this RNA a r e spliced out: some s t r e t c h e s a r e excised and the r e m a i n i n g portions a r e fused t o g e t h e r by an as y e t u n d e f i n e d e n z y m a t i c process. The exons ( 2 5 ) , regions of t h e DNA t h a t will be e x p r e s s e d in m a t u r e m e s s a g e , a r e s e p a r a t e d f r o m e a c h o t h e r by introns, regions of DNA t h a t lie within the g e n e t i c e l e m e n t but whose t r a n s c r i p t s will be s p l i c e d o u t of t h e m e s s a g e . Fig. 7 shows this process: the original t r a n s c r i p t of a gene (now t h o u g h t of as a t r a n s c r i p t i o n unit) will u n d e r g o a s e r i e s of s p l i c e s b e f o r e being able to f u n c t i o n as a m a t u r e m e s s a g e in the c y t o p l a s m . Fig. 8 shows a f e w e x a m p l e s . V e r t e b r a t e genes can h a v e m a n y - 8, 15, e v e n 50 - e x o n s ( 2 9 , 3 0 ) , and the exons a r e for the m o s t p a r t s h o r t coding s t r e t c h e s s e p a r a t e d by hundreds to s e v e r a l t h o u s a n d s of b a s e pairs of intron DNA. The rapid sequencing has m e a n t t h a t we c a n w o r k o u t t h e D N A s e q u e n c e of a n y of t h e s e c o m p l e x g e n e structures. But can we u n d e r s t a n d t h e m ? The emerging generalization is t h a t p r o k a r y o t i c genes have c o n t i g u o u s coding s e q u e n c e s while the g e n e s for the highest e u k a r y o t e s a r e c h a r a c t e r i z e d by a c o m p l e x e x o n / i n t r o n s t r u c t u r e . As we m o v e up f r o m p r o k a r y o t e s , the s i m p l e s t e u k a r y o t e s , such as y e a s t s , h a v e f e w introns; f u r t h e r up the e v o l u t i o n a r y ladder the genes a r e m o r e broken up. ( Y e a s t m i t o c h o n d r i a have introns; a r e t h e y an e x c e p t i o n to this pattern?) Are we seeing t h e e m e r g e n c e of the i n t r o n - e x o n s t r u c t u r e r i s i n g to e v e r g r e a t e r d e g r e e s of c o m p l e x i t y as we m o v e up to the v e r t e b r a t e s , or t h e loss of p r e e x i s t i n g i n t r o n - e x o n s t r u c t u r e s as w e m o v e d o w n to the s i m p l e s t i n v e r t e b r a t e s and the p r o k a r y o t e s ? One view considers the splicing as an a d a p t a t i o n t h a t b e c o m e s "ever m o r e n e c e s s a r y in m o r e h i g h l y s t r u c t u r e d organisms. T h e o t h e r view considers t h e s p l i c i n g as l o s t if t h e o r g a n i s m m a k e s a c h o i c e t o s i m p l i f y and to r e p l i c a t e its DNA m o r e rapidly, to go through m o r e g e n e r a t i o n s in a s h o r t t i m e , and t h u s t o be u n d e r a s i g n i f i c a n t p r e s s u r e to r e s t r i c t its DNA c o n t e n t ( 3 1 ) . What role can this g e n e r a l i n t r o n - e x o n s t r u c t u r e play in the g e n e s of the higher o r g a n i s m s ? Although m o s t genes t h a t h a v e been studied h a v e this s t r u c t u r e , t h e r e a r e two n o t a b l e e x c e p t i o n s : t h e g e n e s f o r the histones and t h o s e for the i n t e r f e r o n s . This last d e m o n s t r a t e s t h a t 366 GILBERT Fig. 7. A t r a n s c r i p t i o n unit c o r r e s p o n d i n g to alternating exons and introns. The whole gene, a transcription unit, is copied into RNA terminating in a poly(A) tail. The regions corresponding to introns are spliced out leaving a messenger RNA made up of the three exons, the regions that are e x p r e s s e d in the mature message. t h e r e can be no a b s o l u t e l y e s s e n t i a l role t h a t the introns must play, t h e r e can be no a b s o l u t e n e e d f o r s p l i c i n g in o r d e r to e x p r e s s a p r o t e i n in m a m m a l i a n cells. Although t h e r e is a line of e x p e r i m e n t s t h a t shows t h a t s o m e m e s s e n g e r s m u s t have at l e a s t a s i n g l e s p l i c e m a d e b e f o r e t h e y c a n be e x p r e s s e d , t h e r e is no e v i d e n c e t h a t the g r e a t m u l t i p l i c i t y of splices a r e needed. T h e r e is a pair of genes for i n s u l i n in t h e r a t , t h a t d i f f e r in t h e n u m b e r of introns; both are e x p r e s s e d - which d e m o n s t r a t e s t h a t the intron t h a t splits the coding region of one of t h e m has no e s s e n t i a l role, in cis, in the expression of t h a t gene ( 3 2 ) . Although a c o m m o n c o n j e c t u r e is t h a t the splicing m i g h t h a v e a r e g u l a t o r y r o l e , so f a r t h e r e is no t i s s u e - d e p e n d e n t splicing p a t t e r n t h a t could be i n t e r p r e t e d as showing the e x i s t e n c e of a gene (or tissue) s p e c i f i c splicing e n z y m e . The introns a r e much longer than the exons. Their DNA s e q u e n c e drifts rapidly by m u t a t i o n and small additions and deletions ( a c c u m u lating c h a n g e s as rapidly as possible at the s a m e r a t e as t h e s i l e n t c h a n g e s in codons). This suggests t h a t it is not their s e q u e n c e t h a t is r e l e v a n t , but t h e i r length. Their f u n c t i o n is to m o v e the exons a p a r t along the c h r o m o s o m e . DNA SEQUENCING ANI~ GENE STRUCTURE 367 globin Ig light chain Ig heavy chain 9 Fig. 8. Examples of the intron-exon structure of a few genes. (i) The gene for globin is broken up by two introns into three exons (20). (2) The functional gene in a myeloma cell for the immunoglobulin lambda light chain is broken up into a short exon corresponding to the hydrophobic leader sequence, an exon corresponding to the V region, and then, after an intron of some t h o u s a n d bases, an exon corres p o n d i n g to the 112 amino acids of the constant region (26). (3) A typical gene for a g a m m a h e a v y chain of i m m u n o g l o b u l i n (27928). The mature gene corresponds to a hydrophobic leader sequence, an exon corresponding to the variable region, and then, after a long intr0n, a series of exons: the first corresponding to the first domain of the constant region:, the second corresponding to a 1 5 - a m i n o - a c i d h i n g e region, the third corresponding to the second domain of the constant region, and the fourth exon corresponding to the third domain of the constant region. A consequence of the separation of exons by long introns is that the recombination frequency, both illegitimate and legitimate, between e x o n s will be higher (25). This will increase the rate, over evolutionary time, at which the exons, representing p a r t s of t h e p r o t e i n structures, will be shuffled and reassorted to make new combinations. Consider the process by which a structural d o m a i n is d u p l i c a t e d t o m a k e t h e t w o - d o m a i n s t r u c t u r e of t h e light chain of the immunoglobulins (or duplicated again to make the four-dom ai n s t r u c t u r e of the heavy chain, or combined to make the triple s t r u c t u r e represented by ovomucoid ( 2 9 ) ) . C l a s s i c a l l y , this i n v o l v e d a p r e c i s e u n e q u a l crossing-over that fused the two copies of the original gene, in phase, to make a double-length gene. As Fig. 9 shows, this process involves an e x t r e m e l y rare, precise illegitimate event (a recombination event th at leads to the fusing of two DNA sequences at a point where t here is no matching of sequence) that has as its consequence the synthesis at a high level of the new, presumably more useful, double-length gene product. C o n s i d e r t h e s a m e p r o c e s s agai nst the background of a general splicing mechanism. Again, the process of forming the double g e n e mu s t involve an i11egitimate recombination event, but now that 368 GILBERT G X b > e v e n t can occur anywhere Within a stretch of I000 to 10 000 bases flanking the 3' side of one copy and the 5' side of the other to form an i n t r o n s e p a r a t i n g t h e genes for the two domains. From a long transcript across this region, even i n e f f i c i e n t splicing may produce the n e w d o u b l e - l e n g t h gene product. This will happen some t06 to I08 times more rapidly than the classica! process because of all of t h e d i f f e r e n t combinations of sites at which the recombination can occur. If the long transcript can be spliced~ even at a low frequency, some of the double-length product can be made. This is a faster way for evolution to form the final gene: proceeding through a rapid step to a s t r u c t u r e t h a t can p r o d u c e a small a m o u n t of the useful gene product. Small mutational steps can b e selected to p r o d u c e b e t t e r splicing signals and thus more of the gene product. If the splicing DNA S E Q U E N C I N G AND GENE S T R U C T U R E 369 Fig. 9. A double-length gene product arises through unequal crossing over. (a) The classical process by which a gene c o r r e s p o n d i n g to a single polypeptide chain might have its length doubled by a c r o s s i n g over. The top two lines indicate two coding regions, brought into a c c i d e n t a l a p p o s i t i o n by some act of i l l e g i t i m a t e r e c o m b i n a t i o n which fuses the carboxy terminal region (the 3' end) of one copy of the gene to the amino terminal region (the 5' end) of the other. This rare illegitimate event (involving no sequence m a t c h i n g ) would, if it occurs in phase, produce a double-length gene which could code for a d o u b l e - l e n g t h RNA which in turn translates into a double-length protein containing the reiteration of a basic domain. (b) The same process occurring in the presence of the splicing function. Now the unequal crossing over can occur anywhere to the 3' side of one copy of the gene and anywhere in front of the 5' end of the other copy of the gene to produce a gene containing ~wo exons separated by a long intron. I conjecture that the long transcript of this region now will be spliced at some low frequency to produce a mature message encoding the reiterated protein. signals already exist, recombination w i t h i n i n t r o n s p r o v i d e s an i m m e d i a t e way to build p o l y m e r i c s t r u c t u r e s o u t of s i m p l e r u n i t s . One would p r e d i c t t h a t p o l y m e r i c s t r u c t u r e s , m a d e up of s i m p l e r units, will be found to have genes in which the i n t r o n - e x o n s t r u c t u r e of t h e p r i m i t i v e u n i t is r e p e a t e d , s e p a r a t e d again by introns. T h a t is the case. The r a t e of l e g i t i m a t e r e c o m b i n a t i o n b e t w e e n the exons of a gene will be i n c r e a s e d by the introns. Consider two m u t a t i o n s to b e t t e r f u n c t i o n i n g , a r i s i n g in d i f f e r e n t p a r t s of a gene and spreading, by s e l e c t i o n , through the population. Classically, both m u t a t i o n s could end up in a single p o l y p e p t i d e chain, a f t e r both genes find t h e i r way into a single diploid individual, by homologous r e c o m b i n a t i o n within t h e gene. Fig. 10 shows t h a t this process also should be s p e e d e d s o m e 10to 1 0 0 - f o l d by s p r e a d i n g t h e e x o n s a p a r t . This effect will be s t r o n g e s t if t h e e x o n s c a n e v o l v e s e p a r a t e l y - if t h e y r e p r e s e n t s t r u c t u r e s t h a t can a c c u m u l a t e s u c c e s s f u l c h a n g e s i n d e p e n d e n t l y . F u r t h e r m o r e , one can c h a n g e the p a t t e r n of exons by changing the initiation or t e r m i n a t i o n of t h e RNA t r a n s c r i p t , to add e x t r a exons or t o t i e t o g e t h e r e x o n s f r o m o n e region of the DNA to exons f r o m a n o t h e r . This has been o b s e r v e d in adenovirus, and is found in n o t a b l e e x a m p l e s in the immunoglobulins in which exons can be added to or s u b t r a c t e d f r o m the c a r b o x y t e r m i n u s of the h e a v y c h a i n to m o d i f y the protein. H o o d ' s l a b o r a t o r y has shown t h a t - t h i s p r o c e s s is used to switch b e t w e e n two d i f f e r e n t f o r m s of an IgM h e a v y chain ( 3 3 ) . A membrane-bound f o r m is s y n t h e s i z e d by a longer t r a n s c r i p t , which splices on two additional exons and splices out p a r t of the last exon of t h e s h o r t e r t r a n s c r i p t . The shorter transcript synthesizes a s e c r e t e d f o r m of the protein. In a similar way the s w i t c h of t h e V 370 GILBERT (] X ( q b IIII | Fig. i0. Introns speed legitimate recombination. (a) The classical pattern by which two mutati0ns~ one occurring in one copy of the gene~ at the left end~ and the other occurring in the other copy of the gene~ at the r i g h t - h a n d end~ might get together by r e c o m b i n a t i o n happening in the homologous stretch of DNA that separates the two mutations. This recombination can create a single gene c a r r y i n g both mutations. (b) The same process happening in a gene in which the mutations occur in separate exons separated by an intron. Now the recombination can occur anywhere9 either in the exon or within the intron~ to produce a new g e n e carrying both mutations. Since the rate of recombination will be directly proportional to the distance along the DNA between the mutations~ it will be faster. DNA S E Q U E N C I N G AND GENE S T R U C T U R E 371 region f r o m IgM to an IgD c o n s t a n t region is p r o b a b l y the result of a d i f f e r e n t , still longer~ t r a n s c r i p t which s p l i c e s a c r o s s to a t t a c h t h e V - r e g i o n e x o n s to the new c o n s t a n t - r e g i o n exons of the d e l t a class. T h e s e c o m b i n a t i o n s of genes h a v e c e r t a i n l y been c r e a t e d by r e c o m bination e v e n t s within the DNA t h a t u l t i m a t e l y b e c o m e s the intron of t h e longer t r a n s c r i p t i o n unit. T h e m o s t s t r i k i n g p r e d i c t i o n of t h i s e v o l u t i o n a r y view is t h a t s e p a r a t e e l e m e n t s defined by t h e e x o n s h a v e s o m e f u n c t i o n a l s i g n i f i c a n c e , t h a t t h e s e e l e m e n t s h a v e been a s s o r t e d and put t o g e t h e r in new c o m b i n a t i o n s t o m a k e up t h e p r o t e i n s t h a t we k n o w . Gene p r o d u c t s a r e a s s e m b l e d o u t o f previously a c h i e v e d solutions of the structure-function problem. C l e a r e x a m p l e s of this a r e still m e a g e r . The hydrophobic l e a d e r s e q u e n c e which is involved in the t r a n s f e r of p r o t e i n s through m e m b r a n e s , a n d ~vhich is t r i m m e d o f f a f t e r t h e s e c r e t i o n , is o f t e n on s e p a r a t e exons - m o s t notably in the i m m u n o globulins ( s e e Fig. 8), but also in ovomuc6'id ( 2 9 ) . In t h e p a i r of g e n e s for insulin in the rat~ a p r o d u c t o~ r e c e n t duplication ( 3 2 ) , the two chains of insulin lie on s e p a r a t e ex0ns in one gen% on a single exon in the o t h e r . The a n c e s t r a l g e n e ( t h e c o m m o n s t r u c t u r e in o t h e r s p e c i e s ( 3 4 ) ) has the additional intron - s u g g e s t i n g t h a t the gene was put t o g e t h e r originally f r o m s e p a r a t e pieces. The gene for l y s o z y m e is broken up into four exons; the second one c a r r i e s the c r i t i c a l a m i n o acids of the a c t i v e s i t e and m o s t of the s u b s t r a t e c o n t a c t s ( 3 5 ) . In the gene for globin the c e n t r a l exon encodes a l m o s t all of the h e m e contacts. Fig. 11 shows a s c h e m a t i c dissection of the m o l e c u l e . A r e c e n t e x p e r i m e n t (36) has shown t h a t the p o l y p e p t i d e t h a t c o r r e sponds to the c e n t r a l exon in itself is a h e m e - b i n d i n g ' m i n i g l o b i n ' ; the side exons h a v e provided p o l y p e p t i d e m a t e r i a l to s t a b i l i z e the protein. A t t h e s a m e m o m e n t t h a t the rapid sequencing m e t h o d s and the m o l e c u l a r cloning g a v e us the p r o m i s e of being able to work out t h e s t r u c t u r e of any gen% the ability to a c h i e v e a c o m p l e t e u n d e r s t a n d i n g of t h e g e n e t i c m a t e r i a l , N a t u r e r e v e a l e d h e r s e l f to be m o r e c o m p l e x than we had imagined. We could not read the gene p r o d u c t d i r e c t l y f r o m the c h r o m o s o m e by DNA sequencing alone. We m u s t appeal to t h e s e q u e n c e of the a c t u a l protein, or at l e a s t the s e q u e n c e of the m a t u r e m e s s e n g e r RNA, to l e a r n t h e i n t r o n - e x o n structure of t h e gene. N o n e t h e l e s s t h e h o p e e x i s t s , t h a t as we look down on the s e q u e n c e of DNA in the c h r o m o s o m % we will n o t l e a r n s i m p l y t h e p r i m a r y s t r u c t u r e Of the gene products, but we will learn a s p e c t s of the f u n c t i o n a l s t r u c t u r e of the p r o t e i n s - p u t t o g e t h e r o v e r e v o l u t i o n a r y t i m e as exons linked through introns. My i n t e r e s t in biology has always c e n t e r e d on two p r o b l e m s , how is t h e g e n e t i c i n f o r m a t i o n m a d e m a n i f e s t ? and how is it c o n t r o l l e d ? We h a v e l e a r n e d much a b o u t the way in which a gene is t r a n s l a t e d into protein. The control of genes in p r o k a r y o t e s is well u n d e r s t o o d , but for e u k a r y o t e s the c r i t i c a l m e c h a n i s m s of c o n t r o l a r e s t i l l n o t known. T h e p u r p o s e of r e s e a r c h is to e x p l o r e the unknown. The desire for new knowledge calls f o r t h the a n s w e r s to new questions. I o w e a g r e a t d e b t to m y s t u d e n t s and c o l l a b o r a t o r s o v e r the years; the g r e a t e s t to 3ira Watson~ who s t i m u l a t e d m y i n t e r e s t in m o l e c u l a r biology~ to Benno Muller-Hill~ with whom I worked on the 2 a c repressor~ and to Allan Maxam~ with whom I d e v e l o p e d t h e DNA sequencing. 372 GILBERT P.., DNA SEQUENCING AND GENE STRUCTURE 373 Fig. ii. A schematic dissection of globin into the product of the separate exons. (a) The black arrows show the points at which the structure of a chain of globin is interrupted by the introns. (The structures of the chains of globin and of myoglobin are very similar; the schematic structure s h o w n is myoglobin.) The introns interrupt the protein in the alpha-helical regions to break the protein into three portions, shown in ( b ) . The product of the central exon surrounds the heme; the products of the other two exons, I conjecture, wrap around and stabilize the protein. References i. 2. 3. 4. 5. 6. 7. 8. 9. I0. ii. 12. 13. G i l b e r t W & MHller-Hill B (1966) Isolation of the 2ac repressor. Proc. Natl. Acad. Sci. U.S.A. 56, 1891-1898. Gilbert W & MUller-Hill B (1967) The 2ac operator is DNA. Proc. Natl. Acad. Sci. U.S.A. 58, 2415-2421. Gilbert W & Maxam A (1973) The nucleotide sequence of the 2ac operator. Proc. Natl. Acad. Sci. U.S.A. 70, 3581-3584. Gilbert W, Maizels N & Maxam A (1975) Sequences of controlling regions of the E. coli lactose operator. Genetics 79, 227. Dickson RC, Abelson J, Barnes WM & Reznikoff WS (1975) Genetic regulation: the lac control region. Science 187, 27-35. Gilbert W, Maxam A & Mirzabekov A (1976) Contacts between the 2ac repressor and DNA revealed by methylation. In Contro2 of R i b o s o m e Synthesis, pp 139-148, Alfred Benzon Symposium 9, Munksgaard. Maxam AM & Gilbert W (1977) A new method for sequencing DNA. Proc. Natl. Acad. Sci. U.S.A. 74, 560-564. Tonegawa S, Maxam AM, Tizard R, Bernard O & Gilbert W (1978) Sequence of a mouse germ-line gene for a variable region of an immunoglobulin light chain. Proc. Natl. Acad. Sci U.S.A. 75 1 1485-1489. Maxam AM & Gilbert W (1980) Sequencing end-labeled DNA with base-specific chemical cleavages. Methods in Enzymology 65, 499-560. Sanger F & Coulson AR (1978) The use of thin acrylamide gels for DNA sequencing. FEBS Lett. 87, 107-110. Farabaugh PJ (1978) Sequence of the lacI gene. Nature 274, 765-769. Beyreuther K, Adler K, Fanning E, Murray C~ Klemm A & Geisler N (1975) Amino-acid sequence of lac repressor from Escherichia coli. Eur. J. Biochem. 59, 491-509. Coulondre C, Miller JH, Farabaugh PJ & Gilbert W (1978) Molecular basis of base substitution hotspots in Escherichia coli. Nature 274, 775-780. 374 14. GILBERT Lindahl T, Ljungquist S, Siegert W, Nyberg B & Sperens B (1977) DNA N-glycosidases. J. Biol. Chem. 252, 3286-3294. 15. Sutcliffe JG (1978) Nucleotide sequence of the ampicillin resistance gene of Escherichia co2i plasmid pBR322. Proc. Natl. Acad. Sci. U.S.A. 75, 3737-3741. 16. Ambler RP & Scott GK (1978) Partial amino acid sequence of penicillinase coded by Escherichia co2i plasmid R6K. Proc. Natl. Acad. Sci. U.S.A. 75, 3732-3736. 17. Sutcliffe JG (1978) Complete nucleotide sequence of the Escherichia co2i plasmid pBR322. Cold Spring Harbor Symposia on Quantitative Biology 43, 77-90. 18. For a review, see: Yanofsky C (1981) Attenuation in the control of expression of bacterial operons. Nature 289, 751-758. 19. Tilghman SM, Tiemeier DC, Seidman JG, Peterlin BM, Sullivan M, Maizel JV & Leder P (1978) Intervening sequence of DNA identified in the structural portion of a mouse ~-globin gene. Proc. Natl. Acad. Sci. U.S.A. 75, 725-729. 20. Konkel DA, Tilghman SM & Leder P (1978) The sequence of the chromosomal mouse 8-globin major gene: homologies in capping, splicing and poly(A) sites. Cell 15, 1125-1132. 21. Brack C & Tonegawa S (1977) Variable and constant parts of the immunoglobulin light chain gene of a mouse myeloma cell are 1250 nontranslated bases apart. Proc. Natl. Acad. Sci. U.S.A. 74, 5652-5656. 22. Breathnach R, Mandel JL & Chambon P (1977) Ovalbumin gene is split in chicken DNA. Nature 270, 314-319. 23. Berget SM, Moore C & Sharp PA (1977) Spliced segments at the 5' terminus of adenovirus 2 late mRNA. Proc. Natl. Acad. Sci. U.S.A. 74, 3171-3175. 24. Chow LT, Gelinas RE, Broker TR & Roberts RJ (1977) An amazing sequence arrangement at the 5' ends of adenovirus 2 messenger RNA. Cell 12, I-8. 25. Gilbert W (1978) Why genes in pieces? Nature 271, 501. 26. Bernard O, Hozumi N & Tonegawa S (1978) Sequences of mouse immunoglobulin light chain genes before and after somatic changes. Cell 15, 1133-1144. 27. Sakano H, Rogers JH, H~ppi K, Brack C, Traunecker A, Maki R~ Wall R & Tonegawa S (1979) Domains and the hinge region of an immunoglobulin heavy chain are encoded in separate DNA segments. Nature 277, 627-633. 28. Honjo T, Obata M, Yamawaki-Kataoka Y, Kataoka T, Kawakami T, Takahashi N & Mano Y (1979) Cloning and complete nucleotide sequence of mouse immunoglobin yl chain gene. Cell 18, 559-568. 29. Stein JP, Catterall JF, Kristo P, Means AR & O'Malley BW (1980) Ovomucoid intervening sequences specify functional domains and generate protein polymorphism. Cell 21, 681-687. 30. Yamado Y, Avvedimento VE, Mudryj M, Ohkubo H, Vogeli G, Irani M, Pastan I & de Crombrugghe B (1980) The collagen gene: evidence for its evolutionary assembly by amplification of a DNA segment containing an exon of 54 bp. Cell 22, 887-892. DNA SEQUENCING AND GENE STRUCTURE 31. 32. 33. 34. 35. 36. 375 Doolittle WF (1978) Genes in pieces: were they ever together? Nature 272, 581. Lomedico P, Rosenthal N9 Efstratiadis A, Gilbert W~ Kolodner R & Tizard R (1979) The structure and evolution of the two nonallelic rat preproinsulin genes. Cell 18, 545-558. Early P~ Rogers J9 Davis M~ Calame K, Bond M, Wall R & Hood L (1980) Two mRNAs can be produced from a single immunoglobin gene by alternative RNA processing pathways. Cell 20, 313-319. Perler Fy Efstratiadis A~ Lomedico P9 Gilbert W9 Kolodner R & Dodgson J (1980) The evolution of genes: the chicken preproinsulin gene. Cell 20, 555-566. Jung A~ Sippel AE~ Grez M & Sch~tz G (1980) Exons encode functional and structural units of chicken lysozyme. Proc. Natl. Acad. Sci. U.S.A. 77, 5759-5763. Craik CS, Buchman SR & Beychok S (1980) Characterization of globin domains: heme binding to the central exon product. Proc. Natl. Acad. Sci. U.S.A. 77, 1384-1388.
© Copyright 2026 Paperzz