DNA sequencing and gene structure

353
Bioscience Reports I, 353-375 (1981)
Printed in Great Britain
DNA s e q u e n c i n g and g e n e s t r u c t u r e
Nobel l e c t u r e , g D e c e m b e r 1980
Walter GILBERT
Department of Biochemistry and Molecular Biology,
Harvard University, The Biological Laboratories,
Cambridge, Massachusetts 02138, U.S.A.
When we w o r k out the s t r u c t u r e of DNA m o l e c u l e s , we e x a m i n e
the f u n d a m e n t a l level t h a t underlies all p r o c e s s e s in living cells.
DNA
is the i n f o r m a t i o n s t o r e t h a t u l t i m a t e l y d i c t a t e s the s t r u c t u r e of e v e r y
gene p r o d u c t , d e l i n e a t e s e v e r y p a r t of the o r g a n i s m . The order of t h e
bases along DNA c o n t a i n s the c o m p l e t e set of i n s t r u c t i o n s t h a t m a k e
up the g e n e t i c i n h e r i t a n c e .
We do not know how to i n t e r p r e t t h o s e
instructions;
l i k e a c h i l d , we c a n s p e l l o u t t h e a l p h a b e t without
u n d e r s t a n d i n g m o r e than a f e w words on a page.
I c a m e to the c h e m i c a l DNA sequencing by a c c i d e n t .
Since the
middle s i x t i e s m y w o r k had f o c u s s e d on t h e c o n t r o l of g e n e s in
b a c t e r i a , studying a s p e c i f i c gene product, a p r o t e i n r e p r e s s o r m a d e by
the c o n t r o l g e n e f o r t h e l a c o p e r o n ( t h e c l u s t e r of g e n e s t h a t
m e t a b o l i z e the sugar l a c t o s e ) .
Benno M~iller-Hill and I had i s o l a t e d
and c h a r a c t e r i z e d this m o l e c u l e during t h e l a t e s i x t i e s a n d d e m o n s t r a t e d t h a t this p r o t e i n bound to b a c t e r i a l DNA i m m e d i a t e l y at the
b e g i n n i n g of t h e f i r s t g e n e of t h e t h r e e - g e n e
cluster that this
r e p r e s s o r c o n t r o l l e d (1,2).
In the y e a r s since then, my l a b o r a t o r y had
shown t h a t this protein a c t e d by p r e v e n t i n g t h e RNA p o l y m e r a s e f r o m
copying the l a c operon genes into RNA. I had used the f a c t t h a t the
l a c r e p r e s s o r bound to DNA a t a s p e c i f i c r e g i o n , t h e o p e r a t o r ,
to
i s o l a t e the DNA of this region by digesting all of the r e s t of the DNA
with D N a s e to l e a v e only a small f r a g m e n t bound to t h e r e p r e s s o r ,
protected
f r o m t h e a c t i o n of the e n z y m e .
This i s o l a t e d a t w e n t y f i v e - b a s e - p a i r f r a g m e n t of DNA out of the 3 million base pairs in the
bacterial chromosome.
In t h e e a r l y s e v e n t i e s , Allan M a x a m and I
worked out the s e q u e n c e of this small f r a g m e n t (3) by copying this
DNA into short f r a g m e n t s of RNA and using on t h e s e RNA copies the
s e q u e n c i n g m e t h o d s t h a t had b e e n d e v e l o p e d by S a n g e r a n d h i s
c o l l e a g u e s in the l a t e sixties.
This was a laborious process t h a t took
s e v e r a l years.
When a s t u d e n t , N a n c y Maizels, then d e t e r m i f l e d t h e
s e q u e n c e of t h e f i r s t 63 b a s e s of the m e s s e n g e r RNA for the l a c
operon genes, we d i s c o v e r e d t h a t t h e l a c r e p r e s s o r b o u n d t o D N A
immediately
a f t e r the s t a r t of the m e s s e n g e r RNA ( 4 ) , in a region
t h a t lies under t h e RNA p o l y m e r a s e when it binds to DNA to i n i t i a t e
RNA synthesis.
We c o n t i n u e d to c h a r a c t e r i z e the l a c o p e r a t o r by
sequencing a n u m b e r of m u t a t i o n s ( o p e r a t o r c o n s t i t u t i v e m u t a t i o n s )
t h a t d a m a g e d the ability of the r e p r e s s o r to bind to DNA. We w a n t e d
to d e t e r m i n e m o r e DNA s e q u e n c e in the region to d e f i n e t h e p o l y m e r a s e binding site and o t h e r e l e m e n t s involved in l a c g e n e control;
h o w e v e r , t h a t s e q u e n c e w a s w o r k e d o u t in a n o t h e r l a b o r a t o r y by
D i c k s o n , A b e l s o n , B a r n e s , and R e z n i k o f f ( 5 ) .
Thus by the middle
9
Nobel
Foundation
1981
354
GILBERT
s e v e n t i e s I knew all the sequences t h a t I had been curious about, and
my students (David Pribnow and 3ohn Majors) and I w e r e t r y i n g t o
a n s w e r q u e s t i o n s a b o u t t h e i n t e r a c t i o n of the RNA p o l y m e r a s e and
o t h e r control f a c t o r s with DNA.
At this point, a n o t h e r line of e x p e r i m e n t s was opened up by a new
suggestion.
Andrei Mirzabekov c a m e to visit me in early 1975. The
purpose of his visit was twofold; to describe e x p e r i m e n t s t h a t he had
been doing using dimethyl s u l f a t e to m e t h y l a t e the guanines and t h e
adenines in DNA and to urge me to do a similar e x p e r i m e n t with the
l a c repressor.
Dimethyl s u l f a t e m e t h y l a t e s the guanines uniquely a t
t h e /r p o s i t i o n , w h i c h is exposed in the major groove of the DNA
double helix, while it m e t h y l a t e s the adenines at the /r position which
is e x p o s e d in the minor groove (Fig. i ) .
Mirzabekov had used this
p r o p e r t y to a t t e m p t to d e t e r m i n e the disposition o f ~ h i s t o n e s and o f
c e r t a i n antibiotics on the DNA molecule bY observing the blocking of
the i n c o r p o r a t i o n of r a d i o a t i v e m e t h y l groups onto t h e g u a n i n e s and
adenines of bulk DNA. He urged me to use this groove s p e c i f i c i t y to
learn something a b o u t the i n t e r a c t i o n of the 7ac r e p r e s s o r w i t h t h e
lac operator.
H o w e v e r , the amounts of 2ac o p e r a t o r available were
e x t r e m e l y small, and t h e r e was no o b v i o u s way of e x a m i n i n g t h e
protein sitting on DNA to ask which bases in the sequence the protein
would p r o t e c t against a t t a c k by the dimethyl s u l f a t e r e a g e n t .
It was not until a f t e r a second visit by Mirzabekov t h a t an idea
finally e m e r g e d .
He and I, and Allan M a x a m and 3 a y G r a l l a , h a d
lunch together.
D u r i n g our c o n v e r s a t i o n I had an i d e a :[or an
e x p e r i m e n t , which u l t i m a t e l y underlies our s e q u e n c i n g m e t h o d .
We
k n e w we c o u l d o b t a i n a defined DNA f r a g m e n t , 55 base-pairs long,
which c a r r i e d near its c e n t e r the region to which the l a c r e p r e s s o r
bound. This fragment was made by cutting the DNA sequentially with
two different r e s t r i c t i o n enzymes, each defining one end of the
fragment (see Fig. 2). Secondly, I knew that at every base along the
DNA at which methylation occurred, that base could be removed by
heat.
Furthermore, once that had happened, only a sugar would be
left holding the DNA chain together, and that sugar could be hydrolysed, in principle, in alkali to break the DNA chain. I put these
ideas together by conjecturing that i i we labelled one end of one
strand of the DNA fragment with radioactive phosphate, we might
determine the point of methylation by measuring the distance between
the labelled end and the point of breakage.
We could get such
labelled DNA by isolating a DNA fragment (by length by e l e c t r o phoresis through polyacrylamide gels) made by c u t t i n g w i t h one
restriction enzyme, labelling both ends of t h a t f r a g m e n t , and then
c u t t i n g it again w i t h a second r e s t r i c t i o n enzyme to release two
separable double-stranded fragments, each having a label at one end
but not the other.
Using polynucleotide kinase this procedure would
introduce a radioactive label into the 5' end of one of the DNA
strands of the fragment bearing the operator while leaving the other
unlabelled (Fig. 2).
If we then modified t h a t DNA w i t h d i m e t h y l
sulfate so that only an occasional adenine or guanine would be
methylated, heated, and cleaved the DNA with alkali at the point of
depurination, we would release among other fragments a labelled
fragment extending from the unique point of labelling to the f i r s t
point of breakage.
Fig. 3 shows this idea. Any fragments frem the
DNA SEQUENCING
AND GENE S T R U C T U R E
355
,,oor
Sugar
HI
CH3 o H ./H
~.N-H
/
/ ' \\
/
0
Sugar
T
N'~~
\
4"1
l
N~
uoor
CH 3
Fig. i.
Methylated cytosine-guanine and thymineadenine base pairs.
The top of the figure shows a
cytosine-guanine base pair methylated at the N7
position of guanine. The bottom of the figure shows
a thymine-adenine base pair methylated at the N3
position of adenine.
The region above each of the
base pairs is exposed in the major groove of DNA.
The region below each of the base pairs lies between
the sugar phosphate backbones in the minor groove of
the DNA double helix.
o t h e r s t r a n d w o u l d be u n l a b e l l e d , as w o u l d any f r a g m e n t s arising
beyond t h e f i r s t p o i n t of b r e a k a g e .
If we c o u l d s e p a r a t e ~t h e s e
fragments
by s i z e , as we could in principle by e l e c t r o p h o r e s i s on a
polyacrylamide
g e l , w e m i g h t be a b l e to a s s o c i a t e t h e l a b e l l e d
f r a g m e n t s b a c k to the known s e q u e n c e and thus i d e n t i f y e a c h guanine
and a d e n i n e in t h e o p e r a t o r t h a t h a d b e e n m o d i f i e d by d i m e t h y l
sulfate.
If we could do the m o d i f i c a t i o n in the p r e s e n c e of the Zac
r e p r e s s o r p r o t e i n bound to the DNA f r a g m e n t , then if t h e r e p r e s s o r
lay close to the N7 of a guanine, we would not m o d i f y the DNA a t
t h a t base, and the c o r r e s p o n d i n g f r a g m e n t w o u l d n o t a p p e a r in t h e
analytical pattern.
356
GILBERT
Alu
Alu
Hpa
Hae
Fig. 2.
P r o c e d u r e for obtaining a double-stranded
DNA fragment u n i q u e l y l a b e l l e d at one end of one
strand.
The figure shows the restriction cuts for
the enzymes Alul (target AG/CT) and Hpa (target C/CGG
producing an uneven end) in the neighborhood of the
2ac operator.
The 2ac repressor is shown b o u n d to
the DNA.
By cutting the DNA from this region first
with the enzyme A2u, then labelling with radioactive
p32 the 5' termini of both strands of the DNA with
polynucleotide kinase, and then cutting in turn with
the enzyme Hpa, we can isolate a DNA fragment that
carries the binding site for the repressor uniquely
labelled at one end of one strand.
I set out to do this e x p e r i m e n t .
Allan M a x a m m a d e t h e labelled
DNA f r a g m e n t s , and I began to learn how to modify and to b r e a k the
DNA.
This involved analysing the r e l e a s e of the bases f r o m DNA and
the breakage steps separately.
Finally the experiment
was put
together.
Fig. 4 shows the results:
an a u t o r a d i o g r a m of the e l e c t r o p h o r e t i c p a t t e r n displays a series of bands e x t e n d i n g downward in size
f r o m t h e f u l l - l e n g t h f r a g m e n t , e a c h caused by the c l e a v a g e of the
DNA at an adenine or a guanine.
The s a m e t r e a t m e n t of the DNA
f r a g m e n t with d i m e t h y l s u l f a t e , now c a r r i e d out in the p r e s e n c e of the
r e p r e s s o r , produced a similar p a t t e r n , e x c e p t t h a t s o m e of the bands
w e r e missing (lane one versus lane two in Fig. #).
The e x p e r i m e n t
was c l e a r l y a success in t h a t the p r e s e n c e of the r e p r e s s o r b l o c k e d
the a t t a c k by d i m e t h y l s u l f a t e on s o m e of the guanines and s o m e of
the adenines in the o p e r a t o r (6).
I hoped t h a t the size d i s c r i m i n a t i o n
would be a c c u r a t e enough to p e r m i t the a s s i g n m e n t of e a c h band in
the p a t t e r n to a s p e c i f i c base in t h e s e q u e n c e .
This proved true
b e c a u s e the spacing in the p a t t e r n , and the p r e s e n c e of light and dark
bands, the dark bands c o r r e s p o n d i n g to guanines and the light ones to
a d e n i n e s , w e r e s u f f i c i e n t l y c h a r a c t e r i s t i c to c o r r e l a t e the two.
The
guanines r e a c t a b o u t five t i m e s m o r e rapidly with d i m e t h y l s u l f a t e ,
w h i l e t h e m e t h y l a t e d adenines a r e r e l e a s e d f r o m DNA m o r e rapLdly
than t h e guanines during heating; the shift in i n t e n s i t i e s as a f u n c t i o n
DNA SEQUENCING AND GENE STRUCTURE
3.57
I I I I I I I I I I""II I I I I I I I I I I
G
GG
G
171111111111111111111
G
GG
G
Modify
CH3
~I~
i ili
i i i i i i i i i I i i i I i i,,ii'i
G
GG
,•I•'ilIIII
G
l l l ' l l l ' l l l l l l l l
G
11111111
G
~"II
G
I I II
I II
Displace
I I I I I
GG
I| I II II'~'I I I | |
G
~,'I
G
CH3
GG
II'I I I
G
11111
G
Cleave
and
Separate
G
Fig. 3. Outline of procedure to produce fragments of
DNA by breaking the DNA at guanines.
Consider an
end-labelled strand of DNA. We modify an occasional
guanine by methylation with d i m e t h y l s u l f a t e .
Heating the DNA will then displace that guanine from
the DNA strand, leaving behind the bare sugar;
cleaving the DNA with alkali will break the DNA at
the missing guanine; the fragments are then separated
by size, the actual size of the fragment (followed
because it carries the radioactive label) being
determined by the position of the modified guanine.
ol t h e t i m e of h e a t t r e a t m e n t could be used to establish unambiguously which base was which originally. Furt herm ore, the gel pat t ern
is so c l e a r , bands corresponding to fragments differing by one base
were resolved.
At this point it was clear that this technique could
determine the adenines and guanines along DNA for distances of the
order of 40 nucleotides. By determining the purines on one strand and
the purines on the c o m p l e m e n t a r y strand (as Fig. 4 does), one has in
principle a com pl e t e sequencing method.
358
GILBERT
H a v i n g in h a n d a r e a c t i o n t h a t will d e t e r m i n e
and distinguish
adenines and guanines, could we find r e a c t i o n s t h a t would distinguish
c y t o s i n e s and t h y m i n e s ?
Allan M a x a m and I t u r n e d our a t t e n t i o n to
this end.
( F i r s t we e x a m i n e d a s e c o n d b i n d i n g s i t e f o r t h e Z a c
r e p r e s s o r t h a t lies a f e w hundred bases f u r t h e r along the DNA, under
the I i r s t gene of the operon.
This binding site has no physiological
function.
We could l o c a t e this binding site on a r e s t r i c t i o n f r a g m e n t
by r e p e a t i n g the m e t h y l a t i o n - p r o t e c t i o n e x p e r i m e n t
and identifying
bases p r o t e c t e d by t h e l a c r e p r e s s o r .
I used the m e t h y l a t i o n p a t t e r n
to a t t e m p t to p r e d i c t the positions ol the adenines and the guanines in
t h e u n k n o w n D N A s e q u e n c e ; Allan M a x a m then used the w a n d e r i n g
spot sequencing m e t h o d of Sanger and his c o w o r k e r s to d e t e r m i n e the
DNA s e q u e n c e of this region to v e r i f y t h a t we had m a d e a s u c c e s s f u l
prediction.)
Allan Maxam then went on to do the next p a r t of t h e
development.
We k n e w t h a t h y d r a z i n e
would a t t a c k the c y t o s i n e s
DNA S E Q U E N C I N G
AND GENE S T R U C T U R E
359
Fig. 4.
Methylation protection experiment with the
2ac repressor.
The columns show the pattern of
cleavage along each strand of the 53- to 55-base-long
fragment bearing the 2ac operator. The second column
from the left represents the DNA (labelled at the 5'
end of the 53-base-long strand) treated by dimethyl
sulfate, cleaved by heat and alkali.
The dark bands
correspond to breaks at guanines; the light bands are
breaks at adenines.
The second column of the figure
reads from the bottom:
-,-,AAA--,--,---,---,-AA .... A-A-AA-A-A-....
The first column shows this same double-stranded
piece of DNA treated with dimethyl sulfate in the
presence of the 2ac repressor. The repressor prevents
the interaction of dimethyl sulfate with the guanine
a third of the way up the pattern and blocks the
reaction with two adenines in the upper third of the
pattern.
The bands corresponding to these two
adenines represent fragments that differ in length by
one base~ 30 and 31 long.
The right-hand side of
this pattern shows the same experiment done with the
label at the end of the other DNA strand.
The
sequence at the far right would be:
-,-G-GGAA--G-GAG-GGA-AA-AA---....
and t h y m i n e s in DNA and d a m a g e t h e m s u f f i c i e n t l y , or e l i m i n a t e t h e m
to f o r m a h y d r a z o n e , so t h a t a f u r t h e r t r e a t m e n t of t h e D N A w i t h
b e n z a i d e h y d e i o l l o w e d by alkali (or a t r e a t m e n t with an a m i n e ) , would
c l e a v e at the d a m a g e d base. This soon g a v e us a similar p a t t e r n , but
b r o k e the DNA at the c y t o s i n e s and t h y m i n e s w i t h o u t d i s c r i m i n a t i o n .
Allan M a x a m then d i s c o v e r e d t h a t salt, o n e - m o l a r s a l t in t h e 1 5 - m o l a r
h y d r a z i n e , a l t e r e d t h e r e a c t i o n to s u p p r e s s t h e r e a c t i v i t y of the
t h y m i n e s . The two r e a c t i o n s t o g e t h e r then positioned and distinguished
t h e t h y m i n e s a n d t h e c y t o s i n e s in a D N A s e q u e n c e .
This last
d i s c r i m i n a t i o n c o n c e p t u a l l y c o m p l e t e d the m e t h o d .
To i m p r o v e t h e
d i s c r i m i n a t i o n b e t w e e n the purines, and to provide r e d u n d a n t i n f o r m a tion which would s e r v e to m a k e the s e q u e n c i n g m o r e s e c u r e a g a i n s t
e r r o r s , we u s e d t h e f a c t t h a t the m e t h y l a t e d adenosines d e p u r i n a t e
m o r e rapidly, e s p e c i a l l y in acid, to r e l e a s e the a d e n o s i n e s p r e f e r e n tially and thus to obtain four r e a c t i o n s :
one for A ' s , one p r e f e r e n t i a l
for G ' s , one for C ' s and T ' s , and one for C ' s d e t e r m i n i n g t h e T ' s by
difference.
T h i s s t a g e of t h e w o r k w a s c o m p l e t e d within a f e w
months.
As the range of resolution on the gels was e x t e n d e d t o w a r d
100 b a s e s , the c l e a v a g e at t h e p y r i m i d i n e s was not s a t i s f a c t o r y ; t h e
result of the i n c o m p l e t e c l e a v a g e w a s t h a t t h e l o n g e r f r a g m e n t s
c o n t a i n e d a v a r i e t y of i n t e r n a l d a m a g e s and the p a t t e r n blurred out.
A f t e r m a n y m o n t h s of s e a r c h i n g , an a n s w e r was f o u n d .
A primary
a m i n e , aniline, will displace the h y d r a z i n e p r o d u c t s and p r o d u c e a b e t a
e l i m i n a t i o n t h a t r e l e a s e s t h e p h o s p h a t e f r o m t h e 3' p o s i t i o n on t h e
sugar, but it will not r e l e a s e the o t h e r p h o s p h a t e , and the m o b i l i t y of
a DNA f r a g m e n t with a blocked 3' p h o s p h a t e bearing a s u g a r - a n i l i n e
residue
is d i f f e r e n t f r o m t h e f r e e - p h o s p h a t e - e n d e d
chains from
360
GILBERT
the other reactions.
A s e c o n d a r y a m i n e , p i p e r i d i n e , is far m o r e
e f f e c t i v e and triggers both beta eliminations as well as eliminating all
t h e b r e a k d o w n p r o d u c t s of t h e h y d r a z i n e r e a c t i o n f r o m the sugar.
This r e a g e n t c o m p l e t e d the DNA sequencing techniques (7).
Although
the d e v e l o p m e n t of the t e c h n i q u e s continued for a n o t h e r nine months,
they were distributed f r e e l y to other groups t h a t w a n t e d to use t h e m .
Fig. 5 shows an actual sequencing p a t t e r n from the 1978 period, used
in the work described in (8).
F i g . 6 s h o w s t w o e x a m p l e s of t h e
c h e m i s t r y (9).
The logic behind the chemical m e t h o d is to divide the a t t a c k into
two steps. In the first we use a r e a g e n t t h a t c a r r i e s the s p e c i f i c i t y ,
DNA SEQUENCING
AND GENE S T R U C T U R E
361
Fig. 5. Actual sequencing pattern from the 1978 period.
Products of four different chemical reactions, applied
to a DNA fragment about 150 bases long~ are electrophoresed on a polyacrylamide gel; three loadings produce
sets of patterns that have moved different distances
down the gel.
The four columns correspond to reactions
that break the DNA:
i) primarily at the adenines, 2)
only at the guanines, 3) at the cytosines but not the
thymines, and 4) at both the cytosines and thymines.
The very shortest fragments are at the bottom right-hand
side of the picture and the sequence is read up the gel
recognizing first the band in the left-hand column
corresponding to A, a band in the two right-hand columns
corresponding to C, a band in the far right-hand column
corresponding
to T, a band in the left-hand column
corresponding to A, and so forth.
After reading up as
far as possible, the sequence continues in the sets of
bands at the left-hand side of the gel and then still
further in the pattern in the center of the gel.
From
the original p h o t o g r a p h the sequence of the entire
fragment can be read.
The fragment is from the genomic
DNA corresponding to the variable region of the lambda
light chain of mouse immunoglobulin (8).
but we l i m i t the e x t e n t of t h a t r e a c t i o n to only one base out of
s e v e r a l hundred possible t a r g e t s in e a c h DNA f r a g m e n t .
This p e r m i t s
t h e r e a c t i o n to be used in the domain of g r e a t e s t s p e c i f i c i t y :
only
the v e r y initial s t a g e s of a c h e m i c a l r e a c t i o n a r e i n v o l v e d .
The
s e c o n d s t e p , t h e c l e a v a g e of t h e D N A s t r a n d , must be c o m p l e t e .
Since the t a r g e t has a l r e a d y been distinguished f r o m the o t h e r bases
along t h e DNA chain by the p r e l i m i n a r y d a m a g e , we can use vigorous,
q u a n t i t a t i v e r e a c t i o n conditions. The result is a clean break, r e l e a s i n g
a f r a g m e n t w i t h o u t hidden d a m a g e s , which is required if the m o b i l i t i e s
of t h e f r a g m e n t s a r e to be v e r y closely c o r r e l a t e d so t h a t the bands
will not blur.
( T h e s p e c i f i c i t y need be only a b o u t a f a c t o r of ten for
t h e s e q u e n c e to be read u n a m b i g u o u s l y . )
Today, l a t e r d e v e l o p m e n t s of the t e c h n i q u e (9) have m o d i f i e d the
guanine r e a c t i o n and r e p l a c e d the d i m e t h y l s u l f a t e adenosine r e a c t i o n
with a d i r e c t d e p u r i n a t i o n r e a c t i o n t h a t r e l e a s e s both the adenines and
guanines equally. These changes, and the i n t r o d u c t i o n of the very thin
gels by S a n g e r ' s group (10), now m a k e it possible to read s e q u e n c e s
out b e t w e e n 200-400 bases f r o m the point of labelling.
The actual
c h e m i c a l workup, the analysis on gels, and the a u t o r a d i o g r a p h y are the
s h o r t p a r t of the process. The m a j o r t i m e spent in DNA sequencing is
s p e n t in the p r e p a r a t i o n of the DNA f r a g m e n t s and on the e l e m e n t s of
strategy.
The speed of the sequencing c o m e s only in p a r t f r o m t h e
a b i l i t y t o r e a d o f f q u i c k l y s e v e r a l h u n d r e d bases of D N A - at a
glance.
The m o r e i m p o r t a n t e l e m e n t is the linear p r e s e n t a t i o n of t h e
p r o b l e m . R a t h e r than s e q u e n c e r a n d o m l y , one can begin at one end of
a r e s t r i c t i o n m a p and m o v e r a t i o n a l l y through a g e n e - or c o n s t r u c t
the r e s t r i c t i o n map as o n e goes.
362
GILBERT
o
~H~
HN , , J ~ N \
o
G
0
,/
0
HN
IO1
5L-.-O_
/
0
0
0
,~,,,~Hc=o
II H2N N
5LO-- P--O--CH3
CH.
>~
N--H
I
0-
O H N/I""NI~N
~
5 - O - - PI - - O - - C ~ n _v
0-
0
I
0
I
OH-
-O~P=O
I
0
0
I
I
- 0 - - P=O
I
0
-O--P=O
I
0
1~
0
0
,
51--0-- IP--O- +
0
,
II
5--O--P--O--CH3
OH
H I
I ~/"~
H,C--C--CH--CH--C--N~..~
CH3
H~ N ~ H c , . = O
HzN , ~ N ~"'NH=
-~(~,,--.-~
9 -
+
o-
O-0--
OH-
~
0
I
-O--P==O
I
0
P--=O
I
0
0
b
HN~CH3
H ~
cH3
o
3 ?u
r
II
5--O--P~0--CH2
,
0
I
"0--P--O
I
0
0
o
Sin--O--P--O-- CHE
o+
~.
,~
", |
H2CBC__CHmCH_C~N~
+
-0
!
-O--P--O
I
0
0!
"O--PmO
I
0
0
I
-O--P~O
I
0
9-o-,~-o~
O.-
e/"-~
~c-%2
01
"O--PmO
I
0
,~
;;
f
II
5 - 0 - - .P--O--CH2
'
O"
~_~o
0
I
-O~PmO
I
0
" N-NH 2
DNA S E Q U E N C I N G AND GENE S T R U C T U R E
363
Fig. 6. Examples of the detailed chemistry involved
in breaking the DNA.
(a) The guanine breakage. The
guanines are first methylated with dimethyl sulfate.
The imidazole ring is opened by treatment with alkali
( d u r i n g the piperidine treatment).
Piperidine
displaces the base and then t r i g g e r s two b e t a
eliminations that release both phosphates from the
sugar and cleave the DNA strand leaving a 3' and a 5'
phosphate.
(b) The hydrazine attack on a thymine
that breaks the DNA at the pyrimidines.
T h e f i r s t l o n g s e q u e n c e was done by a g r a d u a t e student, Phillip
F a r a b a u g h , who used the new t e c h n i q u e s to s e q u e n c e t h e gene for t h e
2ac repressor
(11).
The p r o t e i n s e q u e n c e of this gene p r o d u c t had
been worked out in the e a r l y s e v e n t i e s by B e y r e u t h e r
and his c o workers (12).
S i n c e the a m i n o - a c i d s e q u e n c e was known, he could
quickly (a f e w m o n t h s ) e s t a b l i s h t h e DNA s e q u e n c e .
However, the
DNA s e q u e n c e showed t h a t t h e r e w e r e e r r o r s in the p r o t e i n s e q u e n c e ,
two a m i n o acids dropped a t one place and e l e v e n at a n o t h e r .
Since
t h e p r o t e i n s e q u e n c e contains 360 a m i n o acids, he had to work out a
gene of 1080 bases. DNA s e q u e n c i n g is f a s t e r and m o r e a c c u r a t e t h a n
protein sequencing.
T h e r e a s o n f o r t h i s is t h a t DNA is a linear
i n f o r m a t i o n s t o r e . B e c a u s e the c h e m i s t r y of e a c h r e s t r i c t i o n f r a g m e n t
is like any o t h e r - t h e y d i f f e r only in length - t h e r e is no p a r t i c u l a r
reason for losing t r a c k of t h e m , e x c e p t for t h e v e r y s m a l l e s t .
By
s e q u e n c i n g across the joins b e t w e e n the f r a g m e n t s , one e s t a b l i s h e s an
u n a m b i g u o u s order.
P r o t e i n s , on the o t h e r hand, a r e strings of a m i n o
acids used by N a t u r e to c r e a t e a wide v a r i e t y of c h e m i s t r i e s .
When a
p r o t e i n is f r a g m e n t e d ,
the fragments
can exhibit quite different
properties,
s o m e of which m a y be unusually u n f o r t u n a t e in t e r m s of
solubility or loss. T h e r e is no simple way of keeping a c c o u n t of t h e
t o t a l c o n t e n t of a m i n o acids, or of the order of f r a g m e n t s , as t h e r e is
for DNA, w h e r e t h e length of t h e r e s t r i c t i o n f r a g m e n t s can easily be
measured.
3effrey Miller and his coworkers had done an extensive analysis of
the appearance of mutations in the 2ac repressor gene. Three sites in
the gene are hotspots, at which the mutation rate is some 10 times
higher than at other sites. D N A sequencing showed that at each of
these sites there was a modified base, a 5-methyl cytosine) in the
sequence (13). (The chemical sequencing detects the presence of the
5-methyl cytosine directly, because the methyl group suppresses
c o m p l e t e l y t h e r e a c t i v i t y of this base in the h y d r a z i n e r e a c t i o n .
A
b l a n k s p a c e a p p e a r s in t h e s e q u e n c e , but on t h e o t h e r s t r a n d is a
guanine.)
T h e h i g h m u t a t i o n r a t e is a t r a n s i t i o n t o a t h y m i n e .
5-Methyl c y t o s i n e o c c u r s at a low f r e q u e n c y in DNA" this o b s e r v a t i o n
shows t h a t it is a m u t a g e n .
What is the e x p l a n a t i o n ?
D e a m i n a t i o n of
c y t o s i n e to uracil o c c u r s n a t u r a l l y .
If this o c c u r r e d in DNA it could
lead to a t r a n s i t i o n ; h o w e v e r , it usually does not, s i n c e t h e r e is an
e n z y m e t h a t scans DNA e x a m i n i n g it for deoxyuridine ( 1 4 ) .
When it
finds this base in DNA, m i s m a t c h e d or not, it b r e a k s t h e g l y c o s i d i c
bond and r e m o v e s the uracil.
This is then r e c o g n i z e d as a d e f e c t in
364
GILBERT
D N A , a n d a n o t h e r group of e n z y m e s then r e p a i r the d e p y r i m i d i n a t e d
spot.
H o w e v e r , 5 - m e t h y l c y t o s i n e d e a m i n a t e s to t h y m i n e - a n a t u r a l
c o m p o n e n t of DNA.
On r e p a i r or r e s y n t h e s i s a t r a n s i t i o n will ensue.
This whole a r g u m e n t explains why t h y m i n e is used in DNA - the e x t r a
m e t h y l g r o u p s e r v e s to suppress the e f f e c t s of the n a t u r a l r a t e of
deamination.
To f i n d o u t how e a s y and how a c c u r a t e DNA sequencing was, I
asked a student, G r e g o r S u t c l i f f e , to s e q u e n c e the ampicillin r e s i s t a n c e
gene, the beta-lactamase
gene, of Escherichia coli.
This gene is
c a r r i e d on a v a r i e t y of plasmids, including a small c o n s t r u c t e d plasmid,
p B R 3 2 2 , in E. c o 2 / .
All t h a t he knew a b o u t the protein was an
a p p r o x i m a t e m o l e c u l a r weight, and t h a t a c e r t a i n r e s t r i c t i o n c u t on
the plasmid inactivated
t h a t gene.
He had no previous e x p e r i e n c e
with DNA sequencing when he s e t out to work out the s t r u c t u r e of
DNA for this gene.
A f t e r seven months he had worked out a b o u t
1000 bases of d o u b l e - s t r a n d e d DNA, sequencing one s t r a n d a n d t h e n
sequencing the o t h e r for c o n f i r m a t i o n .
The unique long reading f r a m e
d e t e r m i n e d the s e q u e n c e of the protein p r o d u c t of this gene, a p r o t e i n
of 286 r e s i d u e s ( 1 5 ) .
We t h o u g h t t h a t t h e D N A s e q u e n c e was
unambiguous.
Luckily t h e r e was a v a i l a b l e , f r o m A m b l e r ' s l a b o r a t o r y ,
p a r t i a l s e q u e n c e i n f o r m a t i o n a b o u t the protein which had been o b t a i n e d
as a result of s e v e r a l y e a r s ' work a t t e m p t i n g to develop a s e q u e n c e
for the b e t a - l a c t a m a s e (16).
This i n f o r m a t i o n , while not s u f f i c i e n t to
d e t e r m i n e the protein s e q u e n c e d i r e c t l y , was a d e q u a t e to c o n f i r m t h a t
t h e p r e d i c t i o n of t h e D N A sequencing was c o r r e c t .
S u t c l i f f e then
became very enthusiastic
and s e q u e n c e d t h e r e s t of t h e p l a s m i d
pBR322 during the n e x t six months, to finish his thesis. He s e q u e n c e d
both strands of this ~ 3 6 2 - b a s e - p a i r - l o n g plasmid in o r d e r to c o n f i r m
the s e q u e n c e (17).
The c h e m i c a l sequencing is unambiguou% e x c e p t
for an o c c a s i o n a l c h a r a c t e r i s t i c f e a t u r e in t h e D N A f r a g m e n t
itself
t h a t c a u s e s it to m o v e a n o m a l o u s l y during the gel e l e c t r o p h o r e s i s .
As
longer and longer s t r a n d s a r e being a n a l y s e d on the gel, a hairpin loop
can f o r m at one end of the f r a g m e n t if the s e q u e n c e is s u f f i c i e n t l y
self-complementary.
As the f r a g m e n t a t i o n passes through this portion
of the m o l e c u l e , the mobilities on the gel do not d e c r e a s e u n i f o r m l y
as a function of length, but some of the m o l e c u l e s m o v e a b e r r a n t l y , a
f e a t u r e c a l l e d c o m p r e s s i o n , b e c a u s e the bands on an a u t o r a d i o g r a p h
b e c o m e close t o g e t h e r , or can o v e r l a p to c o n c e a l one or m o r e bases.
T h i s r a r e f e a t u r e o c c u r s a b o u t o n c e e v e r y t h o u s a n d bases.
It is
resolved by sequencing the o p p o s i t e strand in the o t h e r d i r e c t i o n along
the double-stranded
m o l e c u l e ( o r t h e s a m e s t r a n d in the o p p o s i t e
c h e m i c a l d i r e c t i o n ) b e c a u s e the hairpin will f o r m w h e n a d i f f e r e n t
r e g i o n of t h e s e q u e n c e is exposed and the c o m p r e s s i o n f e a t u r e will
o c c u r in a d i f f e r e n t place in the s e q u e n c e .
If b o t h s t r a n d s of t h e
DNA helix a r e s e q u e n c e d , the s e q u e n c e can be unambiguous.
The Structure
of Genes
T h e f i r s t g e n e s to be s e q u e n c e d , t h o s e in b a c t e r i a , yielded an
e x p e c t e d s t r u c t u r e : a contiguous series of codons lying upon t h e DNA
b e t w e e n an initiation signal and one of the t e r m i n a t o r signals. B e f o r e
the position at which the RNA copy will s t a r t , t h e r e lies a site for
t h e R N A p o l y m e r a s e , i n t e r a c t i n g with the Pribnow box, a region of
DNA S E Q U E N C I N G
AND GENE S T R U C T U R E
365
s e q u e n c e homology lying one turn of the helix b e f o r e the initial base
of t h e m e s s e n g e r RNA, and also with a n o t h e r r e g i o n of h o m o l o g y ,
thirty-five
bases before the start.
Thus one could u n d e r s t a n d the
b a c t e r i a l gene in t e r m s of a binding s i t e for t h e RNA p o l y m e r a s e , and
f u r t h e r binding sites for r e p r e s s o r s and a c t i v a t o r p r o t e i n s around and
under the p o l y m e r a s e . A l t e r n a t i v e l y , the c o n t r o l on t r a n s c r i p t i o n could
be e x e r c i s e d by a c o n t r o l of the t e r m i n a t i o n function:
new proteins
or an e l e g a n t t r a n s l a t i o n c o n t r o l (18) could d e t e r m i n e w h e t h e r or not
the p o l y m e r a s e would r e a d past a s t o p signal into a new gene.
When t h e f i r s t g e n e s f r o m v e r t e b r a t e s
were transferred
into
bacteria
by t h e r e c o m b i n a n t
D N A t e c h n i q u e s a n d s e q u e n c e d , an
entirely different structure emerged.
The coding s e q u e n c e s for globin
(19,20), for immunoglobulin (21), and for o v a l b u m i n (22) did not lie
on t h e D N A a s a c o n t i n u o u s s e r i e s of c o d o n s b u t r a t h e r ' w e r e
interrupted
by long s t r e t c h e s of non-coding DNA.
The d i s c o v e r y of
RNA splicing in a d e n o v i r u s by S h a r p a n d his c o w o r k e r s ( 2 3 ) a n d
Broker and R o b e r t s and t h e i r c o w o r k e r s (24) p a v e d t h e way for this
new s t r u c t u r e .
They had shown t h a t a f t e r the original t r a n s c r i p t i o n of
D N A i n t o a l o n g RNA, regions of this RNA a r e spliced out:
some
s t r e t c h e s a r e excised and the r e m a i n i n g portions a r e fused t o g e t h e r by
an as y e t u n d e f i n e d e n z y m a t i c process.
The exons ( 2 5 ) , regions of
t h e DNA t h a t will be e x p r e s s e d in m a t u r e m e s s a g e , a r e s e p a r a t e d
f r o m e a c h o t h e r by introns, regions of DNA t h a t lie within the g e n e t i c
e l e m e n t but whose t r a n s c r i p t s will be s p l i c e d o u t of t h e m e s s a g e .
Fig. 7 shows this process:
the original t r a n s c r i p t of a gene (now
t h o u g h t of as a t r a n s c r i p t i o n unit) will u n d e r g o a s e r i e s of s p l i c e s
b e f o r e being able to f u n c t i o n as a m a t u r e m e s s a g e in the c y t o p l a s m .
Fig. 8 shows a f e w e x a m p l e s .
V e r t e b r a t e genes can h a v e m a n y - 8,
15, e v e n 50 - e x o n s ( 2 9 , 3 0 ) , and the exons a r e for the m o s t p a r t
s h o r t coding s t r e t c h e s s e p a r a t e d by hundreds to s e v e r a l t h o u s a n d s of
b a s e pairs of intron DNA.
The rapid sequencing has m e a n t t h a t we
c a n w o r k o u t t h e D N A s e q u e n c e of a n y of t h e s e c o m p l e x g e n e
structures.
But can we u n d e r s t a n d t h e m ?
The emerging generalization
is t h a t p r o k a r y o t i c
genes have
c o n t i g u o u s coding s e q u e n c e s while the g e n e s for the highest e u k a r y o t e s
a r e c h a r a c t e r i z e d by a c o m p l e x e x o n / i n t r o n s t r u c t u r e .
As we m o v e up
f r o m p r o k a r y o t e s , the s i m p l e s t e u k a r y o t e s , such as y e a s t s , h a v e f e w
introns; f u r t h e r up the e v o l u t i o n a r y ladder the genes a r e m o r e broken
up.
( Y e a s t m i t o c h o n d r i a have introns; a r e t h e y an e x c e p t i o n to this
pattern?)
Are we seeing t h e e m e r g e n c e of the i n t r o n - e x o n s t r u c t u r e
r i s i n g to e v e r g r e a t e r d e g r e e s of c o m p l e x i t y as we m o v e up to the
v e r t e b r a t e s , or t h e loss of p r e e x i s t i n g i n t r o n - e x o n s t r u c t u r e s
as w e
m o v e d o w n to the s i m p l e s t i n v e r t e b r a t e s and the p r o k a r y o t e s ?
One
view considers the splicing as an a d a p t a t i o n t h a t b e c o m e s "ever m o r e
n e c e s s a r y in m o r e h i g h l y s t r u c t u r e d
organisms.
T h e o t h e r view
considers t h e s p l i c i n g as l o s t if t h e o r g a n i s m m a k e s a c h o i c e t o
s i m p l i f y and to r e p l i c a t e its DNA m o r e rapidly, to go through m o r e
g e n e r a t i o n s in a s h o r t t i m e , and t h u s t o be u n d e r a s i g n i f i c a n t
p r e s s u r e to r e s t r i c t its DNA c o n t e n t ( 3 1 ) .
What role can this g e n e r a l i n t r o n - e x o n s t r u c t u r e play in the g e n e s
of the higher o r g a n i s m s ?
Although m o s t genes t h a t h a v e been studied
h a v e this s t r u c t u r e , t h e r e a r e two n o t a b l e e x c e p t i o n s : t h e g e n e s f o r
the histones and t h o s e for the i n t e r f e r o n s . This last d e m o n s t r a t e s t h a t
366
GILBERT
Fig. 7.
A t r a n s c r i p t i o n unit c o r r e s p o n d i n g to
alternating exons and introns.
The whole gene, a
transcription unit, is copied into RNA terminating in
a poly(A) tail. The regions corresponding to introns
are spliced out leaving a messenger RNA made up of
the three exons, the regions that are e x p r e s s e d in
the mature message.
t h e r e can be no a b s o l u t e l y e s s e n t i a l role t h a t the introns must play,
t h e r e can be no a b s o l u t e n e e d f o r s p l i c i n g in o r d e r to e x p r e s s a
p r o t e i n in m a m m a l i a n cells.
Although t h e r e is a line of e x p e r i m e n t s
t h a t shows t h a t s o m e m e s s e n g e r s m u s t have at l e a s t a s i n g l e s p l i c e
m a d e b e f o r e t h e y c a n be e x p r e s s e d , t h e r e is no e v i d e n c e t h a t the
g r e a t m u l t i p l i c i t y of splices a r e needed.
T h e r e is a pair of genes for
i n s u l i n in t h e r a t , t h a t d i f f e r in t h e n u m b e r of introns; both are
e x p r e s s e d - which d e m o n s t r a t e s t h a t the intron t h a t splits the coding
region of one of t h e m has no e s s e n t i a l role, in cis, in the expression
of t h a t gene ( 3 2 ) . Although a c o m m o n c o n j e c t u r e is t h a t the splicing
m i g h t h a v e a r e g u l a t o r y r o l e , so f a r t h e r e is no t i s s u e - d e p e n d e n t
splicing p a t t e r n t h a t could be i n t e r p r e t e d as showing the e x i s t e n c e of
a gene (or tissue) s p e c i f i c splicing e n z y m e .
The introns a r e much longer than the exons.
Their DNA s e q u e n c e
drifts rapidly by m u t a t i o n and small additions and deletions ( a c c u m u lating c h a n g e s as rapidly as possible at the s a m e r a t e as t h e s i l e n t
c h a n g e s in codons).
This suggests t h a t it is not their s e q u e n c e t h a t is
r e l e v a n t , but t h e i r length.
Their f u n c t i o n is to m o v e the exons a p a r t
along the c h r o m o s o m e .
DNA SEQUENCING ANI~ GENE STRUCTURE
367
globin
Ig light chain
Ig heavy chain
9
Fig. 8.
Examples of the intron-exon structure of a
few genes.
(i) The gene for globin is broken up by
two introns into three exons (20).
(2) The functional gene in a myeloma cell for the immunoglobulin
lambda light chain is broken up into a short exon
corresponding to the hydrophobic leader sequence, an
exon corresponding to the V region, and then, after
an intron of some t h o u s a n d bases, an exon corres p o n d i n g to the 112 amino acids of the constant
region (26).
(3) A typical gene for a g a m m a h e a v y
chain of i m m u n o g l o b u l i n (27928).
The mature gene
corresponds to a hydrophobic leader sequence, an exon
corresponding to the variable region, and then, after
a long intr0n, a series of exons: the first corresponding to the first domain of the constant region:,
the second corresponding to a 1 5 - a m i n o - a c i d h i n g e
region, the third corresponding to the second domain
of the constant region, and the fourth exon corresponding to the third domain of the constant region.
A consequence of the separation of exons by long introns is that
the recombination frequency, both illegitimate and legitimate, between
e x o n s will be higher (25).
This will increase the rate, over evolutionary time, at which the exons, representing p a r t s of t h e p r o t e i n
structures, will be shuffled and reassorted to make new combinations.
Consider the process by which a structural d o m a i n is d u p l i c a t e d t o
m a k e t h e t w o - d o m a i n s t r u c t u r e of t h e light chain of the immunoglobulins (or duplicated again to make the four-dom ai n s t r u c t u r e of
the heavy chain, or combined to make the triple s t r u c t u r e represented
by ovomucoid ( 2 9 ) ) .
C l a s s i c a l l y , this i n v o l v e d a p r e c i s e u n e q u a l
crossing-over that fused the two copies of the original gene, in phase,
to make a double-length gene. As Fig. 9 shows, this process involves
an e x t r e m e l y rare, precise illegitimate event (a recombination event
th at leads to the fusing of two DNA sequences at a point where t here
is no matching of sequence) that has as its consequence the synthesis
at a high level of the new, presumably more useful, double-length gene
product.
C o n s i d e r t h e s a m e p r o c e s s agai nst the background of a
general splicing mechanism. Again, the process of forming the double
g e n e mu s t involve an i11egitimate recombination event, but now that
368
GILBERT
G
X
b
>
e v e n t can occur anywhere Within a stretch of I000 to 10 000 bases
flanking the 3' side of one copy and the 5' side of the other to form
an i n t r o n s e p a r a t i n g t h e genes for the two domains.
From a long
transcript across this region, even i n e f f i c i e n t splicing may produce the
n e w d o u b l e - l e n g t h gene product.
This will happen some t06 to I08
times more rapidly than the classica! process because of all of t h e
d i f f e r e n t combinations of sites at which the recombination can occur.
If the long transcript can be spliced~ even at a low frequency, some
of the double-length product can be made.
This is a faster way for
evolution to form the final gene:
proceeding through a rapid step to
a s t r u c t u r e t h a t can p r o d u c e a small a m o u n t of the useful gene
product.
Small mutational steps can b e selected to p r o d u c e b e t t e r
splicing signals and thus more of the gene product.
If the splicing
DNA S E Q U E N C I N G
AND GENE S T R U C T U R E
369
Fig. 9.
A double-length gene product arises through
unequal crossing over.
(a) The classical process by
which a gene c o r r e s p o n d i n g to a single polypeptide
chain might have its length doubled by a c r o s s i n g
over.
The top two lines indicate two coding regions,
brought into a c c i d e n t a l a p p o s i t i o n by some act of
i l l e g i t i m a t e r e c o m b i n a t i o n which fuses the carboxy
terminal region (the 3' end) of one copy of the gene
to the amino terminal region (the 5' end) of the
other.
This rare illegitimate event (involving no
sequence m a t c h i n g ) would, if it occurs in phase,
produce a double-length gene which could code for a
d o u b l e - l e n g t h RNA which in turn translates into a
double-length protein containing the reiteration of a
basic domain.
(b) The same process occurring in the
presence of the splicing function.
Now the unequal
crossing over can occur anywhere to the 3' side of one
copy of the gene and anywhere in front of the 5' end
of the other copy of the gene to produce a gene
containing ~wo exons separated by a long intron.
I
conjecture that the long transcript of this region now
will be spliced at some low frequency to produce a
mature message encoding the reiterated protein.
signals
already
exist, recombination
w i t h i n i n t r o n s p r o v i d e s an
i m m e d i a t e way to build p o l y m e r i c s t r u c t u r e s
o u t of s i m p l e r u n i t s .
One would p r e d i c t t h a t p o l y m e r i c s t r u c t u r e s , m a d e up of s i m p l e r units,
will be found to have genes in which the i n t r o n - e x o n s t r u c t u r e of t h e
p r i m i t i v e u n i t is r e p e a t e d , s e p a r a t e d again by introns.
T h a t is the
case.
The r a t e of l e g i t i m a t e r e c o m b i n a t i o n b e t w e e n the exons of a gene
will be i n c r e a s e d by the introns.
Consider two m u t a t i o n s to b e t t e r
f u n c t i o n i n g , a r i s i n g in d i f f e r e n t
p a r t s of a gene and spreading, by
s e l e c t i o n , through the population.
Classically, both m u t a t i o n s could
end up in a single p o l y p e p t i d e chain, a f t e r both genes find t h e i r way
into a single diploid individual, by homologous r e c o m b i n a t i o n within t h e
gene. Fig. 10 shows t h a t this process also should be s p e e d e d s o m e 10to 1 0 0 - f o l d by s p r e a d i n g t h e e x o n s a p a r t .
This effect will be
s t r o n g e s t if t h e e x o n s c a n e v o l v e s e p a r a t e l y - if t h e y r e p r e s e n t
s t r u c t u r e s t h a t can a c c u m u l a t e s u c c e s s f u l c h a n g e s i n d e p e n d e n t l y .
F u r t h e r m o r e , one can c h a n g e the p a t t e r n of exons by changing the
initiation or t e r m i n a t i o n of t h e RNA t r a n s c r i p t , to add e x t r a exons or
t o t i e t o g e t h e r e x o n s f r o m o n e region of the DNA to exons f r o m
a n o t h e r . This has been o b s e r v e d in adenovirus, and is found in n o t a b l e
e x a m p l e s in the immunoglobulins in which exons can be added to or
s u b t r a c t e d f r o m the c a r b o x y t e r m i n u s of the h e a v y c h a i n to m o d i f y
the protein.
H o o d ' s l a b o r a t o r y has shown t h a t - t h i s p r o c e s s is used to
switch b e t w e e n two d i f f e r e n t f o r m s of an IgM h e a v y chain ( 3 3 ) .
A
membrane-bound
f o r m is s y n t h e s i z e d by a longer t r a n s c r i p t , which
splices on two additional exons and splices out p a r t of the last exon
of t h e s h o r t e r t r a n s c r i p t .
The shorter transcript
synthesizes a
s e c r e t e d f o r m of the protein.
In a similar way the s w i t c h of t h e V
370
GILBERT
(]
X
(
q
b
IIII
|
Fig. i0.
Introns speed legitimate recombination.
(a) The classical pattern by which two mutati0ns~
one occurring in one copy of the gene~ at the left
end~ and the other occurring in the other copy of
the gene~ at the r i g h t - h a n d end~ might get
together by r e c o m b i n a t i o n
happening
in the
homologous stretch of DNA that separates the two
mutations. This recombination can create a single
gene c a r r y i n g both mutations.
(b) The same
process happening in a gene in which the mutations
occur in separate exons separated by an intron.
Now the recombination can occur anywhere9 either
in the exon or within the intron~ to produce a new
g e n e carrying both mutations.
Since the rate of
recombination will be directly proportional to the
distance along the DNA between the mutations~ it
will be faster.
DNA S E Q U E N C I N G
AND GENE S T R U C T U R E
371
region f r o m IgM to an IgD c o n s t a n t region is p r o b a b l y the result of a
d i f f e r e n t , still longer~ t r a n s c r i p t which s p l i c e s a c r o s s to a t t a c h t h e
V - r e g i o n e x o n s to the new c o n s t a n t - r e g i o n exons of the d e l t a class.
T h e s e c o m b i n a t i o n s of genes h a v e c e r t a i n l y been c r e a t e d by r e c o m bination e v e n t s within the DNA t h a t u l t i m a t e l y b e c o m e s the intron of
t h e longer t r a n s c r i p t i o n unit.
T h e m o s t s t r i k i n g p r e d i c t i o n of t h i s e v o l u t i o n a r y view is t h a t
s e p a r a t e e l e m e n t s defined by t h e e x o n s h a v e s o m e f u n c t i o n a l s i g n i f i c a n c e , t h a t t h e s e e l e m e n t s h a v e been a s s o r t e d and put t o g e t h e r in
new c o m b i n a t i o n s t o m a k e up t h e p r o t e i n s t h a t we k n o w .
Gene
p r o d u c t s a r e a s s e m b l e d o u t o f previously a c h i e v e d solutions of the
structure-function problem.
C l e a r e x a m p l e s of this a r e still m e a g e r .
The hydrophobic l e a d e r s e q u e n c e which is involved in the t r a n s f e r of
p r o t e i n s through m e m b r a n e s ,
a n d ~vhich is t r i m m e d o f f a f t e r t h e
s e c r e t i o n , is o f t e n on s e p a r a t e exons - m o s t notably in the i m m u n o globulins ( s e e Fig. 8), but also in ovomuc6'id ( 2 9 ) .
In t h e p a i r of
g e n e s for insulin in the rat~ a p r o d u c t o~ r e c e n t duplication ( 3 2 ) , the
two chains of insulin lie on s e p a r a t e ex0ns in one gen% on a single
exon in the o t h e r . The a n c e s t r a l g e n e ( t h e c o m m o n s t r u c t u r e in o t h e r
s p e c i e s ( 3 4 ) ) has the additional intron - s u g g e s t i n g t h a t the gene was
put t o g e t h e r originally f r o m s e p a r a t e pieces.
The gene for l y s o z y m e
is broken up into four exons; the second one c a r r i e s the c r i t i c a l a m i n o
acids of the a c t i v e s i t e and m o s t of the s u b s t r a t e c o n t a c t s ( 3 5 ) .
In
the gene for globin the c e n t r a l exon encodes a l m o s t all of the h e m e
contacts.
Fig. 11 shows a s c h e m a t i c dissection of the m o l e c u l e .
A
r e c e n t e x p e r i m e n t (36) has shown t h a t the p o l y p e p t i d e t h a t c o r r e sponds to the c e n t r a l exon in itself is a h e m e - b i n d i n g ' m i n i g l o b i n ' ; the
side exons h a v e provided p o l y p e p t i d e m a t e r i a l to s t a b i l i z e the protein.
A t t h e s a m e m o m e n t t h a t the rapid sequencing m e t h o d s and the
m o l e c u l a r cloning g a v e us the p r o m i s e of being able to work out t h e
s t r u c t u r e of any gen% the ability to a c h i e v e a c o m p l e t e u n d e r s t a n d i n g
of t h e g e n e t i c m a t e r i a l , N a t u r e r e v e a l e d h e r s e l f to be m o r e c o m p l e x
than we had imagined.
We could not read the gene p r o d u c t d i r e c t l y
f r o m the c h r o m o s o m e by DNA sequencing alone.
We m u s t appeal to
t h e s e q u e n c e of the a c t u a l protein, or at l e a s t the s e q u e n c e of the
m a t u r e m e s s e n g e r RNA, to l e a r n t h e i n t r o n - e x o n
structure
of t h e
gene.
N o n e t h e l e s s t h e h o p e e x i s t s , t h a t as we look down on the
s e q u e n c e of DNA in the c h r o m o s o m % we will n o t l e a r n s i m p l y t h e
p r i m a r y s t r u c t u r e Of the gene products, but we will learn a s p e c t s of
the f u n c t i o n a l s t r u c t u r e of the p r o t e i n s - p u t t o g e t h e r o v e r e v o l u t i o n a r y t i m e as exons linked through introns.
My i n t e r e s t in biology has always c e n t e r e d on two p r o b l e m s ,
how
is t h e g e n e t i c i n f o r m a t i o n m a d e m a n i f e s t ? and how is it c o n t r o l l e d ?
We h a v e l e a r n e d much a b o u t the way in which a gene is t r a n s l a t e d
into protein.
The control of genes in p r o k a r y o t e s is well u n d e r s t o o d ,
but for e u k a r y o t e s the c r i t i c a l m e c h a n i s m s of c o n t r o l a r e s t i l l n o t
known.
T h e p u r p o s e of r e s e a r c h is to e x p l o r e the unknown.
The
desire for new knowledge calls f o r t h the a n s w e r s to new questions.
I o w e a g r e a t d e b t to m y s t u d e n t s and c o l l a b o r a t o r s o v e r the
years; the g r e a t e s t to 3ira Watson~ who s t i m u l a t e d m y i n t e r e s t in
m o l e c u l a r biology~ to Benno Muller-Hill~ with whom I worked on the
2 a c repressor~ and to Allan Maxam~ with whom I d e v e l o p e d t h e DNA
sequencing.
372
GILBERT
P..,
DNA SEQUENCING AND GENE STRUCTURE
373
Fig. ii.
A schematic dissection of globin into the
product of the separate exons.
(a) The black arrows
show the points at which the structure of a chain of
globin is interrupted by the introns.
(The structures of the chains of globin and of myoglobin are
very similar; the schematic structure s h o w n is
myoglobin.)
The introns interrupt the protein in the
alpha-helical regions to break the protein into three
portions, shown in ( b ) . The product of the central
exon surrounds the heme; the products of the other
two exons, I conjecture, wrap around and stabilize
the protein.
References
i.
2.
3.
4.
5.
6.
7.
8.
9.
I0.
ii.
12.
13.
G i l b e r t W & MHller-Hill B (1966) Isolation of the 2ac
repressor.
Proc. Natl. Acad. Sci. U.S.A. 56, 1891-1898.
Gilbert W & MUller-Hill B (1967) The 2ac operator is DNA.
Proc. Natl. Acad. Sci. U.S.A. 58, 2415-2421.
Gilbert W & Maxam A (1973) The nucleotide sequence of the 2ac
operator.
Proc. Natl. Acad. Sci. U.S.A. 70, 3581-3584.
Gilbert W, Maizels N & Maxam A (1975) Sequences of controlling regions of the E. coli lactose operator.
Genetics
79, 227.
Dickson RC, Abelson J, Barnes WM & Reznikoff WS (1975)
Genetic regulation: the lac control region.
Science 187,
27-35.
Gilbert W, Maxam A & Mirzabekov A (1976) Contacts between the
2ac repressor and DNA revealed by methylation.
In Contro2 of
R i b o s o m e Synthesis, pp 139-148, Alfred Benzon Symposium 9,
Munksgaard.
Maxam AM & Gilbert W (1977) A new method for sequencing DNA.
Proc. Natl. Acad. Sci. U.S.A. 74, 560-564.
Tonegawa S, Maxam AM, Tizard R, Bernard O & Gilbert W (1978)
Sequence of a mouse germ-line gene for a variable region of
an immunoglobulin light chain.
Proc. Natl. Acad. Sci U.S.A.
75 1 1485-1489.
Maxam AM & Gilbert W (1980) Sequencing end-labeled DNA with
base-specific chemical cleavages.
Methods in Enzymology 65,
499-560.
Sanger F & Coulson AR (1978) The use of thin acrylamide gels
for DNA sequencing.
FEBS Lett. 87, 107-110.
Farabaugh PJ (1978) Sequence of the lacI gene.
Nature 274,
765-769.
Beyreuther K, Adler K, Fanning E, Murray C~ Klemm A & Geisler
N (1975) Amino-acid sequence of lac repressor from Escherichia coli. Eur. J. Biochem. 59, 491-509.
Coulondre C, Miller JH, Farabaugh PJ & Gilbert W (1978)
Molecular basis of base substitution hotspots in Escherichia
coli. Nature 274, 775-780.
374
14.
GILBERT
Lindahl T, Ljungquist S, Siegert W, Nyberg B & Sperens B
(1977) DNA N-glycosidases. J. Biol. Chem. 252, 3286-3294.
15. Sutcliffe JG (1978) Nucleotide sequence of the ampicillin
resistance gene of Escherichia co2i plasmid pBR322. Proc.
Natl. Acad. Sci. U.S.A. 75, 3737-3741.
16. Ambler RP & Scott GK (1978) Partial amino acid sequence of
penicillinase coded by Escherichia co2i plasmid R6K. Proc.
Natl. Acad. Sci. U.S.A. 75, 3732-3736.
17. Sutcliffe JG (1978) Complete nucleotide sequence of the
Escherichia co2i plasmid pBR322. Cold Spring Harbor Symposia
on Quantitative Biology 43, 77-90.
18. For a review, see: Yanofsky C (1981) Attenuation in the
control of expression of bacterial operons. Nature 289,
751-758.
19. Tilghman SM, Tiemeier DC, Seidman JG, Peterlin BM, Sullivan
M, Maizel JV & Leder P (1978) Intervening sequence of DNA
identified in the structural portion of a mouse ~-globin
gene. Proc. Natl. Acad. Sci. U.S.A. 75, 725-729.
20. Konkel DA, Tilghman SM & Leder P (1978) The sequence of the
chromosomal mouse 8-globin major gene: homologies in capping,
splicing and poly(A) sites. Cell 15, 1125-1132.
21. Brack C & Tonegawa S (1977) Variable and constant parts of
the immunoglobulin light chain gene of a mouse myeloma cell
are 1250 nontranslated bases apart. Proc. Natl. Acad. Sci.
U.S.A. 74, 5652-5656.
22. Breathnach R, Mandel JL & Chambon P (1977) Ovalbumin gene is
split in chicken DNA. Nature 270, 314-319.
23. Berget SM, Moore C & Sharp PA (1977) Spliced segments at the
5' terminus of adenovirus 2 late mRNA. Proc. Natl. Acad.
Sci. U.S.A. 74, 3171-3175.
24. Chow LT, Gelinas RE, Broker TR & Roberts RJ (1977) An amazing
sequence arrangement at the 5' ends of adenovirus 2 messenger
RNA. Cell 12, I-8.
25. Gilbert W (1978) Why genes in pieces? Nature 271, 501.
26. Bernard O, Hozumi N & Tonegawa S (1978) Sequences of mouse
immunoglobulin light chain genes before and after somatic
changes. Cell 15, 1133-1144.
27. Sakano H, Rogers JH, H~ppi K, Brack C, Traunecker A, Maki R~
Wall R & Tonegawa S (1979) Domains and the hinge region of an
immunoglobulin heavy chain are encoded in separate DNA
segments. Nature 277, 627-633.
28. Honjo T, Obata M, Yamawaki-Kataoka Y, Kataoka T, Kawakami T,
Takahashi N & Mano Y (1979) Cloning and complete nucleotide
sequence of mouse immunoglobin yl chain gene. Cell 18,
559-568.
29. Stein JP, Catterall JF, Kristo P, Means AR & O'Malley BW
(1980) Ovomucoid intervening sequences specify functional
domains and generate protein polymorphism. Cell 21,
681-687.
30. Yamado Y, Avvedimento VE, Mudryj M, Ohkubo H, Vogeli G, Irani
M, Pastan I & de Crombrugghe B (1980) The collagen gene:
evidence for its evolutionary assembly by amplification of a
DNA segment containing an exon of 54 bp. Cell 22, 887-892.
DNA SEQUENCING AND GENE STRUCTURE
31.
32.
33.
34.
35.
36.
375
Doolittle WF (1978) Genes in pieces: were they ever
together? Nature 272, 581.
Lomedico P, Rosenthal N9 Efstratiadis A, Gilbert W~ Kolodner
R & Tizard R (1979) The structure and evolution of the two
nonallelic rat preproinsulin genes. Cell 18, 545-558.
Early P~ Rogers J9 Davis M~ Calame K, Bond M, Wall R & Hood L
(1980) Two mRNAs can be produced from a single immunoglobin
gene by alternative RNA processing pathways. Cell 20,
313-319.
Perler Fy Efstratiadis A~ Lomedico P9 Gilbert W9 Kolodner R &
Dodgson J (1980) The evolution of genes: the chicken
preproinsulin gene. Cell 20, 555-566.
Jung A~ Sippel AE~ Grez M & Sch~tz G (1980) Exons encode
functional and structural units of chicken lysozyme. Proc.
Natl. Acad. Sci. U.S.A. 77, 5759-5763.
Craik CS, Buchman SR & Beychok S (1980) Characterization of
globin domains: heme binding to the central exon product.
Proc. Natl. Acad. Sci. U.S.A. 77, 1384-1388.