Document

Vol. 48 No. 3/2001
587–598
QUARTERLY
I dedicate this review in memory of Professor Jacek Augustyniak, who introduced me to
the world of genes and genomes
Review
The human genome structure and organization*
Wojciech Maka³owski½
National Center for Biotechnology Information, National Library of Medicine,
National Institutes of Health, Bethesda, U.S.A.
Received: 22 January, 2001; accepted: 26 February, 2001
Genetic information of human is encoded in two genomes: nuclear and mitochondrial. Both of them reflect molecular evolution of human starting from the beginning
of life (about 4.5 billion years ago) until the origin of Homo sapiens species about
100000 years ago. From this reason human genome contains some features that are
common for different groups
of organisms and some features that are unique for
Homo sapiens. 3.2 ´ 109 base pairs of human nuclear genome are packed into 237
chromosomes of different size. The smallest chromosome – 21st contains 5 ´ 10
base pairs while the biggest one –1st contains 2.63 ´ 108 base pairs. Despite the fact
that the nucleotide sequence of all chromosomes is established, the organisation of
nuclear genome put still questions: for example: the exact number of genes encoded
by the human genome is still unknown giving estimations from 30 to 150 thousand
genes. Coding sequences represent a few percent of human nuclear genome. The majority of the genome is represented by repetitive sequences (about 50%) and
noncoding unique sequences. This part of the genome is frequently wrongly called
“junk DNA”. The distribution of genes on chromosomes is irregular, DNA fragments
containing low percentage of GC pairs code lower number of genes than the fragments of high percentage of GC pairs.
*Presented at the XXXVI Meeting of the Polish Biochemical Society, Poznañ, 13 September 2000, Poland.
½Mailing address: NCBI/NLM/NIH, 45 Center Drive, MSC 6510, Bldg. 45, Room 6As.47A, Bethesda, MD
20892-6510, U.S.A., phone: (301) 435 5989; fax: (301) 480 2918; e-mail: [email protected]
CDS, coding DNA sequence; EST, expressed sequence tag; FISH, fluorescence in situ hybridization; HERV, human endogenous retrovirus; LINE, long interspersed repetitive element; LTR,
long terminal repeat; SAR, scaffold-attachement region; SINE, short interspersed repetitive element;
TE, transposable element; UTR, untranslated region.
Abbreviations:
588
INTRODUCTION
W. Makalowski
—
HISTORICAL
PERSPECTIVE
From the beginning of humanity, people
have been interested in themselves. They
were well aware of two aspects of living nature: an immense variability within each species and the tendency for characteristics of
parents to be transmitted to their offspring.
Already pre-Socratic philosophers noticed
that people shared some characteristics, e.g.
had usually, with some exceptions, two hands,
a nose, large forehead, in other words they
were alike. On the other hand, everybody was
different and nobody should have a problem
to distinguish those two gentlemen by such
characteristics as eyes, cheeks, or shirts. Ancient people were also aware that the above
was true for both intra- and inter-species comparison.
The question arises: how does it happen that
our children are more similar to parents than
to monkeys? The problem already intrigued
pre-Socratic philosophers. Probably the first
person who publicly expressed his thoughts
on the subject was Anaxagoras of Clazomenae. According to his teaching, seed material
is carried from all parts of the body to reproductive organs by the humors. Fertilization is
the mixing of the seed material of father and
mother. That all parts of the body participate
in the production of seed material is documented by the fact that blue-eyed parents
have blue-eyed children and baldheaded men
have sons that become baldheaded — not a
very good prospect for my own children. The
idea of panspermy or pangenesis was adapted
and taught by the famous physician
Hipocrates (about 460–377 B.C.) and was
widely accepted until the end of the nineteenth century, also by Charles Darwin. One
of the greatest scientists of all time, Aristotle
of Stagira had a different view on the problem. Aristotle¢s theory of inheritance, as described in one of his major works De
generatione animalium, was holistic. He held
that the contributions by males and females
2001
were not equal. The semen of the male contributes the form-giving principle, eidos, while
the menstrual blood, cantemina, of the female
is the unformed substance shaped by the eidos
of the semen. “The female always provides the
material, the male provides that which fashions the material into shape; this in our view,
is the specific characteristic of each sex: that
is what it means to be male or to be female.”
(Aristotle, 1965).
The twentieth century witnessed accelerated
development of biology and with it the nature
of the inheritance process was understood.
Consequently, an effort to decipher the
blueprint of our species has started. Several
biological discoveries were especially important to decipher the human genome. Everything started with the rediscovery of Mendel¢s
laws by Hugo Marie De Vries (1900), followed
by discovery of chromosomes by Thomas H.
Morgan in 1910 (Morgan, 1910). In 1953,
James D. Watson and Francis H.C. Crick unraveled the structure of DNA (Watson &
Crick, 1953a; Watson & Crick, 1953b). Fours
years later, Johan H. Matthaei and Marshall
Nirenberg performed experiments which enabled deciphering the genetic code. With the
development of the fast methods of DNA sequencing in the mid-seventies (Maxam &
Gilbert, 1977; Sanger et al., 1977), followed by
automation of cloning and sequencing in the
nineties, the way to understand our blueprint
became clear. By now, many complete
genomes of both prokaryotic and eukaryotic
organisms have been sequenced. For
up-to-date tables with completed genomes, go
to http://www.ebi. ac.uk/genomes/. On June
26, 2000, virtually all news agencies in the
world announced completion of a working
draft of the human genome. This accomplishment was so important for humankind that instead of announcing it at a scientific conference or in a scientific journal, as used to be
with a scientific milestones, a special press
conference was organized in The White House
in Washington, D.C. In several days faces of
major players from both private and public
Vol. 48
The human genome
sectors appeared on journals¢ covers around
the world, including the Polish weeklies
Polityka and Wprost. It is worth pointing out
that the public genome project already completed sequence of two chromosomes: 22 (December, 1999) (Dunham et al., 1999) and 21
(May, 2000) (Hattori et al., 2000). The working draft of the human genome was published
by both projects last January.
HUMAN
GENOME
—
GENERAL
INFORMATION
Our genetic material is stored in two
organelles: nucleus and mitochondria. This review is focused on the nuclear genome in
which 3.2 miliard bp are packed in 22 pairs of
autosomes and two sex chromosomes, X and
Y. Human chromosomes are not of equal
sizes; the smallest, chromosome 21, is 54 mln
bp long; the largest, chromosome 1, is almost
five times bigger with 249 mln bp (see Table 1).
Genomic sequences can be divided in several
ways. From the functional point of view we
can distinguish genes, pseudogenes, and
non-coding DNA (Fig. 1). Only a minute fraction of the genome — about 3% — codes for pro-
589
Table 1. Physical sizes of human chromosomes
Chromosome
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Y
Size (Mbp)
249
237
192
183
174
165
153
135
132
132
132
123
108
105
99
84
81
75
69
63
54
57
141
60
transposable elements as well but with time
they have mutated beyond recognition.
Figure 1. Fractions of different sequences in the human
genome.
teins. There are many pseudogenes in the human genome (0.5%) but most of the genome
consists of introns and intergenic DNA. Almost half of these sequences consist of different transposons; moreover, the remaining
non-coding DNA most likely originated from
SEQUENCE
COMPLEXITY
The human genome contains various levels
of complexity as demonstrated by reassociation kinetics. Such analyses of the human
genome estimate that 60% of the DNA is ei-
590
W. Makalowski
ther single copy or in very low copies; 30% of
the DNA is moderately repetitive; and 10% is
considered highly repetitive.
Various staining techniques demonstrate alternative banding patterns of mitotic chromosomes referred to as karyograms. Although
the three broad classes of DNA are scattered
throughout the chromosome, chromosomal
banding patterns reflect levels of compartmentalization of the DNA. Using the
C-banding technique yields dark-staining regions of the chromosome (or C bands), referred to as heterochromatin. These regions
are highly coiled, contain highly repetitive
DNA, and are typically found at the
centromeres, telomeres, and on the Y chromosome. They are composed of long arrays of
tandem repeats and therefore some may contain a nucleotide composition that differs significantly from the remainder of the genome
(approximately 40–42% GC). That means that
they can be separated from the bulk of the genome by buoyant density (caesium chloride)
gradient centrifugation. Gradient centrifugation results in a major band and three minor bands referred to as satellite bands —
hence the term satellite DNA.
The G-banding technique yields a pattern of
alternating light and dark bands reflecting
variations in base composition, time of replication, chromatin conformation, and the density of genes and repetitive sequences. Therefore, the karyograms define chromosomal organization and allow for identification of the
different chromosomes. The darker bands, or
G bands, are comparatively more condensed,
more AT-rich, less gene-rich and replicate
later than the DNA within the pale bands,
which correspond to the R bands by an alternative staining technique. More recently,
these alternative banding patterns have been
correlated to the level of compaction of scaffold-attachment regions (SARs).
The human genome may also be compartmentalized into large (> 300 kb) segments of
DNA that are homogeneous in base composi-
2001
tion referred to as isochores (Bernardi, 2000),
based on sequence analysis and compositional
mapping. L1 and L2 are GC-poor (or ‘light¢)
isochore families representing about 62% of
the genome. The H1, H2 and H3 (heavy)
isochore classes are increasingly GC-rich.
There is some correlation between isochores
and chromosomal bands. G bands are almost
exclusively composed of GC-poor isochores,
with a minor contribution from H1. R bands
can be classified further into T bands (R banding at elevated temperatures), which are composed mainly of H2 and H3 isochores, and R¢
(non-T R bands) which are comprised of
nearly equal amounts of GC-rich (primarily
H1) and GC-poor isochores.
Additionally, there are five human chromosomes (13, 14, 15, 21, 22) distinguished at
their terminus by a thin bridge with rounded
ends referred to as chromosomal satellites.
These contain repeats of genes coding for
rRNA and ribosomal proteins that coalesce to
form the nucleolus and are known as the nucleolar organizing regions.
HUMAN
GENE
NUMBER
It is interesting that the number of genes
coded by our genome is not known and probably will not be known long after completion of
the human genome sequencing. Nevertheless,
in the last decade, several groups tried to answer this question using different methods
(see Table 2). Unfortunately, the estimations
differ very much with prediction as low as
28 000 up to 80 000 genes per human haploid.
The whole genomic community is so excited
with this mysterious number that they decided to organize the Gene Sweepstake. The
Gene Sweepstake will run between 2000 and
2003 and its detailed rules may be found at:
http://www. ensembl.org/Genesweep/. As of
January 2001, 165 bets were made with gene
number between 27 462 and 153 478 and a
mean value of 61 710.
Vol. 48
The human genome
591
Table 2. Estimation of human gene number using different methods
Gene number
80 000
64 000
35 000
28 000–34 000
30 000
EXONS
Method
CpG islands
ESTs
ESTs
Comparative genomics
Gene punctuation
CHARACTERISTIC
In most human genes, coding sequences are
interrupted by stretches of non-coding sequences, which are spliced out during mRNA
maturation. Using nomenclature introduced
by Walter Gilbert (Gilbert, 1978), the human
genes look like mosaics, consisting of series of
exons (DNA sequences that can be subsequently found in the mature mRNA) and
introns (silent DNA sequences that are absent
from the final mRNA). As nothing in nature is
simple, some of the introns carry significant
information and even code for other complete
genes (see description of nested genes below).
Initially, it was thought that introns occured
only in untranslated parts of mRNA and coding sequences (CDS) were not interrupted.
However, it soon became clear that introns
could be found in all domains of mRNA molecule. Therefore, exons can be classified as follows: 5¢ UTR exons, coding exons, 3¢ UTR
exons, and all possible combinations of those
three main types, including single exons that
cover the whole mRNA. The latter are very interesting from the evolutionary biology point
of view, because in most cases they are
retroposed copies of “regular” genes with
introns. Michael Zhang of Cold Spring Harbor
Laboratory analyzed 4731 human exons
(Zhang, 1998). It appears that human exons
are relatively short with median value of 167
bp and mean equal to 216 bp. The shortest
exon was only 12 bp while the longest one
6609 bp. These numbers have to be taken with
some caution because they are based on
GenBank annotation, which sometimes is not
very precise. Mixed (including coding and
Reference
(Antequera & Bird, 1994)
(Fields et al., 1994)
(Ewing & Green, 2000)
(Roest Crollius et al., 2000)
(Yang et al., 2001)
non-coding sequences) exons tend to be longer
than single type exons, especially those at the
end of the message; not surprisingly so, since
3¢ UTRs are relatively long in mammalian
mRNAs. In our analysis of over 2000 human
mRNA sequences the median and mean sizes
of human message domains were as follow:
118 nt and 191 nt for 5¢ UTR, 1191 and 1424
for CDSs, and 534 and 576 for 3¢ UTRs, respectively (Makalowski et al., 1996; Makalowski & Boguski, 1998).
GENE
DISTRIBUTION
Genes may be transcribed from either the
same or from the opposite strand of the genome, i.e. they may lie in the same
(tail-to-head) or opposite orientation
(head-to-head or tail-to-tail). Although the vast
majority of the human genome accounts for
non-exonic sequences, a surprisingly large
number of genes occupy the same genomic
space. About 6% of human genes reside in
introns of other genes (Wong et al., 2000). For
example, intron 27th of NF1 gene hosts three
other genes that have small introns on their
own, suggesting that they are not products of
retroposition (see Fig. 2). Additionally, over
100 gene pairs are overlapping at 3¢ end, i.e.
their 3¢ UTRs occupy the same region though
different strands (I. Makalowska, personal
communication). TPR and MSF genes map to
the same region of chromosome 1. The last
exon of the TPR gene is 872 nt long and overlaps completely with the last exon of the MSF
gene (200 nt). Interestingly, the very end of
592
W. Makalowski
2001
Figure 2. An example of nested genes.
The human sequence from chromosome 1 (GenBank accession number AC004526) was analyzed using
GeneMachine (Makalowska et al., 2001). Connected closed boxes represent gene models as predicted by GenScan
software (Burge & Karlin, 1997) and boxes with arrows represent results of BLASTn search; AC004526 was used as
a query against nr database.
the MSF gene overlaps with the intron of the
TPR gene (see Fig. 3).
Unlike in plant genomes, most of non-exonic
sequences in human genome account for
introns (Wong et al., 2000). However, genes
are not equally distributed throughout the genome. There is a distinct association between
GC-richness and gene density. This is consistent with the association of most genes with
CpG islands, the 500–1000 bp GC-rich seg-
ments flanking (usually at the 5¢ end) most
housekeeping and many tissue-specific genes.
The clustering of CpG islands, as demonstrated by fluorescence in situ hybridization
further depicts gene-poor and gene-rich chromosomal segments (Craig & Bickmore, 1994).
As a consequence, more than half of human
genes locate in the so-called “genomic core”
(isochores H2 and H3) comprising only 12% of
the human genome (see Table 3).
Figure 3. An example of
overlapping genes.
The human sequence from
chromosome 1 (GenBank
accession number AL133533) was analyzed using
GeneMachine (Makalowska et al., 2001). Connected closed boxes represent
gene models as predicted
by GenScan software
(Burge & Karlin, 1997)
and open boxes with arrows represent results of
BLASTn search; AL133533 was used as a query
against nr database.
Vol. 48
The human genome
593
Table 3. Gene density in different isochores
Isochore type
Genome fraction
Gene fraction
Gene density
GENE
Genomic core
H2 and H3
12%
54%
1/10 kbp
FAMILIES
Many genes can be clustered in groups of different sizes based on sequence similarity. The
similarity between two genes varies from
genes coding identical products to genes in
which product similarity is barely detectable
and/or limited to short sequence stretches
called sequence motives. Genes families arose
during the evolution by gene duplications
over the different periods of time as reflected
in sequence similarity. In general, more similar genes shared a common ancestor later (in
nearer past) than genes with a weaker similarity, although gene conversion can result in
very similar or identical gene copies regardless of gene duplication time. Gene duplication can occur by different mechanisms, like
unequal recombination or retroposition. Not
all duplicated genes remain active, some of
them end up in genomic oblivion and are
called pseudogenes. Some of the pseudogenes
can be rescued from the genomic death by capturing a promoter and regulatory elements in
the course of evolution as happened with
Q-globin gene which was rescued by an Alu element after 200 mln years of silent existence
(see discussion in Makalowski, 1995).
The histone gene family is an example of
very similar genes. It consists of five genes
that tend to be linked, although in differing arrays of variable copy numbers dispersed in
the human genome. The individual genes of a
particular histone family encode essentially
identical products (i.e. all H4 genes code for
the identical H4 protein). Analysis of individual human genomic clones has identified isolated histone genes, e.g. H4, clusters of two or
“Empty” space
L, H1
88%
46%
1/100 kbp
more histone genes, or clusters of all histone
genes, e.g. H3-H4-H1-H3-H2A-H2B (Hentschel
& Birnstiel, 1981). A majority of histone genes
form a large cluster on human chromosome 6
(6p21.3) and a small cluster at 1q21. Interestingly, histone genes lack introns; a rare feature for eukaryotic genes.
Genes that encode ribosomal RNA (rRNA)
total about 0.4% of the DNA in the human genome. The individual genes of a particular
rRNA family are essentially identical. The
28S, 5.8S and 18S rRNA genes are clustered
with spacer units in tandem arrays of approximately 60 copies each yielding about 2 million
bp of DNA. These clusters are present on the
short arms of five acrocentric chromosomes
and form the nucleolar organizing regions,
hence approximately 300 copies. These three
rRNA genes are transcribed as a single unit
and then cleaved. 5S rRNA genes are clustered on chromosome 1q.
Some genes in the human genome share
highly conserved amino-acid domains with
weak overall similarity. These often have developmental function. There are nine dispersed paired box (Pax) genes that contain
highly conserved DNA binding domains with
six a-helices. The homeobox or Hox genes
share a common 60 amino-acid sequence. In
humans there are four Hox gene clusters,
each on a different chromosome. However,
the individual genes in the cluster demonstrate greater similarity to a counterpart gene
in another cluster than to the other genes in
the same cluster.
There are pseudogenes that are the result of
retroposition (retropseudogenes). The
pseudogenes lack introns and the flanking
594
W. Makalowski
2001
DNA sequences of the functional locus and
therefore are not products of gene duplication. The generation of these types of elements is dependent on the reverse transcriptase of other retroelements such as LINEs.
REPETITIVE
SEQUENCES
The human genome is occupied by stretches
of DNA sequences of various length that exist
in variable copy number. These repetitive sequences may be in a tandem orientation or
they may be dispersed throughout the genome. Repetitive sequences may be classified
by function, dispersal patterns, and sequence
relatedness. Satellite DNA typically refers to
highly repetitive sequences with no known
function and interspersed repeat sequences
are typically the products of transposable element integration, including retrogenes and
retropseudogenes of a functional gene. For
the up-to-date list of human repetitive elements visit the RepBase at http://www.
girinst.org/.
GENOMIC
DUPLICATIONS
Thirty years ago, Suzumu Ohno put forth a
hypothesis about two duplications of the
whole genome in the early stages of vertebrate
evolution (Ohno, 1970). According to his hypothesis, most vertebrate gene families
should give three or four well-defined
branches, as presented in Fig. 4. Unfortunately, analysis of over 10 000 vertebrate gene
families does not support Ohno¢s hypothesis
(Makalowski, unpublished observation). Nevertheless, duplications in human genome do
exist and they play a significant role in genes
and the genome evolution. Although sometimes very large, they appear to be on a local,
not a global scale. For example, the comparison of the complete human chromosome 21
sequence with both itself and other human sequences revealed many large duplications
Figure 4. A hypothetical phylogenetical tree of
vertebrate gene family under Ohno¢s hypothesis
about two genomeduplicationsin early vertebrate
evolution.
Drosophila gene represents an outgoup and four clusters of a gene family are encircled. An asterisk (*)
marks first genome duplication and a hash sign (#)
marks points of second genome duplication. Different
branch lengths suggest different evolutionary rates after ancestral gene duplication.
with the largest intra-chromosomal duplication being 189 kb (position 188–377 and
14795–15 002 in q arm) and the largest detected inter-chromosomal duplication of over
100 kb region from q arm of chromosome 21
(position 646–751) duplicated in chromosome
22 (position 45–230) (RIKEN, 2000).
MICROSATELLITES,
AND
MINISATELLITES,
MACROSATELLITES
Microsatellites are small arrays of short simple tandem repeats, primarily 4 bp or less. Dif-
Vol. 48
The human genome
ferent arrays are found dispersed throughout
the genome, although dinucleotide CA/TG repeats are most common, yielding 0.5% of the
genome. Runs of As and Ts are common as
well. Microsatellites have no known functions.
However, CA/TG dinucleotide pairs can form
the Z-DNA conformation in vitro, which may
indicate some function. Repeat unit copy
number variation of microsatellites apparently occurs by replication slippage. The expansion of trinucleotide repeats within genes
has been associated with genetic disorders
such as Huntington disease or fragile-X syndrome.
Minisatellites are tandemly repeated sequences of DNA of lengths ranging from 1 kbp
to 15 kbp. For example, telomeric DNA sequences contain 10–15 kb of hexanucleotide
repeats, most commonly TTAGGG in the human genome, at the termini of the chromosomes. These sequences are added by telomerase to ensure complete replication of the
chromosome.
Macrosatellites are very long arrays, up to
hundreds of kilobases, of tandemly repeated
DNA. There are three satellite bands observed
by buoyant density centrifugation. However,
not all satellite sequences are resolved by density gradient centrifugation, e.g. alpha satellite DNA or alphoid DNA that constitute the
bulk of centromeric heterochromatin on all
chromosomes. The interchromosomal divergence of the alpha satellite families allows the
different chromosomes to be distinguished by
fluorescence in situ hybridization (FISH).
TRANSPOSABLE
ELEMENTS
The human genome contains interspersed
repeat sequences that have largely amplified
in copy number by movement throughout the
genome. Those sequences (transposable elements or TEs) can be divided into two classes
based on the mode of transposition (Finnegan, 1989). The Class I elements are TEs
which transpose by replication that involves
595
an RNA intermediate which is reverse transcribed back to DNA prior to reinsertion.
These are called retroelements and include
LTR transposons, which are structurally similar to integrated retroviruses, non-LTR elements (LINEs and SINEs), and retrogenes
(see Fig. 5). Class II elements move by a conservative cut-and-paste mechanisms, the excision of the donor element is followed by its reinsertion elsewhere in the genome. Integration of Class I and Class II transposable elements results in the duplication of a short sequence of DNA, the target site. There are
about 500 families of such transposons. Most
of transposition has occurred via an RNA intermediate, yielding classes of sequences referred to as retroelements (more than 400
families, e.g. Alu, L1, retrogenes, MIR). However, there is also evidence of an ancient
DNA-mediated transposition (more than 60
families of class II (DNA) transposons, e.g.
THE-1, Charlie, Tigger, mariner).
RETROELEMENTS
Short interspersed repetitive elements
(SINEs) and long interspersed repetitive elements (LINEs) are the two most abundant
classes of repeats in human, and represent
the two major classes of mammalian retrotransposons. Structural features shared by
LINEs and SINEs include an A-rich 3¢ end
and the lack of long terminal repeats (LTRs);
these features distinguish them from retroviruses and related retroelements.
A full-length LINE (or L1 element) is approximately 6.1 kbp although most are truncated
pseudogenes with various 5¢ ends due to incomplete reverse transcription. There are
about 100 000 copies of L1 sequences in our
genome. Approximately 1% of the estimated
3500 full-length LINEs have functional RNA
polymerase II promoter sequences along with
two intact open reading frames necessary to
generate new L1 copies. Individual LINEs
contain a poly-A tail and are flanked by direct
596
W. Makalowski
2001
Figure 5. The structure of different human transposable elements.
Open arrows denote duplicated target sites and closed arrows denote long terminal repeats (LTRs). The following
abbreviations are used: CP, capside; NC, nucleocapsid; Pr, proteinase; RT, reverse transcriptase; Int, integrase;
ORF, open reading frame, and A and B denote polymerase III internal promoter.
repeats. LINE mobilization activity has been
verified in both germinal and somatic tissues.
The Alu element is estimated at 500 000–
900 000 copies in the human genome representing the primary SINE family, the most
successful transposon in any genome. Sequence comparisons suggest that Alu repeats
were derived from the 7SL RNA gene. Each
Alu element is about 280 bp with a dimeric
structure, contains RNA polymerase III promoter sequences, and typically has an A-rich
tail and flanking direct repeats (generated
during integration). Although Alu elements
are present in all primate genomes, more than
2000 Alu elements have integrated within the
human genome subsequent to the divergence
of humans from the great apes.
The human genome also contains families of
retroviral-related sequences. These are characterized by sequences encoding enzymes for
retroposition and contain LTRs. In addition,
solitary LTRs of these elements may be located throughout the genome. There are sev-
eral low abundant (10–1000 copies) human
endogenous retrovirus (HERV) families, with
individual elements ranging from 6 to 10 kb,
collectively encompassing about 1% of the genome.
CLASS
II
ELEMENTS
Class II elements contain inverted repeats
(10–500 bp) at their termini and encode a
transposase that catalyses transposition.
They move by excision at the donor site and
reinsertion elsewhere in the genome by a
non-replicative mechanism. The human genome hosts a number of repeated sequences
originated in more than 60 different DNA
transposons.
The mariner ‘fossils’ present in our genome
closely resemble members of three subfamilies identified in insects, adding to the already extensive evidence that horizontal
transfer between genomes has been impor-
Vol. 48
The human genome
tant in genomic evolution. Other human DNA
transposon remains also show high similarity
to sequences in distantly related organisms.
Nevertheless, the level of sequence divergence
suggests that activity of all identified elements predates human evolution.
CONCLUSIONS
The 3.2 billion bp of our genetic blueprint is
packed into 23 pairs of chromosomes, or 46
DNA molecules. Only a fraction of the genome
is occupied by protein-coding exons and the
majority of non-exonic sequences consists of
repetitive elements. Functional exons contribute merely 2% of a genome, up to 50% of a genome is occupied by repetitive element, the remaining 48% is called unique DNA, most of
which probably originated in mobile elements
diverged over time beyond recognition. Different evolutionary forces shape the human genome composition and structure. It appears
that different mobile elements play a significant role in this process (reviewed recently in
Makalowski, 2000). The human genome is a
dynamic entity, new functional elements appear and old ones become extinct as genes
that evolve according to birth and death rule
(Ota & Nei, 1994) similarly to species evolution. This confirms that the theory of evolution is truly universal and applies not only to
all organisms but to all levels of life as well.
I would like to thank Izabela Makalowska for
sharing unpublished data and Jakub Makalowski for preparing Fig. 5.
REFERENCES
Antequera, F. & Bird, A. (1994) Predicting the total number of human genes. Nat. Genet. 8, 114.
Aristotle (1965) De generatione animalium. Oxonii,
E Typographeo Clarendoniano.
597
Bernardi, G. (2000) Isochores and the evolutionary genomics of vertebrates. Gene 241, 3–17.
Burge, C. & Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA.
J. Mol. Biol. 268, 78–94.
Craig, J.M. & Bickmore, W.A. (1994) The distribution of CpG islands in mammalian chromosomes. Nat. Genet. 7, 376–382.
Dunham, I., Shimizu, N. et al. (1999) The DNA sequence of human chromosome 22. Nature
402, 489–495.
Ewing, B. & Green, P. (2000) Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 25, 232–234.
Fields, C., Adams, M.D., White, O. & Venter, C.O.
(1994) How many genes in the human genome? Nat. Genet. 7, 345–346.
Finnegan, D.J. (1989) Eukaryotic transposable elements and genome evolution. Trends Genet.
5, 103–107.
Gilbert, W. (1978) Why genes in pieces? Nature
271, 501.
Hattori, M. et al. (2000) The DNA sequence of human chromosome 21. The chromosome 21
mapping and sequencing consortium (see
comments). Nature 405, 311–319.
Hentschel, C.C. & Birnstiel, M.L. (1981) The organization and expression of histone gene families. Cell 25, 301–313.
Makalowska, I. et al. (2001) GeneMachine: A tool
for seqence analysis and annotation. submitted.
Makalowski, W. (1995) SINEs as a genomic scrap
yard: An essay on genomic evolution; in The
Impact of Short Interspersed Elements (SINEs)
(Maraia, R.J. & Austin,
R.G., eds.) pp. 81–104, Landes Company.
Makalowski, W. (2000) Genomic scrap yard: How
genomes utilize all that junk. Gene 259,
61–67.
Makalowski, W. & Boguski, M.S. (1998) Evolutionary parameters of the transcribed mammalian
genome: An analysis of 2,820 orthologous rodent and human sequences. Proc. Natl. Acad.
Sci. U.S.A. 95, 9407–9412.
on the Host Genome
598
W. Makalowski
Makalowski, W., Zhang, J. & Boguski, M.S. (1996)
Comparative analysis of 1196 orthologous
mouse and human full-length mRNA and protein sequences. Genome Res. 6, 846–857.
Maxam, A.M. & Gilbert, W. (1977) A new method
for sequencing DNA. Proc. Natl. Acad. Sci.
U.S.A. 74, 560–564.
Morgan, T.H. (1910) Chromosomes and heredity.
Amer. Nat. 44, 449–496.
Ohno, S. (1970) Evolution by Gene Duplication.
Springer Verlag, New York.
Ota, T. & Nei, M. (1994) Divergent evolution and
evolution by the birth-and-death process in the
immunoglobulin VH gene family. Mol. Biol.
Evol. 11, 469–482.
RIKEN (2000) http://hgp.gsc.riken.go.jp/ chr21/
annotation.htm
Roest Crollius, H. et al. (2000) Estimate of human
gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet. 25, 235–238.
2001
Sanger, F., Nicklen, S. & Coulson, A.R. (1977)
DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 74,
5463–5467.
Watson, J.D. & Crick, F.H.C. (1953a) Genetical implications of the structure of deoxyribonucleic
acid. Nature 171, 964–967.
Watson, J.D. & Crick, F.H.C. (1953b) Molecular
structure of nucleic acids: A structure for
deoxyribose nucleic acid. Nature 171,
737–738.
Wong, G.K., Passey, D.A., Huang, Y., Yang, Z. &
Yu, J. (2000) Is “Junk” DNA mostly intron
DNA? Genome Res. 10, 1672–1678.
Yang, C.Z. et al. (2001) Gene Punctuation. Submitted.
Zhang, M.Q. (1998) Statistical features of human
exons and their flanking regions. Hum. Mol.
Genet. 7, 919–932.
8885d_c24_920-947
2/11/04
1:36 PM
Page 921 mac76 mac76:385_reb:
PA R T
III
INFORMATION PATHWAYS
24
25
26
27
28
Genes and Chromosomes 923
DNA Metabolism 948
RNA Metabolism 995
Protein Metabolism 1034
Regulation of Gene Expression 1081
he third and final part of this book explores the biochemical mechanisms underlying the apparently contradictory requirements for both genetic continuity and
the evolution of living organisms. What is the molecular
nature of genetic material? How is genetic information
transmitted from one generation to the next with high
fidelity? How do the rare changes in genetic material
that are the raw material of evolution arise? How is genetic information ultimately expressed in the amino acid
sequences of the astonishing variety of protein molecules in a living cell?
The fundamental unit of information in living systems is the gene. A gene can be defined biochemically
as a segment of DNA (or, in a few cases, RNA) that encodes the information required to produce a functional
biological product. The final product is usually a protein, so much of the material in Part III concerns genes
that encode proteins. A functional gene product might
also be one of several classes of RNA molecules. The
storage, maintenance, and metabolism of these informational units form the focal points of our discussion in
Part III.
Modern biochemical research on gene structure and
function has brought to biology a revolution comparable to that stimulated by the publication of Darwin’s theory on the origin of species nearly 150 years ago. An understanding of how information is stored and used in
T
cells has brought penetrating new insights to some of
the most fundamental questions about cellular structure
and function. A comprehensive conceptual framework
for biochemistry is now unfolding.
Today’s understanding of information pathways has
arisen from the convergence of genetics, physics, and
chemistry in modern biochemistry. This was epitomized
by the discovery of the double-helical structure of DNA,
postulated by James Watson and Francis Crick in 1953
(see Fig. 8–15). Genetic theory contributed the concept
of coding by genes. Physics permitted the determination of molecular structure by x-ray diffraction analysis.
Chemistry revealed the composition of DNA. The profound impact of the Watson-Crick hypothesis arose from
its ability to account for a wide range of observations
derived from studies in these diverse disciplines.
This revolution in our understanding of the structure of DNA inevitably stimulated questions about its
function. The double-helical structure itself clearly suggested how DNA might be copied so that the information it contains can be transmitted from one generation
to the next. Clarification of how the information in DNA
is converted into functional proteins came with the discovery of both messenger RNA and transfer RNA and
with the deciphering of the genetic code.
These and other major advances gave rise to the
central dogma of molecular biology, comprising the
three major processes in the cellular utilization of genetic information. The first is replication, the copying
of parental DNA to form daughter DNA molecules with
identical nucleotide sequences. The second is transcription, the process by which parts of the genetic
message encoded in DNA are copied precisely into RNA.
The third is translation, whereby the genetic message
encoded in messenger RNA is translated on the ribosomes into a polypeptide with a particular sequence of
amino acids.
921
8885d_c24_922
922
2/11/04
Part III
3:11 PM
Page 922 mac76 mac76:385_reb:
Information Pathways
Replication
DNA
Transcription
RNA
Translation
Protein
The central dogma of molecular biology, showing the general pathways of information flow via replication, transcription, and translation. The term “dogma” is a misnomer. Introduced by Francis Crick at
a time when little evidence supported these ideas, the dogma has become a well-established principle.
Part III explores these and related processes. In
Chapter 24 we examine the structure, topology, and
packaging of chromosomes and genes. The processes
underlying the central dogma are elaborated in Chapters 25 through 27. Finally, we turn to regulation, examining how the expression of genetic information is
controlled (Chapter 28).
A major theme running through these chapters is
the added complexity inherent in the biosynthesis of
macromolecules that contain information. Assembling
nucleic acids and proteins with particular sequences of
nucleotides and amino acids represents nothing less
than preserving the faithful expression of the template
upon which life itself is based. We might expect the formation of phosphodiester bonds in DNA or peptide
bonds in proteins to be a trivial feat for cells, given the
arsenal of enzymatic and chemical tools described in
Part II. However, the framework of patterns and rules
established in our examination of metabolic pathways
thus far must be enlarged considerably to take into
account molecular information. Bonds must be formed
between particular subunits in informational biopolymers, avoiding either the occurrence or the persistence
of sequence errors. This has an enormous impact on the
thermodynamics, chemistry, and enzymology of the
biosynthetic processes. Formation of a peptide bond requires an energy input of only about 21 kJ/mol of bonds
and can be catalyzed by relatively simple enzymes. But
to synthesize a bond between two specific amino acids
at a particular point in a polypeptide, the cell invests
about 125 kJ/mol while making use of more than 200
enzymes, RNA molecules, and specialized proteins. The
chemistry involved in peptide bond formation does not
change because of this requirement, but additional
processes are layered over the basic reaction to ensure
that the peptide bond is formed between particular
amino acids. Information is expensive.
The dynamic interaction between nucleic acids and
proteins is another central theme of Part III. With the
important exception of a few catalytic RNA molecules
(discussed in Chapters 26 and 27), the processes that
make up the pathways of cellular information flow are
catalyzed and regulated by proteins. An understanding
of these enzymes and other proteins can have practical
as well as intellectual rewards, because they form the
basis of recombinant DNA technology (introduced in
Chapter 9).
8885d_c24_920-947
2/11/04
1:36 PM
Page 923 mac76 mac76:385_reb:
chapter
24
GENES AND CHROMOSOMES
24.1
24.2
24.3
Chromosomal Elements 924
DNA Supercoiling 930
The Structure of Chromosomes
938
DNA topoisomerases are the magicians of the DNA world.
By allowing DNA strands or double helices to pass
through each other, they can solve all of the topological
problems of DNA in replication, transcription and other
cellular transactions.
tain them (Fig. 24–1). In this chapter we shift our focus
from the secondary structure of DNA, considered in
Chapter 8, to the extraordinary degree of organization
required for the tertiary packaging of DNA into chromosomes. We first examine the elements within viral and
cellular chromosomes, then assess their size and organization. We next consider DNA topology, providing a
—James Wang, article in Nature Reviews in
Molecular Cell Biology, 2002
Supercoiling, in fact, does more for DNA than act as an
executive enhancer; it keeps the unruly, spreading DNA
inside the cramped confines that the cell has provided
for it.
—Nicholas Cozzarelli, Harvey Lectures, 1993
lmost every cell of a multicellular organism contains
the same complement of genetic material—its
genome. Just look at any human individual for a hint
of the wealth of information contained in each human
cell. Chromosomes, the nucleic acid molecules that are
the repository of an organism’s genetic information, are
the largest molecules in a cell and may contain thousands of genes as well as considerable tracts of intergenic DNA. The 16 chromosomes in the relatively small
genome of the yeast Saccharomyces cerevisiae have
molecular masses ranging from 1.5 108 to 1 109 daltons, corresponding to DNA molecules with 230,000 to
1,532,000 contiguous base pairs (bp). Human chromosomes range up to 279 million bp.
The very size of DNA molecules presents an interesting biological puzzle, given that they are generally
much longer than the cells or viral packages that con-
A
0.5 m
FIGURE 24–1 Bacteriophage T2 protein coat surrounded by its single, linear molecule of DNA. The DNA was released by lysing the
bacteriophage particle in distilled water and allowing the DNA to
spread on the water surface. An undamaged T2 bacteriophage particle consists of a head structure that tapers to a tail by which the bacteriophage attaches itself to the outer surface of a bacterial cell. All
the DNA shown in this electron micrograph is normally packaged inside the phage head.
923
8885d_c24_920-947
924
2/11/04
Chapter 24
1:36 PM
Page 924 mac76 mac76:385_reb:
Genes and Chromosomes
description of the coiling of DNA molecules. Finally, we
discuss the protein-DNA interactions that organize
chromosomes into compact structures.
24.1 Chromosomal Elements
Cellular DNA contains genes and intergenic regions,
both of which may serve functions vital to the cell. The
more complex genomes, such as those of eukaryotic
cells, demand increased levels of chromosomal organization, and this is reflected in the chromosome’s structural features. We begin by considering the different
types of DNA sequences and structural elements within
chromosomes.
Genes Are Segments of DNA That Code
for Polypeptide Chains and RNAs
Our understanding of genes has evolved tremendously
over the last century. Classically, a gene was defined as
a portion of a chromosome that determines or affects a
single character or phenotype (visible property), such
as eye color. George Beadle and Edward Tatum proposed
a molecular definition of a gene in 1940. After exposing
spores of the fungus Neurospora crassa to x rays and
other agents known to damage DNA and cause alterations
in DNA sequence (mutations), they detected mutant
fungal strains that lacked one or another specific enzyme, sometimes resulting in the failure of an entire
metabolic pathway. Beadle and Tatum concluded that a
gene is a segment of genetic material that determines
or codes for one enzyme: the one gene–one enzyme
hypothesis. Later this concept was broadened to one
gene–one polypeptide, because many genes code for
proteins that are not enzymes or for one polypeptide of
a multisubunit protein.
The modern biochemical definition of a gene is even
more precise. A gene is all the DNA that encodes the
primary sequence of some final gene product, which can
be either a polypeptide or an RNA with a structural or
catalytic function. DNA also contains other segments or
sequences that have a purely regulatory function. Regulatory sequences provide signals that may denote the
beginning or the end of genes, or influence the transcription of genes, or function as initiation points for
replication or recombination (Chapter 28). Some genes
can be expressed in different ways to generate multiple
gene products from one segment of DNA. The special
transcriptional and translational mechanisms that allow
this are described in Chapters 26 through 28.
We can make direct estimations of the minimum
overall size of genes that encode proteins. As described
in detail in Chapter 27, each amino acid of a polypeptide chain is coded for by a sequence of three consecutive nucleotides in a single strand of DNA (Fig. 24–2),
with these “codons” arranged in a sequence that corresponds to the sequence of amino acids in the polypeptide that the gene encodes. A polypeptide chain of 350
amino acid residues (an average-size chain) corre-
DNA
5
3
mRNA
3
C
G
T
G
G
A
T
A
C
A
C
T
T
T
T
G
C
C
G
T
T
T
C
T
G
C
A
C
C
T
A
T
G
T
G
A
A
A
A
C
G
G
C
A
A
A
G
A
5
5
C
G
U
G
G
A
U
A
C
A
C
U
U
U
U
G
C
C
G
U
U
U
C
U
Polypeptide
Amino
terminus
Arg
Gly
Tyr
Thr
Phe
Ala
Val
Ser
3
Carboxyl
terminus
Template strand
FIGURE 24–2 Colinearity of the coding nucleotide sequences of
George W. Beadle,
1903–1989
Edward L. Tatum,
1909–1975
DNA and mRNA and the amino acid sequence of a polypeptide chain.
The triplets of nucleotide units in DNA determine the amino acids in
a protein through the intermediary mRNA. One of the DNA strands
serves as a template for synthesis of mRNA, which has nucleotide
triplets (codons) complementary to those of the DNA. In some bacterial and many eukaryotic genes, coding sequences are interrupted at
intervals by regions of noncoding sequences (called introns).
8885d_c24_925
2/12/04
11:21 AM
Page 925 mac76 mac76:385_reb:
24.1
sponds to 1,050 bp. Many genes in eukaryotes and a few
in prokaryotes are interrupted by noncoding DNA segments and are therefore considerably longer than this
simple calculation would suggest.
How many genes are in a single chromosome? The
Escherichia coli chromosome, one of the prokaryotic
genomes that has been completely sequenced, is a circular DNA molecule (in the sense of an endless loop
rather than a perfect circle) with 4,639,221 bp. These
base pairs encode about 4,300 genes for proteins and
another 115 genes for stable RNA molecules. Among eukaryotes, the approximately 3.2 billion base pairs of the
human genome include 30,000 to 35,000 genes on 24
different chromosomes.
DNA Molecules Are Much Longer Than the Cellular
Packages That Contain Them
Chromosomal DNAs are often many orders of magnitude longer than the cells or viruses in which they are
found (Fig. 24–1; Table 24–1). This is true of every class
of organism or parasite.
Viruses Viruses are not free-living organisms; rather,
they are infectious parasites that use the resources of a
host cell to carry out many of the processes they require to propagate. Many viral particles consist of no
more than a genome (usually a single RNA or DNA molecule) surrounded by a protein coat.
Almost all plant viruses and some bacterial and animal viruses have RNA genomes. These genomes tend
to be particularly small. For example, the genomes of
mammalian retroviruses such as HIV are about 9,000 nucleotides long, and that of the bacteriophage Q has
4,220 nucleotides. Both types of viruses have singlestranded RNA genomes.
The genomes of DNA viruses vary greatly in size
(Table 24–1). Many viral DNAs are circular for at least
part of their life cycle. During viral replication within a
host cell, specific types of viral DNA called replicative
forms may appear; for example, many linear DNAs become circular and all single-stranded DNAs become
TABLE 24–1
Chromosomal Elements
double-stranded. A typical medium-sized DNA virus is
bacteriophage (lambda), which infects E. coli. In its
replicative form inside cells, DNA is a circular double
helix. This double-stranded DNA contains 48,502 bp and
has a contour length of 17.5 m. Bacteriophage X174
is a much smaller DNA virus; the DNA in the viral particle is a single-stranded circle, and the double-stranded
replicative form contains 5,386 bp. Although viral
genomes are small, the contour lengths of their DNAs
are much greater than the long dimensions of the viral
particles that contain them. The DNA of bacteriophage
T4, for example, is about 290 times longer than the viral particle itself (Table 24–1).
Bacteria A single E. coli cell contains almost 100 times
as much DNA as a bacteriophage particle. The chromosome of an E. coli cell is a single double-stranded
circular DNA molecule. Its 4,639,221 bp have a contour
length of about 1.7 mm, some 850 times the length of
the E. coli cell (Fig. 24–3). In addition to the very large,
circular DNA chromosome in their nucleoid, many bacteria contain one or more small circular DNA molecules
that are free in the cytosol. These extrachromosomal
elements are called plasmids (Fig. 24–4; see also
p. 311). Most plasmids are only a few thousand base
pairs long, but some contain more than 10,000 bp. They
carry genetic information and undergo replication to
yield daughter plasmids, which pass into the daughter
cells at cell division. Plasmids have been found in yeast
and other fungi as well as in bacteria.
In many cases plasmids confer no obvious advantage on their host, and their sole function appears to be
self-propagation. However, some plasmids carry genes
that are useful to the host bacterium. For example,
some plasmid genes make a host bacterium resistant
to antibacterial agents. Plasmids carrying the gene for
the enzyme -lactamase confer resistance to -lactam
antibiotics such as penicillin and amoxicillin (see Box
20–1). These and similar plasmids may pass from an
antibiotic-resistant cell to an antibiotic-sensitive cell of the
same or another bacterial species, making the recipient
cell antibiotic resistant. The extensive use of antibiotics
The Sizes of DNA and Viral Particles for Some Bacterial Viruses (Bacteriophages)
Virus
Size of viral
DNA (bp)
X174
T7
(lambda)
T4
5,386
39,936
48,502
168,889
Length of
viral DNA (nm)
1,939
14,377
17,460
60,800
925
Long dimension of
viral particle (nm)
25
78
190
210
Note: Data on size of DNA are for the replicative form (double-stranded). The contour length is calculated assuming that
each base pair occupies a length of 3.4 Å (see Fig. 8–15).
8885d_c24_920-947
926
2/11/04
Chapter 24
1:36 PM
Page 926 mac76 mac76:385_reb:
Genes and Chromosomes
FIGURE 24–3 The length of the E. coli chromosome (1.7 mm) depicted in
linear form relative to the length of a typical E. coli cell (2 m).
E. coli
E. coli
DNA
FIGURE 24–4 DNA from a lysed E. coli cell. In this electron micrograph several small, circular plasmid DNAs are indicated by white arrows. The black spots and white specks are artifacts
of the preparation.
in some human populations has served as a strong
selective force, encouraging the spread of antibiotic
resistance–coding plasmids (as well as transposable elements, described below, that harbor similar genes) in
disease-causing bacteria and creating bacterial strains
that are resistant to several antibiotics. Physicians are
becoming increasingly reluctant to prescribe antibiotics
unless a clear clinical need is confirmed. For similar reasons, the widespread use of antibiotics in animal feeds
is being curbed.
Eukaryotes A yeast cell, one of the simplest eukaryotes, has 2.6 times more DNA in its genome than an E.
coli cell (Table 24–2). Cells of Drosophila, the fruit fly
used in classical genetic studies, contain more than 35
times as much DNA as E. coli cells, and human cells
have almost 700 times as much. The cells of many plants
and amphibians contain even more. The genetic material
of eukaryotic cells is apportioned into chromosomes, the
diploid (2n) number depending on the species (Table
24–2). A human somatic cell, for example, has 46 chro-
mosomes (Fig. 24–5). Each chromosome of a eukaryotic cell, such as that shown in Figure 24–5a, contains
a single, very large, duplex DNA molecule. The DNA
molecules in the 24 different types of human chromosomes (22 matching pairs plus the X and Y sex chromosomes) vary in length over a 25-fold range. Each type
of chromosome in eukaryotes carries a characteristic set
of genes. Interestingly, the number of genes does not
vary nearly as much as does genome size (see Chapter
9 for a discussion of the types of sequences, besides
genes, that contribute to genome size).
The DNA of one human genome (22 chromosomes
plus X and Y or two X chromosomes), placed end to
end, would extend for about a meter. Most human cells
are diploid and each cell contains a total of 2 m of DNA.
An adult human body contains approximately 1014 cells
and thus a total DNA length of 2 1011 km. Compare
this with the circumference of the earth (4 104 km)
or the distance between the earth and the sun
(1.5 108 km)—a dramatic illustration of the extraordinary degree of DNA compaction in our cells.
8885d_c24_920-947
2/11/04
1:36 PM
Page 927 mac76 mac76:385_reb:
24.1
Chromosomal Elements
927
(b)
(a)
FIGURE 24–5 Eukaryotic chromosomes. (a) A pair of linked and condensed
sister chromatids from a human chromosome. Eukaryotic chromosomes are
in this state after replication and at metaphase during mitosis. (b) A complete
set of chromosomes from a leukocyte from one of the authors. There are 46
chromosomes in every normal human somatic cell.
Eukaryotic cells also have organelles, mitochondria
(Fig. 24–6) and chloroplasts, that contain DNA. Mitochondrial DNA (mtDNA) molecules are much smaller
than the nuclear chromosomes. In animal cells, mtDNA
contains fewer than 20,000 bp (16,569 bp in human
mtDNA) and is a circular duplex. Each mitochondrion
typically has two to ten copies of this mtDNA molecule,
and the number can rise to hundreds in certain cells
when an embryo is undergoing cell differentiation. In a
few organisms (trypanosomes, for example) each mitochondrion contains thousands of copies of mtDNA, organized into a complex and interlinked matrix known as
a kinetoplast. Plant cell mtDNA ranges in size from
200,000 to 2,500,000 bp. Chloroplast DNA (cpDNA) also
exists as circular duplexes and ranges in size from
120,000 to 160,000 bp. The evolutionary origin of mitochondrial and chloroplast DNAs has been the subject of
much speculation. A widely accepted view is that they
are vestiges of the chromosomes of ancient bacteria that
gained access to the cytoplasm of host cells and became
the precursors of these organelles (see Fig. 1–36).
FIGURE 24–6 A dividing mitochondrion. Some mitochondrial
proteins and RNAs are encoded by one of the copies of the mitochondrial DNA (none of which are visible here). The DNA (mtDNA)
is replicated each time the mitochondrion divides, before cell division.
8885d_c24_920-947
928
2/11/04
Chapter 24
1:36 PM
Page 928 mac76 mac76:385_reb:
Genes and Chromosomes
TABLE 24–2
DNA, Gene, and Chromosome Content in Some Genomes
Total DNA (bp)
Bacterium (Escherichia coli)
Yeast (Saccharomyces cerevisiae)
Nematode (Caenorhabditis elegans)
Plant (Arabidopsis thaliana)
Fruit fly (Drosophila melanogaster)
Plant (Oryza sativa; rice)
Mouse (Mus musculus)
Human (Homo sapiens)
4,639,221
12,068,000
97,000,000
125,000,000
180,000,000
480,000,000
2,500,000,000
3,200,000,000
Number of
chromosomes*
1
16†
12‡
10
18
24
40
46
Approximate
number of genes
4,405
6,200
19,000
25,500
13,600
57,000
30,000–35,000
30,000–35,000
Note: This information is constantly being refined. For the most current information, consult the websites for the individual genome projects.
*
The diploid chromosome number is given for all eukaryotes except yeast.
†
Haploid chromosome number. Wild yeast strains generally have eight (octoploid) or more sets of these chromosomes.
‡
Number for females, with two X chromosomes. Males have an X but no Y, thus 11 chromosomes in all.
Mitochondrial DNA codes for the mitochondrial tRNAs
and rRNAs and for a few mitochondrial proteins. More
than 95% of mitochondrial proteins are encoded by nuclear DNA. Mitochondria and chloroplasts divide when
the cell divides. Their DNA is replicated before and during division, and the daughter DNA molecules pass into
the daughter organelles.
Eukaryotic Genes and Chromosomes
Are Very Complex
Many bacterial species have only one chromosome per
cell and, in nearly all cases, each chromosome contains
only one copy of each gene. A very few genes, such as
those for rRNAs, are repeated several times. Genes and
regulatory sequences account for almost all the DNA in
prokaryotes. Moreover, almost every gene is precisely
colinear with the amino acid sequence (or RNA sequence) for which it codes (Fig. 24–2).
The organization of genes in eukaryotic DNA is
structurally and functionally much more complex. The
study of eukaryotic chromosome structure, and more
recently the sequencing of entire eukaryotic genomes,
has yielded many surprises. Many, if not most, eukaryotic genes have a distinctive and puzzling structural
feature: their nucleotide sequences contain one or more
intervening segments of DNA that do not code for the
amino acid sequence of the polypeptide product. These
nontranslated inserts interrupt the otherwise colinear
relationship between the nucleotide sequence of the
gene and the amino acid sequence of the polypeptide it
encodes. Such nontranslated DNA segments in genes
are called intervening sequences or introns, and the
coding segments are called exons. Few prokaryotic
genes contain introns.
In higher eukaryotes, the typical gene has much
more intron sequence than sequences devoted to exons. For example, in the gene coding for the single
polypeptide chain of the avian egg protein ovalbumin
(Fig. 24–7), the introns are much longer than the exons; altogether, seven introns make up 85% of the gene’s
DNA. In the gene for the subunit of hemoglobin, a single intron contains more than half of the gene’s DNA.
The gene for the muscle protein titin is the intron champion, with 178 introns. Genes for histones appear to have
no introns. In most cases the function of introns is not
clear. In total, only about 1.5% of human DNA is “coding” or exon DNA, carrying information for protein or
RNA products. However, when the much larger introns
are included in the count, as much as 30% of the human genome consists of genes.
The relative paucity of genes in the human genome
leaves a lot of DNA unaccounted for. Figure 24–8
provides a summary of sequence types. Much of the
nongene DNA is in the form of repeated sequences of
several kinds. Perhaps most surprising, about half the
human genome is made up of moderately repeated sequences that are derived from transposable elements—
segments of DNA, ranging from a few hundred to several thousand base pairs long, that can move from one
location to another in the genome. Transposable elements (transposons) are a kind of molecular parasite,
efficiently making a home within the host genome. Many
have genes encoding proteins that catalyze the transposition process, described in more detail in Chapters
25 and 26. Some transposons in the human genome are
active, moving at a low frequency, but most are inactive
relics, evolutionarily altered by mutations. Although
these elements generally do not encode proteins or
RNAs that are used in human cells, they have played a
8885d_c24_920-947
2/11/04
1:36 PM
Page 929 mac76 mac76:385_reb:
24.1
1
L
Ovalbumin
gene
A
2
B
3
C
4
5
D
E
6
F
Chromosomal Elements
929
7
G
Exon
Intron
2
222 bp
1
90 bp
Hemoglobin
subunit
3
126 bp
A
131 bp
B
851 bp
FIGURE 24–7 Introns in two eukaryotic genes. The gene for ovalbumin has seven introns (A to G), splitting the coding sequences into
eight exons (L, and 1 to 7). The gene for the subunit of hemoglobin
has two introns and three exons, including one intron that alone contains more than half the base pairs of the gene.
major role in human evolution: movement of transposons can lead to the redistribution of other genomic
sequences.
Another 3% or so of the human genome consists of
highly repetitive sequences, also referred to as
simple-sequence DNA or simple sequence repeats
(SSR). These short sequences, generally less than
10 bp long, are sometimes repeated millions of times per
cell. The simple-sequence DNA has also been called
satellite DNA, so named because its unusual base com-
position often causes it to migrate as “satellite” bands
(separated from the rest of the DNA) when fragmented
cellular DNA samples are centrifuged in a cesium chloride density gradient. Studies suggest that simplesequence DNA does not encode proteins or RNAs. Unlike the transposable elements, the highly repetitive
DNA can have identifiable functional importance in
human cellular metabolism, because much of it is associated with two defining features of eukaryotic chromosomes: centromeres and telomeres.
45%
Transposons
21%
LINEs
13%
SINEs
8%
Retroviruslike
1.5% Exons
3% SSR
2
ce 5%
lla
ne
30 n
e
G
%
es
17% ?
is
28.5%
Introns and
noncoding
segments
ous
5% SD
M
FIGURE 24–8 Types of sequences in the human genome. This pie
chart divides the genome into transposons (transposable elements),
genes, and miscellaneous sequences. There are four main classes of
transposons. Long interspersed elements (LINEs), 6 to 8 kbp long (1 kbp
1,000 bp), typically include a few genes encoding proteins that catalyze transposition. The genome has about 850,000 LINEs. Short interspersed elements (SINEs) are about 100 to 300 bp long. Of the 1.5
million in the human genome more than 1 million are Alu elements,
so called because they generally include one copy of the recognition
sequence for AluI, a restriction endonuclease (see Fig. 9–3). The
genome also contains 450,000 copies of retroviruslike transposons,
1.5 to 11 kbp long. Although these are “trapped” in the genome and
cannot move from one cell to another, they are evolutionarily related
to the retroviruses (Chapter 26), which include HIV. A final class of
transposons (making up 1% and not shown here) consists of a variety of transposon remnants that differ greatly in length.
About 30% of the genome consists of sequences included in genes
for proteins, but only a small fraction of this DNA is in exons (coding
sequences). Miscellaneous sequences include simple-sequence repeats (SSR) and large segmental duplications (SD), the latter being segments that appear more than once in different locations. Among the
unlisted sequence elements (denoted by a question mark) are genes
encoding RNAs (which can be harder to identify than genes for proteins) and remnants of transposons that have been evolutionarily altered so that they are now hard to identify.
8885d_c24_920-947
930
2/11/04
Chapter 24
Telomere
1:36 PM
Page 930 mac76 mac76:385_reb:
Genes and Chromosomes
Centromere
Telomere
SUMMARY 24.1 Chromosomal Elements
■
Genes are segments of a chromosome that
contain the information for a functional
polypeptide or RNA molecule. In addition to
genes, chromosomes contain a variety of
regulatory sequences involved in replication,
transcription, and other processes.
■
Genomic DNA and RNA molecules are
generally orders of magnitude longer than the
viral particles or cells that contain them.
■
Many genes in eukaryotic cells, and a few in
bacteria, are interrupted by noncoding
sequences called introns. The coding segments
separated by introns are called exons.
■
Less than one-third of human genomic DNA
consists of genes. Much of the remainder
consists of repeated sequences of various
types. Nucleic acid parasites known as
transposons account for about half of the
human genome.
■
Eukaryotic chromosomes have two important
special-function repetitive DNA sequences:
centromeres, which are attachment points for
the mitotic spindle, and telomeres, located at
the ends of chromosomes.
Unique sequences (genes), dispersed repeats,
and multiple replication origins
FIGURE 24–9 Important structural elements of a yeast chromosome.
The centromere (Fig. 24–9) is a sequence of DNA
that functions during cell division as an attachment
point for proteins that link the chromosome to the mitotic spindle. This attachment is essential for the equal
and orderly distribution of chromosome sets to daughter cells. The centromeres of Saccharomyces cerevisiae have been isolated and studied. The sequences
essential to centromere function are about 130 bp long
and are very rich in AUT pairs. The centromeric sequences of higher eukaryotes are much longer and, unlike those of yeast, generally contain simple-sequence
DNA, which consists of thousands of tandem copies of
one or a few short sequences of 5 to 10 bp, in the same
orientation. The precise role of simple-sequence DNA
in centromere function is not yet understood.
Telomeres (Greek telos, “end”) are sequences at
the ends of eukaryotic chromosomes that help stabilize
the chromosome. The best-characterized telomeres are
those of the simpler eukaryotes. Yeast telomeres end
with about 100 bp of imprecisely repeated sequences of
the form
(5)(TxGy)n
(3)(AxCy)n
where x and y are generally between 1 and 4. The number of telomere repeats, n, is in the range of 20 to 100
for most single-celled eukaryotes and generally more
than 1,500 in mammals. The ends of a linear DNA molecule cannot be routinely replicated by the cellular replication machinery (which may be one reason why bacterial DNA molecules are circular). Repeated telomeric
sequences are added to eukaryotic chromosome ends
primarily by the enzyme telomerase (see Fig. 26–35).
Artificial chromosomes (Chapter 9) have been constructed as a means of better understanding the functional significance of many structural features of eukaryotic chromosomes. A reasonably stable artificial linear
chromosome requires only three components: a centromere, telomeres at each end, and sequences that allow
the initiation of DNA replication. Yeast artificial chromosomes (YACs; see Fig. 9–8) have been developed as a
research tool in biotechnology. Similarly, human artificial
chromosomes (HACs) are being developed for the treatment of genetic diseases by somatic gene therapy.
24.2 DNA Supercoiling
Cellular DNA, as we have seen, is extremely compacted,
implying a high degree of structural organization. The
folding mechanism must not only pack the DNA but also
permit access to the information in the DNA. Before
considering how this is accomplished in processes such
as replication and transcription, we need to examine an
important property of DNA structure known as supercoiling.
Supercoiling means the coiling of a coil. A telephone
cord, for example, is typically a coiled wire. The path
taken by the wire between the base of the phone and
the receiver often includes one or more supercoils (Fig.
24–10). DNA is coiled in the form of a double helix, with
both strands of the DNA coiling around an axis. The
further coiling of that axis upon itself (Fig. 24–11) produces DNA supercoiling. As detailed below, DNA
supercoiling is generally a manifestation of structural
strain. When there is no net bending of the DNA axis
upon itself, the DNA is said to be in a relaxed state.
We might have predicted that DNA compaction involved some form of supercoiling. Perhaps less predictable is that replication and transcription of DNA also
affect and are affected by supercoiling. Both processes
8885d_c24_920-947
2/11/04
Chapter 24
938
1:36 PM
Page 938 mac76 mac76:385_reb:
Genes and Chromosomes
density), which is (Lk Lk0)/Lk0. For cellular
DNAs, is typically 0.05 to 0.07, which
means that approximately 5% to 7% of the
helical turns in the DNA have been removed.
DNA underwinding facilitates strand separation
by enzymes of DNA metabolism.
Plectonemic
■
Solenoidal
(a)
DNAs that differ only in linking number are
called topoisomers. Enzymes that underwind
and/or relax DNA, the topoisomerases, catalyze
changes in linking number. The two classes of
topoisomerases, type I and type II, change Lk
in increments of 1 or 2, respectively, per
catalytic event.
(b)
FIGURE 24–24 Plectonemic and solenoidal supercoiling. (a) Plectonemic supercoiling takes the form of extended right-handed coils.
Solenoidal negative supercoiling takes the form of tight left-handed
turns about an imaginary tubelike structure. The two forms are readily interconverted, although the solenoidal form is generally not observed unless certain proteins are bound to the DNA. (b) Plectonemic
(top) and solenoidal supercoiling of the same DNA molecule, drawn
to scale. Solenoidal supercoiling provides a much greater degree of
compaction.
extended right-handed supercoils characteristic of the
plectonemic form, solenoidal supercoiling involves tight
left-handed turns, similar to the shape taken up by a
garden hose neatly wrapped on a reel. Although their
structures are dramatically different, plectonemic and
solenoidal supercoiling are two forms of negative supercoiling that can be taken up by the same segment of
underwound DNA. The two forms are readily interconvertible. Although the plectonemic form is more stable
in solution, the solenoidal form can be stabilized by
protein binding and is the form found in chromatin. It
provides a much greater degree of compaction (Fig.
24–24b). Solenoidal supercoiling is the mechanism by
which underwinding contributes to DNA compaction.
SUMMARY 24.2 DNA Supercoiling
■
Most cellular DNAs are supercoiled. Underwinding decreases the total number of helical
turns in the DNA relative to the relaxed, B form.
To maintain an underwound state, DNA must
be either a closed circle or bound to protein.
Underwinding is quantified by a topological
parameter called linking number, Lk.
■
Underwinding is measured in terms of specific
linking difference, (also called superhelical
24.3 The Structure of Chromosomes
The term “chromosome” is used to refer to a nucleic
acid molecule that is the repository of genetic information in a virus, a bacterium, a eukaryotic cell, or an organelle. It also refers to the densely colored bodies seen
in the nuclei of dye-stained eukaryotic cells, as visualized using a light microscope.
Chromatin Consists of DNA and Proteins
The eukaryotic cell cycle (see Fig. 12–41) produces remarkable changes in the structure of chromosomes (Fig.
24–25). In nondividing eukaryotic cells (in G0) and
those in interphase (G1, S, and G2), the chromosomal
material, chromatin, is amorphous and appears to be
randomly dispersed in certain parts of the nucleus. In
the S phase of interphase the DNA in this amorphous
state replicates, each chromosome producing two sister
chromosomes (called sister chromatids) that remain associated with each other after replication is complete.
The chromosomes become much more condensed during prophase of mitosis, taking the form of a speciesspecific number of well-defined pairs of sister chromatids (Fig. 24–5).
Chromatin consists of fibers containing protein and
DNA in approximately equal masses, along with a small
amount of RNA. The DNA in the chromatin is very
tightly associated with proteins called histones, which
package and order the DNA into structural units called
nucleosomes (Fig. 24–26). Also found in chromatin are
many nonhistone proteins, some of which help maintain
chromosome structure, others that regulate the expression of specific genes (Chapter 28). Beginning with
nucleosomes, eukaryotic chromosomal DNA is packaged
into a succession of higher-order structures that ultimately yield the compact chromosome seen with the
light microscope. We now turn to a description of this
structure in eukaryotes and compare it with the packaging of DNA in bacterial cells.
8885d_c24_920-947
2/11/04
1:36 PM
Page 939 mac76 mac76:385_reb:
The Structure of Chromosomes
24.3
FIGURE 24–25 Changes in chromosome structure during
the eukaryotic cell cycle. Cellular DNA is uncondensed
throughout interphase. The interphase period can be
subdivided (see Fig. 12–41) into the G1 (gap) phase; the S
(synthesis) phase, when the DNA is replicated; and the G2
phase, in which the replicated chromosomes cohere to one
another. The DNA undergoes condensation in the prophase
of mitosis. Cohesins (green) and condensins (red) are
proteins involved in cohesion and condensation (discussed
later in the chapter). The architecture of the cohesincondensin-DNA complex is not yet established, and the
interactions shown here are figurative, simply suggesting
their role in condensation of the chromosome. During
metaphase, the condensed chromosomes line up along a
plane halfway between the spindle poles. One chromosome
of each pair is linked to each spindle pole via microtubules
that extend between the spindle and the centromere. The
sister chromatids separate at anaphase, each drawn toward
the spindle pole to which it is connected. After cell division
is complete, the chromosomes decondense and the cycle
begins anew.
939
Cohesin
Duplex
DNA
replication
and cohesion
S
Replication occurs
from multiple
origins of replication;
daughter chromatids
are linked by cohesins
G1
G2
Condensins
Interphase
Replication
completed
condensation
Mitosis
Anaphase
separation
Cohesins
Prophase
alignment
Spindle
pole
Metaphase
Histone core
of nucleosome
Linker DNA
of nucleosome
Histones Are Small, Basic Proteins
(a)
50 nm
(b)
FIGURE 24–26 Nucleosomes. Regularly spaced nucleosomes consist
of histone complexes bound to DNA. (a) Schematic illustration and
(b) electron micrograph.
Found in the chromatin of all eukaryotic cells, histones
have molecular weights between 11,000 and 21,000 and
are very rich in the basic amino acids arginine and lysine (together these make up about one-fourth of the
amino acid residues). All eukaryotic cells have five major classes of histones, differing in molecular weight and
amino acid composition (Table 24–3). The H3 histones
are nearly identical in amino acid sequence in all
eukaryotes, as are the H4 histones, suggesting strict
conservation of their functions. For example, only 2 of
102 amino acid residues differ between the H4 histone
molecules of peas and cows, and only 8 differ between
the H4 histones of humans and yeast. Histones H1, H2A,
and H2B show less sequence similarity among eukaryotic species.
Each type of histone has variant forms, because certain amino acid side chains are enzymatically modified
by methylation, ADP-ribosylation, phosphorylation, glycosylation, or acetylation. Such modifications affect the
net electric charge, shape, and other properties of
histones, as well as the structural and functional properties of the chromatin, and they play a role in the regulation of transcription (Chapter 28).
8885d_c24_920-947
940
2/11/04
Chapter 24
1:36 PM
Page 940 mac76 mac76:385_reb:
Genes and Chromosomes
H2B
H4
Nucleosomes Are the Fundamental Organizational
Units of Chromatin
The eukaryotic chromosome depicted in Figure 24–5
represents the compaction of a DNA molecule about
105 m long into a cell nucleus that is typically 5 to
10 m in diameter. This compaction involves several
levels of highly organized folding. Subjection of chromosomes to treatments that partially unfold them reveals
a structure in which the DNA is bound tightly to beads
of protein, often regularly spaced (Fig. 24–26). The
beads in this “beads-on-a-string” arrangement are complexes of histones and DNA. The bead plus the connecting DNA that leads to the next bead form the nucleosome, the fundamental unit of organization upon
which the higher-order packing of chromatin is built.
The bead of each nucleosome contains eight histone
molecules: two copies each of H2A, H2B, H3, and H4.
The spacing of the nucleosome beads provides a repeating unit typically of about 200 bp, of which 146 bp
are bound tightly around the eight-part histone core and
the remainder serve as linker DNA between nucleosome
beads. Histone H1 binds to the linker DNA. Brief treatment of chromatin with enzymes that digest DNA causes
preferential degradation of the linker DNA, releasing histone particles containing 146 bp of bound DNA that have
been protected from digestion. Researchers have crystallized nucleosome cores obtained in this way, and
x-ray diffraction analysis reveals a particle made up of
the eight histone molecules with the DNA wrapped
around it in the form of a left-handed solenoidal supercoil (Fig. 24–27).
A close inspection of this structure reveals why eukaryotic DNA is underwound even though eukaryotic
cells lack enzymes that underwind DNA. Recall that the
solenoidal wrapping of DNA in nucleosomes is but one
form of supercoiling that can be taken up by underwound (negatively supercoiled) DNA. The tight wrapping of DNA around the histone core requires the removal of about one helical turn in the DNA. When the
protein core of a nucleosome binds in vitro to a relaxed,
closed-circular DNA, the binding introduces a negative
supercoil. Because this binding process does not break
the DNA or change the linking number, the formation
of a negative solenoidal supercoil must be accompanied
by a compensatory positive supercoil in the unbound region of the DNA (Fig. 24–28). As mentioned earlier, eukaryotic topoisomerases can relax positive supercoils.
Relaxing the unbound positive supercoil leaves the negative supercoil fixed (through its binding to the nucleosome histone core) and results in an overall decrease
in linking number. Indeed, topoisomerases have proved
necessary for assembling chromatin from purified histones and closed-circular DNA in vitro.
Another factor that affects the binding of DNA to
histones in nucleosome cores is the sequence of the
H2A
H3
H2A
H3
H2B
H4
(a)
(b)
(c)
FIGURE 24–27 DNA wrapped around a nucleosome core. (a) Spacefilling representation of the nucleosome protein core, with different
colors for the different histones (PDB ID 1AOI). (b) Top and (c) side
views of the crystal structure of a nucleosome with 146 bp of bound
DNA. The protein is depicted as a gray surface contour, with the bound
DNA in blue. The DNA binds in a left-handed solenoidal supercoil
that circumnavigates the histone complex 1.8 times. A schematic drawing is included in (c) for comparison with other figures depicting
nucleosomes.
8885d_c24_920-947
2/11/04
1:36 PM
Page 941 mac76 mac76:385_reb:
24.3
TABLE 24–3
The Structure of Chromosomes
941
Types and Properties of Histones
Histone
Molecular
weight
H1*
H2A*
H2B*
H3
H4
21,130
13,960
13,774
15,273
11,236
Number of
amino acid
residues
223
129
125
135
102
Content of basic amino
acids (% of total)
Lys
Arg
29.5
10.9
16.0
19.6
10.8
11.3
19.3
16.4
13.3
13.7
*
The sizes of these histones vary somewhat from species to species. The numbers given here are for bovine histones.
DNA
(a)
Histone
core
Lk 0
(b)
Bound
negative
supercoil
(solenoidal)
bound DNA. Histone cores do not bind randomly to
DNA; rather, they tend to position themselves at certain
locations. This positioning is not fully understood but in
some cases appears to depend on a local abundance of
AUT base pairs in the DNA helix where it is in contact
with the histones (Fig. 24–29). The tight wrapping of
the DNA around the nucleosome’s histone core requires
compression of the minor groove of the helix at these
points, and a cluster of two or three AUT base pairs
makes this compression more likely.
Other proteins are required for the positioning of
some nucleosome cores on DNA. In several organisms,
certain proteins bind to a specific DNA sequence and
then facilitate the formation of a nucleosome core
nearby. Precise positioning of nucleosome cores can
play a role in the expression of some eukaryotic genes
(Chapter 28).
Unbound positive
supercoil (plectonemic)
Lk 1
topoisomerase
A
T pairs abundant
(c)
DNA
One (net) negative
supercoil
Histone core
FIGURE 24–28 Chromatin assembly. (a) Relaxed, closed-circular
DNA. (b) Binding of a histone core to form a nucleosome induces one
negative supercoil. In the absence of any strand breaks, a positive
supercoil must form elsewhere in the DNA (Lk 0). (c) Relaxation
of this positive supercoil by a topoisomerase leaves one net negative
supercoil (Lk 1).
FIGURE 24–29 Positioning of a nucleosome to make optimal use of
AUT base pairs where the histone core is in contact with the minor
groove of the DNA helix.
8885d_c24_920-947
942
2/11/04
Chapter 24
1:36 PM
Page 942 mac76 mac76:385_reb:
Genes and Chromosomes
30
nm
(a)
(b)
FIGURE 24–30 The 30 nm fiber, a higher-order organization of nucleosomes. (a) Schematic illustration of the probable structure of the
fiber, showing nucleosome packing. (b) Electron micrograph.
Nucleosomes Are Packed into Successively
Higher Order Structures
Wrapping of DNA around a nucleosome core compacts
the DNA length about sevenfold. The overall compaction
in a chromosome, however, is greater than 10,000-fold—
ample evidence for even higher orders of structural organization. In chromosomes isolated by very gentle
methods, nucleosome cores appear to be organized into
a structure called the 30 nm fiber (Fig. 24–30). This
packing requires one molecule of histone H1 per nucleosome core. Organization into 30 nm fibers does not extend over the entire chromosome but is punctuated by
regions bound by sequence-specific (nonhistone) DNAbinding proteins. The 30 nm structure also appears to
depend on the transcriptional activity of the particular
region of DNA. Regions in which genes are being transcribed are apparently in a less-ordered state that contains little, if any, histone H1.
The 30 nm fiber, a second level of chromatin organization, provides an approximately 100-fold compaction of the DNA. The higher levels of folding are not
yet understood, but it appears that certain regions of
DNA associate with a nuclear scaffold (Fig. 24–31). The
scaffold-associated regions are separated by loops of
DNA with perhaps 20 to 100 kbp. The DNA in a loop
may contain a set of related genes. For example, in
Drosophila complete sets of histone-coding genes seem
to cluster together in loops that are bounded by scaffold attachment sites (Fig. 24–32). The scaffold itself
appears to contain several proteins, notably large
FIGURE 24–31 A partially unraveled human chromosome, revealing
numerous loops of DNA attached to a scaffoldlike structure.
amounts of histone H1 (located in the interior of the
fiber) and topoisomerase II. The presence of topoisomerase II further emphasizes the relationship between
DNA underwinding and chromatin structure. Topoisomerase II is so important to the maintenance of chromatin structure that inhibitors of this enzyme can kill
30 nm Fiber
Histone
genes
H2B
H3
H4
H2A
Nuclear
scaffold
H1
FIGURE 24–32 Loops of chromosomal DNA attached to a nuclear
scaffold. The DNA in the loops is packaged as 30 nm fibers, so the
loops are the next level of organization. Loops often contain groups
of genes with related functions. Complete sets of histone-coding genes,
as shown in this schematic illustration, appear to be clustered in loops
of this kind. Unlike most genes, histone genes occur in multiple copies
in many eukaryotic genomes.
8885d_c24_920-947
2/11/04
1:36 PM
Page 943 mac76 mac76:385_reb:
24.3
rapidly dividing cells. Several drugs used in cancer
chemotherapy are topoisomerase II inhibitors that allow
the enzyme to promote strand breakage but not the resealing of the breaks.
Evidence exists for additional layers of organization
in eukaryotic chromosomes, each dramatically enhancing the degree of compaction. One model for achieving
this compaction is illustrated in Figure 24–33. Higherorder chromatin structure probably varies from chromosome to chromosome, from one region to the next in
a single chromosome, and from moment to moment in
the life of a cell. No single model can adequately describe these structures. Nevertheless, the principle is
clear: DNA compaction in eukaryotic chromosomes is
Threelikely to involve coils upon coils upon coils . . .
Dimensional Packaging of Nuclear Chromosomes
The Structure of Chromosomes
943
Two
chromatids
(10 coils each)
One coil
(30 rosettes)
Condensed Chromosome Structures Are Maintained
by SMC Proteins
A third major class of chromatin proteins, in addition to
the histones and topoisomerases, is the SMC proteins
(structural maintenance of chromosomes). The primary
structure of SMC proteins consists of five distinct domains (Fig. 24–34a). The amino- and carboxyl-terminal
globular domains, N and C, each of which has part of
an ATP hydrolytic site, are connected by two regions of
-helical coiled-coil motifs (see Fig. 4–11) that are joined
by a hinge domain. The proteins are generally dimeric,
forming a V-shaped complex that is thought to be tied
together through their hinge domains (Fig. 24–34b). One
N and one C domain come together to form a complete
ATP hydrolytic site at each end of the V.
Proteins in the SMC family are found in all types of
organisms, from bacteria to humans. Eukaryotes have
two major types, cohesins and condensins (Fig. 24–25).
The cohesins play a substantial role in linking together
sister chromatids immediately after replication and
keeping them together as the chromosomes condense
to metaphase. This linkage is essential if chromosomes
are to segregate properly at cell division. The detailed
mechanism by which cohesins link sister chromosomes,
and the role of ATP hydrolysis, are not yet understood.
The condensins are essential to the condensation of
chromosomes as cells enter mitosis. In the laboratory,
condensins bind to DNA in a manner that creates positive supercoils; that is, condensin binding causes the
DNA to become overwound, in contrast to the underwinding induced by the binding of nucleosomes. It is not
yet clear how this helps to compact the chromatin, although one possibility is presented in Figure 24–35.
Bacterial DNA Is Also Highly Organized
We now turn briefly to the structure of bacterial chromosomes. Bacterial DNA is compacted in a structure
called the nucleoid, which can occupy a significant
One rosette
(6 loops)
Nuclear
scaffold
One loop
(~75,000 bp)
30 nm Fiber
“Beads-ona-string”
form of
chromatin
DNA
FIGURE 24–33 Compaction of DNA in a eukaryotic chromosome.
Model for levels of organization that could provide DNA compaction
in the chromosomes of eukaryotes. The levels take the form of coils
upon coils. In cells, the higher-order structures (above the 30 nm fibers)
are unlikely to be as uniform as depicted here.
8885d_c24_920-947
944
2/11/04
Chapter 24
1:36 PM
Page 944 mac76 mac76:385_reb:
Genes and Chromosomes
(a)
N
C
Hinge
Coiled coil
Coiled coil
Condensin
(+)(+)
topoisomerase I
(+)(+)
+
(b)
(–)
(–)
ATP
Relaxed DNA
FIGURE 24–35 Model for the effect of condensins on DNA supercoiling. Binding of condensins to a closed-circular DNA in the presence of topoisomerase I leads to the production of positive supercoils
(). Wrapping of the DNA about the condensin introduces positive
supercoils because it wraps in the opposite sense to a solenoidal supercoil (see Fig. 24–24). The compensating negative supercoils () that
appear elsewhere in the DNA are then relaxed by topoisomerase I. In
the chromosome, it is the wrapping of the DNA about condensin that
may contribute to DNA condensation.
ATP
(c)
50 nm
FIGURE 24–34 Structure of SMC proteins. (a) The five domains of
namic molecule, possibly reflecting a requirement for
more ready access to its genetic information. The bacterial cell division cycle can be as short as 15 min,
whereas a typical eukaryotic cell may not divide for
hours or even months. In addition, a much greater
fraction of prokaryotic DNA is used to encode RNA
and/or protein products. Higher rates of cellular metabolism in bacteria mean that a much higher proportion of the DNA is being transcribed or replicated at
a given time than in most eukaryotic cells.
the SMC primary structure. N and C denoted the amino-terminal and
carboxyl-terminal domains, respectively. (b) Each polypeptide is
folded so that the two coiled-coil domains wrap around each other
and the N and C domains come together to form a complete ATPbinding site. Two of these domains are linked at the hinge region to
form the dimeric V-shaped molecule. (c) Electron micrograph of SMC
proteins from Bacillus subtilis.
fraction of the cell volume (Fig. 24–36). The DNA appears to be attached at one or more points to the
inner surface of the plasma membrane. Much less is
known about the structure of the nucleoid than of eukaryotic chromatin. In E. coli, a scaffoldlike structure
appears to organize the circular chromosome into a
series of looped domains, as described above for chromatin. Bacterial DNA does not seem to have any structure comparable to the local organization provided by
nucleosomes in eukaryotes. Histonelike proteins are
abundant in E. coli—the best-characterized example
is a two-subunit protein called HU (Mr 19,000)—but
these proteins bind and dissociate within minutes, and
no regular, stable DNA-histone structure has been
found. The bacterial chromosome is a relatively dy-
2 m
FIGURE 24–36 E. coli cells showing nucleoids. The DNA is stained
with a dye that fluoresces when exposed to UV light. The light area
defines the nucleoid. Note that some cells have replicated their DNA
but have not yet undergone cell division and hence have multiple
nucleoids.
8885d_c24_945
2/12/04
11:22 AM
Page 945 mac76 mac76:385_reb:
Chapter 24
With this overview of the complexity of DNA structure, we are now ready to turn, in the next chapter, to
a discussion of DNA metabolism.
The fundamental unit of organization in the
chromatin of eukaryotic cells is the
nucleosome, which consists of histones and a
200 bp segment of DNA. A core protein
particle containing eight histones (two copies
each of histones H2A, H2B, H3, and H4) is
encircled by a segment of DNA (about 146 bp)
in the form of a left-handed solenoidal
supercoil.
945
■
Nucleosomes are organized into 30 nm fibers,
and the fibers are extensively folded to provide
the 10,000-fold compaction required to fit a
typical eukaryotic chromosome into a cell
nucleus. The higher-order folding involves
attachment to a nuclear scaffold that contains
histone H1, topoisomerase II, and SMC
proteins.
■
Bacterial chromosomes are also extensively
compacted into the nucleoid, but the
chromosome appears to be much more
dynamic and irregular in structure than
eukaryotic chromatin, reflecting the shorter cell
cycle and very active metabolism of a bacterial
cell.
SUMMARY 24.3 The Structure of Chromosomes
■
Further Reading
Key Terms
Terms in bold are defined in the glossary.
exon 928
gene 921
simple-sequence
genome 923
DNA 929
chromosome 923
satellite DNA 929
phenotype 924
centromere 930
mutation 924
telomere 930
regulatory
supercoil 930
sequence 924
relaxed DNA 930
plasmid 925
topology 931
intron 928
underwinding 932
linking number 933
specific linking difference
() 933
superhelical
density 933
topoisomers 934
topoisomerases 935
plectonemic 937
solenoidal 937
chromatin 938
histones 938
nucleosome 938
30 nm fiber 942
SMC proteins 943
cohesins 943
condensins 943
nucleoid 943
Further Reading
General
Blattner, F.R., Plunkett, G., III, Bloch, C.A., Perna, N.T.,
Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode,
C.K., Mayhew, G.F., et al. (1997) The complete genome
sequence of Escherichia coli K-12. Science 277, 1453–1474.
New secrets of this common laboratory organism are revealed.
Cozzarelli, N.R. & Wang, J.C. (eds) (1990) DNA Topology and
Its Biological Effects, Cold Spring Harbor Laboratory Press, Cold
Spring Harbor, NY.
Kornberg, A. & Baker, T.A. (1991) DNA Replication, 2nd edn,
W. H. Freeman & Company, New York.
A good place to start for further information on the structure
and function of DNA.
Lodish, H., Berk, A., Matsudaira, P., Kaiser, C.A., Krieger,
M., Scott, M.P., Zipursky, S.L., & Darnell, J. (2003) Molecular
Cell Biology, 5th edn, W. H. Freeman & Company, New York.
Another excellent general reference.
Genes and Chromosomes
Bromham, L. (2002) The human zoo: endogenous retroviruses in
the human genome. Trends Ecol. Evolut. 17, 91–97.
A thorough description of one of the transposon classes that
makes up a large part of the human genome.
Goffeau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B.,
Feldmann, H., Galibert, F., Hoheisel, J.D., Jacq, C.,
Johnston, M., et al. (1996) Life with 6000 genes. Science 274,
546, 563–567.
Report of the first complete sequence of a eukaryotic genome,
the yeast Saccharomyces cerevisiae.
Greider, C.W. & Blackburn, E.H. (1996) Telomeres, telomerase
and cancer. Sci. Am. 274 (February), 92–97.
Huxley, C. (1997) Mammalian artificial chromosomes and chromosome transgenics. Trends Genet. 13, 345–347.
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody,
M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M.,
FitzHugh, W., et al. (2001) Initial sequencing and analysis of the
human genome. Nature 409, 860–921.
One of the first reports on the draft sequence of the human
genome, with lots of analysis and many associated articles.
Long, M., de Souza, S.J., & Gilbert, W. (1995) Evolution of the
intron-exon structure of eukaryotic genes. Curr. Opin. Genet.
Dev. 5, 774–778.
McEachern, M.J., Krauskopf, A., & Blackburn, E.H. (2000)
Telomeres and their control. Annu. Rev. Genet. 34, 331–358.
8885d_c24_920-947
946
2/11/04
Chapter 24
1:36 PM
Page 946 mac76 mac76:385_reb:
Genes and Chromosomes
Schmid, C.W. (1996) Alu: structure, origin, evolution, significance
and function of one-tenth of human DNA. Prog. Nucleic Acid Res.
Mol. Biol. 53, 283–319.
Lebowitz, J. (1990) Through the looking glass: the discovery of
supercoiled DNA. Trends Biochem. Sci. 15, 202–207.
A short and interesting historical note.
Tyler-Smith, C. & Floridia, G. (2000) Many paths to the top of
the mountain: diverse evolutionary solutions to centromere structure. Cell 102, 5–8.
Details of the diversity of centromere structures from different
organisms, as currently understood.
Wang, J.C. (2002) Cellular roles of DNA topoisomerases: a
molecular perspective. Nat. Rev. Mol. Cell Biol. 3, 430–440.
Zakian, V.A. (1996) Structure, function, and replication of
Saccharomyces cerevisiae telomeres. Annu. Rev. Genet. 30,
141–172.
Supercoiling and Topoisomerases
Berger, J.M. (1998) Type II DNA topoisomerases. Curr. Opin.
Struct. Biol. 8, 26–32.
Boles, T.C., White, J.H., & Cozzarelli, N.R. (1990) Structure of
plectonemically supercoiled DNA. J. Mol. Biol. 213, 931–951.
A study that defines several fundamental features of
supercoiled DNA.
Champoux, J.J. (2001) DNA topoisomerases: structure, function,
and mechanism. Annu. Rev. Biochem. 70, 369–413.
An excellent summary of the topoisomerase classes.
Cozzarelli, N.R., Boles, T.C., & White, J.H. (1990) Primer on
the topology and geometry of DNA supercoiling. In DNA Topology
and Its Biological Effects (Cozzarelli, N.R. & Wang, J.C., eds),
pp. 139–184, Cold Spring Harbor Laboratory Press, Cold Spring
Harbor, NY.
A more advanced and thorough discussion.
Chromatin and Nucleosomes
Filipski, J., Leblanc, J., Youdale, T., Sikorska, M., & Walker,
P.R. (1990) Periodicity of DNA folding in higher order chromatin
structures. EMBO J. 9, 1319–1327.
Hirano, T. (2002) The ABCs of SMC proteins: two-armed ATPases
for chromosome condensation, cohesion and repair. Genes Dev.
16, 399–414.
Description of the rapid advances in understanding of this
interesting class of proteins.
Kornberg, R.D. (1974) Chromatin structure: a repeating unit of
histones and DNA. Science 184, 868–871.
A classic paper that introduced the subunit model for chromatin.
Nasmyth, K. (2002) Segregating sister genomes: the molecular
biology of chromosome separation. Science 297, 559–565.
Wyman, C. & Kanaar, R. (2002) Chromosome organization:
reaching out to embrace new models. Curr. Biol. 12, R446–R448.
A good, short summary of chromosome structure and the roles
of SMC proteins within it.
Zlatanova, J. & van Holde, K. (1996) The linker histones and
chromatin structure: new twists. Prog. Nucleic Acid Res. Mol.
Biol. 52, 217–259.
Problems
1. Packaging of DNA in a Virus Bacteriophage T2 has
a DNA of molecular weight 120 106 contained in a head
about 210 nm long. Calculate the length of the DNA (assume
the molecular weight of a nucleotide pair is 650) and compare it with the length of the T2 head.
2. The DNA of Phage M13 The base composition of
phage M13 DNA is A, 23%; T, 36%; G, 21%; C, 20%. What
does this tell you about the DNA of phage M13?
3. The Mycoplasma Genome The complete genome of
the simplest bacterium known, Mycoplasma genitalium, is
a circular DNA molecule with 580,070 bp. Calculate the molecular weight and contour length (when relaxed) of this molecule. What is Lk0 for the Mycoplasma chromosome? If
0.06, what is Lk?
4. Size of Eukaryotic Genes An enzyme isolated from
rat liver has 192 amino acid residues and is coded for by a
gene with 1,440 bp. Explain the relationship between the
number of amino acid residues in the enzyme and the number of nucleotide pairs in its gene.
5. Linking Number A closed-circular DNA molecule in
its relaxed form has an Lk of 500. Approximately how many
base pairs are in this DNA? How is the linking number altered
(increases, decreases, doesn’t change, becomes undefined)
when (a) a protein complex is bound to form a nucleosome,
(b) one DNA strand is broken, (c) DNA gyrase and ATP are
added to the DNA solution, or (d) the double helix is denatured by heat?
6. Superhelical Density Bacteriophage infects E. coli
by integrating its DNA into the bacterial chromosome. The
success of this recombination depends on the topology of the
E. coli DNA. When the superhelical density () of the E. coli
DNA is greater than 0.045, the probability of integration is
20%; when is less than 0.06, the probability is 70%.
Plasmid DNA isolated from an E. coli culture is found to have
a length of 13,800 bp and an Lk of 1,222. Calculate for this
DNA and predict the likelihood that bacteriophage will be
able to infect this culture.
7. Altering Linking Number (a) What is the Lk of a
5,000 bp circular duplex DNA molecule with a nick in one
strand? (b) What is the Lk of the molecule in (a) when the
nick is sealed (relaxed)? (c) How would the Lk of the molecule in (b) be affected by the action of a single molecule of
E. coli topoisomerase I? (d) What is the Lk of the molecule
in (b) after eight enzymatic turnovers by a single molecule of
DNA gyrase in the presence of ATP? (e) What is the Lk of the
molecule in (d) after four enzymatic turnovers by a single molecule of bacterial type I topoisomerase? (f) What is the Lk of
the molecule in (d) after binding of one nucleosome?
8885d_c24_920-947
2/11/04
1:36 PM
Page 947 mac76 mac76:385_reb:
Chapter 24
8. Chromatin Early evidence that helped researchers
define nucleosome structure is illustrated by the agarose gel
below, in which the thick bands represent DNA. It was generated by briefly treating chromatin with an enzyme that
degrades DNA, then removing all protein and subjecting the
purified DNA to electrophoresis. Numbers at the side of the
gel denote the position to which a linear DNA of the indicated
size would migrate. What does this gel tell you about chromatin structure? Why are the DNA bands thick and spread
out rather than sharply defined?
Problems
947
9. DNA Structure Explain how the underwinding of a BDNA helix might facilitate or stabilize the formation of Z-DNA.
10. Maintaining DNA Structure (a) Describe two structural features required for a DNA molecule to maintain a negatively supercoiled state. (b) List three structural changes
that become more favorable when a DNA molecule is negatively supercoiled. (c) What enzyme, with the aid of ATP, can
generate negative superhelicity in DNA? (d) Describe the
physical mechanism by which this enzyme acts.
11. Yeast Artificial Chromosomes (YACs) YACs are
used to clone large pieces of DNA in yeast cells. What three
types of DNA sequences are required to ensure proper replication and propagation of a YAC in a yeast cell?
1,000 bp
800 bp
600 bp
400 bp
200 bp
343
SINEs and LINEs: the art of biting the hand that feeds you
Alan M Weiner
SINEs and LINEs are short and long interspersed
retrotransposable elements, respectively, that invade new
genomic sites using RNA intermediates. SINEs and LINEs are
found in almost all eukaryotes (although not in Saccharomyces
cerevisiae) and together account for at least 34% of the
human genome. The noncoding SINEs depend on reverse
transcriptase and endonuclease functions encoded by partner
LINEs. With the completion of many genome sequences,
including our own, the database of SINEs and LINEs has
taken a great leap forward. The new data pose new questions
that can only be answered by detailed studies of the
mechanism of retroposition. Current work ranges from the
biochemistry of reverse transcription and integration in vitro,
target site selection in vivo, nucleocytoplasmic transport of the
RNA and ribonucleoprotein intermediates, and mechanisms of
genomic turnover. Two particularly exciting new ideas are that
SINEs may help cells survive physiological stress, and that the
evolution of SINEs and LINEs has been shaped by the forces
of RNA interference. Taken together, these studies promise to
explain the birth and death of SINEs and LINEs, and the
contribution of these repetitive sequence families to the
evolution of genomes.
Addresses
Department of Biochemistry, HSB J417, University of Washington,
Box 357350, Seattle, WA 98195-7350, USA;
e-mail: [email protected]
Current Opinion in Cell Biology 2002, 14:343–350
0955-0674/02/$ — see front matter
© 2002 Elsevier Science Ltd. All rights reserved.
Abbreviations
AP
apurinic/apyrimidinic
EN
endonuclease
LINE
long interspersed repeated sequence
ORF
open reading frame
pA
polyadenylation site
pol
RNA polymerase
RNAi
RNA interference
RT
reverse transcriptase
SINE
short interspersed repeated sequence
SRP
signal recognition particle
Introduction
The dawn of the genomic age has transformed every aspect
of biology, and the study of retrotransposable elements is
no exception. Until complete genomic sequences began to
appear in 1997, whoever studied SINEs and LINEs (short
and long interspersed repeated sequences, as rhymefully
named by Singer [1]) had to worry that any individual
retrotransposon sequence might be misleadingly different
from other members of the same sequence class. For
example, with a database in 1981 of only a few dozen
Alu sequences (the most abundant SINE in the human
genome), no firm conclusions could be drawn about the
remaining 1,090,000 genomic Alu elements [2••] whose
existence was known solely from DNA reassociation
kinetics. Sampling 0.002% of the genome simply does not
inspire confidence.
Now, with many genomic sequences nearing completion,
we can examine essentially all members of a particular
class of SINEs or LINEs in a particular genome, asking
questions and drawing conclusions that should withstand
the test of time. This is the good news.
Genomic anatomy rarely reveals mechanism, which is best
investigated in the cold room or the tissue culture hood. So
for now, the most that can be said for the genome
sequences of yeast, Arabidopsis, flies, worms, mice and
humans is that they have sharpened our questions about
SINEs and LINEs without really answering any of them.
Historically, the first breakthrough from genome structure
to molecular mechanism came in 1993, when the R2 protein
of the insect retrotransposon R2Bm was shown to nick the
target DNA to generate a primer for reverse transcription
of the R2Bm RNA in situ [3]. This brought retroposition
within reach of conventional divide-and-conquer biochemistry. The second breakthrough came in 1996, when
high-frequency retroposition of a human LINE element
was achieved in cultured somatic cells [4]. This meant that
retroposition could be dissected using the powerful
methods of surrogate somatic cell genetics, including
selection for rare events. Together, these two breakthroughs
dispelled any residual fears that retroposition might occur
only in experimentally inaccessible germ line cells [5] or
might occur less frequently than the typical research grant
must be renewed.
SINEs and LINEs: a parts list and owner’s
manual
LINEs are autonomous retroelements; SINEs are their
dependants. As shown in Figure 1, LINEs contain an
unusual internal promoter for RNA polymerase II (pol II),
one or two open reading frames (ORFs) (where the second
ORF is accessed through a frameshift), and a 3′-terminal
polyadenylation site (pA) lacking the usual downstream
efficiency element [6]. ORF1 encodes an essential protein
of unknown function [7••] while ORF2 encodes a bifunctional polypeptide with both reverse transcriptase (RT)
and DNA endonuclease (EN) activity.
EN is upstream of RT in older LINEs (e.g. R2, CRE1
and -2, SLACS, CZAR, Dong and R4), and downstream
of RT in younger LINEs (e.g. L1, Jockey and CR1) [8].
The downstream EN module in older LINEs is a
sequence-specific restriction-like EN; the upstream EN
module in younger LINEs is an apurinic/apyrimidinic
(AP) endonuclease that usually, but not always, lacks
344
Nuclear and gene expression
Figure 1
+1
SINEs
pol III
AAAAAA
+1
older LINEs
(R2, CRE1, SLACS)
pol II
ORF (RT/R-EN)
pA AAAAAA
+1
younger LINEs
(L1, jockey, CR1)
pol II
retropseudogenes
(′processed genes′)
ORF1
ORF2 (AP-EN/RT) pA AAAAAA
pA AAAAAA
A parts list for SINEs and LINEs.
+1 represent the transcription start site.
AP-EN, apurinic/apyrimidinic endonuclease;
pA, polyadenylation signal lacking downstream
efficiency element; pol II and pol III, RNA
polymerase II and III promoters; R-EN,
restriction-like endonuclease; RT, reverse
transcriptase. Solid black arrows represent
7–20 base pair target-site duplications.
Dashed lines indicate deletion of intron
sequences in retropseudogenes derived from
mature mRNA. Only the most common forms
of LINE gene organization are shown; variants
exist. Note that some SINEs have 3′-terminal
dinucleotide or trinucleotide repeats instead
of A-rich tails, and that partner SINEs and
LINEs share common 3′-terminal sequences.
+1
protein-encoding
genes
pol II
pA
Current Opinion in Cell Biology
sequence-specificity [9]. The polyadenylated LINE
transcript is exported to the cytoplasm (presumably like
any other intron-less mRNA) [10,11] and translated.
However, the nascent RT preferentially binds in cis to
the LINE RNA encoding it [7••,12,13]. The resulting
ribonucleoprotein (RNP) complex of a functional LINE
RNA with the bifunctional RT/EN polypeptide then
enters the nucleus, where it initiates a process called
‘target-primed reverse transcription’. In this process, the
EN component of the RT/EN polypeptide nicks the target
DNA to generate a 3′-hydroxyl group, which is used by
the RT component of the RT/EN polypeptide to prime
reverse transcription of the LINE RNA in situ on the
chromosome [3]. Thus, the LINE cDNA never exists free
of the chromosome, as originally postulated [14].
Moreover, the R2 endonuclease is activated by RNA,
presumably to prevent it from wreaking freelance chromosomal damage [15]. Following reverse transcription, the
DNA repair machinery present in somatic [7••], as well as
germ line, cells mends the broken DNA, creating a staggered
break and generating flanking target site duplications of
7–20 nucleotides. Most LINEs are 5′ truncated, an observation
commonly attributed to incomplete reverse transcription.
But other interpretations are possible. A newly retroposed,
full-length LINE element is ‘fertile’ — that is, capable
of further rounds of retroposition — because both the
internal pol II promoter at the 5′ end of the LINE, and
the fully internal polyadenylation signal at the 3′ end,
guarantee that the new element will be transcriptionally
competent to produce an essentially identical polyadenylated LINE mRNA.
SINEs are similar to LINEs, but shorter, simpler, and
almost certainly dependent on LINE RT/EN functions
for retroposition. SINEs have an internal promoter for
RNA polymerase III (pol III) instead of pol II, a 3′-terminal
A-rich tract instead of the pol II polyadenylation signal (or
occasionally dinucleotide or trinucleotide repeats), contain
no significant ORFs, but otherwise function much like
LINEs. A functional SINE must be free of any oligothymidylate tracts (where n?4) which can function as
pol III termination signals. As a result, SINE transcription
continues through the A-rich region until it encounters a
random oligothymidylate tract downstream.
The A-rich tract of SINE elements is presumed to function
as the template for initiation of reverse transcription,
because sequences downstream of the A-rich region are
not retroposed; however, this has not been proven. A
newly retroposed SINE element is ‘fertile’ because the
internal pol III promoter and the 3′-terminal A-rich tract
together guarantee that the new element will be transcriptionally competent to produce new SINE RNAs that
are essentially identical to the original.
The best evidence that SINEs piggyback on LINE
RT/EN functions is the discovery that SINEs sometimes
share common 3′-terminal sequences with ‘partner’ LINEs
[16]. This would facilitate ‘retropositional parasitism’ of
SINEs on LINEs by increasing the efficiency with which
the LINE RT recognises the 3′ end of a cognate SINE
[17,18••]. Not surprisingly, some SINEs are truncated
LINEs, such as those arising from the RTE class of retrotransposons [19], but truncated LINEs are not necessarily
SINEs [20].
No discussion of SINEs and LINEs would be complete
without mentioning ‘retropseudogenes’ (also known as
‘processed genes’) — complete or 5′-truncated copies of
SINEs and LINEs: the art of biting the hand that feeds you Weiner
mature mRNAs flanked by short target-site duplications of
7–20 nucleotides. As long suspected, and now demonstrated
experimentally [7••], retropseudogenes are generated
when the LINE RT/EN binds a mature (presumably
cytoplasmic) mRNA instead of a LINE or SINE RNA.
The resulting retroposed mRNA lacks introns, but has
gained a 3′-terminal poly(A) tail, added post-transcriptionally
during maturation of the mRNA precursor. Unlike SINEs
and LINEs, retropseudogenes are infertile (‘dead on
arrival’) because they lack an internal promoter. Shared
3′-terminal sequences help SINEs exploit LINEs by
overcoming the natural cis preference of LINE RTs for the
RNAs that encoded them [17,18••]. Although generation
of retropseudogenes depends on random binding of the
LINE RT to mRNAs, retroposition is apparently driven
forward by the overwhelming abundance of potential
mRNA templates, and perhaps also by the ability of
3′-terminal poly(A) tracts to facilitate initiation of reverse
transcription by template–primer slippage [21].
Crossing the great divide: nucleocytoplasmic
transport
The study of SINEs and LINEs will not get far without
some serious cell biology, because all available evidence
indicates that the RNA intermediates for retroposition of
SINEs, LINEs and retropseudogenes are captured — if
not actually reverse transcribed — in the cytoplasm. LINE
RT/EN grabs LINE mRNA primarily in cis in the
cytoplasm [13], and carries the mRNA into the nucleus as
a RNP, as is also the case for retroviruses [22]. The almost
complete absence of retropseudogenes containing unexcised
introns (rat preproinsulin 1 being one of the few exceptions
[23,24]) provides further evidence that the LINE RT/EN
first binds RNA intermediates in the cytoplasm. However,
it remains to be seen whether SINEs such as primate
Alu elements [25] and rodent ID elements [5], which are
derived from cytoplasmic RNAs, must retropose through
cytoplasmic RNA intermediates.
If the RNA intermediates for retroposition are cytoplasmic,
SINEs and LINEs may have always been under selection
for cytoplasmic stability as well as efficient nuclear export
(an active process driven by a GTP gradient and requiring
specific cargo-binding proteins [10,11,26,27]). Interestingly,
retroposition of dimeric human Alu elements derived from
the 7SL RNA component of the signal recognition particle
(SRP) appears to be facilitated by binding of two of the six
SRP proteins (SRP9 and SRP14) to the right monomer,
possibly to assure nuclear export and/or stabilize full-length
Alu RNAs in the cytoplasm [25]. Thus, cytoplasmic RNA
retroposition intermediates may evolve to bind proteins
that guarantee nuclear export and/or cytoplasmic stability,
while avoiding proteins that would interfere with reverse
transcription or nuclear import. For example, the nuclear
export apparatus only recognises mature tRNA with
correct 5′ and 3′ ends [28]. This could be one explanation
for why SINEs derived from tRNA often retain a recognisable tRNA fold [29,30]. Another explanation might be
345
that the tRNA fold, like the shared 3′ terminus with a
partner LINE, facilitates binding of the LINE RT/EN
[17]. Similarly, binding of poly(A) binding proteins [31] to
the poly(A) tail of LINEs, and possibly the 3′-terminal
A-rich tract of SINEs [32], may facilitate nuclear export or
cytoplasmic stability.
LINEs: a maelstrom of modules?
A phylogeny of the individual parts (the LINE EN and
RT functions) is not necessarily a phylogeny of the whole
because retroelements, like viruses and organismal genomes,
are a ‘maelstrom of modules’ [33]. The most comprehensive
phylogeny to date is extremely revealing [34,35••]: all
LINE-like elements (or ‘non-LTR retrotransposons’)
share a common ancestral RT, most closely related to the
RT of certain group II introns. The earliest LINE-like
elements possessed a sequence-specific restriction-like
endonuclease downstream of the RT module. This
downstream restriction endonuclease was later replaced
by an upstream AP EN, and still later some of these
elements acquired an RNase H domain downstream of the
RT. All of the restriction-like endonucleases are sequencespecific [8], but the AP EN can be nonspecific or, more
rarely, sequence-specific, as in the Bombyx R1Bm element
[36]. Perhaps when we learn the function of the second
ORF that is present only in the younger elements, we will
begin to understand whether LINEs were assembled from
pre-existing cellular parts, or cellular functions were
devised for pre-existing LINE or group II intron functions.
A SINE is born
Most but not all SINEs are derived from tRNA; exceptions
include rodent ID elements derived from neuronal
BC1 RNA, also found in male germ cells [5], and primate
Alu elements derived from the ubiquitous 7SL RNA
component of the SRP [37]. However, any SINE that too
closely resembles the parent RNA could function as a
dominant-negative mutant, or be sequestered from
retroposition by interaction with proteins that normally
bind the parent RNA. Thus, as mentioned above, the trick
may be to retain sufficient resemblance to the parent
RNA to bind proteins that are important for transport or
stability, but not those proteins that would interfere with
retroposition. Certainly, preservation of the internal pol III
transcription factor binding sites (the A and B boxes) is not
sufficient to explain preservation of a recognisable tRNA
fold. We also can’t explain why an abundant pol III
transcript like 5S ribosomal RNA has apparently never
given rise to a SINE; why primate Alu elements are dimeric
while the vast majority of mammalian SINEs are
monomeric; why the V-SINE (vertebrate-specific SINE)
superfamily maintains a highly conserved core sequence
sandwiched between the tRNA-like 5′ end and the
LINE-like 3′ end [18••]. We can’t even explain why there
are no LINEs or SINEs in the budding yeast S. cerevisiae,
despite an abundance of retroviral-like elements and the
presence of L1-like retrotransposons in the pathogenic yeasts
Candida albicans and Cryptococcus neoformans [38,39]
346
Nuclear and gene expression
For SINEs that share 3′-terminal sequences with a partner
LINE and persist by ‘retropositional parasitism’ [17,18••],
it is not difficult to imagine how the SINE got its tail: the
LINE RT would generate a short cDNA (equivalent to a
retroviral ‘strong stop’ cDNA) by copying the 3′-terminal
LINE RNA sequence. This cDNA would then switch
from the LINE template to the RNA parent of the
SINE-to-be, thus attaching a RT landing pad to an RNA
carrying an internal pol III promoter. One can also imagine
that 3′-terminal identity with a partner LINE enables
SINEs to more efficiently pirate the LINE RT, a cis-acting
enzyme designed to keep functional LINEs alive by
ignoring RNAs derived from the vast excess of moribund
or nonfunctional elements [7••,13]. However, ‘retropositional
parasitism’ is not risk-free: the human SINE MIR, which
shares 50 bases of 3′-terminal sequence with the partner
LINE2 element, was doomed when LINE2 (L2)
became extinct [2••].
Other SINEs, such as primate Alu sequences, lack partner
LINEs, perhaps because the L1 RT binds tightly to the
A-rich tail and is less dependent on auxiliary or adjacent
RNA sequences. Only careful studies of LINE RT
activities will confirm, modify or enable us to reject
these stories.
Knowing when to stop: dodging host genes
and seeking safe havens
All retrotransposable elements are insertional mutagens
that can cause disease or disability by inserting near or
within essential genes [40,41]. As a result, retroelements
face an existential dilemma [42]. Unlike infectious viral
elements that can afford to kill one host in the process of
infecting others, noninfectious retrotransposable elements
are captive but rebellious passengers within the host
genome [42]. (We ignore the possible horizontal transfer of
SINEs and LINEs as stowaways in retroviral particles
[43–45].) If the retroelement multiplies too recklessly,
it will kill the host; but if it does not multiply fast enough
to offset natural loss and decay, it cannot survive. This
balancing act pits the ingenuity of the retroelement against
the ingenuity of the host; the goal is to overrun, but not
overcome the host. The abundance of dead retroelements
littering the human genomic landscape provides mute
testimony regarding this endless battle [2••]; however, we
do not understand why one family of LINEs (say, L1)
succeeds another (say, L2), or what enables one family of
SINEs or LINEs to replicate explosively, while others eke
out a minimal existence without vanishing into oblivion.
New data from the (nearly) complete human genome
sequence underscore the magnitude of the problem:
1,500,000 SINEs (70% of them Alu elements) account
for 13% of our genome, and 850,000 LINEs for another
21% of the genome, giving a grand total of 34% transposable elements [2••]. Moreover, the distribution of
transposable elements is inexplicably uneven, best
exemplified by a 525 kb segment of chromosome Xp11,
where the density of retroposable elements is 89%, and the
four human homeobox gene clusters (HoxA, HoxB, HoxC,
and HoxD), where the density of interspersed repeats is
less than 2%. It is a wonder that our genes have survived
the bombardment.
The choice of integration site is a difficult one for any
transposable element: On the one hand, it is advantageous
to avoid integration in highly active genomic regions where
major damage might be done [46•]. On the other hand, it is
disadvantageous to be sequestered in inactive regions
where transcriptional activity is low. One solution is to
target tandemly repeated genes (rRNA, trypanosome and
nematode trans-spliced leaders [47,48]) or dispersed
multigene families (tRNA). For example, the insect R1
and R2 LINE elements carry a sequence-specific
endonuclease that cuts within the rDNA repeat unit [36].
With excess rRNA coding capacity provided by several
hundred tandem repeats of the rDNA, the host can endure
a significant burden of insertions. Yet the R1 element is
scarce in Bombyx rDNA where concerted evolution or other
forces select against integrants [49•], while the very same
R1 element is superabundant in Drosophila rDNA where it
may interrupt 50–70% of the tandem rDNA repeats [50].
Another strategy — not yet shown for a SINE or LINE — is
to land near genes, but not in them. In yeast, the Ty3
endogenous retrovirus targets the 5′ flanking region of
tRNA genes by a protein–protein interaction between the Ty3
integration machinery and components of the transcription
factor TFIIIB [51,52]. tRNA gene expression is apparently
unaffected, but the Ty3 element is thereby guaranteed a
home in a constitutively open chromatin region.
Mysteries of the genome: inverse distribution
of SINEs and LINEs
Curiously, human SINEs (Alu elements) are concentrated
in gene-rich GC-rich regions of the genome, and LINEs in
gene-poor AT-rich regions [2••,53••], supporting previous
evidence for the existence of isochores (extended regions
of compositionally or functionally similar DNA) [54,55•].
This inverse distribution is unlikely to reflect preferential
integration of SINEs in GC-rich regions, because SINEs
are widely believed to pirate the LINE RT/EN protein,
and also unlikely to reflect preferential loss from AT-rich
regions, as these appear to tolerate high levels of ‘junk’
DNA. Thus, SINEs may be preferentially retained in
GC-rich regions (perhaps positively selected to augment
the stress response, as described below [56,57,58••]). Or
LINEs might be preferentially lost from GC-rich regions
(perhaps because the 20-fold larger LINEs are more
dangerous insertional mutagens [46•]).
Alternatively, an excess of L1 elements on the human Y
chromosome, and to a lesser extent on the human X
chromosome, suggests that LINEs if not SINEs may be
purged by recombination between homologues containing
occupied and empty target sites [46•]. Although these gene
SINEs and LINEs: the art of biting the hand that feeds you Weiner
conversion events would have to be directional to purge
newly integrated retrotransposable elements instead of
fixing them, recombinational differences between GC-rich
and AT-rich regions of the genome might then explain the
inverse distribution of SINEs and LINEs.
Another possible explanation for the inverse genomic
distribution of SINEs and LINEs would be differential
cell cycle regulation at one (or more) of the many steps
in the retroposition process. Among the steps that could
be cell-cycle regulated are transcription of the RNAs
themselves [59], nuclear export and cytoplasmic stabilization
of the RNAs, nuclear import of the RNAs or RNP
intermediates, differential chromatin condensation
(preferential decondensation of gene-rich GC-rich regions
may facilitate integration), different DNA replication
schedules (gene-rich GC-rich DNA is usually replicated
earlier than gene-poor AT-rich regions), and differential DNA
repair (preferential decondensation of gene-rich GC-rich
regions may render the underlying DNA more susceptible
to damage, with integration as a byproduct or consequence).
Although there is no direct evidence that chromatin
structure can affect the integration of SINEs and LINEs,
there are retroviral precedents: nucleosomes generally
block, but occasionally enhance, mammalian retroviral
integration [60] and protein–protein interactions are
responsible for targeting Ty3 to tRNA promoters [51,52]
and Ty5 to silent chromatin [61].
The inverse distribution of SINEs and LINEs could also
be influenced by differential DNA methylation within
gene-poor AT-rich and gene-rich GC-rich regions, but
there are conflicting reports of the effect of CpG
methylation on SINE and LINE transcription. Natural
CpG methylation in HeLa cells inhibits Alu transcription,
and inhibition is relieved by treating the cells with the
demethylating drug 5-azacytidine [32]. In contrast, artificial
CpG methylation inhibited L1 but not Alu transcription
in both transient and stable expression assays when the
methylated CpG binding protein MeCP2 was overexpressed and/or targeted to the SINE or LINE reporter
constructs by a Gal4 DNA-binding domain [62•]. MeCP2
is a transcriptional repressor that tethers the Sin3A–HDAC1
and Sin3A–HDAC2 histone deacetylase complexes to
CpG-methylated DNA.
The taming of the shrewd: SINEs, LINEs and
RNA interference
RNA interference (RNAi) is a form of post-transcriptional
gene silencing triggered by double-stranded RNA. It is
found in many organisms, including flies [63,64], worms
[65] and mammals [66]. In flies [67••], worms [68,69] and
vertebrates, including humans [67••], RNAi appears to be
a normal regulatory mechanism in which single-stranded
‘micro RNAs’, around 22 nt in size and derived from
somewhat larger developmentally controlled RNA precursors, anneal with target mRNAs and trigger their
degradation. RNAi also plays a protective or defensive role,
347
and has been implicated in silencing transposons [70–72]
and viruses [73–75] in fungi and plants.
SINEs and LINEs appear to be subject to RNAi [76,77••].
Although SINEs and LINEs can be independently
transcribed from internal promoters, co-transcription of
SINEs and LINEs from external promoters will generate
antisense as well as sense transcripts. These sense and
antisense transcripts of SINEs and LINEs could in
principle anneal to form double-stranded RNAs capable
of triggering the degradation of any other transcripts
containing SINE and LINE sequences, whether independently transcribed from internal SINE and LINE
promoters or co-transcribed as part of larger RNAs. To
explain how mRNAs and mRNA precursors containing all
or part of a SINE or LINE sequence can survive attack by
RNAi, one might speculate that divergence between any
pair of SINE or LINE sequences is usually too great to
trigger RNAi. In any event, RNAi may have played a role
in the evolution of SINEs and LINEs, as it is part of the
cellular environment in which SINEs and LINEs arise and
propagate. We do not know whether RNAi evolved first as
a defensive or developmental regulatory mechanism, but an
extreme view would be that SINEs and LINEs could not
flourish until RNAi evolved to protect the cell from them.
Accidental travellers
SINEs and LINEs, no less than retropseudogenes, can
provide the ‘seeds of evolution’ [78]. Perhaps the single
most conspicuous demonstration of the power of retroposition is the functional, tissue-specific human pgk-2 locus, an
autosomal phosphoglycerate kinase retrogene expressed
during spermatogenesis and lacking the ten introns of the
ubiquitously expressed X-linked pgk-1 gene. How PGK-2
acquired both a promoter and tissue specificity is an
interesting question. Was it a lucky insertion into a tissuespecific promoter or chromosomal region, or was the
specificity an accident that was then perfected by subsequent selection? Almost equally remarkable is the rat
preproinsulin I gene, a functional retroposon which, having
lost one of the two ancestral preproinsulin II introns, may
be the sole instance in which a partially spliced mRNA
appears to have served as the RNA intermediate [23].
Instances of useful genomic mayhem created by
retroposition of SINEs and LINEs are still relatively sparse
(see http://exppc01.uni-muenster.de/expath/alltables.htm),
but more will surely emerge from detailed analysis of
the human genome sequence. For example, L1s can
co-transduce 3′ flanking sequences [79,80], and integration
of Alu elements in reverse orientation can introduce
fortuitous 3′ splice sites that are capable of diversifying
the spectrum of alternatively spliced mRNAs (R Sorek and
G Ast, personal communication). There is also one report
of a rodent B2 SINE that carries a pol II promoter [81•].
Remarkably, this portable pol II promoter does not interfere
with the internal pol III promoter function required for
B2 retroposition, and in one case the mobile pol II promoter
348
Nuclear and gene expression
drives transcription of a typical gene encoding the laminin
variant, Lama3.
Are SINEs useful parasites?
From the moment of their discovery, SINEs and LINEs
were treated as genomic parasites [82], an internal infection
that could be kept in check but rarely cured. However,
evolution, like science, is rife with reversals of fortune.
The first group to sequence human Alu elements has now
proposed that SINEs, once considered a source of genomic
stress, may in fact help ease cells through physiological
stresses that induce Alu transcription such as heat shock
and translational inhibition [56,57,58••]. The induced
SINE transcripts would then bind to PKR kinase, blocking
the ability of this kinase to inhibit translation by phosphorylating the initiation factor eIF2α.
Although initially greeted with scepticism, the view that
SINEs may lend us a helping hand has received significant
support from an unexpected direction: the International
Human Genome Sequencing Consortium argued, in
presenting the first draft of the human genome sequence
[2••], that over-representation of Alu elements in gene-rich
GC-rich DNA is most easily explained if ‘SINEs actually
earn their keep in the genome’ by positive selection, as
proposed by Schmid [56,57,58••]. Although the jury is still
out, at least we can now look forward to a fair trial.
Update
A recent publication [83] suggests that methylation may be
responsible for the mysteriously uneven genomic distribution
of SINES, perhaps by affecting the insertion and/or deletion
of the elements.
References and recommended reading
Papers of particular interest, published within the annual period of review,
have been highlighted as:
• of special interest
•• of outstanding interest
1.
Singer MF: SINEs and LINEs: highly repeated short and long
interspersed sequences in mammalian genomes. Cell 1982,
28:433-434.
2.
••
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC,
Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W et al.: Initial
sequencing and analysis of the human genome. Nature 2001,
409:860-921.
Although the first scholarly presentation of the human genome sequence
might have been expected to focus primarily on genes, the public consortium
did a marvellous job of discussing the general appearance of the genomic
landscape, and the role that SINEs and LINEs may have played in shaping
that rugged geography. The uneven distribution of SINEs in the human
genome suggests that retroposable elements are subject to positive as well
as negative selection, reinforcing provocative recent evidence that SINEs
may be good for us after all (see Li et al. [2001] [58••] below).
3.
Luan DD, Korman MH, Jakubczak JL, Eickbush TH: Reverse
transcription of R2Bm RNA is primed by a nick at the
chromosomal target site: a mechanism for non-LTR
retrotransposition. Cell 1993, 72:595-605.
4.
Moran JV, Holmes SE, Naas TP, DeBerardinis RJ, Boeke JD,
Kazazian HH Jr: High frequency retrotransposition in cultured
mammalian cells. Cell 1996, 87:917-927.
5.
Muslimov IA, Lin Y, Heller M, Brosius J, Zakeri Z, Tiedge H: A small
RNA in testis and brain: implications for male germ cell
development. J Cell Sci 2002, 115:1243-1250.
6.
Hans H, Alwine JC: Functionally significant secondary structure of
the simian virus 40 late polyadenylation signal. Mol Cell Biol
2000, 20:2926-2932.
7.
Esnault C, Maestre J, Heidmann T: Human LINE retrotransposons
•• generate processed pseudogenes. Nat Genet 2000, 24:363-367.
Since the discovery of SINEs, LINEs and processed pseudogenes in the
early 1980s, it was widely assumed that LINE reverse transcriptase would
be responsible for propagating noncoding SINEs and generating processed
retropseudogenes derived from mRNAs. This work provides the long-awaited
evidence for this long-held hypothesis, and develops the experimental tools
for understanding the functions of LINE reverse transcriptase and integrase
at the molecular level.
8.
Yang J, Malik HS, Eickbush TH: Identification of the endonuclease
domain encoded by R2 and other site- specific, non-long terminal
repeat retrotransposable elements. Proc Natl Acad Sci USA 1999,
96:7847-7852.
9.
Feng Q, Moran JV, Kazazian HH Jr, Boeke JD: Human L1
retrotransposon encodes a conserved endonuclease required for
retrotransposition. Cell 1996, 87:905-916.
10. Pasquinelli AE, Ernst RK, Lund E, Grimm C, Zapp ML, Rekosh D,
Hammarskjold ML, Dahlberg JE: The constitutive transport element
(CTE) of Mason–Pfizer monkey virus (MPMV) accesses a cellular
mRNA export pathway. EMBO J 1997, 16:7500-7510.
11. Paca RE, Ogert RA, Hibbert CS, Izaurralde E, Beemon KL: Rous
sarcoma virus DR posttranscriptional elements use a novel RNA
export pathway. J Virol 2000, 74:9507-9514.
12. Kimberland ML, Divoky V, Prchal J, Schwahn U, Berger W,
Kazazian HH Jr: Full-length human L1 insertions retain the capacity
for high frequency retrotransposition in cultured cells. Hum Mol
Genet 1999, 8:1557-1560.
13. Wei W, Gilbert N, Ooi SL, Lawler JF, Ostertag EM, Kazazian HH,
Boeke JD, Moran JV: Human L1 retrotransposition: cis
preference versus trans complementation. Mol Cell Biol 2001,
21:1429-1439.
14. Van Arsdell SW, Denison RA, Bernstein LB, Weiner AM, Manser T,
Gesteland RF: Direct repeats flank three small nuclear RNA
pseudogenes in the human genome. Cell 1981, 26:11-17.
15. Yang J, Eickbush TH: RNA-induced changes in the activity of the
endonuclease encoded by the R2 retrotransposable element.
Mol Cell Biol 1998, 18:3455-3465.
16. Okada N, Hamada M, Ogiwara I, Ohshima K: SINEs and
LINEs share common 3′′ sequences: a review. Gene 1997,
205:229-243.
17.
Ogiwara I, Miya M, Ohshima K, Okada N: Retropositional parasitism
of SINEs on LINEs: identification of SINEs and LINEs in
elasmobranchs. Mol Biol Evol 1999, 16:1238-1250.
18. Ogiwara I, Miya M, Ohshima K, Okada N: V-SINEs: a new
•• superfamily of vertebrate SINEs that are widespread in vertebrate
genomes and retain a strongly conserved segment within each
repetitive unit. Genome Res 2002, 12:316-324.
The Okada group was the first to show that SINEs sometimes impersonate
LINEs by imitating (or stealing) their 3′-terminal sequences. The shared
3′-terminal sequences may facilitate retroposition by enabling SINE RNA
to bind LINE reverse transcriptase more tightly. V-SINEs are a stunning
example of this kind of ‘retropositional parasitism’ in which one mobile
element exploits another.
19. Malik HS, Eickbush TH: The RTE class of non-LTR retrotransposons
is widely distributed in animals and is the origin of many SINEs.
Mol Biol Evol 1998, 15:1123-1134.
20. Nikaido M, Okada N: CetSINEs and AREs are not SINEs but are
parts of cetartiodactyl L1. Mamm Genome 2000, 11:1123-1126.
21. Bebenek K, Abbotts J, Roberts JD, Wilson SH, Kunkel TA: Specificity
and mechanism of error-prone replication by human
immunodeficiency virus-1 reverse transcriptase. J Biol Chem
1989, 264:16948-16956.
22. Kenna MA, Brachmann CB, Devine SE, Boeke JD: Invading the
yeast nucleus: a nuclear localization signal at the C terminus of
Ty1 integrase is required for transposition in vivo. Mol Cell Biol
1998, 18:1115-1124.
23. Soares MB, Schon E, Henderson A, Karathanasis SK, Cate R, Zeitlin S,
Chirgwin J, Efstratiadis A: RNA-mediated gene duplication: the rat
preproinsulin I gene is a functional retroposon. Mol Cell Biol
1985, 5:2090-2103.
SINEs and LINEs: the art of biting the hand that feeds you Weiner
24. Weiner AM, Deininger PL, Efstratiadis A: Nonviral retroposons:
genes, pseudogenes, and transposable elements generated by
the reverse flow of genetic information. Annu Rev Biochem 1986,
55:631-661.
25. Sarrowa J, Chang DY, Maraia RJ: The decline in human Alu
retroposition was accompanied by an asymmetric decrease
in SRP9/14 binding to dimeric Alu RNA and increased
expression of small cytoplasmic Alu RNA. Mol Cell Biol 1997,
17:1144-1151.
26. Kuersten S, Ohno M, Mattaj IW: Nucleocytoplasmic transport: Ran,
beta and beyond. Trends Cell Biol 2001, 11:497-503.
27.
Dahlberg JE, Lund E: Functions of the GTPase Ran in RNA export
from the nucleus. Curr Opin Cell Biol 1998, 10:400-408.
28. Lund E, Dahlberg JE: Proofreading and amino acylation of
tRNAs before export from the nucleus. Science 1998,
282:2082-2085.
29. Daniels GR, Deininger PL: Repeat sequence families derived from
mammalian tRNA genes. Nature 1985, 317:819-822.
30. Sakamoto K, Okada N: Rodent type 2 Alu family, rat identifier
sequence, rabbit C family, and bovine or goat 73-bp repeat may
have evolved from tRNA genes. J Mol Evol 1985, 22:134-140.
31. Voeltz GK, Ongkasuwan J, Standart N, Steitz JA: A novel embryonic
poly(A) binding protein, ePAB, regulates mRNA deadenylation in
Xenopus egg extracts. Genes Dev 2001, 15:774-788.
32. Liu WM, Maraia RJ, Rubin CM, Schmid CW: Alu transcripts:
cytoplasmic localisation and regulation by DNA methylation.
Nucleic Acids Res 1994, 22:1087-1095.
33. Gibbs A, Calisher CH, Garcia-Arenal F (eds): Molecular Basis of
Virus Evolution. Proceedings of a Fundacion Juan March Symposium.
Cambridge: Cambridge University Press, 1995:603.
34. Malik HS, Burke WD, Eickbush TH: The age and evolution of
non-LTR retrotransposable elements. Mol Biol Evol 1999,
16:793-805.
35. Malik HS, Eickbush TH: Phylogenetic analysis of ribonuclease H
•• domains suggests a late, chimeric origin of LTR retrotransposable
elements and retroviruses. Genome Res 2001, 11:1187-1197.
The Eickbush group has led the way in using phylogeny to attack a fundamental problem in the evolution of mobile elements: who came first, and
who stole what from whom? Were new elements assembled from parts of
the old, or from pre-existing cellular parts; or did mobile elements both steal
cellular functions and bequeath new functions to the cell? This paper argues
that all LINE-like elements share a common ancestral reverse transcriptase
(RT) most closely related to the RT of certain mobile group II introns that also
move through RNA intermediates.
36. Feng Q, Schumann G, Boeke JD: Retrotransposon R1Bm
endonuclease cleaves the target sequence. Proc Natl Acad Sci
USA 1998, 95:2083-2088.
37.
Ullu E, Tschudi C: Alu sequences are processed 7SL RNA genes.
Nature 1984, 312:171-172.
38. Goodwin TJ, Poulter RT: The diversity of retrotransposons in the
yeast Cryptococcus neoformans. Yeast 2001, 18:865-880.
39. Goodwin TJ, Ormandy JE, Poulter RT: L1-like non-LTR
retrotransposons in the yeast Candida albicans. Curr Genet 2001,
39:83-91.
40. Kazazian HH Jr: An estimated frequency of endogenous insertional
mutations in humans. Nat Genet 1999, 22:130.
41. Deininger PL, Batzer MA: Alu repeats and human disease. Mol
Genet Metab 1999, 67:183-193.
42. Craigie R: Hotspots and warm spots: integration specificity of
retroelements. Trends Genet 1992, 8:187-190.
43. Peters G, Harada F, Dahlberg JE, Panet A, Haseltine WA, Baltimore D:
Low-molecular-weight RNAs of Moloney murine leukemia virus:
identification of the primer for RNA-directed DNA synthesis. J Virol
1977, 21:1031-1041.
44. Ikawa Y, Ross J, Leder P: An association between globin
messenger RNA and 60S RNA derived from Friend leukemia
virus. Proc Natl Acad Sci USA 1974, 71:1154-1158.
45. Linial M, Medeiros E, Hayward WS: An avian oncovirus mutant
(SE 21Q1b) deficient in genomic RNA: biological and biochemical
characterization. Cell 1978, 15:1371-1381.
349
46. Boissinot S, Entezam A, Furano AV: Selection against deleterious
•
LINE-1-containing loci in the human lineage. Mol Biol Evol 2001,
18:926-935.
This is the clearest evidence to date showing that LINEs can be subject to
negative selection, and are kicked out of genomic regions where they have
the potential to do harm.
47.
Aksoy S, Williams S, Chang S, Richards FF: SLACS retrotransposon
from Trypanosoma brucei gambiense is similar to mammalian
LINEs. Nucleic Acids Res 1990, 18:785-792.
48. Malik HS, Eickbush TH: NeSL-1, an ancient lineage of site-specific
non-LTR retrotransposons from Caenorhabditis elegans. Genetics
2000, 154:193-203.
49. Perez-Gonzalez CE, Eickbush TH: Dynamics of R1 and R2 elements
•
in the rDNA locus of Drosophila simulans. Genetics 2001,
158:1557-1567.
The mobile retroelements R1 and R2 insert specifically into tandem repeats
of arthropod rRNA genes but, as determined by a new PCR assay, R1 and
R2 seldom spread to other chromosomes in the population. This indicates
that recombinational bias purges new R1 and R2 insertions, and also suggests
that the actual frequency of R1 and R2 retrotransposition is much higher
than previously thought.
50. Wellauer PK, Dawid IB: The structural organization of ribosomal
DNA in Drosophila melanogaster. Cell 1977, 10:193-212.
51. Aye M, Dildine SL, Claypool JA, Jourdain S, Sandmeyer SB:
A truncation mutant of the 95-kilodalton subunit of transcription
factor IIIC reveals asymmetry in Ty3 integration. Mol Cell Biol
2001, 21:7839-7851.
52. Kim JM, Vanguri S, Boeke JD, Gabriel A, Voytas DF: Transposable
elements and genome organization: a comprehensive survey of
retrotransposons revealed by the complete Saccharomyces
cerevisiae genome sequence. Genome Res 1998, 8:464-478.
53. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG,
•• Smith HO, Yandell M, Evans CA, Holt RA et al.: The sequence of the
human genome. Science 2001, 291:1304-1351.
The human genome sequence has fully confirmed the evidence, painstakingly
assembled by the Bernardi group over many years, documenting large-scale
inhomogeneities within the human genome. These unexpected local differences
in base composition, gene density and density of mobile elements result in
‘isochores’ — extended chromosomal regions of similar composition and
presumably function — but we still do not know how or why isochores arise.
Perhaps clever bioinformatics will suggest some useful experiments to probe
the significance and origin of isochores.
54. Pavlicek A, Jabbari K, Paces J, Paces V, Hejnar JV, Bernardi G:
Similar integration but different stability of Alus and LINEs in the
human genome. Gene 2001, 276:39-45.
55. Pavlicek A, Paces J, Clay O, Bernardi G: A compact view of
•
isochores in the draft human genome sequence. FEBS Lett 2002,
511:165-169.
See annotation Venter et al. (2001) [53••].
56. Chu WM, Ballard R, Carpick BW, Williams BR, Schmid CW:
Potential Alu function: regulation of the activity of
double-stranded RNA-activated kinase PKR. Mol Cell Biol 1998,
18:58-68.
57.
Liu WM, Chu WM, Choudary PV, Schmid CW: Cell stress and
translational inhibitors transiently increase the abundance of
mammalian SINE transcripts. Nucleic Acids Res 1995,
23:1758-1765.
58. Li TH, Schmid CW: Differential stress induction of individual Alu
•• loci: implications for transcription and retrotransposition. Gene
2001, 276:135-141.
This is the most recent in a series of experiments arguing, against common
wisdom, that SINEs may be useful or even subject to positive selection. It is
only fitting that this provocative hypothesis should emerge from the group
that discovered Alu elements over two decades ago.
59. Scott PH, Cairns CA, Sutcliffe JE, Alzuherri HM, McLees A, Winter AG,
White RJ: Regulation of RNA polymerase III transcription during
cell cycle entry. J Biol Chem 2001, 276:1005-1014.
60. Cost GJ, Golding A, Schlissel MS, Boeke JD: Target DNA
chromatinization modulates nicking by L1 endonuclease. Nucleic
Acids Res 2001, 29:573-577.
61. Xie W, Gai X, Zhu Y, Zappulla DC, Sternglanz R, Voytas DF:
Targeting of the yeast Ty5 retrotransposon to silent chromatin is
mediated by interactions between integrase and Sir4p. Mol Cell
Biol 2001, 21:6606-6614.
350
Nuclear and gene expression
62. Yu F, Zingler N, Schumann G, Stratling WH: Methyl-CpG-binding
•
protein 2 represses LINE-1 expression and retrotransposition but
not Alu transcription. Nucleic Acids Res 2001, 29:4493-4501.
Using both transient and stable assays for transcription of artificially methylated
DNA, the authors show that methylation may differentially affect SINE and
LINE retroposition. This unexpected observation has the potential to explain
the uneven distribution of SINEs and LINEs. See also Greally (2002) [83].
63. Hammond SM, Bernstein E, Beach D, Hannon GJ: An RNA-directed
nuclease mediates post-transcriptional gene silencing in
Drosophila cells. Nature 2000, 404:293-296.
64. Caplen NJ, Parrish S, Imani F, Fire A, Morgan RA: Specific inhibition
of gene expression by small double-stranded RNAs in
invertebrate and vertebrate systems. Proc Natl Acad Sci USA
2001, 98:9742-9747.
65. Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC:
Potent and specific genetic interference by double-stranded RNA
in Caenorhabditis elegans. Nature 1998, 391:806-811.
66. Elbashir SM, Lendeckel W, Tuschl T: RNA interference is mediated
by 21- and 22-nucleotide RNAs. Genes Dev 2001, 15:188-200.
67.
••
Lagos-Quintana M, Rauhut R, Lendeckel W, Tuschl T: Identification
of novel genes coding for small expressed RNAs. Science 2001,
294:853-858.
In many eukaryotes, including humans, double-stranded RNAs are
processed into ‘micro RNAs’ (miRNAs) that can regulate expression of
complementary mRNAs, or mediate an unusual post-transcriptional genesilencing phenomenon called RNAi (‘RNA interference’). RNAi appears to
be a potent cellular defence mechanism against viruses and DNA transposons,
and it may do the same for SINEs and LINEs (see Jensen et al. [1999] [77••]).
73. Dougherty WG, Lindbo JA, Smith HA, Parks TD, Swaney S,
Proebsting WM: RNA-mediated virus resistance in
transgenic plants: exploitation of a cellular pathway possibly
involved in RNA degradation. Mol Plant Microbe Interact 1994,
7:544-552.
74. Mourrain P, Beclin C, Elmayan T, Feuerbach F, Godon C, Morel JB,
Jouette D, Lacombe AM, Nikic S, Picault N et al.: Arabidopsis
SGS2 and SGS3 genes are required for posttranscriptional
gene silencing and natural virus resistance. Cell 2000,
101:533-542.
75. Dalmay T, Horsefield R, Braunstein TH, Baulcombe DC: SDE3
encodes an RNA helicase required for post-transcriptional gene
silencing in Arabidopsis. EMBO J 2001, 20:2069-2078.
76. Jensen S, Gassama MP, Heidmann T: Cosuppression of I
transposon activity in Drosophila by I-containing sense and
antisense transgenes. Genetics 1999, 153:1767-1774.
77.
••
Jensen S, Gassama MP, Heidmann T: Taming of transposable
elements by homology-dependent gene silencing. Nat Genet
1999, 21:209-212.
If SINEs and LINEs give rise to duplex RNA, as might be expected if these
retroelements are subject to readthrough transcription from external promoters,
RNA interference may be responsible for preventing sudden explosive
growth of SINE and LINE families.
78. Brosius J: Retroposons — seeds of evolution. Science 1991,
251:753.
79. Moran JV, DeBerardinis RJ, Kazazian HH Jr: Exon shuffling by
L1 retrotransposition. Science 1999, 283:1530-1534.
68. Lau NC, Lim LP, Weinstein EG, Bartel DP: An abundant class of tiny
RNAs with probable regulatory roles in Caenorhabditis elegans.
Science 2001, 294:858-862.
80. Goodier JL, Ostertag EM, Kazazian HH Jr: Transduction of
3′′-flanking sequences is common in L1 retrotransposition. Hum
Mol Genet 2000, 9:653-657.
69. Lee RC, Ambros V: An extensive class of small RNAs in
Caenorhabditis elegans. Science 2001, 294:862-864.
81. Ferrigno O, Virolle T, Djabari Z, Ortonne JP, White RJ, Aberdam D:
•
Transposable B2 SINE elements can provide mobile RNA
polymerase II promoters. Nat Genet 2001, 28:77-81.
As mobile elements, SINEs and LINEs have the potential to destroy (by
insertional mutagenesis), to create (by generating functional retropseudogenes), and to empower (by giving old genes new promoters or regulatory
signals as described in this novel story). Thus, in genomic evolution, as in life,
mayhem is sometimes useful.
70. Tabara H, Sarkissian M, Kelly WG, Fleenor J, Grishok A, Timmons L,
Fire A, Mello CC: The rde-1 gene, RNA interference, and
transposon silencing in C. elegans. Cell 1999, 99:123-132.
71. Ketting RF, Haverkamp TH, van Luenen HG, Plasterk RH: Mut-7 of
C. elegans, required for transposon silencing and RNA
interference, is a homolog of Werner syndrome helicase and
RNaseD. Cell 1999, 99:133-141.
72. Wu-Scharf D, Jeong B, Zhang C, Cerutti H: Transgene and
transposon silencing in Chlamydomonas reinhardtii by a
DEAH-box RNA helicase. Science 2000, 290:1159-1162.
82. Orgel LE, Crick FH: Selfish DNA: the ultimate parasite. Nature
1980, 284:604-607.
83. Greally JM. Short interspersed transposable elements (SINEs) are
excluded from imprinted regions in the human genome. Proc Natl
Acad Sci USA 2002, 99:327-332.