Vol. 48 No. 3/2001 587598 QUARTERLY I dedicate this review in memory of Professor Jacek Augustyniak, who introduced me to the world of genes and genomes Review The human genome structure and organization* Wojciech Maka³owski½ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, U.S.A. Received: 22 January, 2001; accepted: 26 February, 2001 Genetic information of human is encoded in two genomes: nuclear and mitochondrial. Both of them reflect molecular evolution of human starting from the beginning of life (about 4.5 billion years ago) until the origin of Homo sapiens species about 100000 years ago. From this reason human genome contains some features that are common for different groups of organisms and some features that are unique for Homo sapiens. 3.2 ´ 109 base pairs of human nuclear genome are packed into 237 chromosomes of different size. The smallest chromosome 21st contains 5 ´ 10 base pairs while the biggest one 1st contains 2.63 ´ 108 base pairs. Despite the fact that the nucleotide sequence of all chromosomes is established, the organisation of nuclear genome put still questions: for example: the exact number of genes encoded by the human genome is still unknown giving estimations from 30 to 150 thousand genes. Coding sequences represent a few percent of human nuclear genome. The majority of the genome is represented by repetitive sequences (about 50%) and noncoding unique sequences. This part of the genome is frequently wrongly called junk DNA. The distribution of genes on chromosomes is irregular, DNA fragments containing low percentage of GC pairs code lower number of genes than the fragments of high percentage of GC pairs. *Presented at the XXXVI Meeting of the Polish Biochemical Society, Poznañ, 13 September 2000, Poland. ½Mailing address: NCBI/NLM/NIH, 45 Center Drive, MSC 6510, Bldg. 45, Room 6As.47A, Bethesda, MD 20892-6510, U.S.A., phone: (301) 435 5989; fax: (301) 480 2918; e-mail: [email protected] CDS, coding DNA sequence; EST, expressed sequence tag; FISH, fluorescence in situ hybridization; HERV, human endogenous retrovirus; LINE, long interspersed repetitive element; LTR, long terminal repeat; SAR, scaffold-attachement region; SINE, short interspersed repetitive element; TE, transposable element; UTR, untranslated region. Abbreviations: 588 INTRODUCTION W. Makalowski HISTORICAL PERSPECTIVE From the beginning of humanity, people have been interested in themselves. They were well aware of two aspects of living nature: an immense variability within each species and the tendency for characteristics of parents to be transmitted to their offspring. Already pre-Socratic philosophers noticed that people shared some characteristics, e.g. had usually, with some exceptions, two hands, a nose, large forehead, in other words they were alike. On the other hand, everybody was different and nobody should have a problem to distinguish those two gentlemen by such characteristics as eyes, cheeks, or shirts. Ancient people were also aware that the above was true for both intra- and inter-species comparison. The question arises: how does it happen that our children are more similar to parents than to monkeys? The problem already intrigued pre-Socratic philosophers. Probably the first person who publicly expressed his thoughts on the subject was Anaxagoras of Clazomenae. According to his teaching, seed material is carried from all parts of the body to reproductive organs by the humors. Fertilization is the mixing of the seed material of father and mother. That all parts of the body participate in the production of seed material is documented by the fact that blue-eyed parents have blue-eyed children and baldheaded men have sons that become baldheaded — not a very good prospect for my own children. The idea of panspermy or pangenesis was adapted and taught by the famous physician Hipocrates (about 460–377 B.C.) and was widely accepted until the end of the nineteenth century, also by Charles Darwin. One of the greatest scientists of all time, Aristotle of Stagira had a different view on the problem. Aristotle¢s theory of inheritance, as described in one of his major works De generatione animalium, was holistic. He held that the contributions by males and females 2001 were not equal. The semen of the male contributes the form-giving principle, eidos, while the menstrual blood, cantemina, of the female is the unformed substance shaped by the eidos of the semen. “The female always provides the material, the male provides that which fashions the material into shape; this in our view, is the specific characteristic of each sex: that is what it means to be male or to be female.” (Aristotle, 1965). The twentieth century witnessed accelerated development of biology and with it the nature of the inheritance process was understood. Consequently, an effort to decipher the blueprint of our species has started. Several biological discoveries were especially important to decipher the human genome. Everything started with the rediscovery of Mendel¢s laws by Hugo Marie De Vries (1900), followed by discovery of chromosomes by Thomas H. Morgan in 1910 (Morgan, 1910). In 1953, James D. Watson and Francis H.C. Crick unraveled the structure of DNA (Watson & Crick, 1953a; Watson & Crick, 1953b). Fours years later, Johan H. Matthaei and Marshall Nirenberg performed experiments which enabled deciphering the genetic code. With the development of the fast methods of DNA sequencing in the mid-seventies (Maxam & Gilbert, 1977; Sanger et al., 1977), followed by automation of cloning and sequencing in the nineties, the way to understand our blueprint became clear. By now, many complete genomes of both prokaryotic and eukaryotic organisms have been sequenced. For up-to-date tables with completed genomes, go to http://www.ebi. ac.uk/genomes/. On June 26, 2000, virtually all news agencies in the world announced completion of a working draft of the human genome. This accomplishment was so important for humankind that instead of announcing it at a scientific conference or in a scientific journal, as used to be with a scientific milestones, a special press conference was organized in The White House in Washington, D.C. In several days faces of major players from both private and public Vol. 48 The human genome sectors appeared on journals¢ covers around the world, including the Polish weeklies Polityka and Wprost. It is worth pointing out that the public genome project already completed sequence of two chromosomes: 22 (December, 1999) (Dunham et al., 1999) and 21 (May, 2000) (Hattori et al., 2000). The working draft of the human genome was published by both projects last January. HUMAN GENOME GENERAL INFORMATION Our genetic material is stored in two organelles: nucleus and mitochondria. This review is focused on the nuclear genome in which 3.2 miliard bp are packed in 22 pairs of autosomes and two sex chromosomes, X and Y. Human chromosomes are not of equal sizes; the smallest, chromosome 21, is 54 mln bp long; the largest, chromosome 1, is almost five times bigger with 249 mln bp (see Table 1). Genomic sequences can be divided in several ways. From the functional point of view we can distinguish genes, pseudogenes, and non-coding DNA (Fig. 1). Only a minute fraction of the genome — about 3% — codes for pro- 589 Table 1. Physical sizes of human chromosomes Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y Size (Mbp) 249 237 192 183 174 165 153 135 132 132 132 123 108 105 99 84 81 75 69 63 54 57 141 60 transposable elements as well but with time they have mutated beyond recognition. Figure 1. Fractions of different sequences in the human genome. teins. There are many pseudogenes in the human genome (0.5%) but most of the genome consists of introns and intergenic DNA. Almost half of these sequences consist of different transposons; moreover, the remaining non-coding DNA most likely originated from SEQUENCE COMPLEXITY The human genome contains various levels of complexity as demonstrated by reassociation kinetics. Such analyses of the human genome estimate that 60% of the DNA is ei- 590 W. Makalowski ther single copy or in very low copies; 30% of the DNA is moderately repetitive; and 10% is considered highly repetitive. Various staining techniques demonstrate alternative banding patterns of mitotic chromosomes referred to as karyograms. Although the three broad classes of DNA are scattered throughout the chromosome, chromosomal banding patterns reflect levels of compartmentalization of the DNA. Using the C-banding technique yields dark-staining regions of the chromosome (or C bands), referred to as heterochromatin. These regions are highly coiled, contain highly repetitive DNA, and are typically found at the centromeres, telomeres, and on the Y chromosome. They are composed of long arrays of tandem repeats and therefore some may contain a nucleotide composition that differs significantly from the remainder of the genome (approximately 40–42% GC). That means that they can be separated from the bulk of the genome by buoyant density (caesium chloride) gradient centrifugation. Gradient centrifugation results in a major band and three minor bands referred to as satellite bands — hence the term satellite DNA. The G-banding technique yields a pattern of alternating light and dark bands reflecting variations in base composition, time of replication, chromatin conformation, and the density of genes and repetitive sequences. Therefore, the karyograms define chromosomal organization and allow for identification of the different chromosomes. The darker bands, or G bands, are comparatively more condensed, more AT-rich, less gene-rich and replicate later than the DNA within the pale bands, which correspond to the R bands by an alternative staining technique. More recently, these alternative banding patterns have been correlated to the level of compaction of scaffold-attachment regions (SARs). The human genome may also be compartmentalized into large (> 300 kb) segments of DNA that are homogeneous in base composi- 2001 tion referred to as isochores (Bernardi, 2000), based on sequence analysis and compositional mapping. L1 and L2 are GC-poor (or ‘light¢) isochore families representing about 62% of the genome. The H1, H2 and H3 (heavy) isochore classes are increasingly GC-rich. There is some correlation between isochores and chromosomal bands. G bands are almost exclusively composed of GC-poor isochores, with a minor contribution from H1. R bands can be classified further into T bands (R banding at elevated temperatures), which are composed mainly of H2 and H3 isochores, and R¢ (non-T R bands) which are comprised of nearly equal amounts of GC-rich (primarily H1) and GC-poor isochores. Additionally, there are five human chromosomes (13, 14, 15, 21, 22) distinguished at their terminus by a thin bridge with rounded ends referred to as chromosomal satellites. These contain repeats of genes coding for rRNA and ribosomal proteins that coalesce to form the nucleolus and are known as the nucleolar organizing regions. HUMAN GENE NUMBER It is interesting that the number of genes coded by our genome is not known and probably will not be known long after completion of the human genome sequencing. Nevertheless, in the last decade, several groups tried to answer this question using different methods (see Table 2). Unfortunately, the estimations differ very much with prediction as low as 28 000 up to 80 000 genes per human haploid. The whole genomic community is so excited with this mysterious number that they decided to organize the Gene Sweepstake. The Gene Sweepstake will run between 2000 and 2003 and its detailed rules may be found at: http://www. ensembl.org/Genesweep/. As of January 2001, 165 bets were made with gene number between 27 462 and 153 478 and a mean value of 61 710. Vol. 48 The human genome 591 Table 2. Estimation of human gene number using different methods Gene number 80 000 64 000 35 000 28 00034 000 30 000 EXONS Method CpG islands ESTs ESTs Comparative genomics Gene punctuation CHARACTERISTIC In most human genes, coding sequences are interrupted by stretches of non-coding sequences, which are spliced out during mRNA maturation. Using nomenclature introduced by Walter Gilbert (Gilbert, 1978), the human genes look like mosaics, consisting of series of exons (DNA sequences that can be subsequently found in the mature mRNA) and introns (silent DNA sequences that are absent from the final mRNA). As nothing in nature is simple, some of the introns carry significant information and even code for other complete genes (see description of nested genes below). Initially, it was thought that introns occured only in untranslated parts of mRNA and coding sequences (CDS) were not interrupted. However, it soon became clear that introns could be found in all domains of mRNA molecule. Therefore, exons can be classified as follows: 5¢ UTR exons, coding exons, 3¢ UTR exons, and all possible combinations of those three main types, including single exons that cover the whole mRNA. The latter are very interesting from the evolutionary biology point of view, because in most cases they are retroposed copies of “regular” genes with introns. Michael Zhang of Cold Spring Harbor Laboratory analyzed 4731 human exons (Zhang, 1998). It appears that human exons are relatively short with median value of 167 bp and mean equal to 216 bp. The shortest exon was only 12 bp while the longest one 6609 bp. These numbers have to be taken with some caution because they are based on GenBank annotation, which sometimes is not very precise. Mixed (including coding and Reference (Antequera & Bird, 1994) (Fields et al., 1994) (Ewing & Green, 2000) (Roest Crollius et al., 2000) (Yang et al., 2001) non-coding sequences) exons tend to be longer than single type exons, especially those at the end of the message; not surprisingly so, since 3¢ UTRs are relatively long in mammalian mRNAs. In our analysis of over 2000 human mRNA sequences the median and mean sizes of human message domains were as follow: 118 nt and 191 nt for 5¢ UTR, 1191 and 1424 for CDSs, and 534 and 576 for 3¢ UTRs, respectively (Makalowski et al., 1996; Makalowski & Boguski, 1998). GENE DISTRIBUTION Genes may be transcribed from either the same or from the opposite strand of the genome, i.e. they may lie in the same (tail-to-head) or opposite orientation (head-to-head or tail-to-tail). Although the vast majority of the human genome accounts for non-exonic sequences, a surprisingly large number of genes occupy the same genomic space. About 6% of human genes reside in introns of other genes (Wong et al., 2000). For example, intron 27th of NF1 gene hosts three other genes that have small introns on their own, suggesting that they are not products of retroposition (see Fig. 2). Additionally, over 100 gene pairs are overlapping at 3¢ end, i.e. their 3¢ UTRs occupy the same region though different strands (I. Makalowska, personal communication). TPR and MSF genes map to the same region of chromosome 1. The last exon of the TPR gene is 872 nt long and overlaps completely with the last exon of the MSF gene (200 nt). Interestingly, the very end of 592 W. Makalowski 2001 Figure 2. An example of nested genes. The human sequence from chromosome 1 (GenBank accession number AC004526) was analyzed using GeneMachine (Makalowska et al., 2001). Connected closed boxes represent gene models as predicted by GenScan software (Burge & Karlin, 1997) and boxes with arrows represent results of BLASTn search; AC004526 was used as a query against nr database. the MSF gene overlaps with the intron of the TPR gene (see Fig. 3). Unlike in plant genomes, most of non-exonic sequences in human genome account for introns (Wong et al., 2000). However, genes are not equally distributed throughout the genome. There is a distinct association between GC-richness and gene density. This is consistent with the association of most genes with CpG islands, the 500–1000 bp GC-rich seg- ments flanking (usually at the 5¢ end) most housekeeping and many tissue-specific genes. The clustering of CpG islands, as demonstrated by fluorescence in situ hybridization further depicts gene-poor and gene-rich chromosomal segments (Craig & Bickmore, 1994). As a consequence, more than half of human genes locate in the so-called “genomic core” (isochores H2 and H3) comprising only 12% of the human genome (see Table 3). Figure 3. An example of overlapping genes. The human sequence from chromosome 1 (GenBank accession number AL133533) was analyzed using GeneMachine (Makalowska et al., 2001). Connected closed boxes represent gene models as predicted by GenScan software (Burge & Karlin, 1997) and open boxes with arrows represent results of BLASTn search; AL133533 was used as a query against nr database. Vol. 48 The human genome 593 Table 3. Gene density in different isochores Isochore type Genome fraction Gene fraction Gene density GENE Genomic core H2 and H3 12% 54% 1/10 kbp FAMILIES Many genes can be clustered in groups of different sizes based on sequence similarity. The similarity between two genes varies from genes coding identical products to genes in which product similarity is barely detectable and/or limited to short sequence stretches called sequence motives. Genes families arose during the evolution by gene duplications over the different periods of time as reflected in sequence similarity. In general, more similar genes shared a common ancestor later (in nearer past) than genes with a weaker similarity, although gene conversion can result in very similar or identical gene copies regardless of gene duplication time. Gene duplication can occur by different mechanisms, like unequal recombination or retroposition. Not all duplicated genes remain active, some of them end up in genomic oblivion and are called pseudogenes. Some of the pseudogenes can be rescued from the genomic death by capturing a promoter and regulatory elements in the course of evolution as happened with Q-globin gene which was rescued by an Alu element after 200 mln years of silent existence (see discussion in Makalowski, 1995). The histone gene family is an example of very similar genes. It consists of five genes that tend to be linked, although in differing arrays of variable copy numbers dispersed in the human genome. The individual genes of a particular histone family encode essentially identical products (i.e. all H4 genes code for the identical H4 protein). Analysis of individual human genomic clones has identified isolated histone genes, e.g. H4, clusters of two or Empty space L, H1 88% 46% 1/100 kbp more histone genes, or clusters of all histone genes, e.g. H3-H4-H1-H3-H2A-H2B (Hentschel & Birnstiel, 1981). A majority of histone genes form a large cluster on human chromosome 6 (6p21.3) and a small cluster at 1q21. Interestingly, histone genes lack introns; a rare feature for eukaryotic genes. Genes that encode ribosomal RNA (rRNA) total about 0.4% of the DNA in the human genome. The individual genes of a particular rRNA family are essentially identical. The 28S, 5.8S and 18S rRNA genes are clustered with spacer units in tandem arrays of approximately 60 copies each yielding about 2 million bp of DNA. These clusters are present on the short arms of five acrocentric chromosomes and form the nucleolar organizing regions, hence approximately 300 copies. These three rRNA genes are transcribed as a single unit and then cleaved. 5S rRNA genes are clustered on chromosome 1q. Some genes in the human genome share highly conserved amino-acid domains with weak overall similarity. These often have developmental function. There are nine dispersed paired box (Pax) genes that contain highly conserved DNA binding domains with six a-helices. The homeobox or Hox genes share a common 60 amino-acid sequence. In humans there are four Hox gene clusters, each on a different chromosome. However, the individual genes in the cluster demonstrate greater similarity to a counterpart gene in another cluster than to the other genes in the same cluster. There are pseudogenes that are the result of retroposition (retropseudogenes). The pseudogenes lack introns and the flanking 594 W. Makalowski 2001 DNA sequences of the functional locus and therefore are not products of gene duplication. The generation of these types of elements is dependent on the reverse transcriptase of other retroelements such as LINEs. REPETITIVE SEQUENCES The human genome is occupied by stretches of DNA sequences of various length that exist in variable copy number. These repetitive sequences may be in a tandem orientation or they may be dispersed throughout the genome. Repetitive sequences may be classified by function, dispersal patterns, and sequence relatedness. Satellite DNA typically refers to highly repetitive sequences with no known function and interspersed repeat sequences are typically the products of transposable element integration, including retrogenes and retropseudogenes of a functional gene. For the up-to-date list of human repetitive elements visit the RepBase at http://www. girinst.org/. GENOMIC DUPLICATIONS Thirty years ago, Suzumu Ohno put forth a hypothesis about two duplications of the whole genome in the early stages of vertebrate evolution (Ohno, 1970). According to his hypothesis, most vertebrate gene families should give three or four well-defined branches, as presented in Fig. 4. Unfortunately, analysis of over 10 000 vertebrate gene families does not support Ohno¢s hypothesis (Makalowski, unpublished observation). Nevertheless, duplications in human genome do exist and they play a significant role in genes and the genome evolution. Although sometimes very large, they appear to be on a local, not a global scale. For example, the comparison of the complete human chromosome 21 sequence with both itself and other human sequences revealed many large duplications Figure 4. A hypothetical phylogenetical tree of vertebrate gene family under Ohno¢s hypothesis about two genomeduplicationsin early vertebrate evolution. Drosophila gene represents an outgoup and four clusters of a gene family are encircled. An asterisk (*) marks first genome duplication and a hash sign (#) marks points of second genome duplication. Different branch lengths suggest different evolutionary rates after ancestral gene duplication. with the largest intra-chromosomal duplication being 189 kb (position 188–377 and 14795–15 002 in q arm) and the largest detected inter-chromosomal duplication of over 100 kb region from q arm of chromosome 21 (position 646–751) duplicated in chromosome 22 (position 45–230) (RIKEN, 2000). MICROSATELLITES, AND MINISATELLITES, MACROSATELLITES Microsatellites are small arrays of short simple tandem repeats, primarily 4 bp or less. Dif- Vol. 48 The human genome ferent arrays are found dispersed throughout the genome, although dinucleotide CA/TG repeats are most common, yielding 0.5% of the genome. Runs of As and Ts are common as well. Microsatellites have no known functions. However, CA/TG dinucleotide pairs can form the Z-DNA conformation in vitro, which may indicate some function. Repeat unit copy number variation of microsatellites apparently occurs by replication slippage. The expansion of trinucleotide repeats within genes has been associated with genetic disorders such as Huntington disease or fragile-X syndrome. Minisatellites are tandemly repeated sequences of DNA of lengths ranging from 1 kbp to 15 kbp. For example, telomeric DNA sequences contain 10–15 kb of hexanucleotide repeats, most commonly TTAGGG in the human genome, at the termini of the chromosomes. These sequences are added by telomerase to ensure complete replication of the chromosome. Macrosatellites are very long arrays, up to hundreds of kilobases, of tandemly repeated DNA. There are three satellite bands observed by buoyant density centrifugation. However, not all satellite sequences are resolved by density gradient centrifugation, e.g. alpha satellite DNA or alphoid DNA that constitute the bulk of centromeric heterochromatin on all chromosomes. The interchromosomal divergence of the alpha satellite families allows the different chromosomes to be distinguished by fluorescence in situ hybridization (FISH). TRANSPOSABLE ELEMENTS The human genome contains interspersed repeat sequences that have largely amplified in copy number by movement throughout the genome. Those sequences (transposable elements or TEs) can be divided into two classes based on the mode of transposition (Finnegan, 1989). The Class I elements are TEs which transpose by replication that involves 595 an RNA intermediate which is reverse transcribed back to DNA prior to reinsertion. These are called retroelements and include LTR transposons, which are structurally similar to integrated retroviruses, non-LTR elements (LINEs and SINEs), and retrogenes (see Fig. 5). Class II elements move by a conservative cut-and-paste mechanisms, the excision of the donor element is followed by its reinsertion elsewhere in the genome. Integration of Class I and Class II transposable elements results in the duplication of a short sequence of DNA, the target site. There are about 500 families of such transposons. Most of transposition has occurred via an RNA intermediate, yielding classes of sequences referred to as retroelements (more than 400 families, e.g. Alu, L1, retrogenes, MIR). However, there is also evidence of an ancient DNA-mediated transposition (more than 60 families of class II (DNA) transposons, e.g. THE-1, Charlie, Tigger, mariner). RETROELEMENTS Short interspersed repetitive elements (SINEs) and long interspersed repetitive elements (LINEs) are the two most abundant classes of repeats in human, and represent the two major classes of mammalian retrotransposons. Structural features shared by LINEs and SINEs include an A-rich 3¢ end and the lack of long terminal repeats (LTRs); these features distinguish them from retroviruses and related retroelements. A full-length LINE (or L1 element) is approximately 6.1 kbp although most are truncated pseudogenes with various 5¢ ends due to incomplete reverse transcription. There are about 100 000 copies of L1 sequences in our genome. Approximately 1% of the estimated 3500 full-length LINEs have functional RNA polymerase II promoter sequences along with two intact open reading frames necessary to generate new L1 copies. Individual LINEs contain a poly-A tail and are flanked by direct 596 W. Makalowski 2001 Figure 5. The structure of different human transposable elements. Open arrows denote duplicated target sites and closed arrows denote long terminal repeats (LTRs). The following abbreviations are used: CP, capside; NC, nucleocapsid; Pr, proteinase; RT, reverse transcriptase; Int, integrase; ORF, open reading frame, and A and B denote polymerase III internal promoter. repeats. LINE mobilization activity has been verified in both germinal and somatic tissues. The Alu element is estimated at 500 000– 900 000 copies in the human genome representing the primary SINE family, the most successful transposon in any genome. Sequence comparisons suggest that Alu repeats were derived from the 7SL RNA gene. Each Alu element is about 280 bp with a dimeric structure, contains RNA polymerase III promoter sequences, and typically has an A-rich tail and flanking direct repeats (generated during integration). Although Alu elements are present in all primate genomes, more than 2000 Alu elements have integrated within the human genome subsequent to the divergence of humans from the great apes. The human genome also contains families of retroviral-related sequences. These are characterized by sequences encoding enzymes for retroposition and contain LTRs. In addition, solitary LTRs of these elements may be located throughout the genome. There are sev- eral low abundant (10–1000 copies) human endogenous retrovirus (HERV) families, with individual elements ranging from 6 to 10 kb, collectively encompassing about 1% of the genome. CLASS II ELEMENTS Class II elements contain inverted repeats (10–500 bp) at their termini and encode a transposase that catalyses transposition. They move by excision at the donor site and reinsertion elsewhere in the genome by a non-replicative mechanism. The human genome hosts a number of repeated sequences originated in more than 60 different DNA transposons. The mariner ‘fossils’ present in our genome closely resemble members of three subfamilies identified in insects, adding to the already extensive evidence that horizontal transfer between genomes has been impor- Vol. 48 The human genome tant in genomic evolution. Other human DNA transposon remains also show high similarity to sequences in distantly related organisms. Nevertheless, the level of sequence divergence suggests that activity of all identified elements predates human evolution. CONCLUSIONS The 3.2 billion bp of our genetic blueprint is packed into 23 pairs of chromosomes, or 46 DNA molecules. Only a fraction of the genome is occupied by protein-coding exons and the majority of non-exonic sequences consists of repetitive elements. Functional exons contribute merely 2% of a genome, up to 50% of a genome is occupied by repetitive element, the remaining 48% is called unique DNA, most of which probably originated in mobile elements diverged over time beyond recognition. Different evolutionary forces shape the human genome composition and structure. It appears that different mobile elements play a significant role in this process (reviewed recently in Makalowski, 2000). The human genome is a dynamic entity, new functional elements appear and old ones become extinct as genes that evolve according to birth and death rule (Ota & Nei, 1994) similarly to species evolution. This confirms that the theory of evolution is truly universal and applies not only to all organisms but to all levels of life as well. I would like to thank Izabela Makalowska for sharing unpublished data and Jakub Makalowski for preparing Fig. 5. REFERENCES Antequera, F. & Bird, A. (1994) Predicting the total number of human genes. Nat. Genet. 8, 114. Aristotle (1965) De generatione animalium. Oxonii, E Typographeo Clarendoniano. 597 Bernardi, G. (2000) Isochores and the evolutionary genomics of vertebrates. Gene 241, 317. Burge, C. & Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 7894. Craig, J.M. & Bickmore, W.A. (1994) The distribution of CpG islands in mammalian chromosomes. Nat. Genet. 7, 376382. Dunham, I., Shimizu, N. et al. (1999) The DNA sequence of human chromosome 22. Nature 402, 489495. Ewing, B. & Green, P. (2000) Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 25, 232234. Fields, C., Adams, M.D., White, O. & Venter, C.O. (1994) How many genes in the human genome? Nat. Genet. 7, 345346. Finnegan, D.J. (1989) Eukaryotic transposable elements and genome evolution. Trends Genet. 5, 103107. Gilbert, W. (1978) Why genes in pieces? Nature 271, 501. Hattori, M. et al. (2000) The DNA sequence of human chromosome 21. The chromosome 21 mapping and sequencing consortium (see comments). Nature 405, 311319. Hentschel, C.C. & Birnstiel, M.L. (1981) The organization and expression of histone gene families. Cell 25, 301313. Makalowska, I. et al. (2001) GeneMachine: A tool for seqence analysis and annotation. submitted. Makalowski, W. (1995) SINEs as a genomic scrap yard: An essay on genomic evolution; in The Impact of Short Interspersed Elements (SINEs) (Maraia, R.J. & Austin, R.G., eds.) pp. 81104, Landes Company. Makalowski, W. (2000) Genomic scrap yard: How genomes utilize all that junk. Gene 259, 6167. Makalowski, W. & Boguski, M.S. (1998) Evolutionary parameters of the transcribed mammalian genome: An analysis of 2,820 orthologous rodent and human sequences. Proc. Natl. Acad. Sci. U.S.A. 95, 94079412. on the Host Genome 598 W. Makalowski Makalowski, W., Zhang, J. & Boguski, M.S. (1996) Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res. 6, 846857. Maxam, A.M. & Gilbert, W. (1977) A new method for sequencing DNA. Proc. Natl. Acad. Sci. U.S.A. 74, 560564. Morgan, T.H. (1910) Chromosomes and heredity. Amer. Nat. 44, 449496. Ohno, S. (1970) Evolution by Gene Duplication. Springer Verlag, New York. Ota, T. & Nei, M. (1994) Divergent evolution and evolution by the birth-and-death process in the immunoglobulin VH gene family. Mol. Biol. Evol. 11, 469482. RIKEN (2000) http://hgp.gsc.riken.go.jp/ chr21/ annotation.htm Roest Crollius, H. et al. (2000) Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet. 25, 235238. 2001 Sanger, F., Nicklen, S. & Coulson, A.R. (1977) DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 74, 54635467. Watson, J.D. & Crick, F.H.C. (1953a) Genetical implications of the structure of deoxyribonucleic acid. Nature 171, 964967. Watson, J.D. & Crick, F.H.C. (1953b) Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid. Nature 171, 737738. Wong, G.K., Passey, D.A., Huang, Y., Yang, Z. & Yu, J. (2000) Is Junk DNA mostly intron DNA? Genome Res. 10, 16721678. Yang, C.Z. et al. (2001) Gene Punctuation. Submitted. Zhang, M.Q. (1998) Statistical features of human exons and their flanking regions. Hum. Mol. Genet. 7, 919932. 8885d_c24_920-947 2/11/04 1:36 PM Page 921 mac76 mac76:385_reb: PA R T III INFORMATION PATHWAYS 24 25 26 27 28 Genes and Chromosomes 923 DNA Metabolism 948 RNA Metabolism 995 Protein Metabolism 1034 Regulation of Gene Expression 1081 he third and final part of this book explores the biochemical mechanisms underlying the apparently contradictory requirements for both genetic continuity and the evolution of living organisms. What is the molecular nature of genetic material? How is genetic information transmitted from one generation to the next with high fidelity? How do the rare changes in genetic material that are the raw material of evolution arise? How is genetic information ultimately expressed in the amino acid sequences of the astonishing variety of protein molecules in a living cell? The fundamental unit of information in living systems is the gene. A gene can be defined biochemically as a segment of DNA (or, in a few cases, RNA) that encodes the information required to produce a functional biological product. The final product is usually a protein, so much of the material in Part III concerns genes that encode proteins. A functional gene product might also be one of several classes of RNA molecules. The storage, maintenance, and metabolism of these informational units form the focal points of our discussion in Part III. Modern biochemical research on gene structure and function has brought to biology a revolution comparable to that stimulated by the publication of Darwin’s theory on the origin of species nearly 150 years ago. An understanding of how information is stored and used in T cells has brought penetrating new insights to some of the most fundamental questions about cellular structure and function. A comprehensive conceptual framework for biochemistry is now unfolding. Today’s understanding of information pathways has arisen from the convergence of genetics, physics, and chemistry in modern biochemistry. This was epitomized by the discovery of the double-helical structure of DNA, postulated by James Watson and Francis Crick in 1953 (see Fig. 8–15). Genetic theory contributed the concept of coding by genes. Physics permitted the determination of molecular structure by x-ray diffraction analysis. Chemistry revealed the composition of DNA. The profound impact of the Watson-Crick hypothesis arose from its ability to account for a wide range of observations derived from studies in these diverse disciplines. This revolution in our understanding of the structure of DNA inevitably stimulated questions about its function. The double-helical structure itself clearly suggested how DNA might be copied so that the information it contains can be transmitted from one generation to the next. Clarification of how the information in DNA is converted into functional proteins came with the discovery of both messenger RNA and transfer RNA and with the deciphering of the genetic code. These and other major advances gave rise to the central dogma of molecular biology, comprising the three major processes in the cellular utilization of genetic information. The first is replication, the copying of parental DNA to form daughter DNA molecules with identical nucleotide sequences. The second is transcription, the process by which parts of the genetic message encoded in DNA are copied precisely into RNA. The third is translation, whereby the genetic message encoded in messenger RNA is translated on the ribosomes into a polypeptide with a particular sequence of amino acids. 921 8885d_c24_922 922 2/11/04 Part III 3:11 PM Page 922 mac76 mac76:385_reb: Information Pathways Replication DNA Transcription RNA Translation Protein The central dogma of molecular biology, showing the general pathways of information flow via replication, transcription, and translation. The term “dogma” is a misnomer. Introduced by Francis Crick at a time when little evidence supported these ideas, the dogma has become a well-established principle. Part III explores these and related processes. In Chapter 24 we examine the structure, topology, and packaging of chromosomes and genes. The processes underlying the central dogma are elaborated in Chapters 25 through 27. Finally, we turn to regulation, examining how the expression of genetic information is controlled (Chapter 28). A major theme running through these chapters is the added complexity inherent in the biosynthesis of macromolecules that contain information. Assembling nucleic acids and proteins with particular sequences of nucleotides and amino acids represents nothing less than preserving the faithful expression of the template upon which life itself is based. We might expect the formation of phosphodiester bonds in DNA or peptide bonds in proteins to be a trivial feat for cells, given the arsenal of enzymatic and chemical tools described in Part II. However, the framework of patterns and rules established in our examination of metabolic pathways thus far must be enlarged considerably to take into account molecular information. Bonds must be formed between particular subunits in informational biopolymers, avoiding either the occurrence or the persistence of sequence errors. This has an enormous impact on the thermodynamics, chemistry, and enzymology of the biosynthetic processes. Formation of a peptide bond requires an energy input of only about 21 kJ/mol of bonds and can be catalyzed by relatively simple enzymes. But to synthesize a bond between two specific amino acids at a particular point in a polypeptide, the cell invests about 125 kJ/mol while making use of more than 200 enzymes, RNA molecules, and specialized proteins. The chemistry involved in peptide bond formation does not change because of this requirement, but additional processes are layered over the basic reaction to ensure that the peptide bond is formed between particular amino acids. Information is expensive. The dynamic interaction between nucleic acids and proteins is another central theme of Part III. With the important exception of a few catalytic RNA molecules (discussed in Chapters 26 and 27), the processes that make up the pathways of cellular information flow are catalyzed and regulated by proteins. An understanding of these enzymes and other proteins can have practical as well as intellectual rewards, because they form the basis of recombinant DNA technology (introduced in Chapter 9). 8885d_c24_920-947 2/11/04 1:36 PM Page 923 mac76 mac76:385_reb: chapter 24 GENES AND CHROMOSOMES 24.1 24.2 24.3 Chromosomal Elements 924 DNA Supercoiling 930 The Structure of Chromosomes 938 DNA topoisomerases are the magicians of the DNA world. By allowing DNA strands or double helices to pass through each other, they can solve all of the topological problems of DNA in replication, transcription and other cellular transactions. tain them (Fig. 24–1). In this chapter we shift our focus from the secondary structure of DNA, considered in Chapter 8, to the extraordinary degree of organization required for the tertiary packaging of DNA into chromosomes. We first examine the elements within viral and cellular chromosomes, then assess their size and organization. We next consider DNA topology, providing a —James Wang, article in Nature Reviews in Molecular Cell Biology, 2002 Supercoiling, in fact, does more for DNA than act as an executive enhancer; it keeps the unruly, spreading DNA inside the cramped confines that the cell has provided for it. —Nicholas Cozzarelli, Harvey Lectures, 1993 lmost every cell of a multicellular organism contains the same complement of genetic material—its genome. Just look at any human individual for a hint of the wealth of information contained in each human cell. Chromosomes, the nucleic acid molecules that are the repository of an organism’s genetic information, are the largest molecules in a cell and may contain thousands of genes as well as considerable tracts of intergenic DNA. The 16 chromosomes in the relatively small genome of the yeast Saccharomyces cerevisiae have molecular masses ranging from 1.5 108 to 1 109 daltons, corresponding to DNA molecules with 230,000 to 1,532,000 contiguous base pairs (bp). Human chromosomes range up to 279 million bp. The very size of DNA molecules presents an interesting biological puzzle, given that they are generally much longer than the cells or viral packages that con- A 0.5 m FIGURE 24–1 Bacteriophage T2 protein coat surrounded by its single, linear molecule of DNA. The DNA was released by lysing the bacteriophage particle in distilled water and allowing the DNA to spread on the water surface. An undamaged T2 bacteriophage particle consists of a head structure that tapers to a tail by which the bacteriophage attaches itself to the outer surface of a bacterial cell. All the DNA shown in this electron micrograph is normally packaged inside the phage head. 923 8885d_c24_920-947 924 2/11/04 Chapter 24 1:36 PM Page 924 mac76 mac76:385_reb: Genes and Chromosomes description of the coiling of DNA molecules. Finally, we discuss the protein-DNA interactions that organize chromosomes into compact structures. 24.1 Chromosomal Elements Cellular DNA contains genes and intergenic regions, both of which may serve functions vital to the cell. The more complex genomes, such as those of eukaryotic cells, demand increased levels of chromosomal organization, and this is reflected in the chromosome’s structural features. We begin by considering the different types of DNA sequences and structural elements within chromosomes. Genes Are Segments of DNA That Code for Polypeptide Chains and RNAs Our understanding of genes has evolved tremendously over the last century. Classically, a gene was defined as a portion of a chromosome that determines or affects a single character or phenotype (visible property), such as eye color. George Beadle and Edward Tatum proposed a molecular definition of a gene in 1940. After exposing spores of the fungus Neurospora crassa to x rays and other agents known to damage DNA and cause alterations in DNA sequence (mutations), they detected mutant fungal strains that lacked one or another specific enzyme, sometimes resulting in the failure of an entire metabolic pathway. Beadle and Tatum concluded that a gene is a segment of genetic material that determines or codes for one enzyme: the one gene–one enzyme hypothesis. Later this concept was broadened to one gene–one polypeptide, because many genes code for proteins that are not enzymes or for one polypeptide of a multisubunit protein. The modern biochemical definition of a gene is even more precise. A gene is all the DNA that encodes the primary sequence of some final gene product, which can be either a polypeptide or an RNA with a structural or catalytic function. DNA also contains other segments or sequences that have a purely regulatory function. Regulatory sequences provide signals that may denote the beginning or the end of genes, or influence the transcription of genes, or function as initiation points for replication or recombination (Chapter 28). Some genes can be expressed in different ways to generate multiple gene products from one segment of DNA. The special transcriptional and translational mechanisms that allow this are described in Chapters 26 through 28. We can make direct estimations of the minimum overall size of genes that encode proteins. As described in detail in Chapter 27, each amino acid of a polypeptide chain is coded for by a sequence of three consecutive nucleotides in a single strand of DNA (Fig. 24–2), with these “codons” arranged in a sequence that corresponds to the sequence of amino acids in the polypeptide that the gene encodes. A polypeptide chain of 350 amino acid residues (an average-size chain) corre- DNA 5 3 mRNA 3 C G T G G A T A C A C T T T T G C C G T T T C T G C A C C T A T G T G A A A A C G G C A A A G A 5 5 C G U G G A U A C A C U U U U G C C G U U U C U Polypeptide Amino terminus Arg Gly Tyr Thr Phe Ala Val Ser 3 Carboxyl terminus Template strand FIGURE 24–2 Colinearity of the coding nucleotide sequences of George W. Beadle, 1903–1989 Edward L. Tatum, 1909–1975 DNA and mRNA and the amino acid sequence of a polypeptide chain. The triplets of nucleotide units in DNA determine the amino acids in a protein through the intermediary mRNA. One of the DNA strands serves as a template for synthesis of mRNA, which has nucleotide triplets (codons) complementary to those of the DNA. In some bacterial and many eukaryotic genes, coding sequences are interrupted at intervals by regions of noncoding sequences (called introns). 8885d_c24_925 2/12/04 11:21 AM Page 925 mac76 mac76:385_reb: 24.1 sponds to 1,050 bp. Many genes in eukaryotes and a few in prokaryotes are interrupted by noncoding DNA segments and are therefore considerably longer than this simple calculation would suggest. How many genes are in a single chromosome? The Escherichia coli chromosome, one of the prokaryotic genomes that has been completely sequenced, is a circular DNA molecule (in the sense of an endless loop rather than a perfect circle) with 4,639,221 bp. These base pairs encode about 4,300 genes for proteins and another 115 genes for stable RNA molecules. Among eukaryotes, the approximately 3.2 billion base pairs of the human genome include 30,000 to 35,000 genes on 24 different chromosomes. DNA Molecules Are Much Longer Than the Cellular Packages That Contain Them Chromosomal DNAs are often many orders of magnitude longer than the cells or viruses in which they are found (Fig. 24–1; Table 24–1). This is true of every class of organism or parasite. Viruses Viruses are not free-living organisms; rather, they are infectious parasites that use the resources of a host cell to carry out many of the processes they require to propagate. Many viral particles consist of no more than a genome (usually a single RNA or DNA molecule) surrounded by a protein coat. Almost all plant viruses and some bacterial and animal viruses have RNA genomes. These genomes tend to be particularly small. For example, the genomes of mammalian retroviruses such as HIV are about 9,000 nucleotides long, and that of the bacteriophage Q has 4,220 nucleotides. Both types of viruses have singlestranded RNA genomes. The genomes of DNA viruses vary greatly in size (Table 24–1). Many viral DNAs are circular for at least part of their life cycle. During viral replication within a host cell, specific types of viral DNA called replicative forms may appear; for example, many linear DNAs become circular and all single-stranded DNAs become TABLE 24–1 Chromosomal Elements double-stranded. A typical medium-sized DNA virus is bacteriophage (lambda), which infects E. coli. In its replicative form inside cells, DNA is a circular double helix. This double-stranded DNA contains 48,502 bp and has a contour length of 17.5 m. Bacteriophage X174 is a much smaller DNA virus; the DNA in the viral particle is a single-stranded circle, and the double-stranded replicative form contains 5,386 bp. Although viral genomes are small, the contour lengths of their DNAs are much greater than the long dimensions of the viral particles that contain them. The DNA of bacteriophage T4, for example, is about 290 times longer than the viral particle itself (Table 24–1). Bacteria A single E. coli cell contains almost 100 times as much DNA as a bacteriophage particle. The chromosome of an E. coli cell is a single double-stranded circular DNA molecule. Its 4,639,221 bp have a contour length of about 1.7 mm, some 850 times the length of the E. coli cell (Fig. 24–3). In addition to the very large, circular DNA chromosome in their nucleoid, many bacteria contain one or more small circular DNA molecules that are free in the cytosol. These extrachromosomal elements are called plasmids (Fig. 24–4; see also p. 311). Most plasmids are only a few thousand base pairs long, but some contain more than 10,000 bp. They carry genetic information and undergo replication to yield daughter plasmids, which pass into the daughter cells at cell division. Plasmids have been found in yeast and other fungi as well as in bacteria. In many cases plasmids confer no obvious advantage on their host, and their sole function appears to be self-propagation. However, some plasmids carry genes that are useful to the host bacterium. For example, some plasmid genes make a host bacterium resistant to antibacterial agents. Plasmids carrying the gene for the enzyme -lactamase confer resistance to -lactam antibiotics such as penicillin and amoxicillin (see Box 20–1). These and similar plasmids may pass from an antibiotic-resistant cell to an antibiotic-sensitive cell of the same or another bacterial species, making the recipient cell antibiotic resistant. The extensive use of antibiotics The Sizes of DNA and Viral Particles for Some Bacterial Viruses (Bacteriophages) Virus Size of viral DNA (bp) X174 T7 (lambda) T4 5,386 39,936 48,502 168,889 Length of viral DNA (nm) 1,939 14,377 17,460 60,800 925 Long dimension of viral particle (nm) 25 78 190 210 Note: Data on size of DNA are for the replicative form (double-stranded). The contour length is calculated assuming that each base pair occupies a length of 3.4 Å (see Fig. 8–15). 8885d_c24_920-947 926 2/11/04 Chapter 24 1:36 PM Page 926 mac76 mac76:385_reb: Genes and Chromosomes FIGURE 24–3 The length of the E. coli chromosome (1.7 mm) depicted in linear form relative to the length of a typical E. coli cell (2 m). E. coli E. coli DNA FIGURE 24–4 DNA from a lysed E. coli cell. In this electron micrograph several small, circular plasmid DNAs are indicated by white arrows. The black spots and white specks are artifacts of the preparation. in some human populations has served as a strong selective force, encouraging the spread of antibiotic resistance–coding plasmids (as well as transposable elements, described below, that harbor similar genes) in disease-causing bacteria and creating bacterial strains that are resistant to several antibiotics. Physicians are becoming increasingly reluctant to prescribe antibiotics unless a clear clinical need is confirmed. For similar reasons, the widespread use of antibiotics in animal feeds is being curbed. Eukaryotes A yeast cell, one of the simplest eukaryotes, has 2.6 times more DNA in its genome than an E. coli cell (Table 24–2). Cells of Drosophila, the fruit fly used in classical genetic studies, contain more than 35 times as much DNA as E. coli cells, and human cells have almost 700 times as much. The cells of many plants and amphibians contain even more. The genetic material of eukaryotic cells is apportioned into chromosomes, the diploid (2n) number depending on the species (Table 24–2). A human somatic cell, for example, has 46 chro- mosomes (Fig. 24–5). Each chromosome of a eukaryotic cell, such as that shown in Figure 24–5a, contains a single, very large, duplex DNA molecule. The DNA molecules in the 24 different types of human chromosomes (22 matching pairs plus the X and Y sex chromosomes) vary in length over a 25-fold range. Each type of chromosome in eukaryotes carries a characteristic set of genes. Interestingly, the number of genes does not vary nearly as much as does genome size (see Chapter 9 for a discussion of the types of sequences, besides genes, that contribute to genome size). The DNA of one human genome (22 chromosomes plus X and Y or two X chromosomes), placed end to end, would extend for about a meter. Most human cells are diploid and each cell contains a total of 2 m of DNA. An adult human body contains approximately 1014 cells and thus a total DNA length of 2 1011 km. Compare this with the circumference of the earth (4 104 km) or the distance between the earth and the sun (1.5 108 km)—a dramatic illustration of the extraordinary degree of DNA compaction in our cells. 8885d_c24_920-947 2/11/04 1:36 PM Page 927 mac76 mac76:385_reb: 24.1 Chromosomal Elements 927 (b) (a) FIGURE 24–5 Eukaryotic chromosomes. (a) A pair of linked and condensed sister chromatids from a human chromosome. Eukaryotic chromosomes are in this state after replication and at metaphase during mitosis. (b) A complete set of chromosomes from a leukocyte from one of the authors. There are 46 chromosomes in every normal human somatic cell. Eukaryotic cells also have organelles, mitochondria (Fig. 24–6) and chloroplasts, that contain DNA. Mitochondrial DNA (mtDNA) molecules are much smaller than the nuclear chromosomes. In animal cells, mtDNA contains fewer than 20,000 bp (16,569 bp in human mtDNA) and is a circular duplex. Each mitochondrion typically has two to ten copies of this mtDNA molecule, and the number can rise to hundreds in certain cells when an embryo is undergoing cell differentiation. In a few organisms (trypanosomes, for example) each mitochondrion contains thousands of copies of mtDNA, organized into a complex and interlinked matrix known as a kinetoplast. Plant cell mtDNA ranges in size from 200,000 to 2,500,000 bp. Chloroplast DNA (cpDNA) also exists as circular duplexes and ranges in size from 120,000 to 160,000 bp. The evolutionary origin of mitochondrial and chloroplast DNAs has been the subject of much speculation. A widely accepted view is that they are vestiges of the chromosomes of ancient bacteria that gained access to the cytoplasm of host cells and became the precursors of these organelles (see Fig. 1–36). FIGURE 24–6 A dividing mitochondrion. Some mitochondrial proteins and RNAs are encoded by one of the copies of the mitochondrial DNA (none of which are visible here). The DNA (mtDNA) is replicated each time the mitochondrion divides, before cell division. 8885d_c24_920-947 928 2/11/04 Chapter 24 1:36 PM Page 928 mac76 mac76:385_reb: Genes and Chromosomes TABLE 24–2 DNA, Gene, and Chromosome Content in Some Genomes Total DNA (bp) Bacterium (Escherichia coli) Yeast (Saccharomyces cerevisiae) Nematode (Caenorhabditis elegans) Plant (Arabidopsis thaliana) Fruit fly (Drosophila melanogaster) Plant (Oryza sativa; rice) Mouse (Mus musculus) Human (Homo sapiens) 4,639,221 12,068,000 97,000,000 125,000,000 180,000,000 480,000,000 2,500,000,000 3,200,000,000 Number of chromosomes* 1 16† 12‡ 10 18 24 40 46 Approximate number of genes 4,405 6,200 19,000 25,500 13,600 57,000 30,000–35,000 30,000–35,000 Note: This information is constantly being refined. For the most current information, consult the websites for the individual genome projects. * The diploid chromosome number is given for all eukaryotes except yeast. † Haploid chromosome number. Wild yeast strains generally have eight (octoploid) or more sets of these chromosomes. ‡ Number for females, with two X chromosomes. Males have an X but no Y, thus 11 chromosomes in all. Mitochondrial DNA codes for the mitochondrial tRNAs and rRNAs and for a few mitochondrial proteins. More than 95% of mitochondrial proteins are encoded by nuclear DNA. Mitochondria and chloroplasts divide when the cell divides. Their DNA is replicated before and during division, and the daughter DNA molecules pass into the daughter organelles. Eukaryotic Genes and Chromosomes Are Very Complex Many bacterial species have only one chromosome per cell and, in nearly all cases, each chromosome contains only one copy of each gene. A very few genes, such as those for rRNAs, are repeated several times. Genes and regulatory sequences account for almost all the DNA in prokaryotes. Moreover, almost every gene is precisely colinear with the amino acid sequence (or RNA sequence) for which it codes (Fig. 24–2). The organization of genes in eukaryotic DNA is structurally and functionally much more complex. The study of eukaryotic chromosome structure, and more recently the sequencing of entire eukaryotic genomes, has yielded many surprises. Many, if not most, eukaryotic genes have a distinctive and puzzling structural feature: their nucleotide sequences contain one or more intervening segments of DNA that do not code for the amino acid sequence of the polypeptide product. These nontranslated inserts interrupt the otherwise colinear relationship between the nucleotide sequence of the gene and the amino acid sequence of the polypeptide it encodes. Such nontranslated DNA segments in genes are called intervening sequences or introns, and the coding segments are called exons. Few prokaryotic genes contain introns. In higher eukaryotes, the typical gene has much more intron sequence than sequences devoted to exons. For example, in the gene coding for the single polypeptide chain of the avian egg protein ovalbumin (Fig. 24–7), the introns are much longer than the exons; altogether, seven introns make up 85% of the gene’s DNA. In the gene for the subunit of hemoglobin, a single intron contains more than half of the gene’s DNA. The gene for the muscle protein titin is the intron champion, with 178 introns. Genes for histones appear to have no introns. In most cases the function of introns is not clear. In total, only about 1.5% of human DNA is “coding” or exon DNA, carrying information for protein or RNA products. However, when the much larger introns are included in the count, as much as 30% of the human genome consists of genes. The relative paucity of genes in the human genome leaves a lot of DNA unaccounted for. Figure 24–8 provides a summary of sequence types. Much of the nongene DNA is in the form of repeated sequences of several kinds. Perhaps most surprising, about half the human genome is made up of moderately repeated sequences that are derived from transposable elements— segments of DNA, ranging from a few hundred to several thousand base pairs long, that can move from one location to another in the genome. Transposable elements (transposons) are a kind of molecular parasite, efficiently making a home within the host genome. Many have genes encoding proteins that catalyze the transposition process, described in more detail in Chapters 25 and 26. Some transposons in the human genome are active, moving at a low frequency, but most are inactive relics, evolutionarily altered by mutations. Although these elements generally do not encode proteins or RNAs that are used in human cells, they have played a 8885d_c24_920-947 2/11/04 1:36 PM Page 929 mac76 mac76:385_reb: 24.1 1 L Ovalbumin gene A 2 B 3 C 4 5 D E 6 F Chromosomal Elements 929 7 G Exon Intron 2 222 bp 1 90 bp Hemoglobin subunit 3 126 bp A 131 bp B 851 bp FIGURE 24–7 Introns in two eukaryotic genes. The gene for ovalbumin has seven introns (A to G), splitting the coding sequences into eight exons (L, and 1 to 7). The gene for the subunit of hemoglobin has two introns and three exons, including one intron that alone contains more than half the base pairs of the gene. major role in human evolution: movement of transposons can lead to the redistribution of other genomic sequences. Another 3% or so of the human genome consists of highly repetitive sequences, also referred to as simple-sequence DNA or simple sequence repeats (SSR). These short sequences, generally less than 10 bp long, are sometimes repeated millions of times per cell. The simple-sequence DNA has also been called satellite DNA, so named because its unusual base com- position often causes it to migrate as “satellite” bands (separated from the rest of the DNA) when fragmented cellular DNA samples are centrifuged in a cesium chloride density gradient. Studies suggest that simplesequence DNA does not encode proteins or RNAs. Unlike the transposable elements, the highly repetitive DNA can have identifiable functional importance in human cellular metabolism, because much of it is associated with two defining features of eukaryotic chromosomes: centromeres and telomeres. 45% Transposons 21% LINEs 13% SINEs 8% Retroviruslike 1.5% Exons 3% SSR 2 ce 5% lla ne 30 n e G % es 17% ? is 28.5% Introns and noncoding segments ous 5% SD M FIGURE 24–8 Types of sequences in the human genome. This pie chart divides the genome into transposons (transposable elements), genes, and miscellaneous sequences. There are four main classes of transposons. Long interspersed elements (LINEs), 6 to 8 kbp long (1 kbp 1,000 bp), typically include a few genes encoding proteins that catalyze transposition. The genome has about 850,000 LINEs. Short interspersed elements (SINEs) are about 100 to 300 bp long. Of the 1.5 million in the human genome more than 1 million are Alu elements, so called because they generally include one copy of the recognition sequence for AluI, a restriction endonuclease (see Fig. 9–3). The genome also contains 450,000 copies of retroviruslike transposons, 1.5 to 11 kbp long. Although these are “trapped” in the genome and cannot move from one cell to another, they are evolutionarily related to the retroviruses (Chapter 26), which include HIV. A final class of transposons (making up 1% and not shown here) consists of a variety of transposon remnants that differ greatly in length. About 30% of the genome consists of sequences included in genes for proteins, but only a small fraction of this DNA is in exons (coding sequences). Miscellaneous sequences include simple-sequence repeats (SSR) and large segmental duplications (SD), the latter being segments that appear more than once in different locations. Among the unlisted sequence elements (denoted by a question mark) are genes encoding RNAs (which can be harder to identify than genes for proteins) and remnants of transposons that have been evolutionarily altered so that they are now hard to identify. 8885d_c24_920-947 930 2/11/04 Chapter 24 Telomere 1:36 PM Page 930 mac76 mac76:385_reb: Genes and Chromosomes Centromere Telomere SUMMARY 24.1 Chromosomal Elements ■ Genes are segments of a chromosome that contain the information for a functional polypeptide or RNA molecule. In addition to genes, chromosomes contain a variety of regulatory sequences involved in replication, transcription, and other processes. ■ Genomic DNA and RNA molecules are generally orders of magnitude longer than the viral particles or cells that contain them. ■ Many genes in eukaryotic cells, and a few in bacteria, are interrupted by noncoding sequences called introns. The coding segments separated by introns are called exons. ■ Less than one-third of human genomic DNA consists of genes. Much of the remainder consists of repeated sequences of various types. Nucleic acid parasites known as transposons account for about half of the human genome. ■ Eukaryotic chromosomes have two important special-function repetitive DNA sequences: centromeres, which are attachment points for the mitotic spindle, and telomeres, located at the ends of chromosomes. Unique sequences (genes), dispersed repeats, and multiple replication origins FIGURE 24–9 Important structural elements of a yeast chromosome. The centromere (Fig. 24–9) is a sequence of DNA that functions during cell division as an attachment point for proteins that link the chromosome to the mitotic spindle. This attachment is essential for the equal and orderly distribution of chromosome sets to daughter cells. The centromeres of Saccharomyces cerevisiae have been isolated and studied. The sequences essential to centromere function are about 130 bp long and are very rich in AUT pairs. The centromeric sequences of higher eukaryotes are much longer and, unlike those of yeast, generally contain simple-sequence DNA, which consists of thousands of tandem copies of one or a few short sequences of 5 to 10 bp, in the same orientation. The precise role of simple-sequence DNA in centromere function is not yet understood. Telomeres (Greek telos, “end”) are sequences at the ends of eukaryotic chromosomes that help stabilize the chromosome. The best-characterized telomeres are those of the simpler eukaryotes. Yeast telomeres end with about 100 bp of imprecisely repeated sequences of the form (5)(TxGy)n (3)(AxCy)n where x and y are generally between 1 and 4. The number of telomere repeats, n, is in the range of 20 to 100 for most single-celled eukaryotes and generally more than 1,500 in mammals. The ends of a linear DNA molecule cannot be routinely replicated by the cellular replication machinery (which may be one reason why bacterial DNA molecules are circular). Repeated telomeric sequences are added to eukaryotic chromosome ends primarily by the enzyme telomerase (see Fig. 26–35). Artificial chromosomes (Chapter 9) have been constructed as a means of better understanding the functional significance of many structural features of eukaryotic chromosomes. A reasonably stable artificial linear chromosome requires only three components: a centromere, telomeres at each end, and sequences that allow the initiation of DNA replication. Yeast artificial chromosomes (YACs; see Fig. 9–8) have been developed as a research tool in biotechnology. Similarly, human artificial chromosomes (HACs) are being developed for the treatment of genetic diseases by somatic gene therapy. 24.2 DNA Supercoiling Cellular DNA, as we have seen, is extremely compacted, implying a high degree of structural organization. The folding mechanism must not only pack the DNA but also permit access to the information in the DNA. Before considering how this is accomplished in processes such as replication and transcription, we need to examine an important property of DNA structure known as supercoiling. Supercoiling means the coiling of a coil. A telephone cord, for example, is typically a coiled wire. The path taken by the wire between the base of the phone and the receiver often includes one or more supercoils (Fig. 24–10). DNA is coiled in the form of a double helix, with both strands of the DNA coiling around an axis. The further coiling of that axis upon itself (Fig. 24–11) produces DNA supercoiling. As detailed below, DNA supercoiling is generally a manifestation of structural strain. When there is no net bending of the DNA axis upon itself, the DNA is said to be in a relaxed state. We might have predicted that DNA compaction involved some form of supercoiling. Perhaps less predictable is that replication and transcription of DNA also affect and are affected by supercoiling. Both processes 8885d_c24_920-947 2/11/04 Chapter 24 938 1:36 PM Page 938 mac76 mac76:385_reb: Genes and Chromosomes density), which is (Lk Lk0)/Lk0. For cellular DNAs, is typically 0.05 to 0.07, which means that approximately 5% to 7% of the helical turns in the DNA have been removed. DNA underwinding facilitates strand separation by enzymes of DNA metabolism. Plectonemic ■ Solenoidal (a) DNAs that differ only in linking number are called topoisomers. Enzymes that underwind and/or relax DNA, the topoisomerases, catalyze changes in linking number. The two classes of topoisomerases, type I and type II, change Lk in increments of 1 or 2, respectively, per catalytic event. (b) FIGURE 24–24 Plectonemic and solenoidal supercoiling. (a) Plectonemic supercoiling takes the form of extended right-handed coils. Solenoidal negative supercoiling takes the form of tight left-handed turns about an imaginary tubelike structure. The two forms are readily interconverted, although the solenoidal form is generally not observed unless certain proteins are bound to the DNA. (b) Plectonemic (top) and solenoidal supercoiling of the same DNA molecule, drawn to scale. Solenoidal supercoiling provides a much greater degree of compaction. extended right-handed supercoils characteristic of the plectonemic form, solenoidal supercoiling involves tight left-handed turns, similar to the shape taken up by a garden hose neatly wrapped on a reel. Although their structures are dramatically different, plectonemic and solenoidal supercoiling are two forms of negative supercoiling that can be taken up by the same segment of underwound DNA. The two forms are readily interconvertible. Although the plectonemic form is more stable in solution, the solenoidal form can be stabilized by protein binding and is the form found in chromatin. It provides a much greater degree of compaction (Fig. 24–24b). Solenoidal supercoiling is the mechanism by which underwinding contributes to DNA compaction. SUMMARY 24.2 DNA Supercoiling ■ Most cellular DNAs are supercoiled. Underwinding decreases the total number of helical turns in the DNA relative to the relaxed, B form. To maintain an underwound state, DNA must be either a closed circle or bound to protein. Underwinding is quantified by a topological parameter called linking number, Lk. ■ Underwinding is measured in terms of specific linking difference, (also called superhelical 24.3 The Structure of Chromosomes The term “chromosome” is used to refer to a nucleic acid molecule that is the repository of genetic information in a virus, a bacterium, a eukaryotic cell, or an organelle. It also refers to the densely colored bodies seen in the nuclei of dye-stained eukaryotic cells, as visualized using a light microscope. Chromatin Consists of DNA and Proteins The eukaryotic cell cycle (see Fig. 12–41) produces remarkable changes in the structure of chromosomes (Fig. 24–25). In nondividing eukaryotic cells (in G0) and those in interphase (G1, S, and G2), the chromosomal material, chromatin, is amorphous and appears to be randomly dispersed in certain parts of the nucleus. In the S phase of interphase the DNA in this amorphous state replicates, each chromosome producing two sister chromosomes (called sister chromatids) that remain associated with each other after replication is complete. The chromosomes become much more condensed during prophase of mitosis, taking the form of a speciesspecific number of well-defined pairs of sister chromatids (Fig. 24–5). Chromatin consists of fibers containing protein and DNA in approximately equal masses, along with a small amount of RNA. The DNA in the chromatin is very tightly associated with proteins called histones, which package and order the DNA into structural units called nucleosomes (Fig. 24–26). Also found in chromatin are many nonhistone proteins, some of which help maintain chromosome structure, others that regulate the expression of specific genes (Chapter 28). Beginning with nucleosomes, eukaryotic chromosomal DNA is packaged into a succession of higher-order structures that ultimately yield the compact chromosome seen with the light microscope. We now turn to a description of this structure in eukaryotes and compare it with the packaging of DNA in bacterial cells. 8885d_c24_920-947 2/11/04 1:36 PM Page 939 mac76 mac76:385_reb: The Structure of Chromosomes 24.3 FIGURE 24–25 Changes in chromosome structure during the eukaryotic cell cycle. Cellular DNA is uncondensed throughout interphase. The interphase period can be subdivided (see Fig. 12–41) into the G1 (gap) phase; the S (synthesis) phase, when the DNA is replicated; and the G2 phase, in which the replicated chromosomes cohere to one another. The DNA undergoes condensation in the prophase of mitosis. Cohesins (green) and condensins (red) are proteins involved in cohesion and condensation (discussed later in the chapter). The architecture of the cohesincondensin-DNA complex is not yet established, and the interactions shown here are figurative, simply suggesting their role in condensation of the chromosome. During metaphase, the condensed chromosomes line up along a plane halfway between the spindle poles. One chromosome of each pair is linked to each spindle pole via microtubules that extend between the spindle and the centromere. The sister chromatids separate at anaphase, each drawn toward the spindle pole to which it is connected. After cell division is complete, the chromosomes decondense and the cycle begins anew. 939 Cohesin Duplex DNA replication and cohesion S Replication occurs from multiple origins of replication; daughter chromatids are linked by cohesins G1 G2 Condensins Interphase Replication completed condensation Mitosis Anaphase separation Cohesins Prophase alignment Spindle pole Metaphase Histone core of nucleosome Linker DNA of nucleosome Histones Are Small, Basic Proteins (a) 50 nm (b) FIGURE 24–26 Nucleosomes. Regularly spaced nucleosomes consist of histone complexes bound to DNA. (a) Schematic illustration and (b) electron micrograph. Found in the chromatin of all eukaryotic cells, histones have molecular weights between 11,000 and 21,000 and are very rich in the basic amino acids arginine and lysine (together these make up about one-fourth of the amino acid residues). All eukaryotic cells have five major classes of histones, differing in molecular weight and amino acid composition (Table 24–3). The H3 histones are nearly identical in amino acid sequence in all eukaryotes, as are the H4 histones, suggesting strict conservation of their functions. For example, only 2 of 102 amino acid residues differ between the H4 histone molecules of peas and cows, and only 8 differ between the H4 histones of humans and yeast. Histones H1, H2A, and H2B show less sequence similarity among eukaryotic species. Each type of histone has variant forms, because certain amino acid side chains are enzymatically modified by methylation, ADP-ribosylation, phosphorylation, glycosylation, or acetylation. Such modifications affect the net electric charge, shape, and other properties of histones, as well as the structural and functional properties of the chromatin, and they play a role in the regulation of transcription (Chapter 28). 8885d_c24_920-947 940 2/11/04 Chapter 24 1:36 PM Page 940 mac76 mac76:385_reb: Genes and Chromosomes H2B H4 Nucleosomes Are the Fundamental Organizational Units of Chromatin The eukaryotic chromosome depicted in Figure 24–5 represents the compaction of a DNA molecule about 105 m long into a cell nucleus that is typically 5 to 10 m in diameter. This compaction involves several levels of highly organized folding. Subjection of chromosomes to treatments that partially unfold them reveals a structure in which the DNA is bound tightly to beads of protein, often regularly spaced (Fig. 24–26). The beads in this “beads-on-a-string” arrangement are complexes of histones and DNA. The bead plus the connecting DNA that leads to the next bead form the nucleosome, the fundamental unit of organization upon which the higher-order packing of chromatin is built. The bead of each nucleosome contains eight histone molecules: two copies each of H2A, H2B, H3, and H4. The spacing of the nucleosome beads provides a repeating unit typically of about 200 bp, of which 146 bp are bound tightly around the eight-part histone core and the remainder serve as linker DNA between nucleosome beads. Histone H1 binds to the linker DNA. Brief treatment of chromatin with enzymes that digest DNA causes preferential degradation of the linker DNA, releasing histone particles containing 146 bp of bound DNA that have been protected from digestion. Researchers have crystallized nucleosome cores obtained in this way, and x-ray diffraction analysis reveals a particle made up of the eight histone molecules with the DNA wrapped around it in the form of a left-handed solenoidal supercoil (Fig. 24–27). A close inspection of this structure reveals why eukaryotic DNA is underwound even though eukaryotic cells lack enzymes that underwind DNA. Recall that the solenoidal wrapping of DNA in nucleosomes is but one form of supercoiling that can be taken up by underwound (negatively supercoiled) DNA. The tight wrapping of DNA around the histone core requires the removal of about one helical turn in the DNA. When the protein core of a nucleosome binds in vitro to a relaxed, closed-circular DNA, the binding introduces a negative supercoil. Because this binding process does not break the DNA or change the linking number, the formation of a negative solenoidal supercoil must be accompanied by a compensatory positive supercoil in the unbound region of the DNA (Fig. 24–28). As mentioned earlier, eukaryotic topoisomerases can relax positive supercoils. Relaxing the unbound positive supercoil leaves the negative supercoil fixed (through its binding to the nucleosome histone core) and results in an overall decrease in linking number. Indeed, topoisomerases have proved necessary for assembling chromatin from purified histones and closed-circular DNA in vitro. Another factor that affects the binding of DNA to histones in nucleosome cores is the sequence of the H2A H3 H2A H3 H2B H4 (a) (b) (c) FIGURE 24–27 DNA wrapped around a nucleosome core. (a) Spacefilling representation of the nucleosome protein core, with different colors for the different histones (PDB ID 1AOI). (b) Top and (c) side views of the crystal structure of a nucleosome with 146 bp of bound DNA. The protein is depicted as a gray surface contour, with the bound DNA in blue. The DNA binds in a left-handed solenoidal supercoil that circumnavigates the histone complex 1.8 times. A schematic drawing is included in (c) for comparison with other figures depicting nucleosomes. 8885d_c24_920-947 2/11/04 1:36 PM Page 941 mac76 mac76:385_reb: 24.3 TABLE 24–3 The Structure of Chromosomes 941 Types and Properties of Histones Histone Molecular weight H1* H2A* H2B* H3 H4 21,130 13,960 13,774 15,273 11,236 Number of amino acid residues 223 129 125 135 102 Content of basic amino acids (% of total) Lys Arg 29.5 10.9 16.0 19.6 10.8 11.3 19.3 16.4 13.3 13.7 * The sizes of these histones vary somewhat from species to species. The numbers given here are for bovine histones. DNA (a) Histone core Lk 0 (b) Bound negative supercoil (solenoidal) bound DNA. Histone cores do not bind randomly to DNA; rather, they tend to position themselves at certain locations. This positioning is not fully understood but in some cases appears to depend on a local abundance of AUT base pairs in the DNA helix where it is in contact with the histones (Fig. 24–29). The tight wrapping of the DNA around the nucleosome’s histone core requires compression of the minor groove of the helix at these points, and a cluster of two or three AUT base pairs makes this compression more likely. Other proteins are required for the positioning of some nucleosome cores on DNA. In several organisms, certain proteins bind to a specific DNA sequence and then facilitate the formation of a nucleosome core nearby. Precise positioning of nucleosome cores can play a role in the expression of some eukaryotic genes (Chapter 28). Unbound positive supercoil (plectonemic) Lk 1 topoisomerase A T pairs abundant (c) DNA One (net) negative supercoil Histone core FIGURE 24–28 Chromatin assembly. (a) Relaxed, closed-circular DNA. (b) Binding of a histone core to form a nucleosome induces one negative supercoil. In the absence of any strand breaks, a positive supercoil must form elsewhere in the DNA (Lk 0). (c) Relaxation of this positive supercoil by a topoisomerase leaves one net negative supercoil (Lk 1). FIGURE 24–29 Positioning of a nucleosome to make optimal use of AUT base pairs where the histone core is in contact with the minor groove of the DNA helix. 8885d_c24_920-947 942 2/11/04 Chapter 24 1:36 PM Page 942 mac76 mac76:385_reb: Genes and Chromosomes 30 nm (a) (b) FIGURE 24–30 The 30 nm fiber, a higher-order organization of nucleosomes. (a) Schematic illustration of the probable structure of the fiber, showing nucleosome packing. (b) Electron micrograph. Nucleosomes Are Packed into Successively Higher Order Structures Wrapping of DNA around a nucleosome core compacts the DNA length about sevenfold. The overall compaction in a chromosome, however, is greater than 10,000-fold— ample evidence for even higher orders of structural organization. In chromosomes isolated by very gentle methods, nucleosome cores appear to be organized into a structure called the 30 nm fiber (Fig. 24–30). This packing requires one molecule of histone H1 per nucleosome core. Organization into 30 nm fibers does not extend over the entire chromosome but is punctuated by regions bound by sequence-specific (nonhistone) DNAbinding proteins. The 30 nm structure also appears to depend on the transcriptional activity of the particular region of DNA. Regions in which genes are being transcribed are apparently in a less-ordered state that contains little, if any, histone H1. The 30 nm fiber, a second level of chromatin organization, provides an approximately 100-fold compaction of the DNA. The higher levels of folding are not yet understood, but it appears that certain regions of DNA associate with a nuclear scaffold (Fig. 24–31). The scaffold-associated regions are separated by loops of DNA with perhaps 20 to 100 kbp. The DNA in a loop may contain a set of related genes. For example, in Drosophila complete sets of histone-coding genes seem to cluster together in loops that are bounded by scaffold attachment sites (Fig. 24–32). The scaffold itself appears to contain several proteins, notably large FIGURE 24–31 A partially unraveled human chromosome, revealing numerous loops of DNA attached to a scaffoldlike structure. amounts of histone H1 (located in the interior of the fiber) and topoisomerase II. The presence of topoisomerase II further emphasizes the relationship between DNA underwinding and chromatin structure. Topoisomerase II is so important to the maintenance of chromatin structure that inhibitors of this enzyme can kill 30 nm Fiber Histone genes H2B H3 H4 H2A Nuclear scaffold H1 FIGURE 24–32 Loops of chromosomal DNA attached to a nuclear scaffold. The DNA in the loops is packaged as 30 nm fibers, so the loops are the next level of organization. Loops often contain groups of genes with related functions. Complete sets of histone-coding genes, as shown in this schematic illustration, appear to be clustered in loops of this kind. Unlike most genes, histone genes occur in multiple copies in many eukaryotic genomes. 8885d_c24_920-947 2/11/04 1:36 PM Page 943 mac76 mac76:385_reb: 24.3 rapidly dividing cells. Several drugs used in cancer chemotherapy are topoisomerase II inhibitors that allow the enzyme to promote strand breakage but not the resealing of the breaks. Evidence exists for additional layers of organization in eukaryotic chromosomes, each dramatically enhancing the degree of compaction. One model for achieving this compaction is illustrated in Figure 24–33. Higherorder chromatin structure probably varies from chromosome to chromosome, from one region to the next in a single chromosome, and from moment to moment in the life of a cell. No single model can adequately describe these structures. Nevertheless, the principle is clear: DNA compaction in eukaryotic chromosomes is Threelikely to involve coils upon coils upon coils . . . Dimensional Packaging of Nuclear Chromosomes The Structure of Chromosomes 943 Two chromatids (10 coils each) One coil (30 rosettes) Condensed Chromosome Structures Are Maintained by SMC Proteins A third major class of chromatin proteins, in addition to the histones and topoisomerases, is the SMC proteins (structural maintenance of chromosomes). The primary structure of SMC proteins consists of five distinct domains (Fig. 24–34a). The amino- and carboxyl-terminal globular domains, N and C, each of which has part of an ATP hydrolytic site, are connected by two regions of -helical coiled-coil motifs (see Fig. 4–11) that are joined by a hinge domain. The proteins are generally dimeric, forming a V-shaped complex that is thought to be tied together through their hinge domains (Fig. 24–34b). One N and one C domain come together to form a complete ATP hydrolytic site at each end of the V. Proteins in the SMC family are found in all types of organisms, from bacteria to humans. Eukaryotes have two major types, cohesins and condensins (Fig. 24–25). The cohesins play a substantial role in linking together sister chromatids immediately after replication and keeping them together as the chromosomes condense to metaphase. This linkage is essential if chromosomes are to segregate properly at cell division. The detailed mechanism by which cohesins link sister chromosomes, and the role of ATP hydrolysis, are not yet understood. The condensins are essential to the condensation of chromosomes as cells enter mitosis. In the laboratory, condensins bind to DNA in a manner that creates positive supercoils; that is, condensin binding causes the DNA to become overwound, in contrast to the underwinding induced by the binding of nucleosomes. It is not yet clear how this helps to compact the chromatin, although one possibility is presented in Figure 24–35. Bacterial DNA Is Also Highly Organized We now turn briefly to the structure of bacterial chromosomes. Bacterial DNA is compacted in a structure called the nucleoid, which can occupy a significant One rosette (6 loops) Nuclear scaffold One loop (~75,000 bp) 30 nm Fiber “Beads-ona-string” form of chromatin DNA FIGURE 24–33 Compaction of DNA in a eukaryotic chromosome. Model for levels of organization that could provide DNA compaction in the chromosomes of eukaryotes. The levels take the form of coils upon coils. In cells, the higher-order structures (above the 30 nm fibers) are unlikely to be as uniform as depicted here. 8885d_c24_920-947 944 2/11/04 Chapter 24 1:36 PM Page 944 mac76 mac76:385_reb: Genes and Chromosomes (a) N C Hinge Coiled coil Coiled coil Condensin (+)(+) topoisomerase I (+)(+) + (b) (–) (–) ATP Relaxed DNA FIGURE 24–35 Model for the effect of condensins on DNA supercoiling. Binding of condensins to a closed-circular DNA in the presence of topoisomerase I leads to the production of positive supercoils (). Wrapping of the DNA about the condensin introduces positive supercoils because it wraps in the opposite sense to a solenoidal supercoil (see Fig. 24–24). The compensating negative supercoils () that appear elsewhere in the DNA are then relaxed by topoisomerase I. In the chromosome, it is the wrapping of the DNA about condensin that may contribute to DNA condensation. ATP (c) 50 nm FIGURE 24–34 Structure of SMC proteins. (a) The five domains of namic molecule, possibly reflecting a requirement for more ready access to its genetic information. The bacterial cell division cycle can be as short as 15 min, whereas a typical eukaryotic cell may not divide for hours or even months. In addition, a much greater fraction of prokaryotic DNA is used to encode RNA and/or protein products. Higher rates of cellular metabolism in bacteria mean that a much higher proportion of the DNA is being transcribed or replicated at a given time than in most eukaryotic cells. the SMC primary structure. N and C denoted the amino-terminal and carboxyl-terminal domains, respectively. (b) Each polypeptide is folded so that the two coiled-coil domains wrap around each other and the N and C domains come together to form a complete ATPbinding site. Two of these domains are linked at the hinge region to form the dimeric V-shaped molecule. (c) Electron micrograph of SMC proteins from Bacillus subtilis. fraction of the cell volume (Fig. 24–36). The DNA appears to be attached at one or more points to the inner surface of the plasma membrane. Much less is known about the structure of the nucleoid than of eukaryotic chromatin. In E. coli, a scaffoldlike structure appears to organize the circular chromosome into a series of looped domains, as described above for chromatin. Bacterial DNA does not seem to have any structure comparable to the local organization provided by nucleosomes in eukaryotes. Histonelike proteins are abundant in E. coli—the best-characterized example is a two-subunit protein called HU (Mr 19,000)—but these proteins bind and dissociate within minutes, and no regular, stable DNA-histone structure has been found. The bacterial chromosome is a relatively dy- 2 m FIGURE 24–36 E. coli cells showing nucleoids. The DNA is stained with a dye that fluoresces when exposed to UV light. The light area defines the nucleoid. Note that some cells have replicated their DNA but have not yet undergone cell division and hence have multiple nucleoids. 8885d_c24_945 2/12/04 11:22 AM Page 945 mac76 mac76:385_reb: Chapter 24 With this overview of the complexity of DNA structure, we are now ready to turn, in the next chapter, to a discussion of DNA metabolism. The fundamental unit of organization in the chromatin of eukaryotic cells is the nucleosome, which consists of histones and a 200 bp segment of DNA. A core protein particle containing eight histones (two copies each of histones H2A, H2B, H3, and H4) is encircled by a segment of DNA (about 146 bp) in the form of a left-handed solenoidal supercoil. 945 ■ Nucleosomes are organized into 30 nm fibers, and the fibers are extensively folded to provide the 10,000-fold compaction required to fit a typical eukaryotic chromosome into a cell nucleus. The higher-order folding involves attachment to a nuclear scaffold that contains histone H1, topoisomerase II, and SMC proteins. ■ Bacterial chromosomes are also extensively compacted into the nucleoid, but the chromosome appears to be much more dynamic and irregular in structure than eukaryotic chromatin, reflecting the shorter cell cycle and very active metabolism of a bacterial cell. SUMMARY 24.3 The Structure of Chromosomes ■ Further Reading Key Terms Terms in bold are defined in the glossary. exon 928 gene 921 simple-sequence genome 923 DNA 929 chromosome 923 satellite DNA 929 phenotype 924 centromere 930 mutation 924 telomere 930 regulatory supercoil 930 sequence 924 relaxed DNA 930 plasmid 925 topology 931 intron 928 underwinding 932 linking number 933 specific linking difference () 933 superhelical density 933 topoisomers 934 topoisomerases 935 plectonemic 937 solenoidal 937 chromatin 938 histones 938 nucleosome 938 30 nm fiber 942 SMC proteins 943 cohesins 943 condensins 943 nucleoid 943 Further Reading General Blattner, F.R., Plunkett, G., III, Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., et al. (1997) The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1474. New secrets of this common laboratory organism are revealed. Cozzarelli, N.R. & Wang, J.C. (eds) (1990) DNA Topology and Its Biological Effects, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Kornberg, A. & Baker, T.A. (1991) DNA Replication, 2nd edn, W. H. Freeman & Company, New York. A good place to start for further information on the structure and function of DNA. Lodish, H., Berk, A., Matsudaira, P., Kaiser, C.A., Krieger, M., Scott, M.P., Zipursky, S.L., & Darnell, J. (2003) Molecular Cell Biology, 5th edn, W. H. Freeman & Company, New York. Another excellent general reference. Genes and Chromosomes Bromham, L. (2002) The human zoo: endogenous retroviruses in the human genome. Trends Ecol. Evolut. 17, 91–97. A thorough description of one of the transposon classes that makes up a large part of the human genome. Goffeau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J.D., Jacq, C., Johnston, M., et al. (1996) Life with 6000 genes. Science 274, 546, 563–567. Report of the first complete sequence of a eukaryotic genome, the yeast Saccharomyces cerevisiae. Greider, C.W. & Blackburn, E.H. (1996) Telomeres, telomerase and cancer. Sci. Am. 274 (February), 92–97. Huxley, C. (1997) Mammalian artificial chromosomes and chromosome transgenics. Trends Genet. 13, 345–347. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921. One of the first reports on the draft sequence of the human genome, with lots of analysis and many associated articles. Long, M., de Souza, S.J., & Gilbert, W. (1995) Evolution of the intron-exon structure of eukaryotic genes. Curr. Opin. Genet. Dev. 5, 774–778. McEachern, M.J., Krauskopf, A., & Blackburn, E.H. (2000) Telomeres and their control. Annu. Rev. Genet. 34, 331–358. 8885d_c24_920-947 946 2/11/04 Chapter 24 1:36 PM Page 946 mac76 mac76:385_reb: Genes and Chromosomes Schmid, C.W. (1996) Alu: structure, origin, evolution, significance and function of one-tenth of human DNA. Prog. Nucleic Acid Res. Mol. Biol. 53, 283–319. Lebowitz, J. (1990) Through the looking glass: the discovery of supercoiled DNA. Trends Biochem. Sci. 15, 202–207. A short and interesting historical note. Tyler-Smith, C. & Floridia, G. (2000) Many paths to the top of the mountain: diverse evolutionary solutions to centromere structure. Cell 102, 5–8. Details of the diversity of centromere structures from different organisms, as currently understood. Wang, J.C. (2002) Cellular roles of DNA topoisomerases: a molecular perspective. Nat. Rev. Mol. Cell Biol. 3, 430–440. Zakian, V.A. (1996) Structure, function, and replication of Saccharomyces cerevisiae telomeres. Annu. Rev. Genet. 30, 141–172. Supercoiling and Topoisomerases Berger, J.M. (1998) Type II DNA topoisomerases. Curr. Opin. Struct. Biol. 8, 26–32. Boles, T.C., White, J.H., & Cozzarelli, N.R. (1990) Structure of plectonemically supercoiled DNA. J. Mol. Biol. 213, 931–951. A study that defines several fundamental features of supercoiled DNA. Champoux, J.J. (2001) DNA topoisomerases: structure, function, and mechanism. Annu. Rev. Biochem. 70, 369–413. An excellent summary of the topoisomerase classes. Cozzarelli, N.R., Boles, T.C., & White, J.H. (1990) Primer on the topology and geometry of DNA supercoiling. In DNA Topology and Its Biological Effects (Cozzarelli, N.R. & Wang, J.C., eds), pp. 139–184, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. A more advanced and thorough discussion. Chromatin and Nucleosomes Filipski, J., Leblanc, J., Youdale, T., Sikorska, M., & Walker, P.R. (1990) Periodicity of DNA folding in higher order chromatin structures. EMBO J. 9, 1319–1327. Hirano, T. (2002) The ABCs of SMC proteins: two-armed ATPases for chromosome condensation, cohesion and repair. Genes Dev. 16, 399–414. Description of the rapid advances in understanding of this interesting class of proteins. Kornberg, R.D. (1974) Chromatin structure: a repeating unit of histones and DNA. Science 184, 868–871. A classic paper that introduced the subunit model for chromatin. Nasmyth, K. (2002) Segregating sister genomes: the molecular biology of chromosome separation. Science 297, 559–565. Wyman, C. & Kanaar, R. (2002) Chromosome organization: reaching out to embrace new models. Curr. Biol. 12, R446–R448. A good, short summary of chromosome structure and the roles of SMC proteins within it. Zlatanova, J. & van Holde, K. (1996) The linker histones and chromatin structure: new twists. Prog. Nucleic Acid Res. Mol. Biol. 52, 217–259. Problems 1. Packaging of DNA in a Virus Bacteriophage T2 has a DNA of molecular weight 120 106 contained in a head about 210 nm long. Calculate the length of the DNA (assume the molecular weight of a nucleotide pair is 650) and compare it with the length of the T2 head. 2. The DNA of Phage M13 The base composition of phage M13 DNA is A, 23%; T, 36%; G, 21%; C, 20%. What does this tell you about the DNA of phage M13? 3. The Mycoplasma Genome The complete genome of the simplest bacterium known, Mycoplasma genitalium, is a circular DNA molecule with 580,070 bp. Calculate the molecular weight and contour length (when relaxed) of this molecule. What is Lk0 for the Mycoplasma chromosome? If 0.06, what is Lk? 4. Size of Eukaryotic Genes An enzyme isolated from rat liver has 192 amino acid residues and is coded for by a gene with 1,440 bp. Explain the relationship between the number of amino acid residues in the enzyme and the number of nucleotide pairs in its gene. 5. Linking Number A closed-circular DNA molecule in its relaxed form has an Lk of 500. Approximately how many base pairs are in this DNA? How is the linking number altered (increases, decreases, doesn’t change, becomes undefined) when (a) a protein complex is bound to form a nucleosome, (b) one DNA strand is broken, (c) DNA gyrase and ATP are added to the DNA solution, or (d) the double helix is denatured by heat? 6. Superhelical Density Bacteriophage infects E. coli by integrating its DNA into the bacterial chromosome. The success of this recombination depends on the topology of the E. coli DNA. When the superhelical density () of the E. coli DNA is greater than 0.045, the probability of integration is 20%; when is less than 0.06, the probability is 70%. Plasmid DNA isolated from an E. coli culture is found to have a length of 13,800 bp and an Lk of 1,222. Calculate for this DNA and predict the likelihood that bacteriophage will be able to infect this culture. 7. Altering Linking Number (a) What is the Lk of a 5,000 bp circular duplex DNA molecule with a nick in one strand? (b) What is the Lk of the molecule in (a) when the nick is sealed (relaxed)? (c) How would the Lk of the molecule in (b) be affected by the action of a single molecule of E. coli topoisomerase I? (d) What is the Lk of the molecule in (b) after eight enzymatic turnovers by a single molecule of DNA gyrase in the presence of ATP? (e) What is the Lk of the molecule in (d) after four enzymatic turnovers by a single molecule of bacterial type I topoisomerase? (f) What is the Lk of the molecule in (d) after binding of one nucleosome? 8885d_c24_920-947 2/11/04 1:36 PM Page 947 mac76 mac76:385_reb: Chapter 24 8. Chromatin Early evidence that helped researchers define nucleosome structure is illustrated by the agarose gel below, in which the thick bands represent DNA. It was generated by briefly treating chromatin with an enzyme that degrades DNA, then removing all protein and subjecting the purified DNA to electrophoresis. Numbers at the side of the gel denote the position to which a linear DNA of the indicated size would migrate. What does this gel tell you about chromatin structure? Why are the DNA bands thick and spread out rather than sharply defined? Problems 947 9. DNA Structure Explain how the underwinding of a BDNA helix might facilitate or stabilize the formation of Z-DNA. 10. Maintaining DNA Structure (a) Describe two structural features required for a DNA molecule to maintain a negatively supercoiled state. (b) List three structural changes that become more favorable when a DNA molecule is negatively supercoiled. (c) What enzyme, with the aid of ATP, can generate negative superhelicity in DNA? (d) Describe the physical mechanism by which this enzyme acts. 11. Yeast Artificial Chromosomes (YACs) YACs are used to clone large pieces of DNA in yeast cells. What three types of DNA sequences are required to ensure proper replication and propagation of a YAC in a yeast cell? 1,000 bp 800 bp 600 bp 400 bp 200 bp 343 SINEs and LINEs: the art of biting the hand that feeds you Alan M Weiner SINEs and LINEs are short and long interspersed retrotransposable elements, respectively, that invade new genomic sites using RNA intermediates. SINEs and LINEs are found in almost all eukaryotes (although not in Saccharomyces cerevisiae) and together account for at least 34% of the human genome. The noncoding SINEs depend on reverse transcriptase and endonuclease functions encoded by partner LINEs. With the completion of many genome sequences, including our own, the database of SINEs and LINEs has taken a great leap forward. The new data pose new questions that can only be answered by detailed studies of the mechanism of retroposition. Current work ranges from the biochemistry of reverse transcription and integration in vitro, target site selection in vivo, nucleocytoplasmic transport of the RNA and ribonucleoprotein intermediates, and mechanisms of genomic turnover. Two particularly exciting new ideas are that SINEs may help cells survive physiological stress, and that the evolution of SINEs and LINEs has been shaped by the forces of RNA interference. Taken together, these studies promise to explain the birth and death of SINEs and LINEs, and the contribution of these repetitive sequence families to the evolution of genomes. Addresses Department of Biochemistry, HSB J417, University of Washington, Box 357350, Seattle, WA 98195-7350, USA; e-mail: [email protected] Current Opinion in Cell Biology 2002, 14:343–350 0955-0674/02/$ — see front matter © 2002 Elsevier Science Ltd. All rights reserved. Abbreviations AP apurinic/apyrimidinic EN endonuclease LINE long interspersed repeated sequence ORF open reading frame pA polyadenylation site pol RNA polymerase RNAi RNA interference RT reverse transcriptase SINE short interspersed repeated sequence SRP signal recognition particle Introduction The dawn of the genomic age has transformed every aspect of biology, and the study of retrotransposable elements is no exception. Until complete genomic sequences began to appear in 1997, whoever studied SINEs and LINEs (short and long interspersed repeated sequences, as rhymefully named by Singer [1]) had to worry that any individual retrotransposon sequence might be misleadingly different from other members of the same sequence class. For example, with a database in 1981 of only a few dozen Alu sequences (the most abundant SINE in the human genome), no firm conclusions could be drawn about the remaining 1,090,000 genomic Alu elements [2••] whose existence was known solely from DNA reassociation kinetics. Sampling 0.002% of the genome simply does not inspire confidence. Now, with many genomic sequences nearing completion, we can examine essentially all members of a particular class of SINEs or LINEs in a particular genome, asking questions and drawing conclusions that should withstand the test of time. This is the good news. Genomic anatomy rarely reveals mechanism, which is best investigated in the cold room or the tissue culture hood. So for now, the most that can be said for the genome sequences of yeast, Arabidopsis, flies, worms, mice and humans is that they have sharpened our questions about SINEs and LINEs without really answering any of them. Historically, the first breakthrough from genome structure to molecular mechanism came in 1993, when the R2 protein of the insect retrotransposon R2Bm was shown to nick the target DNA to generate a primer for reverse transcription of the R2Bm RNA in situ [3]. This brought retroposition within reach of conventional divide-and-conquer biochemistry. The second breakthrough came in 1996, when high-frequency retroposition of a human LINE element was achieved in cultured somatic cells [4]. This meant that retroposition could be dissected using the powerful methods of surrogate somatic cell genetics, including selection for rare events. Together, these two breakthroughs dispelled any residual fears that retroposition might occur only in experimentally inaccessible germ line cells [5] or might occur less frequently than the typical research grant must be renewed. SINEs and LINEs: a parts list and owner’s manual LINEs are autonomous retroelements; SINEs are their dependants. As shown in Figure 1, LINEs contain an unusual internal promoter for RNA polymerase II (pol II), one or two open reading frames (ORFs) (where the second ORF is accessed through a frameshift), and a 3′-terminal polyadenylation site (pA) lacking the usual downstream efficiency element [6]. ORF1 encodes an essential protein of unknown function [7••] while ORF2 encodes a bifunctional polypeptide with both reverse transcriptase (RT) and DNA endonuclease (EN) activity. EN is upstream of RT in older LINEs (e.g. R2, CRE1 and -2, SLACS, CZAR, Dong and R4), and downstream of RT in younger LINEs (e.g. L1, Jockey and CR1) [8]. The downstream EN module in older LINEs is a sequence-specific restriction-like EN; the upstream EN module in younger LINEs is an apurinic/apyrimidinic (AP) endonuclease that usually, but not always, lacks 344 Nuclear and gene expression Figure 1 +1 SINEs pol III AAAAAA +1 older LINEs (R2, CRE1, SLACS) pol II ORF (RT/R-EN) pA AAAAAA +1 younger LINEs (L1, jockey, CR1) pol II retropseudogenes (′processed genes′) ORF1 ORF2 (AP-EN/RT) pA AAAAAA pA AAAAAA A parts list for SINEs and LINEs. +1 represent the transcription start site. AP-EN, apurinic/apyrimidinic endonuclease; pA, polyadenylation signal lacking downstream efficiency element; pol II and pol III, RNA polymerase II and III promoters; R-EN, restriction-like endonuclease; RT, reverse transcriptase. Solid black arrows represent 7–20 base pair target-site duplications. Dashed lines indicate deletion of intron sequences in retropseudogenes derived from mature mRNA. Only the most common forms of LINE gene organization are shown; variants exist. Note that some SINEs have 3′-terminal dinucleotide or trinucleotide repeats instead of A-rich tails, and that partner SINEs and LINEs share common 3′-terminal sequences. +1 protein-encoding genes pol II pA Current Opinion in Cell Biology sequence-specificity [9]. The polyadenylated LINE transcript is exported to the cytoplasm (presumably like any other intron-less mRNA) [10,11] and translated. However, the nascent RT preferentially binds in cis to the LINE RNA encoding it [7••,12,13]. The resulting ribonucleoprotein (RNP) complex of a functional LINE RNA with the bifunctional RT/EN polypeptide then enters the nucleus, where it initiates a process called ‘target-primed reverse transcription’. In this process, the EN component of the RT/EN polypeptide nicks the target DNA to generate a 3′-hydroxyl group, which is used by the RT component of the RT/EN polypeptide to prime reverse transcription of the LINE RNA in situ on the chromosome [3]. Thus, the LINE cDNA never exists free of the chromosome, as originally postulated [14]. Moreover, the R2 endonuclease is activated by RNA, presumably to prevent it from wreaking freelance chromosomal damage [15]. Following reverse transcription, the DNA repair machinery present in somatic [7••], as well as germ line, cells mends the broken DNA, creating a staggered break and generating flanking target site duplications of 7–20 nucleotides. Most LINEs are 5′ truncated, an observation commonly attributed to incomplete reverse transcription. But other interpretations are possible. A newly retroposed, full-length LINE element is ‘fertile’ — that is, capable of further rounds of retroposition — because both the internal pol II promoter at the 5′ end of the LINE, and the fully internal polyadenylation signal at the 3′ end, guarantee that the new element will be transcriptionally competent to produce an essentially identical polyadenylated LINE mRNA. SINEs are similar to LINEs, but shorter, simpler, and almost certainly dependent on LINE RT/EN functions for retroposition. SINEs have an internal promoter for RNA polymerase III (pol III) instead of pol II, a 3′-terminal A-rich tract instead of the pol II polyadenylation signal (or occasionally dinucleotide or trinucleotide repeats), contain no significant ORFs, but otherwise function much like LINEs. A functional SINE must be free of any oligothymidylate tracts (where n?4) which can function as pol III termination signals. As a result, SINE transcription continues through the A-rich region until it encounters a random oligothymidylate tract downstream. The A-rich tract of SINE elements is presumed to function as the template for initiation of reverse transcription, because sequences downstream of the A-rich region are not retroposed; however, this has not been proven. A newly retroposed SINE element is ‘fertile’ because the internal pol III promoter and the 3′-terminal A-rich tract together guarantee that the new element will be transcriptionally competent to produce new SINE RNAs that are essentially identical to the original. The best evidence that SINEs piggyback on LINE RT/EN functions is the discovery that SINEs sometimes share common 3′-terminal sequences with ‘partner’ LINEs [16]. This would facilitate ‘retropositional parasitism’ of SINEs on LINEs by increasing the efficiency with which the LINE RT recognises the 3′ end of a cognate SINE [17,18••]. Not surprisingly, some SINEs are truncated LINEs, such as those arising from the RTE class of retrotransposons [19], but truncated LINEs are not necessarily SINEs [20]. No discussion of SINEs and LINEs would be complete without mentioning ‘retropseudogenes’ (also known as ‘processed genes’) — complete or 5′-truncated copies of SINEs and LINEs: the art of biting the hand that feeds you Weiner mature mRNAs flanked by short target-site duplications of 7–20 nucleotides. As long suspected, and now demonstrated experimentally [7••], retropseudogenes are generated when the LINE RT/EN binds a mature (presumably cytoplasmic) mRNA instead of a LINE or SINE RNA. The resulting retroposed mRNA lacks introns, but has gained a 3′-terminal poly(A) tail, added post-transcriptionally during maturation of the mRNA precursor. Unlike SINEs and LINEs, retropseudogenes are infertile (‘dead on arrival’) because they lack an internal promoter. Shared 3′-terminal sequences help SINEs exploit LINEs by overcoming the natural cis preference of LINE RTs for the RNAs that encoded them [17,18••]. Although generation of retropseudogenes depends on random binding of the LINE RT to mRNAs, retroposition is apparently driven forward by the overwhelming abundance of potential mRNA templates, and perhaps also by the ability of 3′-terminal poly(A) tracts to facilitate initiation of reverse transcription by template–primer slippage [21]. Crossing the great divide: nucleocytoplasmic transport The study of SINEs and LINEs will not get far without some serious cell biology, because all available evidence indicates that the RNA intermediates for retroposition of SINEs, LINEs and retropseudogenes are captured — if not actually reverse transcribed — in the cytoplasm. LINE RT/EN grabs LINE mRNA primarily in cis in the cytoplasm [13], and carries the mRNA into the nucleus as a RNP, as is also the case for retroviruses [22]. The almost complete absence of retropseudogenes containing unexcised introns (rat preproinsulin 1 being one of the few exceptions [23,24]) provides further evidence that the LINE RT/EN first binds RNA intermediates in the cytoplasm. However, it remains to be seen whether SINEs such as primate Alu elements [25] and rodent ID elements [5], which are derived from cytoplasmic RNAs, must retropose through cytoplasmic RNA intermediates. If the RNA intermediates for retroposition are cytoplasmic, SINEs and LINEs may have always been under selection for cytoplasmic stability as well as efficient nuclear export (an active process driven by a GTP gradient and requiring specific cargo-binding proteins [10,11,26,27]). Interestingly, retroposition of dimeric human Alu elements derived from the 7SL RNA component of the signal recognition particle (SRP) appears to be facilitated by binding of two of the six SRP proteins (SRP9 and SRP14) to the right monomer, possibly to assure nuclear export and/or stabilize full-length Alu RNAs in the cytoplasm [25]. Thus, cytoplasmic RNA retroposition intermediates may evolve to bind proteins that guarantee nuclear export and/or cytoplasmic stability, while avoiding proteins that would interfere with reverse transcription or nuclear import. For example, the nuclear export apparatus only recognises mature tRNA with correct 5′ and 3′ ends [28]. This could be one explanation for why SINEs derived from tRNA often retain a recognisable tRNA fold [29,30]. Another explanation might be 345 that the tRNA fold, like the shared 3′ terminus with a partner LINE, facilitates binding of the LINE RT/EN [17]. Similarly, binding of poly(A) binding proteins [31] to the poly(A) tail of LINEs, and possibly the 3′-terminal A-rich tract of SINEs [32], may facilitate nuclear export or cytoplasmic stability. LINEs: a maelstrom of modules? A phylogeny of the individual parts (the LINE EN and RT functions) is not necessarily a phylogeny of the whole because retroelements, like viruses and organismal genomes, are a ‘maelstrom of modules’ [33]. The most comprehensive phylogeny to date is extremely revealing [34,35••]: all LINE-like elements (or ‘non-LTR retrotransposons’) share a common ancestral RT, most closely related to the RT of certain group II introns. The earliest LINE-like elements possessed a sequence-specific restriction-like endonuclease downstream of the RT module. This downstream restriction endonuclease was later replaced by an upstream AP EN, and still later some of these elements acquired an RNase H domain downstream of the RT. All of the restriction-like endonucleases are sequencespecific [8], but the AP EN can be nonspecific or, more rarely, sequence-specific, as in the Bombyx R1Bm element [36]. Perhaps when we learn the function of the second ORF that is present only in the younger elements, we will begin to understand whether LINEs were assembled from pre-existing cellular parts, or cellular functions were devised for pre-existing LINE or group II intron functions. A SINE is born Most but not all SINEs are derived from tRNA; exceptions include rodent ID elements derived from neuronal BC1 RNA, also found in male germ cells [5], and primate Alu elements derived from the ubiquitous 7SL RNA component of the SRP [37]. However, any SINE that too closely resembles the parent RNA could function as a dominant-negative mutant, or be sequestered from retroposition by interaction with proteins that normally bind the parent RNA. Thus, as mentioned above, the trick may be to retain sufficient resemblance to the parent RNA to bind proteins that are important for transport or stability, but not those proteins that would interfere with retroposition. Certainly, preservation of the internal pol III transcription factor binding sites (the A and B boxes) is not sufficient to explain preservation of a recognisable tRNA fold. We also can’t explain why an abundant pol III transcript like 5S ribosomal RNA has apparently never given rise to a SINE; why primate Alu elements are dimeric while the vast majority of mammalian SINEs are monomeric; why the V-SINE (vertebrate-specific SINE) superfamily maintains a highly conserved core sequence sandwiched between the tRNA-like 5′ end and the LINE-like 3′ end [18••]. We can’t even explain why there are no LINEs or SINEs in the budding yeast S. cerevisiae, despite an abundance of retroviral-like elements and the presence of L1-like retrotransposons in the pathogenic yeasts Candida albicans and Cryptococcus neoformans [38,39] 346 Nuclear and gene expression For SINEs that share 3′-terminal sequences with a partner LINE and persist by ‘retropositional parasitism’ [17,18••], it is not difficult to imagine how the SINE got its tail: the LINE RT would generate a short cDNA (equivalent to a retroviral ‘strong stop’ cDNA) by copying the 3′-terminal LINE RNA sequence. This cDNA would then switch from the LINE template to the RNA parent of the SINE-to-be, thus attaching a RT landing pad to an RNA carrying an internal pol III promoter. One can also imagine that 3′-terminal identity with a partner LINE enables SINEs to more efficiently pirate the LINE RT, a cis-acting enzyme designed to keep functional LINEs alive by ignoring RNAs derived from the vast excess of moribund or nonfunctional elements [7••,13]. However, ‘retropositional parasitism’ is not risk-free: the human SINE MIR, which shares 50 bases of 3′-terminal sequence with the partner LINE2 element, was doomed when LINE2 (L2) became extinct [2••]. Other SINEs, such as primate Alu sequences, lack partner LINEs, perhaps because the L1 RT binds tightly to the A-rich tail and is less dependent on auxiliary or adjacent RNA sequences. Only careful studies of LINE RT activities will confirm, modify or enable us to reject these stories. Knowing when to stop: dodging host genes and seeking safe havens All retrotransposable elements are insertional mutagens that can cause disease or disability by inserting near or within essential genes [40,41]. As a result, retroelements face an existential dilemma [42]. Unlike infectious viral elements that can afford to kill one host in the process of infecting others, noninfectious retrotransposable elements are captive but rebellious passengers within the host genome [42]. (We ignore the possible horizontal transfer of SINEs and LINEs as stowaways in retroviral particles [43–45].) If the retroelement multiplies too recklessly, it will kill the host; but if it does not multiply fast enough to offset natural loss and decay, it cannot survive. This balancing act pits the ingenuity of the retroelement against the ingenuity of the host; the goal is to overrun, but not overcome the host. The abundance of dead retroelements littering the human genomic landscape provides mute testimony regarding this endless battle [2••]; however, we do not understand why one family of LINEs (say, L1) succeeds another (say, L2), or what enables one family of SINEs or LINEs to replicate explosively, while others eke out a minimal existence without vanishing into oblivion. New data from the (nearly) complete human genome sequence underscore the magnitude of the problem: 1,500,000 SINEs (70% of them Alu elements) account for 13% of our genome, and 850,000 LINEs for another 21% of the genome, giving a grand total of 34% transposable elements [2••]. Moreover, the distribution of transposable elements is inexplicably uneven, best exemplified by a 525 kb segment of chromosome Xp11, where the density of retroposable elements is 89%, and the four human homeobox gene clusters (HoxA, HoxB, HoxC, and HoxD), where the density of interspersed repeats is less than 2%. It is a wonder that our genes have survived the bombardment. The choice of integration site is a difficult one for any transposable element: On the one hand, it is advantageous to avoid integration in highly active genomic regions where major damage might be done [46•]. On the other hand, it is disadvantageous to be sequestered in inactive regions where transcriptional activity is low. One solution is to target tandemly repeated genes (rRNA, trypanosome and nematode trans-spliced leaders [47,48]) or dispersed multigene families (tRNA). For example, the insect R1 and R2 LINE elements carry a sequence-specific endonuclease that cuts within the rDNA repeat unit [36]. With excess rRNA coding capacity provided by several hundred tandem repeats of the rDNA, the host can endure a significant burden of insertions. Yet the R1 element is scarce in Bombyx rDNA where concerted evolution or other forces select against integrants [49•], while the very same R1 element is superabundant in Drosophila rDNA where it may interrupt 50–70% of the tandem rDNA repeats [50]. Another strategy — not yet shown for a SINE or LINE — is to land near genes, but not in them. In yeast, the Ty3 endogenous retrovirus targets the 5′ flanking region of tRNA genes by a protein–protein interaction between the Ty3 integration machinery and components of the transcription factor TFIIIB [51,52]. tRNA gene expression is apparently unaffected, but the Ty3 element is thereby guaranteed a home in a constitutively open chromatin region. Mysteries of the genome: inverse distribution of SINEs and LINEs Curiously, human SINEs (Alu elements) are concentrated in gene-rich GC-rich regions of the genome, and LINEs in gene-poor AT-rich regions [2••,53••], supporting previous evidence for the existence of isochores (extended regions of compositionally or functionally similar DNA) [54,55•]. This inverse distribution is unlikely to reflect preferential integration of SINEs in GC-rich regions, because SINEs are widely believed to pirate the LINE RT/EN protein, and also unlikely to reflect preferential loss from AT-rich regions, as these appear to tolerate high levels of ‘junk’ DNA. Thus, SINEs may be preferentially retained in GC-rich regions (perhaps positively selected to augment the stress response, as described below [56,57,58••]). Or LINEs might be preferentially lost from GC-rich regions (perhaps because the 20-fold larger LINEs are more dangerous insertional mutagens [46•]). Alternatively, an excess of L1 elements on the human Y chromosome, and to a lesser extent on the human X chromosome, suggests that LINEs if not SINEs may be purged by recombination between homologues containing occupied and empty target sites [46•]. Although these gene SINEs and LINEs: the art of biting the hand that feeds you Weiner conversion events would have to be directional to purge newly integrated retrotransposable elements instead of fixing them, recombinational differences between GC-rich and AT-rich regions of the genome might then explain the inverse distribution of SINEs and LINEs. Another possible explanation for the inverse genomic distribution of SINEs and LINEs would be differential cell cycle regulation at one (or more) of the many steps in the retroposition process. Among the steps that could be cell-cycle regulated are transcription of the RNAs themselves [59], nuclear export and cytoplasmic stabilization of the RNAs, nuclear import of the RNAs or RNP intermediates, differential chromatin condensation (preferential decondensation of gene-rich GC-rich regions may facilitate integration), different DNA replication schedules (gene-rich GC-rich DNA is usually replicated earlier than gene-poor AT-rich regions), and differential DNA repair (preferential decondensation of gene-rich GC-rich regions may render the underlying DNA more susceptible to damage, with integration as a byproduct or consequence). Although there is no direct evidence that chromatin structure can affect the integration of SINEs and LINEs, there are retroviral precedents: nucleosomes generally block, but occasionally enhance, mammalian retroviral integration [60] and protein–protein interactions are responsible for targeting Ty3 to tRNA promoters [51,52] and Ty5 to silent chromatin [61]. The inverse distribution of SINEs and LINEs could also be influenced by differential DNA methylation within gene-poor AT-rich and gene-rich GC-rich regions, but there are conflicting reports of the effect of CpG methylation on SINE and LINE transcription. Natural CpG methylation in HeLa cells inhibits Alu transcription, and inhibition is relieved by treating the cells with the demethylating drug 5-azacytidine [32]. In contrast, artificial CpG methylation inhibited L1 but not Alu transcription in both transient and stable expression assays when the methylated CpG binding protein MeCP2 was overexpressed and/or targeted to the SINE or LINE reporter constructs by a Gal4 DNA-binding domain [62•]. MeCP2 is a transcriptional repressor that tethers the Sin3A–HDAC1 and Sin3A–HDAC2 histone deacetylase complexes to CpG-methylated DNA. The taming of the shrewd: SINEs, LINEs and RNA interference RNA interference (RNAi) is a form of post-transcriptional gene silencing triggered by double-stranded RNA. It is found in many organisms, including flies [63,64], worms [65] and mammals [66]. In flies [67••], worms [68,69] and vertebrates, including humans [67••], RNAi appears to be a normal regulatory mechanism in which single-stranded ‘micro RNAs’, around 22 nt in size and derived from somewhat larger developmentally controlled RNA precursors, anneal with target mRNAs and trigger their degradation. RNAi also plays a protective or defensive role, 347 and has been implicated in silencing transposons [70–72] and viruses [73–75] in fungi and plants. SINEs and LINEs appear to be subject to RNAi [76,77••]. Although SINEs and LINEs can be independently transcribed from internal promoters, co-transcription of SINEs and LINEs from external promoters will generate antisense as well as sense transcripts. These sense and antisense transcripts of SINEs and LINEs could in principle anneal to form double-stranded RNAs capable of triggering the degradation of any other transcripts containing SINE and LINE sequences, whether independently transcribed from internal SINE and LINE promoters or co-transcribed as part of larger RNAs. To explain how mRNAs and mRNA precursors containing all or part of a SINE or LINE sequence can survive attack by RNAi, one might speculate that divergence between any pair of SINE or LINE sequences is usually too great to trigger RNAi. In any event, RNAi may have played a role in the evolution of SINEs and LINEs, as it is part of the cellular environment in which SINEs and LINEs arise and propagate. We do not know whether RNAi evolved first as a defensive or developmental regulatory mechanism, but an extreme view would be that SINEs and LINEs could not flourish until RNAi evolved to protect the cell from them. Accidental travellers SINEs and LINEs, no less than retropseudogenes, can provide the ‘seeds of evolution’ [78]. Perhaps the single most conspicuous demonstration of the power of retroposition is the functional, tissue-specific human pgk-2 locus, an autosomal phosphoglycerate kinase retrogene expressed during spermatogenesis and lacking the ten introns of the ubiquitously expressed X-linked pgk-1 gene. How PGK-2 acquired both a promoter and tissue specificity is an interesting question. Was it a lucky insertion into a tissuespecific promoter or chromosomal region, or was the specificity an accident that was then perfected by subsequent selection? Almost equally remarkable is the rat preproinsulin I gene, a functional retroposon which, having lost one of the two ancestral preproinsulin II introns, may be the sole instance in which a partially spliced mRNA appears to have served as the RNA intermediate [23]. Instances of useful genomic mayhem created by retroposition of SINEs and LINEs are still relatively sparse (see http://exppc01.uni-muenster.de/expath/alltables.htm), but more will surely emerge from detailed analysis of the human genome sequence. For example, L1s can co-transduce 3′ flanking sequences [79,80], and integration of Alu elements in reverse orientation can introduce fortuitous 3′ splice sites that are capable of diversifying the spectrum of alternatively spliced mRNAs (R Sorek and G Ast, personal communication). There is also one report of a rodent B2 SINE that carries a pol II promoter [81•]. Remarkably, this portable pol II promoter does not interfere with the internal pol III promoter function required for B2 retroposition, and in one case the mobile pol II promoter 348 Nuclear and gene expression drives transcription of a typical gene encoding the laminin variant, Lama3. Are SINEs useful parasites? From the moment of their discovery, SINEs and LINEs were treated as genomic parasites [82], an internal infection that could be kept in check but rarely cured. However, evolution, like science, is rife with reversals of fortune. The first group to sequence human Alu elements has now proposed that SINEs, once considered a source of genomic stress, may in fact help ease cells through physiological stresses that induce Alu transcription such as heat shock and translational inhibition [56,57,58••]. The induced SINE transcripts would then bind to PKR kinase, blocking the ability of this kinase to inhibit translation by phosphorylating the initiation factor eIF2α. Although initially greeted with scepticism, the view that SINEs may lend us a helping hand has received significant support from an unexpected direction: the International Human Genome Sequencing Consortium argued, in presenting the first draft of the human genome sequence [2••], that over-representation of Alu elements in gene-rich GC-rich DNA is most easily explained if ‘SINEs actually earn their keep in the genome’ by positive selection, as proposed by Schmid [56,57,58••]. Although the jury is still out, at least we can now look forward to a fair trial. Update A recent publication [83] suggests that methylation may be responsible for the mysteriously uneven genomic distribution of SINES, perhaps by affecting the insertion and/or deletion of the elements. References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • of special interest •• of outstanding interest 1. Singer MF: SINEs and LINEs: highly repeated short and long interspersed sequences in mammalian genomes. Cell 1982, 28:433-434. 2. •• Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921. Although the first scholarly presentation of the human genome sequence might have been expected to focus primarily on genes, the public consortium did a marvellous job of discussing the general appearance of the genomic landscape, and the role that SINEs and LINEs may have played in shaping that rugged geography. The uneven distribution of SINEs in the human genome suggests that retroposable elements are subject to positive as well as negative selection, reinforcing provocative recent evidence that SINEs may be good for us after all (see Li et al. [2001] [58••] below). 3. Luan DD, Korman MH, Jakubczak JL, Eickbush TH: Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: a mechanism for non-LTR retrotransposition. Cell 1993, 72:595-605. 4. Moran JV, Holmes SE, Naas TP, DeBerardinis RJ, Boeke JD, Kazazian HH Jr: High frequency retrotransposition in cultured mammalian cells. Cell 1996, 87:917-927. 5. Muslimov IA, Lin Y, Heller M, Brosius J, Zakeri Z, Tiedge H: A small RNA in testis and brain: implications for male germ cell development. J Cell Sci 2002, 115:1243-1250. 6. Hans H, Alwine JC: Functionally significant secondary structure of the simian virus 40 late polyadenylation signal. Mol Cell Biol 2000, 20:2926-2932. 7. Esnault C, Maestre J, Heidmann T: Human LINE retrotransposons •• generate processed pseudogenes. Nat Genet 2000, 24:363-367. Since the discovery of SINEs, LINEs and processed pseudogenes in the early 1980s, it was widely assumed that LINE reverse transcriptase would be responsible for propagating noncoding SINEs and generating processed retropseudogenes derived from mRNAs. This work provides the long-awaited evidence for this long-held hypothesis, and develops the experimental tools for understanding the functions of LINE reverse transcriptase and integrase at the molecular level. 8. Yang J, Malik HS, Eickbush TH: Identification of the endonuclease domain encoded by R2 and other site- specific, non-long terminal repeat retrotransposable elements. Proc Natl Acad Sci USA 1999, 96:7847-7852. 9. Feng Q, Moran JV, Kazazian HH Jr, Boeke JD: Human L1 retrotransposon encodes a conserved endonuclease required for retrotransposition. Cell 1996, 87:905-916. 10. Pasquinelli AE, Ernst RK, Lund E, Grimm C, Zapp ML, Rekosh D, Hammarskjold ML, Dahlberg JE: The constitutive transport element (CTE) of Mason–Pfizer monkey virus (MPMV) accesses a cellular mRNA export pathway. EMBO J 1997, 16:7500-7510. 11. Paca RE, Ogert RA, Hibbert CS, Izaurralde E, Beemon KL: Rous sarcoma virus DR posttranscriptional elements use a novel RNA export pathway. J Virol 2000, 74:9507-9514. 12. Kimberland ML, Divoky V, Prchal J, Schwahn U, Berger W, Kazazian HH Jr: Full-length human L1 insertions retain the capacity for high frequency retrotransposition in cultured cells. Hum Mol Genet 1999, 8:1557-1560. 13. Wei W, Gilbert N, Ooi SL, Lawler JF, Ostertag EM, Kazazian HH, Boeke JD, Moran JV: Human L1 retrotransposition: cis preference versus trans complementation. Mol Cell Biol 2001, 21:1429-1439. 14. Van Arsdell SW, Denison RA, Bernstein LB, Weiner AM, Manser T, Gesteland RF: Direct repeats flank three small nuclear RNA pseudogenes in the human genome. Cell 1981, 26:11-17. 15. Yang J, Eickbush TH: RNA-induced changes in the activity of the endonuclease encoded by the R2 retrotransposable element. Mol Cell Biol 1998, 18:3455-3465. 16. Okada N, Hamada M, Ogiwara I, Ohshima K: SINEs and LINEs share common 3′′ sequences: a review. Gene 1997, 205:229-243. 17. Ogiwara I, Miya M, Ohshima K, Okada N: Retropositional parasitism of SINEs on LINEs: identification of SINEs and LINEs in elasmobranchs. Mol Biol Evol 1999, 16:1238-1250. 18. Ogiwara I, Miya M, Ohshima K, Okada N: V-SINEs: a new •• superfamily of vertebrate SINEs that are widespread in vertebrate genomes and retain a strongly conserved segment within each repetitive unit. Genome Res 2002, 12:316-324. The Okada group was the first to show that SINEs sometimes impersonate LINEs by imitating (or stealing) their 3′-terminal sequences. The shared 3′-terminal sequences may facilitate retroposition by enabling SINE RNA to bind LINE reverse transcriptase more tightly. V-SINEs are a stunning example of this kind of ‘retropositional parasitism’ in which one mobile element exploits another. 19. Malik HS, Eickbush TH: The RTE class of non-LTR retrotransposons is widely distributed in animals and is the origin of many SINEs. Mol Biol Evol 1998, 15:1123-1134. 20. Nikaido M, Okada N: CetSINEs and AREs are not SINEs but are parts of cetartiodactyl L1. Mamm Genome 2000, 11:1123-1126. 21. Bebenek K, Abbotts J, Roberts JD, Wilson SH, Kunkel TA: Specificity and mechanism of error-prone replication by human immunodeficiency virus-1 reverse transcriptase. J Biol Chem 1989, 264:16948-16956. 22. Kenna MA, Brachmann CB, Devine SE, Boeke JD: Invading the yeast nucleus: a nuclear localization signal at the C terminus of Ty1 integrase is required for transposition in vivo. Mol Cell Biol 1998, 18:1115-1124. 23. Soares MB, Schon E, Henderson A, Karathanasis SK, Cate R, Zeitlin S, Chirgwin J, Efstratiadis A: RNA-mediated gene duplication: the rat preproinsulin I gene is a functional retroposon. Mol Cell Biol 1985, 5:2090-2103. SINEs and LINEs: the art of biting the hand that feeds you Weiner 24. Weiner AM, Deininger PL, Efstratiadis A: Nonviral retroposons: genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information. Annu Rev Biochem 1986, 55:631-661. 25. Sarrowa J, Chang DY, Maraia RJ: The decline in human Alu retroposition was accompanied by an asymmetric decrease in SRP9/14 binding to dimeric Alu RNA and increased expression of small cytoplasmic Alu RNA. Mol Cell Biol 1997, 17:1144-1151. 26. Kuersten S, Ohno M, Mattaj IW: Nucleocytoplasmic transport: Ran, beta and beyond. Trends Cell Biol 2001, 11:497-503. 27. Dahlberg JE, Lund E: Functions of the GTPase Ran in RNA export from the nucleus. Curr Opin Cell Biol 1998, 10:400-408. 28. Lund E, Dahlberg JE: Proofreading and amino acylation of tRNAs before export from the nucleus. Science 1998, 282:2082-2085. 29. Daniels GR, Deininger PL: Repeat sequence families derived from mammalian tRNA genes. Nature 1985, 317:819-822. 30. Sakamoto K, Okada N: Rodent type 2 Alu family, rat identifier sequence, rabbit C family, and bovine or goat 73-bp repeat may have evolved from tRNA genes. J Mol Evol 1985, 22:134-140. 31. Voeltz GK, Ongkasuwan J, Standart N, Steitz JA: A novel embryonic poly(A) binding protein, ePAB, regulates mRNA deadenylation in Xenopus egg extracts. Genes Dev 2001, 15:774-788. 32. Liu WM, Maraia RJ, Rubin CM, Schmid CW: Alu transcripts: cytoplasmic localisation and regulation by DNA methylation. Nucleic Acids Res 1994, 22:1087-1095. 33. Gibbs A, Calisher CH, Garcia-Arenal F (eds): Molecular Basis of Virus Evolution. Proceedings of a Fundacion Juan March Symposium. Cambridge: Cambridge University Press, 1995:603. 34. Malik HS, Burke WD, Eickbush TH: The age and evolution of non-LTR retrotransposable elements. Mol Biol Evol 1999, 16:793-805. 35. Malik HS, Eickbush TH: Phylogenetic analysis of ribonuclease H •• domains suggests a late, chimeric origin of LTR retrotransposable elements and retroviruses. Genome Res 2001, 11:1187-1197. The Eickbush group has led the way in using phylogeny to attack a fundamental problem in the evolution of mobile elements: who came first, and who stole what from whom? Were new elements assembled from parts of the old, or from pre-existing cellular parts; or did mobile elements both steal cellular functions and bequeath new functions to the cell? This paper argues that all LINE-like elements share a common ancestral reverse transcriptase (RT) most closely related to the RT of certain mobile group II introns that also move through RNA intermediates. 36. Feng Q, Schumann G, Boeke JD: Retrotransposon R1Bm endonuclease cleaves the target sequence. Proc Natl Acad Sci USA 1998, 95:2083-2088. 37. Ullu E, Tschudi C: Alu sequences are processed 7SL RNA genes. Nature 1984, 312:171-172. 38. Goodwin TJ, Poulter RT: The diversity of retrotransposons in the yeast Cryptococcus neoformans. Yeast 2001, 18:865-880. 39. Goodwin TJ, Ormandy JE, Poulter RT: L1-like non-LTR retrotransposons in the yeast Candida albicans. Curr Genet 2001, 39:83-91. 40. Kazazian HH Jr: An estimated frequency of endogenous insertional mutations in humans. Nat Genet 1999, 22:130. 41. Deininger PL, Batzer MA: Alu repeats and human disease. Mol Genet Metab 1999, 67:183-193. 42. Craigie R: Hotspots and warm spots: integration specificity of retroelements. Trends Genet 1992, 8:187-190. 43. Peters G, Harada F, Dahlberg JE, Panet A, Haseltine WA, Baltimore D: Low-molecular-weight RNAs of Moloney murine leukemia virus: identification of the primer for RNA-directed DNA synthesis. J Virol 1977, 21:1031-1041. 44. Ikawa Y, Ross J, Leder P: An association between globin messenger RNA and 60S RNA derived from Friend leukemia virus. Proc Natl Acad Sci USA 1974, 71:1154-1158. 45. Linial M, Medeiros E, Hayward WS: An avian oncovirus mutant (SE 21Q1b) deficient in genomic RNA: biological and biochemical characterization. Cell 1978, 15:1371-1381. 349 46. Boissinot S, Entezam A, Furano AV: Selection against deleterious • LINE-1-containing loci in the human lineage. Mol Biol Evol 2001, 18:926-935. This is the clearest evidence to date showing that LINEs can be subject to negative selection, and are kicked out of genomic regions where they have the potential to do harm. 47. Aksoy S, Williams S, Chang S, Richards FF: SLACS retrotransposon from Trypanosoma brucei gambiense is similar to mammalian LINEs. Nucleic Acids Res 1990, 18:785-792. 48. Malik HS, Eickbush TH: NeSL-1, an ancient lineage of site-specific non-LTR retrotransposons from Caenorhabditis elegans. Genetics 2000, 154:193-203. 49. Perez-Gonzalez CE, Eickbush TH: Dynamics of R1 and R2 elements • in the rDNA locus of Drosophila simulans. Genetics 2001, 158:1557-1567. The mobile retroelements R1 and R2 insert specifically into tandem repeats of arthropod rRNA genes but, as determined by a new PCR assay, R1 and R2 seldom spread to other chromosomes in the population. This indicates that recombinational bias purges new R1 and R2 insertions, and also suggests that the actual frequency of R1 and R2 retrotransposition is much higher than previously thought. 50. Wellauer PK, Dawid IB: The structural organization of ribosomal DNA in Drosophila melanogaster. Cell 1977, 10:193-212. 51. Aye M, Dildine SL, Claypool JA, Jourdain S, Sandmeyer SB: A truncation mutant of the 95-kilodalton subunit of transcription factor IIIC reveals asymmetry in Ty3 integration. Mol Cell Biol 2001, 21:7839-7851. 52. Kim JM, Vanguri S, Boeke JD, Gabriel A, Voytas DF: Transposable elements and genome organization: a comprehensive survey of retrotransposons revealed by the complete Saccharomyces cerevisiae genome sequence. Genome Res 1998, 8:464-478. 53. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, •• Smith HO, Yandell M, Evans CA, Holt RA et al.: The sequence of the human genome. Science 2001, 291:1304-1351. The human genome sequence has fully confirmed the evidence, painstakingly assembled by the Bernardi group over many years, documenting large-scale inhomogeneities within the human genome. These unexpected local differences in base composition, gene density and density of mobile elements result in ‘isochores’ — extended chromosomal regions of similar composition and presumably function — but we still do not know how or why isochores arise. Perhaps clever bioinformatics will suggest some useful experiments to probe the significance and origin of isochores. 54. Pavlicek A, Jabbari K, Paces J, Paces V, Hejnar JV, Bernardi G: Similar integration but different stability of Alus and LINEs in the human genome. Gene 2001, 276:39-45. 55. Pavlicek A, Paces J, Clay O, Bernardi G: A compact view of • isochores in the draft human genome sequence. FEBS Lett 2002, 511:165-169. See annotation Venter et al. (2001) [53••]. 56. Chu WM, Ballard R, Carpick BW, Williams BR, Schmid CW: Potential Alu function: regulation of the activity of double-stranded RNA-activated kinase PKR. Mol Cell Biol 1998, 18:58-68. 57. Liu WM, Chu WM, Choudary PV, Schmid CW: Cell stress and translational inhibitors transiently increase the abundance of mammalian SINE transcripts. Nucleic Acids Res 1995, 23:1758-1765. 58. Li TH, Schmid CW: Differential stress induction of individual Alu •• loci: implications for transcription and retrotransposition. Gene 2001, 276:135-141. This is the most recent in a series of experiments arguing, against common wisdom, that SINEs may be useful or even subject to positive selection. It is only fitting that this provocative hypothesis should emerge from the group that discovered Alu elements over two decades ago. 59. Scott PH, Cairns CA, Sutcliffe JE, Alzuherri HM, McLees A, Winter AG, White RJ: Regulation of RNA polymerase III transcription during cell cycle entry. J Biol Chem 2001, 276:1005-1014. 60. Cost GJ, Golding A, Schlissel MS, Boeke JD: Target DNA chromatinization modulates nicking by L1 endonuclease. Nucleic Acids Res 2001, 29:573-577. 61. Xie W, Gai X, Zhu Y, Zappulla DC, Sternglanz R, Voytas DF: Targeting of the yeast Ty5 retrotransposon to silent chromatin is mediated by interactions between integrase and Sir4p. Mol Cell Biol 2001, 21:6606-6614. 350 Nuclear and gene expression 62. Yu F, Zingler N, Schumann G, Stratling WH: Methyl-CpG-binding • protein 2 represses LINE-1 expression and retrotransposition but not Alu transcription. Nucleic Acids Res 2001, 29:4493-4501. Using both transient and stable assays for transcription of artificially methylated DNA, the authors show that methylation may differentially affect SINE and LINE retroposition. This unexpected observation has the potential to explain the uneven distribution of SINEs and LINEs. See also Greally (2002) [83]. 63. Hammond SM, Bernstein E, Beach D, Hannon GJ: An RNA-directed nuclease mediates post-transcriptional gene silencing in Drosophila cells. Nature 2000, 404:293-296. 64. Caplen NJ, Parrish S, Imani F, Fire A, Morgan RA: Specific inhibition of gene expression by small double-stranded RNAs in invertebrate and vertebrate systems. Proc Natl Acad Sci USA 2001, 98:9742-9747. 65. Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE, Mello CC: Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 1998, 391:806-811. 66. Elbashir SM, Lendeckel W, Tuschl T: RNA interference is mediated by 21- and 22-nucleotide RNAs. Genes Dev 2001, 15:188-200. 67. •• Lagos-Quintana M, Rauhut R, Lendeckel W, Tuschl T: Identification of novel genes coding for small expressed RNAs. Science 2001, 294:853-858. In many eukaryotes, including humans, double-stranded RNAs are processed into ‘micro RNAs’ (miRNAs) that can regulate expression of complementary mRNAs, or mediate an unusual post-transcriptional genesilencing phenomenon called RNAi (‘RNA interference’). RNAi appears to be a potent cellular defence mechanism against viruses and DNA transposons, and it may do the same for SINEs and LINEs (see Jensen et al. [1999] [77••]). 73. Dougherty WG, Lindbo JA, Smith HA, Parks TD, Swaney S, Proebsting WM: RNA-mediated virus resistance in transgenic plants: exploitation of a cellular pathway possibly involved in RNA degradation. Mol Plant Microbe Interact 1994, 7:544-552. 74. Mourrain P, Beclin C, Elmayan T, Feuerbach F, Godon C, Morel JB, Jouette D, Lacombe AM, Nikic S, Picault N et al.: Arabidopsis SGS2 and SGS3 genes are required for posttranscriptional gene silencing and natural virus resistance. Cell 2000, 101:533-542. 75. Dalmay T, Horsefield R, Braunstein TH, Baulcombe DC: SDE3 encodes an RNA helicase required for post-transcriptional gene silencing in Arabidopsis. EMBO J 2001, 20:2069-2078. 76. Jensen S, Gassama MP, Heidmann T: Cosuppression of I transposon activity in Drosophila by I-containing sense and antisense transgenes. Genetics 1999, 153:1767-1774. 77. •• Jensen S, Gassama MP, Heidmann T: Taming of transposable elements by homology-dependent gene silencing. Nat Genet 1999, 21:209-212. If SINEs and LINEs give rise to duplex RNA, as might be expected if these retroelements are subject to readthrough transcription from external promoters, RNA interference may be responsible for preventing sudden explosive growth of SINE and LINE families. 78. Brosius J: Retroposons — seeds of evolution. Science 1991, 251:753. 79. Moran JV, DeBerardinis RJ, Kazazian HH Jr: Exon shuffling by L1 retrotransposition. Science 1999, 283:1530-1534. 68. Lau NC, Lim LP, Weinstein EG, Bartel DP: An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 2001, 294:858-862. 80. Goodier JL, Ostertag EM, Kazazian HH Jr: Transduction of 3′′-flanking sequences is common in L1 retrotransposition. Hum Mol Genet 2000, 9:653-657. 69. Lee RC, Ambros V: An extensive class of small RNAs in Caenorhabditis elegans. Science 2001, 294:862-864. 81. Ferrigno O, Virolle T, Djabari Z, Ortonne JP, White RJ, Aberdam D: • Transposable B2 SINE elements can provide mobile RNA polymerase II promoters. Nat Genet 2001, 28:77-81. As mobile elements, SINEs and LINEs have the potential to destroy (by insertional mutagenesis), to create (by generating functional retropseudogenes), and to empower (by giving old genes new promoters or regulatory signals as described in this novel story). Thus, in genomic evolution, as in life, mayhem is sometimes useful. 70. Tabara H, Sarkissian M, Kelly WG, Fleenor J, Grishok A, Timmons L, Fire A, Mello CC: The rde-1 gene, RNA interference, and transposon silencing in C. elegans. Cell 1999, 99:123-132. 71. Ketting RF, Haverkamp TH, van Luenen HG, Plasterk RH: Mut-7 of C. elegans, required for transposon silencing and RNA interference, is a homolog of Werner syndrome helicase and RNaseD. Cell 1999, 99:133-141. 72. Wu-Scharf D, Jeong B, Zhang C, Cerutti H: Transgene and transposon silencing in Chlamydomonas reinhardtii by a DEAH-box RNA helicase. Science 2000, 290:1159-1162. 82. Orgel LE, Crick FH: Selfish DNA: the ultimate parasite. Nature 1980, 284:604-607. 83. Greally JM. Short interspersed transposable elements (SINEs) are excluded from imprinted regions in the human genome. Proc Natl Acad Sci USA 2002, 99:327-332.
© Copyright 2026 Paperzz