392 Trends in protein evolution inferred from sequence and structure analysis L Aravind, Raja Mazumder, Sona Vasudevan and Eugene V Koonin* Complementary developments in comparative genomics, protein structure determination and in-depth comparison of protein sequences and structures have provided a better understanding of the prevailing trends in the emergence and diversification of protein domains. The investigation of deep relationships among different classes of proteins involved in key cellular functions, such as nucleic acid polymerases and other nucleotide-dependent enzymes, indicates that a substantial set of diverse protein domains evolved within the primordial, ribozyme-dominated RNA world. Addresses National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA *e-mail: [email protected] Current Opinion in Structural Biology 2002, 12:392–399 0959-440X/02/$ — see front matter Published by Elsevier Science Ltd. Abbreviations aaRS aminoacyl-tRNA synthetases ds double-stranded EP eukaryotic primase ETFP electron transfer flavoprotein FAD flavin adenine dinucleotide HEH helix-extension-helix HNTase HIGH nucleotidyl transferase LUCA last universal common ancestor LysRS lysyl-tRNA synthetase NAD nicotinamide adenine dinucleotide RDRP RNA-dependent RNA polymerase RRM RNA recognition motif SMBD small-molecule-binding domain Introduction Within the short time span of the last six years of the 20th century, genomic sequences and entire protein complements of organisms representing major branches of the tree of life have become available [1,2••]. Concomitantly, progress in X-ray crystallography and NMR techniques, and directed efforts in the solution of structures of diverse proteins have produced a fairly detailed map of the protein universe [3–6]. Matching these experimental efforts, advances in the computational analysis of proteins through sequence profile searches allowed the detection of deep relationships among proteins that previously were amenable only to the direct comparison of structures [7]. Some of the basic principles of protein evolution were already apparent from the earliest large-scale comparisons of protein structures and sequences. The most important of these was the unification of all proteins, despite their enormous sequence diversity, into a relatively small set of structural folds [8,9]. Domains sharing the same fold display essentially the same three-dimensional folding pattern and topology of core secondary structure elements. Within these folds, the proteins could be further grouped into one or more superfamilies, which are monophyletic assemblages characterized by sequence signatures and/or structural features unique to the constituent members. In turn, superfamilies are usually divided into families, compact sets of homologous proteins that share significant sequence similarity. Domains with identical or similar biochemical activities do not necessarily have the same fold, which is a strong argument for a divergent, as opposed to convergent, origin of each fold [10,11]. Taken together, these observations imply that the majority of proteins descended from a relatively small set of ancestral domains through divergent evolution. Several schemes have successfully exploited the above principles to classify available protein structures. Some of these classification systems, such as FSSP [12] and CATH [13], employ automatic clustering based on pairwise comparisons of protein backbone (Cα) atoms. In contrast, SCOP [8,14] is based on careful case-by-case comparative analysis of individual structures and often captures subtle relationships that are not easily detected by automatic procedures alone. Although these fundamentals of our view of the protein world continue to be reinforced, computational analysis of the wealth of data produced by genomic and structural studies leads to a finer understanding of the tendencies of protein and domain evolution at various levels. In particular, higher order relationships among proteins hitherto considered as having different, unrelated folds are becoming apparent. This new understanding sheds some light on the earliest events of protein evolution, stretching back to the time when polypeptides should have been taking over from the ancestral RNA world. On many occasions, the divergence of superfamilies within a given fold, including trends in the elaboration of the structural core during evolution, also can be understood in considerable detail. Furthermore, the principles behind the evolutionary mobility of domains (resulting in domain accretion in multidomain proteins) [1], the rapid sequence divergence associated with the reallocation of functions and the emergence of several distinct functions among relatively close members of a protein family often come into focus [15•]. Here, we briefly discuss some emerging trends in protein evolution, using as case stories comparisons of sequences and structures that have become available in the past two years. Vastly different fates of homologous domains A remarkable aspect of protein evolution that has been well illustrated by recent developments is that different Protein evolution Aravind et al. 393 Figure 1 SAP domain Rho-N H H McrA H vWA H Ku H Ku TM H TM K-TRS H Miz OB H RRM H OB ATPase H SPRY H aatrs H II ATPase H AP-end H Endonuclease VII PARP LEM domain Current Opinion in Structural Biology Structures of five distinct versions of the HEH fold domain and domain architectures of the corresponding proteins. The HEH domains were detected using the PSI-BLAST program with a series of positionspecific scoring matrices specific for various forms of HEH and for the LEM domain ([17] and references therein). Domains are denoted by distinct shapes and colors. aatrsII, class II aaRS; AP-end, apurinic endonuclease; H, HEH domain; K-TRS, LysRS; Ku, core domain of the DNA-binding protein Ku; McrA, McrA family nuclease; Miz, Miz finger domain (a predicted E3 component of ubiquitin ligase); OB, OB fold nucleic-acid-binding domain; PARP, poly(ADP-ribose) polymerase; RRM, RNA recognition motif; SPRY, poorly characterized conserved domain originally identified in SP1 and ryanodine receptors; TM, transmembrane domain; vWA, von Willebrand A domain. The connectors point from the structural models to depictions of the domain architectures of proteins containing the respective forms of the HEH domain. versions of the same domain often have vastly different evolutionary trajectories. A case in point is the small (~40 amino acid residues) helix-extension-helix (HEH) domain, which was originally identified in eukaryotic chromatin-associated proteins containing the SAP domain (one of the distinct HEH fold domains), bacterial transcription termination factor Rho (Rho-N domain located at the N terminus of the Rho proteins), lysyl-tRNA synthetase (LysRS) and endonuclease VII ([16,17]; Figure 1). The structure of the LEM domain, a DNA-binding module from nuclear membrane proteins, indicated that this domain also has the HEH fold [18,19•,20•]. All HEH domains are known (or predicted) to bind nucleic acids, suggesting that this was the ancestral function of the HEH fold [17,18,19•]. However, different (super)families of HEH domains dramatically differ in terms of the diversity of the domain architectures of the corresponding proteins. The version of the HEH fold present in LysRS was not detected in any other context. The Rho-N domain and the LEM domain show a slightly greater range of architectures, either as small, stand-alone HEH fold proteins or as fusions to a small set of other domains. In contrast, the SAP domain is found in at least seven distinct protein architectures, in each case tethering different domains to eukaryotic chromatin [16]. It seems that the HEH domain present in LysRS was narrowly adapted to the specific function of this protein and was unsuitable for reutilization for other functions. In contrast, the Rho-N, LEM and SAP domains appear to have occupied more or less generic niches as nucleic-acid-binding moieties, enabling them to contribute to different functions, for example, in the complex eukaryotic chromatin. Another variation on this theme became apparent with the discovery of the diversity of proteins containing PAS-like domains. This fold is present in protein domains with an enormous diversity of biochemical functions: dedicated small-molecule-binding domains (SMBDs), such as the PAS, GAF and CACHE superfamilies [21•,22,23]; enzymes, such as β-lactamases/D-aminopeptidases [24]; and peptide-binding domain superfamilies, such as profilins, globular domains of synaptobrevins and the RoadblockMglB protein superfamily ([25•,26•,27,28; Figure 2). Two SMBD superfamilies (PAS and GAF domains) and the Roadblock-MglB proteins are represented in all three primary kingdoms (bacteria, archaea and eukaryotes), suggesting that they had already diverged from each other in the last universal common ancestor (LUCA) of modern life forms. The β-lactamase-related enzymes emerged early in bacterial evolution, whereas synaptobrevins and 394 Sequences and topology Figure 2 C S4 S4 N Sec22 S1 S5 D-Aminopeptidase S3 S4 N S1 S2 Peptide interaction C S3 PAS S5 S2 Enzymatic Small molecule binding N S1 S4 S3 N S1 S2 S3 S5 Structures of distinct forms of the PAS-like fold. The β strands of the PAS-like domain are designated S1–S5. ‘Peptide interaction’, ‘enzymatic’ and ‘small molecule binding’ refer to the principal functions of different domains that have this fold. C S4 N S1 GAF S5 S5 S2 S2 S3 C C Profilin Current Opinion in Structural Biology profilins emerged at the onset of eukaryotic evolution. Although GAF and PAS domains diverged from each other at an early stage of evolution, they were recruited to perform analogous functions, as sensors of diverse ligands, in a mélange of evolutionarily variable signaling pathways. Accordingly, PAS and GAF domains have undergone massive proliferation to form large superfamilies exhibiting numerous lineage-specific combinations with signaling domains, such as protein kinases, nucleotide cyclases or phosphodiesterases, and DNA-binding domains [21•,29]. On the other end of the spectrum, the peptide-binding domains with this fold were chiefly recruited as structural components of the cytoskeletal and trafficking apparatus. These proteins typically belong to much smaller superfamilies, often including only one or a few highly conserved sets of orthologs. The enzymes that have a PAS-like fold, the β-lactamases, face a different range of constraints, the key feature being the conservation of the active site residues, whereas the rest of the sequence may show great divergence. The above examples show that the demography of individual domain superfamilies within versatile folds depends on the original functional niche colonized by the progenitor of each superfamily. Domains associated with complex regulatory functions, such as signaling or chromatin dynamics, tend to proliferate via multiple duplications, giving rise to a vast diversity of sequences and multidomain architectures. The emergence of complex regulatory functions during evolution seems to involve the recruitment of a versatile domain, followed by a positive feedback cycle, which allows the proliferation of the domain and results in the creation of new, related functional niches through the domain’s diversification. In contrast, domain forms associated with specific, critical functions tend to evolve under strong purifying selection forces and consequently show little diversity in sequence or domain architectures. Those domain versions that adopt catalytic functions typically show sequence conservation in the vicinity of the active sites but diverge in other regions of the molecule, producing new substrate specificities. Structural and functional diversification of folds through the elaboration of simple ancient cores New structures of nucleic acid polymerases, including the translesion DNA repair polymerases of the DinB family [30••,31••], the RNA-dependent RNA polymerase (RDRP) from the double-stranded (ds) RNA phage φ6 [32••] and the eukaryotic primase (EP) [33••], became available during the past year. These structures, combined with detailed sequence analyses of the respective superfamilies, throw light on the evolution of complex structures from relatively simple ancestral cores. The DNA polymerases of the A, B and Y (DinB-like) superfamilies, the RDRPs of positive-strand and dsRNA viruses, the reverse transcriptases and the nucleotide cyclases all share a common catalytic core — the ‘palm domain’ [9,34]. The core of the palm domain is a (βαβ)2 unit, with two conserved acidic residues at the end of strand 1 and in the loop between strands 2 and 3, that chelate two divalent cations (Figure 3). All these enzymes have similar metal-dependent catalytic mechanisms, suggesting that they evolved from a common ancestor that already possessed polymerase activity. Most Protein evolution Aravind et al. of the polymerases additionally have extensions to or insertions into the palm domain, the ‘finger modules’, which ensure tighter interactions with the template nucleic acid. The finger modules of different polymerase superfamilies often appear to be unrelated and typically have a high α-helical content. Figure 3 (a) Fingers D Recent comparative genome analysis resulted in the prediction of a new family of DNA polymerases that are distinct from all previously described polymerase superfamilies, are present predominantly in thermophilic archaea and bacteria, and appear to be key components of a previously undetected, thermophile-specific DNA repair system. These predicted polymerases have abbreviated finger modules and, in this respect, resemble the nucleotide cyclases [35•]. The palm domain of DNA and RNA polymerases appears to be a version of the RNA recognition motif (RRM)-like fold that is seen in ancient nucleic-acid-binding domains, such as ribosomal protein S6 [36] and pseudouridine synthases [37,38•]. These observations suggest that the ancestral polymerase probably evolved from a nucleicacid-binding domain that might have functioned as an accessory to a self-replicating nucleic acid. Subsequently, the metal-binding active site probably evolved within the ancestral palm domain, along with the intrinsic catalytic activity. Finger domains appear to have been independently inserted, after the divergence of the major polymerase lineages, as adaptations for fine-tuning the polymerases in the context of their specific template and biological functions. In most cases, the finger domains appear to have emerged from coiled-coil structures that condensed to more complex globular units. Alternatively, some finger domains apparently have been derived from the highly mobile zinc ribbon domain or from other small, metal-binding modules. The structure of EP [33••] revealed that it had no relationship with the catalytic TOPRIM domain present in bacterial-type (DnaG-like) primases and in a variety of topoisomerases and nucleases [39–41]. Also, the catalytic domain of EP showed no immediately apparent relationship with the palm domain of DNA and RNA polymerases. However, the detection of highly divergent bacterial homologs of the EPs helped in defining the conserved core shared by all members of this superfamily [42]; a structural comparison of this conserved core with other known structures revealed a similarity to the classic palm domains. Two active site motifs of EPs, the N-terminal DxD motif and the central RXXH motif, perfectly match the two active site motifs of the palm domain, although the equivalent of strand 4 in the latter is highly distorted in the EPs (Figure 3). The bacterial homologs of the EPs contain only the core domain without any inserts, whereas the archaeal and eukaryotic versions have prominent inserts, such as a small metal-chelating module (e.g. zinc flap) and a large α-helical unit (e.g. finger module). Thus, it appears likely that the EPs also evolved from an ancestral (βαβ)2 RRM-like nucleic-acid-binding core by distortion and 395 D S2 S3 S1 S4 C N (b) Zinc flap Fingers R D D H S2 S3 D ‘S4 equivalent’ S1 N C Current Opinion in Structural Biology Topology of (a) the classic palm domain of DNA and RNA polymerases, and (b) the version of the palm domain present in EPs. The four core β strands of the palm domain are labeled S1–S4 and the helices are depicted as cylinders. The sidechains shown in stick representation are the active site residues and the coordinated Mg2+ ions are shown as blue circles. The zinc flap is a small zinc-coordinating module with a CxHxnCxxC signature inserted into the palm domain of EPs. elaboration through the insertion of α-helical modules. The active site of the EPs, while occupying a similar spatial configuration, appears to have evolved differently from the classic palm domains: the EPs contain an acidic and basic residue pair in place of the two acidic residues of the palm domains of other polymerases. In addition to the unusual structure of the region equivalent to strand 4, the EPs contain a conserved acidic residue associated with the active site that has no equivalents in classic palm domains (Figure 3). Early evolution of proteins: glimpses of the RNA world A growing number of recently determined structures of nucleotide-binding domains appear to be variations of 396 Sequences and topology Figure 4 vWA Class I aaRS TOPRIM Mg 2+? d Nucleic acid Haloacid dehalogenase HNTase ATP→acyl-P+ ADP e c ATP→AMP a Photolyase Big bang b f USPA Receiver H-P→D-acyl-P ETFP Dinucleotide/ g bulky nucleotide PP-ATPase TIR FtsZ-tubulin GTP→GDP+P LUCA Sir2 Oxidoreductase Methylase Current Opinion in Structural Biology An evolutionary tree of the ‘Rossmannoid’ nucleotide-binding domains, with an emphasis on the evolution of the HUP domain. The letters indicate major clades that were delineated through sequence and structure comparisons: (a) HUP superclass; (b) USPA-photolyase-ETFP class within the HUP superclass; (c) HNTases and class I aaRS; (d) TOPRIM class, including DnaG-like primases, topoisomerases IA, II and VI, and Old-family nucleases; (e) haloacid dehalogenase class, including P-type ATPases and diverse phosphoesterases; (f) a clade comprising the Receiver and TIR (an adaptor domain involved in eukaryotic programmed cell death) domains involved in different forms of signaling; (g) the typical Rossmann fold class, including FAD/NAD-dependent oxidoreductases, SAM-dependent methyltransferases and Sir2-like NAD-dependent enzymes. The earliest stages of evolution leading to the divergence of the major classes from each other are presently not amenable to reconstruction and are presented as an unresolved ‘big bang’ event. The yellow circle represents the LUCA of modern life forms and the branchings shown in the figure roughly depict the diversity of each class of domain that is traceable to LUCA. vWA, von Willebrand A domain. the basic ‘Rossmannoid’ domain, which consists of multiple α/β units arranged as a three-layered sandwich. The Rossmannoids include NAD- and FAD-dependent oxidoreductases (enzymes with the prototype Rossmann fold), SAM-dependent methyltransferases [43], the Sir2-like NAD-dependent enzymes [44•,45•], the FtsZ-tubulin family of GTPases [46], the TOPRIM domains [39–41], the haloacid dehalogenase-type hydrolases [47,48], PP-loop-containing ATPases (PP-ATPases) [49], USPA-like domains [50,51,52••], photolyases, and class I aminoacyl-tRNA synthetases (aaRS) and the related HIGH nucleotidyl transferases (HNTases) [53]. Given their similar topologies and equivalent spatial locations of the nucleotide-binding sites, it appears likely that all these domain superfamilies evolved from a primordial, generic, nucleotide-binding domain. Attempts were undertaken to detect footprints of some of the divergence events that resulted in the emergence of Protein evolution Aravind et al. clearly recognizable extant domain superfamilies from the hypothetical ancestral nucleotide-binding protein. A combination of sequence and structure comparison methods, along with formal cladistic analysis of structural and sequence features, showed that the PP-ATPases, USPA-like domains, photolyases, electron transfer flavoproteins (ETFPs) and class I aaRS-HNTase superfamilies comprised a monophyletic assemblage, to the exclusion of all other Rossmannoid domains [54••]. It appears that, within this HUP domain class (after HNTase, USPA-like, PP-ATPase), the PP-ATPases diverged first, followed by the USPA-photolyase-ETFP assemblage and finally aaRS and HNTases (Figure 4). The ancestor of the HUP class probably was a generic enzyme that bound ATP and hydrolyzed it to AMP. Comparative genomics reveals that LUCA already encoded 15–18 members of the HUP domain class, including nine aaRS and one RNA-modifying enzyme (thiouridine synthase) [54••]. A corollary of these observations is that several rounds of duplication and divergence of the HUP domains antedate LUCA, and that, during this phase of early evolution, the ancestral, nonspecific HUP proteins performed multiple nucleotidedependent functions, currently allotted to their distinct descendents. However, it seems unlikely, if not outright impossible, that the ancestral HUP domains with generic ATP pyrophosphatase and/or nucleotidyltransferase activity, could, all on their own, catalyze reactions requiring highly specific molecular interactions, such as tRNA aminoacylation or thiouridylation of unique bases in RNA. Furthermore, before the divergence of the aaRS, which form just a terminal branch in the evolutionary tree of HUP domains (Figure 4), HUP domain proteins could not confer specificity on the translation process in the fashion that modern aaRS do. The hypothesis that, in the ancestral translation system, many functions currently performed by proteins relied on RNAs, including the ancestors of rRNA and tRNA [55], offers a plausible solution to this conundrum. Given that, even at this early stage of evolution, the HUP domains already had catalytic and ligand-binding capabilities, the RNAs were probably mainly responsible for the specificity of the corresponding reactions. Several additional rounds of duplication and divergence appear to separate the ancestral HUP domain from the deeper and even less specific common ancestors of large groups of Rossmannoid domains (Figure 4). Thus, it appears that a fairly efficient and accurate translation system existed in the ancient RNA world, with, at best, assistance from proteins that had low biochemical specificity. From a complementary perspective, this and similar analyses of other protein classes suggest that much of the diversity of protein domains had already evolved within the RNA world. With their functional specificity increasing via multiple rounds of duplication and divergence, proteins gradually displaced the ancient RNAs from most of their functions, except for a few truly indispensable ones, such as translation elongation. 397 Conclusions Recent analyses of sequences and structures suggest several generalizations regarding the mode of protein evolution. The same functionally versatile protein fold often includes both highly populated and diverse, and sparse superfamilies. It appears that regulatory functions and highly diversified protein superfamilies co-evolve: duplications within a superfamily generate new functions and these new functional niches, via a positive feedback loop, allow further selection of additional members of the same superfamily. Structural and sequence comparisons also begin to clarify how some of the most complex cellular machines, such as nucleic acid polymerases, have evolved on multiple occasions by the elaboration of simple ancestral cores through inserts. These inserts appear to have been independently acquired by different polymerases and have converged to similar functions related to template interaction, despite their distinct structures. Finally, comparative analysis of deep relationships among protein families may throw light on the early evolution of proteins in the RNA world. Ancestors of a large number of domains in modern proteins appear to have evolved well before the extant, predominantly protein-based translation system was established. Thus, proteins with generic biochemical properties probably collaborated with catalytic RNAs early in evolution, with the latter responsible for the specificity of molecular interactions. This primordial stage of evolution of the protein world was followed by functional diversification of proteins and the gradual displacement of RNAs from enzymatic functions. Future developments in comparative genomics and structural analysis are likely to shed even more light on the actual details surrounding these fundamental evolutionary events. References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • of special interest •• of outstanding interest 1. Koonin EV, Aravind L, Kondrashov AS: The impact of comparative genomics on our understanding of evolution. Cell 2000, 101:573-576. 2. International Human Genome Consortium: Initial sequencing and •• analysis of the human genome. Nature 2001, 409:860-921. The draft sequence of the human genome is accompanied by a fairly detailed analysis of the encoded proteins. Probably the greatest surprise brought about by the human genome sequence is the relatively small number of predicted protein-coding genes, 30 000–40 000, which is only about 50% more than the nematode Caenorhabditis elegans. However, comparison of domain organizations of orthologous proteins shows that domain accretion (accumulation of additional domains), particularly in signaling and regulatory proteins, makes a substantial contribution to the greater complexity of vertebrate proteomes compared to those of simpler eukaryotes. 3. Thornton JM, Todd AE, Milburn D, Borkakoti N, Orengo CA: From structure to function: approaches and limitations. Nat Struct Biol 2000, 7(suppl):991-994. 4. Montelione GT, Zheng D, Huang YJ, Gunsalus KC, Szyperski T: Protein NMR spectroscopy in structural genomics. Nat Struct Biol 2000, 7(suppl):982-985. 5. Brenner SE: Target selection for structural genomics. Nat Struct Biol 2000, 7(suppl):967-969. 6. Sanchez R, Pieper U, Melo F, Eswar N, Marti-Renom MA, Madhusudhan MS, Mirkovic N, Sali A: Protein structure modeling for structural genomics. Nat Struct Biol 2000, 7(suppl):986-990. 398 Sequences and topology 7. Koonin EV, Wolf YI, Aravind L: Protein fold recognition using sequence profiles and its application in structural genomics. Adv Protein Chem 2000, 54:245-275. 22. Ho YS, Burden LM, Hurley JH: Structure of the GAF domain, a ubiquitous signaling motif and a new class of cyclic GMP receptor. EMBO J 2000, 19:5288-5299. 8. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247:536-540. 23. Anantharaman V, Aravind L: Cache - a signaling domain common to animal Ca(2+)-channel subunits and a class of prokaryotic chemotaxis receptors. Trends Biochem Sci 2000, 25:535-537. 9. Murzin AG: How far divergent evolution goes in proteins. Curr Opin Struct Biol 1998, 8:380-387. 24. Goffin C, Ghuysen JM: Multimodular penicillin-binding proteins: an enigmatic family of orthologs and paralogs. Microbiol Mol Biol Rev 1998, 62:1079-1093. 10. Doolittle RF: Convergent evolution: the need to be explicit. Trends Biochem Sci 1994, 19:15-18. 11. Galperin MY, Walker DR, Koonin EV: Analogous enzymes: independent inventions in enzyme evolution. Genome Res 1998, 8:779-790. 12. Holm L, Sander C: The FSSP database: fold classification based on structure-structure alignment of proteins. Nucleic Acids Res 1996, 24:206-209. 13. Orengo CA, Bray JE, Buchan DW, Harrison A, Lee D, Pearl FM, Sillitoe I, Todd AE, Thornton JM: The CATH protein family database: a resource for structural and functional annotation of genomes. Proteomics 2002, 2:11-21. 14. Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 2002, 30:264-267. 15. Todd AE, Orengo CA, Thornton JM: Evolution of function in protein • superfamilies, from a structural perspective. J Mol Biol 2001, 307:1113-1143. Analysis of the functional variation of homologous enzyme superfamilies containing two or more enzymes, as defined by the CATH protein structure classification, using the Enzyme Commission (EC) scheme revealed that the majority of superfamilies display variation in enzyme function. In many cases, this functional diversity could be correlated with local sequence variation and domain shuffling. Typically, substrate specificity is diverse across a superfamily, although the reaction chemistry is usually conserved. 16. Aravind L, Koonin EV: SAP - a putative DNA-binding motif involved in chromosomal organization. Trends Biochem Sci 2000, 25:112-114. 17. Aravind L, Koonin EV: Prokaryotic homologs of the eukaryotic DNA-end-binding protein Ku, novel domains in the Ku protein and prediction of a prokaryotic double-strand break repair system. Genome Res 2001, 11:1365-1374. 18. Cai M, Huang Y, Ghirlando R, Wilson KL, Craigie R, Clore GM: Solution structure of the constant region of nuclear envelope protein LAP2 reveals two LEM-domain structures: one binds BAF and the other binds DNA. EMBO J 2001, 20:4399-4407. 25. Tochio H, Tsui MM, Banfield DK, Zhang M: An autoinhibitory • mechanism for nonsyntaxin SNARE proteins revealed by the structure of Ykt6p. Science 2001, 293:698-702. This paper shows that the structure of the N-terminal domain of the nonsyntaxin SNARE protein Ykt6p is unrelated to the syntaxin structure, but instead resembles the fold of the actin regulatory protein profilin. However, similar to syntaxins, Ykt6p forms a fold-back conformation, whereby the N-terminal domain binds the C terminus of the protein. The N-terminal domain of Ykt6p is shown to contribute to the assembly of SNARE complexes. 26. Gonzalez LC, Weis WI, Scheller RH: A novel snare N-terminal • domain revealed by the crystal structure of Sec22b. J Biol Chem 2001, 276:24203-24211. In this paper, the structure of the N-terminal domain of Sec22b, a nonsyntaxin SNARE involved in endoplasmic reticulum/Golgi membrane trafficking, is described as an apparent circular permutation of a profilin-like domain. The analysis of conserved residues in the N-terminal domain of Sec22b led to the prediction of the binding site. 27. Nodelman IM, Bowman GD, Lindberg U, Schutt CE: X-ray structure determination of human profilin II: a comparative structural analysis of human profilins. J Mol Biol 1999, 294:1271-1285. 28. Koonin EV, Aravind L: Dynein light chains of the Roadblock/LC7 group belong to an ancient protein superfamily implicated in NTPase regulation. Curr Biol 2000, 10:R774-R776. 29. Taylor BL, Zhulin IB: PAS domains: internal sensors of oxygen, redox potential, and light. Microbiol Mol Biol Rev 1999, 63:479-506. 30. Silvian LF, Toth EA, Pham P, Goodman MF, Ellenberger T: Crystal •• structure of a DinB family error-prone DNA polymerase from Sulfolobus solfataricus. Nat Struct Biol 2001, 8:984-989. This paper, together with [31••], reports the crystal structure of the Sulfolobus solfataricus Dbh protein (Dpo4), which provides insights into the mechanism and evolutionary relationships of the Y (DinB-like) family of DNA polymerases, which includes error-prone translesion polymerases. The Dbh protein has a palm domain similar to those of other polymerases but, unlike high-fidelity polymerases, has small fingers, which gives the active site an open appearance, suggesting less limitations on mispairing. 19. Wolff N, Gilquin B, Courchay K, Callebaut I, Worman HJ, Zinn-Justin S: • Structural analysis of emerin, an inner nuclear membrane protein mutated in X-linked Emery-Dreifuss muscular dystrophy. FEBS Lett 2001, 501:171-176. Structural analysis of the N-terminal domain of emerin, which is involved in binding laminin A/C, reveals that this domain has the LEM fold. A conserved solvent-exposed surface was detected in different LEM domains. Further structural comparisons allowed the unification of LEM with HEH domains. 31. Ling H, Boudsocq F, Woodgate R, Yang W: Crystal structure of a •• Y-family DNA polymerase in action: a mechanism for error-prone and lesion-bypass replication. Cell 2001, 107:91-102. In this work, crystal structures of Dpo4 in ternary complexes with DNA and a correct or incorrect incoming nucleotide are analyzed. It is shown that, although Dpo4 has a palm domain similar to those of other polymerases, it makes only limited and unspecific contacts with the replicating base pair, which results in the relaxed selection of the base to be incorporated. In addition, the crystal structure of Dpo4 translocating two template bases to the active site at once suggests a mechanism for bypassing thymine dimers. 20. Laguri C, Gilquin B, Wolff N, Romi-Lebrun R, Courchay K, Callebaut I, • Worman HJ, Zinn-Justin S: Structural characterization of the LEM motif common to three human inner nuclear membrane proteins. Structure 2001, 9:503-511. The solution of the three-dimensional structures of the LEM and LEM-like domains of lamina-associated protein 2 showed that both domains adopt the same fold (largely composed of two parallel α helices). The authors hypothesize that LEM is a protein–protein interaction domain and that the conserved region of LEM domains present at the surface of helix 2 could mediate interaction between LEM and a common protein partner. 32. Butcher SJ, Grimes JM, Makeyev EV, Bamford DH, Stuart DI: A •• mechanism for initiating RNA-dependent RNA polymerization. Nature 2001, 410:235-240. Analysis of the crystal structure of the RDRP from the dsRNA bacteriophage φ6 showed a specific relationship with the RDRP from hepatitis C virus (HCV), supporting the monophyly of dsRNA viruses and positive-strand RNA viruses. In addition, this paper describes structures of the complexes between φ6 RDRP and the RNA template and substrates, revealing unusual details of the polymerization mechanism. Initiation of RNA synthesis by this enzyme involves two molecules of NTPs, one of which functions as the primer. 21. Anantharaman V, Koonin EV, Aravind L: Regulatory potential, • phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J Mol Biol 2001, 307:1271-1292. Intracellular SMBDs show notable evolutionary mobility and substantially contribute to the generation of lineage-specific domain architectures. Analysis of 21 SMBDs described in this paper revealed numerous instances of the re-invention of similar domain architectures involving functionally related but not homologous domains. This suggests that similar selective forces have acted on various SMBDs, which has resulted in the formation of multidomain proteins that fit a limited number of functional stereotypes. Analysis of protein domain architecture resulted in the prediction of the functions and modes of regulation of a variety of uncharacterized proteins. 33. Augustin MA, Huber R, Kaiser JT: Crystal structure of a •• DNA-dependent RNA polymerase (DNA primase). Nat Struct Biol 2001, 8:57-61. The determination of the crystal structure of the catalytic subunit of DNA primase from the archaeon Pyrococcus furiosus described in this paper confirmed the conclusion, previously reached on the basis of sequence comparisons, that there is no significant similarity and, accordingly, no evolutionary relationship between archaeo-eukaryotic and bacterial primases. It was noticed that, although the constellation of amino acid residues in the active center of the archaeal primase resembles those in the active centers of other polymerases, their core folds appear to be unrelated. However, more detailed comparisons described in this review reveal the presence of a palm domain in the archaeo-eukaryotic primases. Protein evolution Aravind et al. 34. Wang J, Sattar AK, Wang CC, Karam JD, Konigsberg WH, Steitz TA: Crystal structure of a pol alpha family replication DNA polymerase from bacteriophage RB69. Cell 1997, 89:1087-1099. 35. Makarova KS, Aravind L, Grishin NV, Rogozin IB, Koonin EV: A DNA • repair system specific for thermophilic archaea and bacteria predicted by genomic context analysis. Nucleic Acids Res 2002, 30:482-496. A comparison of gene order in prokaryotic genomes using a new algorithm for the identification of conserved gene arrays, combined with protein sequence and structure analysis, resulted in the detection of a conserved gene neighborhood that is present chiefly in hyperthermophilic archaea and bacteria, and is predicted to encode a DNA repair system specific to (hyper)thermophiles. One of the key components of the potential thermophile-specific repair system is a predicted DNA polymerase from a new superfamily that shows only limited sequence similarity to other polymerases, but appears to contain all the hallmarks of the palm domain and the predicted catalytic residues. This polymerase, however, is predicted to have short finger modules resembling those in translesion polymerases of the Y family. Along with the absence of Y-family polymerases in most thermophiles, this suggests that the predicted novel repair system might be a thermophile-specific functional analog of the translesion repair system, which, in mesophiles, is centered on the Y-family polymerases. 36. Agalarov SC, Sridhar Prasad G, Funke PM, Stout CD, Williamson JR: Structure of the S15,S6,S18-rRNA complex: assembly of the 30S ribosome central domain. Science 2000, 288:107-113. 37. Foster PG, Huang L, Santi DV, Stroud RM: The structural basis for tRNA recognition and pseudouridine formation by pseudouridine synthase I. Nat Struct Biol 2000, 7:23-27. 38. Hoang C, Ferre-D’Amare AR: Cocrystal structure of a tRNA Psi55 • pseudouridine synthase: nucleotide flipping by an RNA-modifying enzyme. Cell 2001, 107:929-939. A comparison of the crystal structure of the TruB pseudouridine synthase with the structure of TruA, combined with earlier sequence comparisons, leads to the conclusion that all pseudouridine synthases evolved from a common ancestor. It is shown that the recognition of the T loop of the substrate tRNAs by TruB is driven by shape complementarity. The reaction mechanism involves flipping of the modified uracil, a feature that is common to many nucleic acid modification enzymes. 39. Aravind L, Leipe DD, Koonin EV: Toprim-a conserved catalytic domain in type IA and II topoisomerases, DnaG-type primases, OLD family nucleases and RecR proteins. Nucleic Acids Res 1998, 26:4205-4213. 40. Podobnik M, McInerney P, O’Donnell M, Kuriyan J: A TOPRIM domain in the crystal structure of the catalytic core of Escherichia coli primase confirms a structural link to DNA topoisomerases. J Mol Biol 2000, 300:353-362. 41. Keck JL, Roche DD, Lynch AS, Berger JM: Structure of the RNA polymerase domain of E. coli primase. Science 2000, 287:2482-2486. 42. Koonin EV, Wolf YI, Kondrashov AS, Aravind L: Bacterial homologs of the small subunit of eukaryotic DNA primase. J Mol Microbiol Biotechnol 2000, 2:509-512. 43. Bujnicki JM: Comparison of protein structures reveals monophyletic origin of the AdoMet-dependent methyltransferase family and mechanistic convergence rather than recent differentiation of N4-cytosine and N6-adenine DNA methylation. In Silico Biol 1999, 1:175-182. 44. Min J, Landry J, Sternglanz R, Xu RM: Crystal structure of a SIR2 • homolog-NAD complex. Cell 2001, 105:269-279. The high-resolution crystal structure of the NAD-dependent protein deacetylase SIR2 revealed two domains, a typical Rossmann fold nucleotide-binding domain and a small domain containing a zinc ribbon. A distinct mode of NAD binding, in a pocket between the two domains, is described. 399 45. Finnin MS, Donigian JR, Pavletich NP: Structure of the histone • deacetylase SIRT2. Nat Struct Biol 2001, 8:621-625. Analysis of the crystal structure of the human NAD-dependent histone deacetylase SIRT2 reveals the Rossmann fold nucleotide-binding domain and a smaller domain that consists of a helical module and a zinc ribbon. The probable catalytic site is identified as the groove between these two domains (see also [44••]). Another pocket formed by the helical module and lined with hydrophobic residues is proposed as the likely protein-binding site. 46. Nogales E, Downing KH, Amos LA, Lowe J: Tubulin and FtsZ form a distinct family of GTPases. Nat Struct Biol 1998, 5:451-458. 47. Wang W, Kim R, Jancarik J, Yokota H, Kim SH: Crystal structure of phosphoserine phosphatase from Methanococcus jannaschii, a hyperthermophile, at 1.8 Å resolution. Structure 2001, 9:65-71. 48. Cho H, Wang W, Kim R, Yokota H, Damo S, Kim SH, Wemmer D, Kustu S, Yan D: BeF(3)(-) acts as a phosphate analog in proteins phosphorylated on aspartate: structure of a BeF(3)(-) complex with phosphoserine phosphatase. Proc Natl Acad Sci USA 2001, 98:8525-8530. 49. Bork P, Koonin EV: A P-loop-like motif in a widespread ATP pyrophosphatase domain: implications for the evolution of sequence motifs and enzyme activity. Proteins 1994, 20:347-355. 50. Zarembinski TI, Hung LW, Mueller-Dieckmann HJ, Kim KK, Yokota H, Kim R, Kim SH: Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. Proc Natl Acad Sci USA 1998, 95:15189-15193. 51. Makarova KS, Aravind L, Galperin MY, Grishin NV, Tatusov RL, Wolf YI, Koonin EV: Comparative genomics of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell. Genome Res 1999, 9:608-628. 52. Sousa MC, McKay DB: Structure of the universal stress protein of •• Haemophilus influenzae. Structure 2001, 9:1135-1141. The crystal structure of the bacterial UspA protein, although generally similar to the previously reported structure of the MJ0577 protein from the archaeon Methanococcus jannaschii, differs in that UspA does not seem to have an ATP-binding pocket and, in fact, has not been shown to bind ATP. It is hypothesized that the UspA protein superfamily, which has been identified on the basis of sequence conservation, comprises two distinct subsets, only one of which consists of nucleotide-binding proteins. 53. Bork P, Holm L, Koonin EV, Sander C: The cytidylyltransferase superfamily: identification of the nucleotide-binding site and fold prediction. Proteins 1995, 22:259-266. 54. Aravind L, Anantharaman V, Koonin EV: Monophyly of class I •• aminoacyl tRNA synthetase, USPA, ETFP, photolyase, and PP-ATPase nucleotide-binding domains: implications for protein evolution in the RNA world. Proteins 2002, 48:1-14. The monophyly of the catalytic domains of the class I aaRS and the HNTases, nucleotide-binding domains related to the UspA protein (USPA domains), photolyases, ETFPs and PP-ATPases was discerned by detailed structure/sequence analysis. Phyletic distribution patterns within the major lineages of the HUP (after HIGH, USPA and PP-loop) class proteins suggest that the LUCA of all modern life forms already encoded several enzymes of this class. Comparative analysis of the HUP domain proteins also allows the reconstruction of a series of early evolutionary events antedating LUCA. 55. Crick FH: The origin of the genetic code. J Mol Biol 1968, 38:367-379.
© Copyright 2026 Paperzz