Trends in protein evolution inferred from sequence and structure

392
Trends in protein evolution inferred from sequence and
structure analysis
L Aravind, Raja Mazumder, Sona Vasudevan and Eugene V Koonin*
Complementary developments in comparative genomics,
protein structure determination and in-depth comparison of
protein sequences and structures have provided a better
understanding of the prevailing trends in the emergence and
diversification of protein domains. The investigation of deep
relationships among different classes of proteins involved in
key cellular functions, such as nucleic acid polymerases and
other nucleotide-dependent enzymes, indicates that a
substantial set of diverse protein domains evolved within the
primordial, ribozyme-dominated RNA world.
Addresses
National Center for Biotechnology Information, National Library of
Medicine, National Institutes of Health, Bethesda, MD 20894, USA
*e-mail: [email protected]
Current Opinion in Structural Biology 2002, 12:392–399
0959-440X/02/$ — see front matter
Published by Elsevier Science Ltd.
Abbreviations
aaRS
aminoacyl-tRNA synthetases
ds
double-stranded
EP
eukaryotic primase
ETFP
electron transfer flavoprotein
FAD
flavin adenine dinucleotide
HEH
helix-extension-helix
HNTase
HIGH nucleotidyl transferase
LUCA
last universal common ancestor
LysRS
lysyl-tRNA synthetase
NAD
nicotinamide adenine dinucleotide
RDRP
RNA-dependent RNA polymerase
RRM
RNA recognition motif
SMBD
small-molecule-binding domain
Introduction
Within the short time span of the last six years of the 20th
century, genomic sequences and entire protein complements of organisms representing major branches of the
tree of life have become available [1,2••]. Concomitantly,
progress in X-ray crystallography and NMR techniques,
and directed efforts in the solution of structures of diverse
proteins have produced a fairly detailed map of the protein
universe [3–6]. Matching these experimental efforts,
advances in the computational analysis of proteins through
sequence profile searches allowed the detection of deep
relationships among proteins that previously were
amenable only to the direct comparison of structures [7].
Some of the basic principles of protein evolution were
already apparent from the earliest large-scale comparisons
of protein structures and sequences. The most important
of these was the unification of all proteins, despite their
enormous sequence diversity, into a relatively small set of
structural folds [8,9]. Domains sharing the same fold
display essentially the same three-dimensional folding
pattern and topology of core secondary structure elements.
Within these folds, the proteins could be further grouped
into one or more superfamilies, which are monophyletic
assemblages characterized by sequence signatures and/or
structural features unique to the constituent members. In
turn, superfamilies are usually divided into families,
compact sets of homologous proteins that share significant
sequence similarity. Domains with identical or similar
biochemical activities do not necessarily have the same
fold, which is a strong argument for a divergent, as opposed
to convergent, origin of each fold [10,11]. Taken together,
these observations imply that the majority of proteins
descended from a relatively small set of ancestral domains
through divergent evolution. Several schemes have
successfully exploited the above principles to classify
available protein structures. Some of these classification
systems, such as FSSP [12] and CATH [13], employ automatic clustering based on pairwise comparisons of protein
backbone (Cα) atoms. In contrast, SCOP [8,14] is based on
careful case-by-case comparative analysis of individual
structures and often captures subtle relationships that are
not easily detected by automatic procedures alone.
Although these fundamentals of our view of the protein
world continue to be reinforced, computational analysis of
the wealth of data produced by genomic and structural
studies leads to a finer understanding of the tendencies of
protein and domain evolution at various levels. In particular,
higher order relationships among proteins hitherto considered as having different, unrelated folds are becoming
apparent. This new understanding sheds some light on the
earliest events of protein evolution, stretching back to the
time when polypeptides should have been taking over
from the ancestral RNA world. On many occasions, the
divergence of superfamilies within a given fold, including
trends in the elaboration of the structural core during
evolution, also can be understood in considerable detail.
Furthermore, the principles behind the evolutionary
mobility of domains (resulting in domain accretion in
multidomain proteins) [1], the rapid sequence divergence
associated with the reallocation of functions and the
emergence of several distinct functions among relatively
close members of a protein family often come into focus [15•].
Here, we briefly discuss some emerging trends in protein
evolution, using as case stories comparisons of sequences and
structures that have become available in the past two years.
Vastly different fates of homologous domains
A remarkable aspect of protein evolution that has been
well illustrated by recent developments is that different
Protein evolution Aravind et al.
393
Figure 1
SAP domain
Rho-N
H H
McrA H
vWA
H
Ku
H
Ku
TM
H
TM
K-TRS
H
Miz
OB
H RRM
H OB ATPase
H SPRY
H
aatrs H II
ATPase
H AP-end
H
Endonuclease VII
PARP
LEM domain
Current Opinion in Structural Biology
Structures of five distinct versions of the HEH fold domain and domain
architectures of the corresponding proteins. The HEH domains were
detected using the PSI-BLAST program with a series of positionspecific scoring matrices specific for various forms of HEH and for the
LEM domain ([17] and references therein). Domains are denoted by
distinct shapes and colors. aatrsII, class II aaRS; AP-end, apurinic
endonuclease; H, HEH domain; K-TRS, LysRS; Ku, core domain of the
DNA-binding protein Ku; McrA, McrA family nuclease; Miz, Miz finger
domain (a predicted E3 component of ubiquitin ligase); OB, OB fold
nucleic-acid-binding domain; PARP, poly(ADP-ribose) polymerase;
RRM, RNA recognition motif; SPRY, poorly characterized conserved
domain originally identified in SP1 and ryanodine receptors;
TM, transmembrane domain; vWA, von Willebrand A domain.
The connectors point from the structural models to depictions of the
domain architectures of proteins containing the respective forms of the
HEH domain.
versions of the same domain often have vastly different
evolutionary trajectories. A case in point is the small
(~40 amino acid residues) helix-extension-helix (HEH)
domain, which was originally identified in eukaryotic
chromatin-associated proteins containing the SAP domain
(one of the distinct HEH fold domains), bacterial transcription termination factor Rho (Rho-N domain located at
the N terminus of the Rho proteins), lysyl-tRNA synthetase
(LysRS) and endonuclease VII ([16,17]; Figure 1). The
structure of the LEM domain, a DNA-binding module
from nuclear membrane proteins, indicated that this
domain also has the HEH fold [18,19•,20•]. All HEH
domains are known (or predicted) to bind nucleic acids,
suggesting that this was the ancestral function of the HEH
fold [17,18,19•]. However, different (super)families of
HEH domains dramatically differ in terms of the diversity
of the domain architectures of the corresponding proteins.
The version of the HEH fold present in LysRS was not
detected in any other context. The Rho-N domain and the
LEM domain show a slightly greater range of architectures,
either as small, stand-alone HEH fold proteins or as
fusions to a small set of other domains. In contrast, the
SAP domain is found in at least seven distinct protein
architectures, in each case tethering different domains to
eukaryotic chromatin [16]. It seems that the HEH domain
present in LysRS was narrowly adapted to the specific
function of this protein and was unsuitable for reutilization
for other functions. In contrast, the Rho-N, LEM and SAP
domains appear to have occupied more or less generic
niches as nucleic-acid-binding moieties, enabling them to
contribute to different functions, for example, in the
complex eukaryotic chromatin.
Another variation on this theme became apparent with the
discovery of the diversity of proteins containing PAS-like
domains. This fold is present in protein domains with an
enormous diversity of biochemical functions: dedicated
small-molecule-binding domains (SMBDs), such as the
PAS, GAF and CACHE superfamilies [21•,22,23];
enzymes, such as β-lactamases/D-aminopeptidases [24];
and peptide-binding domain superfamilies, such as profilins,
globular domains of synaptobrevins and the RoadblockMglB protein superfamily ([25•,26•,27,28; Figure 2). Two
SMBD superfamilies (PAS and GAF domains) and the
Roadblock-MglB proteins are represented in all three
primary kingdoms (bacteria, archaea and eukaryotes),
suggesting that they had already diverged from each other
in the last universal common ancestor (LUCA) of modern
life forms. The β-lactamase-related enzymes emerged
early in bacterial evolution, whereas synaptobrevins and
394
Sequences and topology
Figure 2
C
S4
S4
N
Sec22
S1
S5
D-Aminopeptidase
S3
S4 N
S1
S2
Peptide
interaction
C
S3
PAS
S5
S2
Enzymatic
Small
molecule
binding
N
S1
S4
S3
N
S1
S2
S3
S5
Structures of distinct forms of the PAS-like
fold. The β strands of the PAS-like domain are
designated S1–S5. ‘Peptide interaction’,
‘enzymatic’ and ‘small molecule binding’ refer
to the principal functions of different domains
that have this fold.
C
S4
N
S1 GAF
S5
S5
S2
S2
S3
C
C
Profilin
Current Opinion in Structural Biology
profilins emerged at the onset of eukaryotic evolution.
Although GAF and PAS domains diverged from each other
at an early stage of evolution, they were recruited to
perform analogous functions, as sensors of diverse ligands,
in a mélange of evolutionarily variable signaling pathways.
Accordingly, PAS and GAF domains have undergone
massive proliferation to form large superfamilies exhibiting
numerous lineage-specific combinations with signaling
domains, such as protein kinases, nucleotide cyclases or
phosphodiesterases, and DNA-binding domains [21•,29].
On the other end of the spectrum, the peptide-binding
domains with this fold were chiefly recruited as structural
components of the cytoskeletal and trafficking apparatus.
These proteins typically belong to much smaller superfamilies, often including only one or a few highly
conserved sets of orthologs. The enzymes that have a
PAS-like fold, the β-lactamases, face a different range of
constraints, the key feature being the conservation of the
active site residues, whereas the rest of the sequence may
show great divergence.
The above examples show that the demography of
individual domain superfamilies within versatile folds
depends on the original functional niche colonized by the
progenitor of each superfamily. Domains associated with
complex regulatory functions, such as signaling or
chromatin dynamics, tend to proliferate via multiple
duplications, giving rise to a vast diversity of sequences
and multidomain architectures. The emergence of
complex regulatory functions during evolution seems to
involve the recruitment of a versatile domain, followed by
a positive feedback cycle, which allows the proliferation of
the domain and results in the creation of new, related
functional niches through the domain’s diversification. In
contrast, domain forms associated with specific, critical
functions tend to evolve under strong purifying selection
forces and consequently show little diversity in sequence
or domain architectures. Those domain versions that adopt
catalytic functions typically show sequence conservation in
the vicinity of the active sites but diverge in other regions
of the molecule, producing new substrate specificities.
Structural and functional diversification of
folds through the elaboration of simple
ancient cores
New structures of nucleic acid polymerases, including the
translesion DNA repair polymerases of the DinB family
[30••,31••], the RNA-dependent RNA polymerase (RDRP)
from the double-stranded (ds) RNA phage φ6 [32••] and
the eukaryotic primase (EP) [33••], became available
during the past year. These structures, combined with
detailed sequence analyses of the respective superfamilies,
throw light on the evolution of complex structures from
relatively simple ancestral cores. The DNA polymerases of
the A, B and Y (DinB-like) superfamilies, the RDRPs of
positive-strand and dsRNA viruses, the reverse transcriptases
and the nucleotide cyclases all share a common catalytic
core — the ‘palm domain’ [9,34]. The core of the palm
domain is a (βαβ)2 unit, with two conserved acidic residues
at the end of strand 1 and in the loop between strands 2
and 3, that chelate two divalent cations (Figure 3). All these
enzymes have similar metal-dependent catalytic mechanisms, suggesting that they evolved from a common
ancestor that already possessed polymerase activity. Most
Protein evolution Aravind et al.
of the polymerases additionally have extensions to or
insertions into the palm domain, the ‘finger modules’,
which ensure tighter interactions with the template
nucleic acid. The finger modules of different polymerase
superfamilies often appear to be unrelated and typically
have a high α-helical content.
Figure 3
(a)
Fingers
D
Recent comparative genome analysis resulted in the
prediction of a new family of DNA polymerases that are
distinct from all previously described polymerase superfamilies, are present predominantly in thermophilic archaea
and bacteria, and appear to be key components of a previously undetected, thermophile-specific DNA repair system.
These predicted polymerases have abbreviated finger modules
and, in this respect, resemble the nucleotide cyclases [35•].
The palm domain of DNA and RNA polymerases appears
to be a version of the RNA recognition motif (RRM)-like
fold that is seen in ancient nucleic-acid-binding domains,
such as ribosomal protein S6 [36] and pseudouridine
synthases [37,38•]. These observations suggest that the
ancestral polymerase probably evolved from a nucleicacid-binding domain that might have functioned as an
accessory to a self-replicating nucleic acid. Subsequently,
the metal-binding active site probably evolved within the
ancestral palm domain, along with the intrinsic catalytic
activity. Finger domains appear to have been independently
inserted, after the divergence of the major polymerase
lineages, as adaptations for fine-tuning the polymerases in
the context of their specific template and biological functions.
In most cases, the finger domains appear to have emerged
from coiled-coil structures that condensed to more complex
globular units. Alternatively, some finger domains apparently
have been derived from the highly mobile zinc ribbon
domain or from other small, metal-binding modules.
The structure of EP [33••] revealed that it had no relationship with the catalytic TOPRIM domain present in
bacterial-type (DnaG-like) primases and in a variety of
topoisomerases and nucleases [39–41]. Also, the catalytic
domain of EP showed no immediately apparent relationship with the palm domain of DNA and RNA polymerases.
However, the detection of highly divergent bacterial
homologs of the EPs helped in defining the conserved core
shared by all members of this superfamily [42]; a structural
comparison of this conserved core with other known
structures revealed a similarity to the classic palm
domains. Two active site motifs of EPs, the N-terminal
DxD motif and the central RXXH motif, perfectly match
the two active site motifs of the palm domain, although the
equivalent of strand 4 in the latter is highly distorted in the
EPs (Figure 3). The bacterial homologs of the EPs contain
only the core domain without any inserts, whereas the
archaeal and eukaryotic versions have prominent inserts,
such as a small metal-chelating module (e.g. zinc flap) and
a large α-helical unit (e.g. finger module). Thus, it appears
likely that the EPs also evolved from an ancestral (βαβ)2
RRM-like nucleic-acid-binding core by distortion and
395
D
S2
S3
S1
S4
C
N
(b)
Zinc
flap
Fingers
R
D
D
H
S2
S3
D
‘S4 equivalent’
S1
N
C
Current Opinion in Structural Biology
Topology of (a) the classic palm domain of DNA and RNA polymerases,
and (b) the version of the palm domain present in EPs. The four core β
strands of the palm domain are labeled S1–S4 and the helices are
depicted as cylinders. The sidechains shown in stick representation are
the active site residues and the coordinated Mg2+ ions are shown as
blue circles. The zinc flap is a small zinc-coordinating module with a
CxHxnCxxC signature inserted into the palm domain of EPs.
elaboration through the insertion of α-helical modules. The
active site of the EPs, while occupying a similar spatial
configuration, appears to have evolved differently from the
classic palm domains: the EPs contain an acidic and basic
residue pair in place of the two acidic residues of the palm
domains of other polymerases. In addition to the unusual
structure of the region equivalent to strand 4, the EPs contain
a conserved acidic residue associated with the active site
that has no equivalents in classic palm domains (Figure 3).
Early evolution of proteins: glimpses of the
RNA world
A growing number of recently determined structures of
nucleotide-binding domains appear to be variations of
396
Sequences and topology
Figure 4
vWA
Class I aaRS
TOPRIM
Mg 2+?
d
Nucleic acid
Haloacid
dehalogenase
HNTase
ATP→acyl-P+
ADP
e
c
ATP→AMP
a
Photolyase
Big bang
b
f
USPA
Receiver
H-P→D-acyl-P
ETFP
Dinucleotide/
g
bulky nucleotide
PP-ATPase
TIR
FtsZ-tubulin
GTP→GDP+P
LUCA
Sir2
Oxidoreductase
Methylase
Current Opinion in Structural Biology
An evolutionary tree of the ‘Rossmannoid’ nucleotide-binding
domains, with an emphasis on the evolution of the HUP domain.
The letters indicate major clades that were delineated through
sequence and structure comparisons: (a) HUP superclass;
(b) USPA-photolyase-ETFP class within the HUP superclass;
(c) HNTases and class I aaRS; (d) TOPRIM class, including
DnaG-like primases, topoisomerases IA, II and VI, and Old-family
nucleases; (e) haloacid dehalogenase class, including P-type
ATPases and diverse phosphoesterases; (f) a clade comprising the
Receiver and TIR (an adaptor domain involved in eukaryotic
programmed cell death) domains involved in different forms of
signaling; (g) the typical Rossmann fold class, including
FAD/NAD-dependent oxidoreductases, SAM-dependent
methyltransferases and Sir2-like NAD-dependent enzymes. The
earliest stages of evolution leading to the divergence of the major
classes from each other are presently not amenable to reconstruction
and are presented as an unresolved ‘big bang’ event. The yellow
circle represents the LUCA of modern life forms and the branchings
shown in the figure roughly depict the diversity of each class of
domain that is traceable to LUCA. vWA, von Willebrand A domain.
the basic ‘Rossmannoid’ domain, which consists of
multiple α/β units arranged as a three-layered sandwich.
The Rossmannoids include NAD- and FAD-dependent
oxidoreductases (enzymes with the prototype Rossmann
fold), SAM-dependent methyltransferases [43], the Sir2-like
NAD-dependent enzymes [44•,45•], the FtsZ-tubulin family
of GTPases [46], the TOPRIM domains [39–41], the haloacid
dehalogenase-type hydrolases [47,48], PP-loop-containing
ATPases (PP-ATPases) [49], USPA-like domains [50,51,52••],
photolyases, and class I aminoacyl-tRNA synthetases (aaRS)
and the related HIGH nucleotidyl transferases (HNTases)
[53]. Given their similar topologies and equivalent
spatial locations of the nucleotide-binding sites, it appears
likely that all these domain superfamilies evolved from a
primordial, generic, nucleotide-binding domain.
Attempts were undertaken to detect footprints of some of
the divergence events that resulted in the emergence of
Protein evolution Aravind et al.
clearly recognizable extant domain superfamilies from
the hypothetical ancestral nucleotide-binding protein. A
combination of sequence and structure comparison
methods, along with formal cladistic analysis of structural
and sequence features, showed that the PP-ATPases,
USPA-like domains, photolyases, electron transfer flavoproteins (ETFPs) and class I aaRS-HNTase superfamilies
comprised a monophyletic assemblage, to the exclusion of
all other Rossmannoid domains [54••]. It appears that,
within this HUP domain class (after HNTase, USPA-like,
PP-ATPase), the PP-ATPases diverged first, followed by
the USPA-photolyase-ETFP assemblage and finally aaRS
and HNTases (Figure 4). The ancestor of the HUP class
probably was a generic enzyme that bound ATP and
hydrolyzed it to AMP. Comparative genomics reveals that
LUCA already encoded 15–18 members of the HUP
domain class, including nine aaRS and one RNA-modifying
enzyme (thiouridine synthase) [54••]. A corollary of these
observations is that several rounds of duplication and
divergence of the HUP domains antedate LUCA, and
that, during this phase of early evolution, the ancestral,
nonspecific HUP proteins performed multiple nucleotidedependent functions, currently allotted to their distinct
descendents. However, it seems unlikely, if not outright
impossible, that the ancestral HUP domains with generic
ATP pyrophosphatase and/or nucleotidyltransferase
activity, could, all on their own, catalyze reactions
requiring highly specific molecular interactions, such as
tRNA aminoacylation or thiouridylation of unique bases in
RNA. Furthermore, before the divergence of the aaRS,
which form just a terminal branch in the evolutionary tree
of HUP domains (Figure 4), HUP domain proteins could
not confer specificity on the translation process in the
fashion that modern aaRS do. The hypothesis that, in the
ancestral translation system, many functions currently
performed by proteins relied on RNAs, including the
ancestors of rRNA and tRNA [55], offers a plausible
solution to this conundrum. Given that, even at this
early stage of evolution, the HUP domains already had
catalytic and ligand-binding capabilities, the RNAs were
probably mainly responsible for the specificity of the
corresponding reactions.
Several additional rounds of duplication and divergence
appear to separate the ancestral HUP domain from the
deeper and even less specific common ancestors of large
groups of Rossmannoid domains (Figure 4). Thus, it
appears that a fairly efficient and accurate translation
system existed in the ancient RNA world, with, at best,
assistance from proteins that had low biochemical specificity.
From a complementary perspective, this and similar
analyses of other protein classes suggest that much of the
diversity of protein domains had already evolved within
the RNA world. With their functional specificity increasing
via multiple rounds of duplication and divergence, proteins
gradually displaced the ancient RNAs from most of their
functions, except for a few truly indispensable ones, such
as translation elongation.
397
Conclusions
Recent analyses of sequences and structures suggest
several generalizations regarding the mode of protein
evolution. The same functionally versatile protein fold
often includes both highly populated and diverse, and
sparse superfamilies. It appears that regulatory functions
and highly diversified protein superfamilies co-evolve:
duplications within a superfamily generate new functions
and these new functional niches, via a positive feedback
loop, allow further selection of additional members of the
same superfamily. Structural and sequence comparisons
also begin to clarify how some of the most complex
cellular machines, such as nucleic acid polymerases, have
evolved on multiple occasions by the elaboration of simple
ancestral cores through inserts. These inserts appear to
have been independently acquired by different polymerases
and have converged to similar functions related to template
interaction, despite their distinct structures. Finally,
comparative analysis of deep relationships among protein
families may throw light on the early evolution of proteins
in the RNA world. Ancestors of a large number of domains
in modern proteins appear to have evolved well before the
extant, predominantly protein-based translation system
was established. Thus, proteins with generic biochemical
properties probably collaborated with catalytic RNAs early
in evolution, with the latter responsible for the specificity
of molecular interactions. This primordial stage of evolution
of the protein world was followed by functional diversification
of proteins and the gradual displacement of RNAs from
enzymatic functions. Future developments in comparative
genomics and structural analysis are likely to shed even more
light on the actual details surrounding these fundamental
evolutionary events.
References and recommended reading
Papers of particular interest, published within the annual period of review,
have been highlighted as:
• of special interest
•• of outstanding interest
1.
Koonin EV, Aravind L, Kondrashov AS: The impact of comparative
genomics on our understanding of evolution. Cell 2000,
101:573-576.
2. International Human Genome Consortium: Initial sequencing and
•• analysis of the human genome. Nature 2001, 409:860-921.
The draft sequence of the human genome is accompanied by a fairly detailed
analysis of the encoded proteins. Probably the greatest surprise brought
about by the human genome sequence is the relatively small number of
predicted protein-coding genes, 30 000–40 000, which is only about 50%
more than the nematode Caenorhabditis elegans. However, comparison of
domain organizations of orthologous proteins shows that domain accretion
(accumulation of additional domains), particularly in signaling and regulatory
proteins, makes a substantial contribution to the greater complexity of
vertebrate proteomes compared to those of simpler eukaryotes.
3.
Thornton JM, Todd AE, Milburn D, Borkakoti N, Orengo CA: From
structure to function: approaches and limitations. Nat Struct Biol
2000, 7(suppl):991-994.
4.
Montelione GT, Zheng D, Huang YJ, Gunsalus KC, Szyperski T:
Protein NMR spectroscopy in structural genomics. Nat Struct Biol
2000, 7(suppl):982-985.
5.
Brenner SE: Target selection for structural genomics. Nat Struct
Biol 2000, 7(suppl):967-969.
6.
Sanchez R, Pieper U, Melo F, Eswar N, Marti-Renom MA,
Madhusudhan MS, Mirkovic N, Sali A: Protein structure modeling
for structural genomics. Nat Struct Biol 2000, 7(suppl):986-990.
398
Sequences and topology
7.
Koonin EV, Wolf YI, Aravind L: Protein fold recognition using
sequence profiles and its application in structural genomics. Adv
Protein Chem 2000, 54:245-275.
22. Ho YS, Burden LM, Hurley JH: Structure of the GAF domain, a
ubiquitous signaling motif and a new class of cyclic GMP
receptor. EMBO J 2000, 19:5288-5299.
8.
Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural
classification of proteins database for the investigation of
sequences and structures. J Mol Biol 1995, 247:536-540.
23. Anantharaman V, Aravind L: Cache - a signaling domain common to
animal Ca(2+)-channel subunits and a class of prokaryotic
chemotaxis receptors. Trends Biochem Sci 2000, 25:535-537.
9.
Murzin AG: How far divergent evolution goes in proteins. Curr
Opin Struct Biol 1998, 8:380-387.
24. Goffin C, Ghuysen JM: Multimodular penicillin-binding proteins: an
enigmatic family of orthologs and paralogs. Microbiol Mol Biol Rev
1998, 62:1079-1093.
10. Doolittle RF: Convergent evolution: the need to be explicit. Trends
Biochem Sci 1994, 19:15-18.
11. Galperin MY, Walker DR, Koonin EV: Analogous enzymes:
independent inventions in enzyme evolution. Genome Res 1998,
8:779-790.
12. Holm L, Sander C: The FSSP database: fold classification based
on structure-structure alignment of proteins. Nucleic Acids Res
1996, 24:206-209.
13. Orengo CA, Bray JE, Buchan DW, Harrison A, Lee D, Pearl FM,
Sillitoe I, Todd AE, Thornton JM: The CATH protein family database:
a resource for structural and functional annotation of genomes.
Proteomics 2002, 2:11-21.
14. Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP
database in 2002: refinements accommodate structural
genomics. Nucleic Acids Res 2002, 30:264-267.
15. Todd AE, Orengo CA, Thornton JM: Evolution of function in protein
•
superfamilies, from a structural perspective. J Mol Biol 2001,
307:1113-1143.
Analysis of the functional variation of homologous enzyme superfamilies
containing two or more enzymes, as defined by the CATH protein structure
classification, using the Enzyme Commission (EC) scheme revealed that the
majority of superfamilies display variation in enzyme function. In many cases,
this functional diversity could be correlated with local sequence variation and
domain shuffling. Typically, substrate specificity is diverse across a superfamily,
although the reaction chemistry is usually conserved.
16. Aravind L, Koonin EV: SAP - a putative DNA-binding motif involved
in chromosomal organization. Trends Biochem Sci 2000,
25:112-114.
17.
Aravind L, Koonin EV: Prokaryotic homologs of the eukaryotic
DNA-end-binding protein Ku, novel domains in the Ku protein and
prediction of a prokaryotic double-strand break repair system.
Genome Res 2001, 11:1365-1374.
18. Cai M, Huang Y, Ghirlando R, Wilson KL, Craigie R, Clore GM:
Solution structure of the constant region of nuclear envelope
protein LAP2 reveals two LEM-domain structures: one binds BAF
and the other binds DNA. EMBO J 2001, 20:4399-4407.
25. Tochio H, Tsui MM, Banfield DK, Zhang M: An autoinhibitory
•
mechanism for nonsyntaxin SNARE proteins revealed by the
structure of Ykt6p. Science 2001, 293:698-702.
This paper shows that the structure of the N-terminal domain of the nonsyntaxin
SNARE protein Ykt6p is unrelated to the syntaxin structure, but instead
resembles the fold of the actin regulatory protein profilin. However, similar to
syntaxins, Ykt6p forms a fold-back conformation, whereby the N-terminal
domain binds the C terminus of the protein. The N-terminal domain of Ykt6p
is shown to contribute to the assembly of SNARE complexes.
26. Gonzalez LC, Weis WI, Scheller RH: A novel snare N-terminal
•
domain revealed by the crystal structure of Sec22b. J Biol Chem
2001, 276:24203-24211.
In this paper, the structure of the N-terminal domain of Sec22b, a nonsyntaxin
SNARE involved in endoplasmic reticulum/Golgi membrane trafficking, is
described as an apparent circular permutation of a profilin-like domain. The
analysis of conserved residues in the N-terminal domain of Sec22b led to
the prediction of the binding site.
27.
Nodelman IM, Bowman GD, Lindberg U, Schutt CE: X-ray structure
determination of human profilin II: a comparative structural
analysis of human profilins. J Mol Biol 1999, 294:1271-1285.
28. Koonin EV, Aravind L: Dynein light chains of the Roadblock/LC7
group belong to an ancient protein superfamily implicated in
NTPase regulation. Curr Biol 2000, 10:R774-R776.
29. Taylor BL, Zhulin IB: PAS domains: internal sensors of oxygen,
redox potential, and light. Microbiol Mol Biol Rev 1999, 63:479-506.
30. Silvian LF, Toth EA, Pham P, Goodman MF, Ellenberger T: Crystal
•• structure of a DinB family error-prone DNA polymerase from
Sulfolobus solfataricus. Nat Struct Biol 2001, 8:984-989.
This paper, together with [31••], reports the crystal structure of the
Sulfolobus solfataricus Dbh protein (Dpo4), which provides insights into the
mechanism and evolutionary relationships of the Y (DinB-like) family of DNA
polymerases, which includes error-prone translesion polymerases. The Dbh
protein has a palm domain similar to those of other polymerases but, unlike
high-fidelity polymerases, has small fingers, which gives the active site an
open appearance, suggesting less limitations on mispairing.
19. Wolff N, Gilquin B, Courchay K, Callebaut I, Worman HJ, Zinn-Justin S:
•
Structural analysis of emerin, an inner nuclear membrane protein
mutated in X-linked Emery-Dreifuss muscular dystrophy. FEBS
Lett 2001, 501:171-176.
Structural analysis of the N-terminal domain of emerin, which is involved in
binding laminin A/C, reveals that this domain has the LEM fold. A conserved
solvent-exposed surface was detected in different LEM domains. Further
structural comparisons allowed the unification of LEM with HEH domains.
31. Ling H, Boudsocq F, Woodgate R, Yang W: Crystal structure of a
•• Y-family DNA polymerase in action: a mechanism for error-prone
and lesion-bypass replication. Cell 2001, 107:91-102.
In this work, crystal structures of Dpo4 in ternary complexes with DNA and
a correct or incorrect incoming nucleotide are analyzed. It is shown that,
although Dpo4 has a palm domain similar to those of other polymerases, it
makes only limited and unspecific contacts with the replicating base pair,
which results in the relaxed selection of the base to be incorporated. In addition,
the crystal structure of Dpo4 translocating two template bases to the active
site at once suggests a mechanism for bypassing thymine dimers.
20. Laguri C, Gilquin B, Wolff N, Romi-Lebrun R, Courchay K, Callebaut I,
•
Worman HJ, Zinn-Justin S: Structural characterization of the LEM
motif common to three human inner nuclear membrane proteins.
Structure 2001, 9:503-511.
The solution of the three-dimensional structures of the LEM and LEM-like
domains of lamina-associated protein 2 showed that both domains adopt the
same fold (largely composed of two parallel α helices). The authors hypothesize
that LEM is a protein–protein interaction domain and that the conserved
region of LEM domains present at the surface of helix 2 could mediate
interaction between LEM and a common protein partner.
32. Butcher SJ, Grimes JM, Makeyev EV, Bamford DH, Stuart DI: A
•• mechanism for initiating RNA-dependent RNA polymerization.
Nature 2001, 410:235-240.
Analysis of the crystal structure of the RDRP from the dsRNA bacteriophage
φ6 showed a specific relationship with the RDRP from hepatitis C virus
(HCV), supporting the monophyly of dsRNA viruses and positive-strand
RNA viruses. In addition, this paper describes structures of the complexes
between φ6 RDRP and the RNA template and substrates, revealing unusual
details of the polymerization mechanism. Initiation of RNA synthesis by this
enzyme involves two molecules of NTPs, one of which functions as the primer.
21. Anantharaman V, Koonin EV, Aravind L: Regulatory potential,
•
phyletic distribution and evolution of ancient, intracellular
small-molecule-binding domains. J Mol Biol 2001,
307:1271-1292.
Intracellular SMBDs show notable evolutionary mobility and substantially
contribute to the generation of lineage-specific domain architectures.
Analysis of 21 SMBDs described in this paper revealed numerous instances
of the re-invention of similar domain architectures involving functionally related
but not homologous domains. This suggests that similar selective forces
have acted on various SMBDs, which has resulted in the formation of
multidomain proteins that fit a limited number of functional stereotypes.
Analysis of protein domain architecture resulted in the prediction of the functions
and modes of regulation of a variety of uncharacterized proteins.
33. Augustin MA, Huber R, Kaiser JT: Crystal structure of a
•• DNA-dependent RNA polymerase (DNA primase). Nat Struct Biol
2001, 8:57-61.
The determination of the crystal structure of the catalytic subunit of DNA
primase from the archaeon Pyrococcus furiosus described in this paper
confirmed the conclusion, previously reached on the basis of sequence
comparisons, that there is no significant similarity and, accordingly, no
evolutionary relationship between archaeo-eukaryotic and bacterial primases.
It was noticed that, although the constellation of amino acid residues in the
active center of the archaeal primase resembles those in the active centers
of other polymerases, their core folds appear to be unrelated. However, more
detailed comparisons described in this review reveal the presence of a palm
domain in the archaeo-eukaryotic primases.
Protein evolution Aravind et al.
34. Wang J, Sattar AK, Wang CC, Karam JD, Konigsberg WH, Steitz TA:
Crystal structure of a pol alpha family replication DNA polymerase
from bacteriophage RB69. Cell 1997, 89:1087-1099.
35. Makarova KS, Aravind L, Grishin NV, Rogozin IB, Koonin EV: A DNA
•
repair system specific for thermophilic archaea and bacteria predicted
by genomic context analysis. Nucleic Acids Res 2002, 30:482-496.
A comparison of gene order in prokaryotic genomes using a new algorithm
for the identification of conserved gene arrays, combined with protein
sequence and structure analysis, resulted in the detection of a conserved
gene neighborhood that is present chiefly in hyperthermophilic archaea and
bacteria, and is predicted to encode a DNA repair system specific to
(hyper)thermophiles. One of the key components of the potential thermophile-specific repair system is a predicted DNA polymerase from a new
superfamily that shows only limited sequence similarity to other polymerases,
but appears to contain all the hallmarks of the palm domain and the predicted
catalytic residues. This polymerase, however, is predicted to have short finger
modules resembling those in translesion polymerases of the Y family. Along
with the absence of Y-family polymerases in most thermophiles, this suggests
that the predicted novel repair system might be a thermophile-specific
functional analog of the translesion repair system, which, in mesophiles, is
centered on the Y-family polymerases.
36. Agalarov SC, Sridhar Prasad G, Funke PM, Stout CD, Williamson JR:
Structure of the S15,S6,S18-rRNA complex: assembly of the 30S
ribosome central domain. Science 2000, 288:107-113.
37.
Foster PG, Huang L, Santi DV, Stroud RM: The structural basis for
tRNA recognition and pseudouridine formation by pseudouridine
synthase I. Nat Struct Biol 2000, 7:23-27.
38. Hoang C, Ferre-D’Amare AR: Cocrystal structure of a tRNA Psi55
•
pseudouridine synthase: nucleotide flipping by an RNA-modifying
enzyme. Cell 2001, 107:929-939.
A comparison of the crystal structure of the TruB pseudouridine synthase
with the structure of TruA, combined with earlier sequence comparisons,
leads to the conclusion that all pseudouridine synthases evolved from a
common ancestor. It is shown that the recognition of the T loop of the
substrate tRNAs by TruB is driven by shape complementarity. The reaction
mechanism involves flipping of the modified uracil, a feature that is common
to many nucleic acid modification enzymes.
39. Aravind L, Leipe DD, Koonin EV: Toprim-a conserved catalytic domain
in type IA and II topoisomerases, DnaG-type primases, OLD family
nucleases and RecR proteins. Nucleic Acids Res 1998, 26:4205-4213.
40. Podobnik M, McInerney P, O’Donnell M, Kuriyan J: A TOPRIM domain
in the crystal structure of the catalytic core of Escherichia coli
primase confirms a structural link to DNA topoisomerases. J Mol
Biol 2000, 300:353-362.
41. Keck JL, Roche DD, Lynch AS, Berger JM: Structure of the RNA
polymerase domain of E. coli primase. Science 2000, 287:2482-2486.
42. Koonin EV, Wolf YI, Kondrashov AS, Aravind L: Bacterial homologs
of the small subunit of eukaryotic DNA primase. J Mol Microbiol
Biotechnol 2000, 2:509-512.
43. Bujnicki JM: Comparison of protein structures reveals monophyletic
origin of the AdoMet-dependent methyltransferase family and
mechanistic convergence rather than recent differentiation of
N4-cytosine and N6-adenine DNA methylation. In Silico Biol 1999,
1:175-182.
44. Min J, Landry J, Sternglanz R, Xu RM: Crystal structure of a SIR2
•
homolog-NAD complex. Cell 2001, 105:269-279.
The high-resolution crystal structure of the NAD-dependent protein deacetylase
SIR2 revealed two domains, a typical Rossmann fold nucleotide-binding
domain and a small domain containing a zinc ribbon. A distinct mode of NAD
binding, in a pocket between the two domains, is described.
399
45. Finnin MS, Donigian JR, Pavletich NP: Structure of the histone
•
deacetylase SIRT2. Nat Struct Biol 2001, 8:621-625.
Analysis of the crystal structure of the human NAD-dependent histone
deacetylase SIRT2 reveals the Rossmann fold nucleotide-binding domain
and a smaller domain that consists of a helical module and a zinc ribbon.
The probable catalytic site is identified as the groove between these two
domains (see also [44••]). Another pocket formed by the helical module and
lined with hydrophobic residues is proposed as the likely protein-binding site.
46. Nogales E, Downing KH, Amos LA, Lowe J: Tubulin and FtsZ form a
distinct family of GTPases. Nat Struct Biol 1998, 5:451-458.
47.
Wang W, Kim R, Jancarik J, Yokota H, Kim SH: Crystal structure
of phosphoserine phosphatase from Methanococcus jannaschii,
a hyperthermophile, at 1.8 Å resolution. Structure 2001,
9:65-71.
48. Cho H, Wang W, Kim R, Yokota H, Damo S, Kim SH, Wemmer D,
Kustu S, Yan D: BeF(3)(-) acts as a phosphate analog in proteins
phosphorylated on aspartate: structure of a BeF(3)(-) complex
with phosphoserine phosphatase. Proc Natl Acad Sci USA 2001,
98:8525-8530.
49. Bork P, Koonin EV: A P-loop-like motif in a widespread ATP
pyrophosphatase domain: implications for the evolution of
sequence motifs and enzyme activity. Proteins 1994,
20:347-355.
50. Zarembinski TI, Hung LW, Mueller-Dieckmann HJ, Kim KK,
Yokota H, Kim R, Kim SH: Structure-based assignment of the
biochemical function of a hypothetical protein: a test case of
structural genomics. Proc Natl Acad Sci USA 1998,
95:15189-15193.
51. Makarova KS, Aravind L, Galperin MY, Grishin NV, Tatusov RL,
Wolf YI, Koonin EV: Comparative genomics of the Archaea
(Euryarchaeota): evolution of conserved protein families,
the stable core, and the variable shell. Genome Res 1999,
9:608-628.
52. Sousa MC, McKay DB: Structure of the universal stress protein of
•• Haemophilus influenzae. Structure 2001, 9:1135-1141.
The crystal structure of the bacterial UspA protein, although generally similar
to the previously reported structure of the MJ0577 protein from the archaeon
Methanococcus jannaschii, differs in that UspA does not seem to have an
ATP-binding pocket and, in fact, has not been shown to bind ATP. It is
hypothesized that the UspA protein superfamily, which has been identified
on the basis of sequence conservation, comprises two distinct subsets, only
one of which consists of nucleotide-binding proteins.
53. Bork P, Holm L, Koonin EV, Sander C: The cytidylyltransferase
superfamily: identification of the nucleotide-binding site and fold
prediction. Proteins 1995, 22:259-266.
54. Aravind L, Anantharaman V, Koonin EV: Monophyly of class I
•• aminoacyl tRNA synthetase, USPA, ETFP, photolyase, and
PP-ATPase nucleotide-binding domains: implications for protein
evolution in the RNA world. Proteins 2002, 48:1-14.
The monophyly of the catalytic domains of the class I aaRS and the
HNTases, nucleotide-binding domains related to the UspA protein (USPA
domains), photolyases, ETFPs and PP-ATPases was discerned by detailed
structure/sequence analysis. Phyletic distribution patterns within the major
lineages of the HUP (after HIGH, USPA and PP-loop) class proteins
suggest that the LUCA of all modern life forms already encoded several
enzymes of this class. Comparative analysis of the HUP domain proteins
also allows the reconstruction of a series of early evolutionary events
antedating LUCA.
55. Crick FH: The origin of the genetic code. J Mol Biol 1968,
38:367-379.