ARTS: alignment of RNA tertiary structures

Vol. 21 Suppl. 2 2005, pages ii47–ii53
doi:10.1093/bioinformatics/bti1108
BIOINFORMATICS
Protein Structure and Function
ARTS: alignment of RNA tertiary structures
Oranit Dror1,∗ , Ruth Nussinov2,3 and Haim Wolfson1
1 School
of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences,
Tel Aviv University, Tel Aviv 69978, Israel, 2 Sackler Institute of Molecular Medicine, Sackler Faculty of Medicine,
Tel Aviv University, Tel Aviv 69978, Israel and 3 Basic Research Program, SAIC-Frederick, Inc., Laboratory of
Experimental and Computational Biology, Building 469, Room 151, Frederick, MD 21702, USA
ABSTRACT
Motivation: A fast growing number of non-coding RNAs have recently
been discovered to play essential roles in many cellular processes.
Similar to proteins, understanding the functions of these active RNAs
requires methods for analyzing their tertiary structures. However, in
contrast to the wide range of structure-based approaches available for
proteins, there is still a lack of methods for studying RNA structures.
Results: We present a new computational method named ARTS
(alignment of RNA tertiary structures). The method compares two
nucleic acid structures (RNAs or DNAs) and detects a-priori unknown
common substructures. These substructures can be either large
global folds containing hundreds and even thousands of nucleotides
or small local tertiary motifs with at least two successive base pairs.
To the best of our knowledge, this is the first method of this type. The
method is highly-efficient and was used to conduct an all-against-all
comparison of all the RNA structures currently available in the Protein
Data Bank.
Availability: The program, a web-server and supplementary
information are available on http://bioinfo3d.cs.tau.ac.il/ARTS
Contact: [email protected]
1
INTRODUCTION
There is a rapid growing evidence that RNA molecules are not solely
carriers of genetic information, but key players in a wide range of
cellular processes, such as protein synthesis and transport, RNA
processing and splicing, and chromosome replication (Storz, 2002;
Eddy, 2001; Doudna and Cech, 2002). Similar to proteins, the function of these non-coding RNAs is inferred from the 3D structures that
they form, rather than their primary sequence per se. Thus, methods for analyzing RNA structures, comparing them with one another
and identifying new motifs and folds are of utmost importance for
deciphering their biological activities (Doudna, 2000).
In the absence of RNA tertiary structures, many computational
methods for structure prediction and analysis of RNAs have been
developed to work at the secondary structure level, that is the level
of base pairing (Hofacker et al., 1994; Zuker et al., 1999; Akutsu,
2000; Gan et al., 2003). Although such methods provide excellent
starting points for exploring RNA structures, their inherent limitation
is that they are incapable of predicting tertiary motifs. These motifs
mainly involve interactions between secondary structure elements,
and are key players in establishing the global fold of an RNA (Batey
et al., 1999; Moore, 1999; Tamura et al., 2004).
∗ To
whom correspondence should be addressed.
In the past few years both the number and size of solved RNA tertiary structures has dramatically increased. This has stimulated the
development of computational methods for 3D structural analysis of
RNAs (Duarte and Pyle, 1998; Lu and Olson, 1999; Lu et al., 1999;
Gendron et al., 2001; Yang et al., 2003; Lu and Olson, 2003). Nevertheless, only a few RNA structural comparison methods are currently
available. The NASSAM method is a graph theoretic method that
searches nucleic acid structures for a given 3D pattern. It represents
each base by two vectors and a whole nucleic acid structure as a
labeled graph, where the nodes are the vector representations of the
bases and the edges are the distances between them. After representing the given 3D pattern and the structure to be searched as graphs,
the exponential-time Ullman algorithm for subgraph isomorphism is
used to locate the pattern within the structure (Harrison et al., 2003).
Another method is PRIMOS (Duarte et al., 2003). This method
represents each nucleotide by two pseudo-torsion angles (η and θ )
as calculated by the AMIGOS program (Duarte and Pyle, 1998) and
a whole structure as a sequence of η − θ values, which is called
an RNA worm. PRIMOS detects structural differences between
molecules with the same number of nucleotides by comparing their
worms. It is mainly suitable for examining different conformations
of the same molecule and searching structures for a specified continuous motif (by comparing the motif’s worm with every possible
worm segment of the structures).
Both NASSAM and PRIMOS are useful motif searching methods,
but they are unsuitable for detecting new motifs that are not specified
in advance. To partially overcome this limitation, the COMPADRES
method employs PRIMOS’s worm representation and searches structures for new motifs consisting of at least five continuous nucleotides
(Wadley and Pyle, 2004). Specifically, for a given dataset of RNA
structures, COMPADRES generates an RNA worm representation
for the entire dataset by concatenating the worms of all chains. The
resulting worm is then plotted against itself so that the η and θ
values of each nucleotide are compared to those of the other nucleotides. Finally, the plot is scanned and all diagonals with at least
five continuous matches of nucleotides are considered as new motif
candidates. Indeed, COMPADRES has been successfully applied
by the authors to discover new conformationally recurring motifs.
However, an inherent limitation is that the discovered motifs are
restricted to be sequential.
Here, we describe a new method for comparing 3D nucleic acid
structures and identifying a-priori unknown common substructures.
The common substructures identified are non-sequential and can be
either large global folds containing hundreds and even thousands of
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
ii47
O.Dror et al.
nucleotides or small local tertiary motifs with at least two successive
base pairs. The method is highly-efficient and suitable for searching large 3D structure databases, like the protein data bank (PDB)
(Berman et al., 2000).
2 APPROACH
2.1 Problem definition
The input consists of two nucleic acid structures represented by the
3D coordinates of their atoms in PDB format (Berman et al., 2000).
We single out the phosphate atoms as critical points and treat the input
structures as sets of points in 3D space, where each point is associated
with a position of a phosphate atom: A = {ai } and B = {bi } (the
indices are according to the primary structure order).
The task can be formulated as finding a rigid transformation
(rotation and translation) that superimposes the largest number of
phosphate atoms of one structure onto the phosphate atoms of the
other structure within a predefined distance error. More precisely, for
a given ≥ 0 the goal is to find a 3D rigid transformation T and two
corresponding point sets A = {ai }ki=1 ⊆ A and B = {bi }ki=1 ⊆ B
of maximal cardinality (k) so that the distance between A and T (B )
is at most . We use two distance functions: (1) the root mean
square
deviation (RMSD), which is defined for A and T (B ) as
(( ki=1 T (bi ) − ai 2 )/k)0.5 (Kaindl and Steipe, 1997) and (2) the
bottleneck matching metric, which is defined for A and T (B ) as
maxi T (bi ) − ai (Efrat et al., 2001).
The pure geometric formulation of the above biological problem
is known as the largest common point set (LCP) problem in computational geometry (Alt and Guibas, 2000). Although this problem
has been extensively studied, there is neither an exact algorithm nor
an approximate one for the RMSD metric. For the bottleneck matching metric, there is an exact algorithm, but it is impractical since it
requires O(n32.5 ) time, where n = max(|A|, |B|) (Ambühl et al.,
2000). There is also an approximate algorithm that guarantees finding the LCP with 8 distance error and cardinality at least as large as
that of the LCP with distance error (Goodrich et al., 1994; Akutsu,
1996). The time complexity of this algorithm is O(n8.5 ). However,
since for nucleic acid structures n can be thousands of nucleotides
(like in the case when comparing two ribosomal structures), this
algorithm might also be impractical.
From a biological standpoint, solving the LCP problem for the
two input nucleic acid structures as if they were pure sets of 3D
points might yield false-positive alignments for compact large dense
structures like the ribosome. The reason is that almost any arbitrary
pair of large compact structures can be aligned so that many of their
nucleotides are superimposed, resulting in biologically meaningless
alignments. To filter out beforehand these noisy alignments, it is
thus preferable to maximize not only the number of corresponding
nucleotides, but also the number of corresponding base pairs. The
rationale is that more than half of the nucleotides in an average noncoding RNA participate in base-pairing and the double helices that
they form are structurally more conserved than loops (Moore, 1999).
If we denote the sets of base pairs for structures A and B as Abp =
{(ai , aj ) : ai , aj ∈ A, i = j } and Bbp = {(bs , bt ) : bs , bt ∈ B, s = t}
respectively, then the task can be formulated as follows: for a given
≥ 0 we are interested in finding a 3D rigid transformation T , two
corresponding point sets A = {ai }ki=1 ⊆ A and B = {bi }ki=1 ⊆ B,
and two corresponding sets of l base pairs Abp = {(ai , aj ) : ai , aj ∈
= {(bs , bt ) : bs , bt ∈ B , s = t} ⊆ Bbp so
A , i = j } ⊆ Abp and Bbp
ii48
that the scoring function F (l, k) = w1 ·k+w2 ·l is maximized and the
bottleneck matching distance between A and T (B ) is at most . The
weights w1 and w2 have been determined to be 1 and 2 respectively.
This means that the score for a nucleotide participating in a spatially
conserved base pairing is twice the score for an unpaired conserved
nucleotide. Note that when w2 = 0 the task is to find the LCP.
Here, we present a heuristic algorithm for the problem stated
above. The theoretical worst case complexity of the algorithm is
O(n3 ) and, as demonstrated in the Results section, it yields good
solutions with very short running times.
2.2
Method
The key idea is to construct all the possible hypothetical correspondences between two successive base pairs of structure A and two
successive base pairs of structure B. We call such a correspondence
a seed match. Since the two structures are considered rigid bodies,
each seed match uniquely defines a rigid transformation that superimposes the two base pairs of B onto the corresponding base pairs
of A. We use this transformation to superimpose the whole structure
B onto structure A and extend the seed match by finding additional
coinciding base pairs and unpaired nucleotides. Finally, the matches
are ranked according to the scoring function and the highest scoring
matches are reported.
2.2.1 Seed match construction The set of base pairs (canonical as
well as non-canonical) of each structure is identified in an initial stage
using the 3DNA software (Lu and Olson, 2003): Abp = {(ai , aj ) :
ai , aj ∈ A, i = j } and Bbp = {(bs , bt ) : bs , bt ∈ B, s = t}.
Then, we construct the hypothetical seed matches as follows. For
each structure we form the set of all base quadrats, where a base
quadrat is a 4-tuple of phosphate atoms located on two successive
base pairs: Â = {(ai , ai+1 , aj , aj +1 ) | (ai , aj +1 ) ∧ (ai+1 , aj ) ∈ Abp }
and B̂ = {(bs , bs+1 , bt , bt+1 ) | (bs , bt+1 ) ∧ (bs+1 , bt ) ∈ Bbp }. A
seed match is then defined as a correspondence between the phosphate atoms of two base quadrats, one in  and one in B̂, that is
{(ai , bs ), (ai+1 , bs+1 ), (aj , bt ), (aj +1 , bt+1 )}. In fact, we are not interested in all the possible matches defined by Â × B̂, but only in
seed matches of -congruent base quadrats. These are base quadrats
that can be superimposed by a transformation T so that the bottleneck matching distance between them is at most : max{ai −
T (bs ), ai+1 − T (bs+1 ), aj − T (bt ), aj +1 − T (bt+1 )} ≤ .
Two base quadrats are -congruent only if the maximal difference
of their inter-distances is at most 2. Based on this observation, the
search space is restricted and all the -congruent base quadrats are
efficiently detected by hashing. Specifically, each base quadrat is
represented by a rigid motion invariant signature consisting of its six
inter-distances. Then, all the base quadrats of structure A are stored
in a 6D hash table addressed by their signatures. The base quadrats of
structure B are used to query the hash table and for each of them all
the base quadrats stored within radius 2 of its signature are extracted.
Note that for each base quadrat of B the extracted set of base quadrats
necessarily contains all its -congruent base quadrats in A.
A seed match is constructed between each base quadrat of B and
each base quadrat of A extracted for it from the hash table. Then,
for each seed match we use the least-squares fitting technique to
compute the transformation that aligns the two base quadrats with
minimum RMSD (Kabsch, 1978). Seed matches with RMSD greater
than are ignored, since for them the bottleneck matching distance is
also greater than . The least-squares fitting technique works in O(n)
ARTS: Alignment of RNA tertiary structures
time, where n is the size of the given corresponding list. For each
seed match there are four corresponding pairs. Thus, the runtime for
computing its transformation is O(1).
2.2.2 Global extension In this stage we extend all the seed
matches. Specifically, for a given seed match with a transformation
T , the goal is to find two corresponding subsets A = {ai }ki=1 ⊆ A
and B = {bi }ki=1 ⊆ B with l corresponding base pairs so that the
scoring function F (l, k) = w1 · k + w2 · l is maximized and ∀i :
ai −T (bi ) ≤ . In case one of the weights in the scoring function is
zero (meaning that the goal is to maximize the number of corresponding nucleotides or the number of corresponding base pairs, but not
both), the optimization problem can be solved by an exact algorithm
for finding maximal matching in a bipartite graph (Mehlhorn, 1999;
Akutsu, 1996). For example, if w1 = 0, the bipartite graph is a
graph G = (Abp ∪ Bbp , E), where e = ((ai , aj ), (bs , bt )) ∈ E if,
and only if, the bottleneck matching distance between the two base
pairs it connects after applying the transformation is at most , that
is ai − T (bs ) and aj − T (bt ) ≤ . The maximal cardinality
matching in this graph is equivalent to the maximal number of base
pairs of B that coincide with base pairs of A after applying T . The
time required for constructing the graphand solving the maximal
bipartite matching problem is O(|G| + |Abp | + |Bbp | · |E|). For
n = max(|A|, |B|), there are O(n) base pairs and, as described
below, the number of possible matches for each base pair of A is
constant. Thus, E = O(n) and the total time complexity is O(n1.5 ).
An exact algorithm for maximizing the scoring function in case
the two weights are non-zero is currently unknown. Instead, we use
the following two-stage greedy approach. We place structure B on a
3D grid. Then, for each base pair (ai , aj ) of A, we access the grid and
examine two balls of radius from ai and aj : δ(ai , ) and δ(aj , ).
If there is a base pair (bs , bt ) of B such that T (bs ) ∈ δ(ai , ) and
T (bt ) ∈ δ(aj , ), then we match it to (ai , aj ). In case more than one
base pair of B exists in the two balls, the one with the smallest bottleneck matching distance from (ai , aj ) is selected. For a small , the
number of phosphate atoms in a ball of radius is bounded. Thus,
for each base pair of A we examine a constant number of possible
matches. Since there are O(n) base pairs the overall running time
is O(n). In the next stage, we find matches for all the remaining
unmatched nucleotides of A in a similar way. The runtime complexity
is also O(n). Finally, after extending the seed matches, we refine
their transformations by applying the least-squares fitting technique
(Kabsch, 1978).
Empirically, this greedy algorithm works quite well. We massively
tested two versions of it, one with the first stage only for maximizing
the number of matched base pairs and the other with the second stage
only for maximizing the number of matched nucleotides. We compared these two versions to the equivalent exact maximal bipartite
matching algorithms and the results were almost the same, but were
obtained with substantially shorter runtime.
2.2.3 Scoring and ranking The extended matches are sorted by
lexicographic order defined on their score and the RMSD between
the matched phosphate atoms. Since two nucleic acids may share
more than one common substructure, we are interested in reporting
the k top-ranking matches and not only the first one, where k is a
user defined parameter (ten by default). However, we cannot simply
take the first k matches in the sorted list, since some of the matches
may be redundant. These are matches with approximately the same
transformation and a slightly different core. To report the k nonredundant top-ranking matches, we apply iterative clustering. In each
iteration, we extract the top-ranking match from the sorted list and
add it to a final non-redundant list of top-ranking matches if it differs
from all the current matches in this list. To check if two matches are
different, we apply their transformations on three phosphate points
forming the largest triangle in structure B and compute the RMSD
distance of the two transformed triangles. If the distance is below ,
the two matches are considered similar. Since it takes O(1) time to
check if two transformations are similar, the total time required for
ranking and reporting the k non-redundant top-ranking matches from
m initial matches is O(m log m + mk). This is equal to O(m log m)
for a small k.
2.3
Complexity
Here, we describe the worst case complexity of the algorithm. Let n
be the maximal number of nucleotides in the given nucleic acids, that
is n = max(|A|, |B|). The number of base pairs participating in stems
for each structure is O(n). Thus, the number of possible base quadrats
for each structure is also O(n). In the worst case, all the base quadrats
of A will be extracted from the hash table for each base quadrat of B.
As a result, we will construct, extend and score O(n2 ) seed matches.
This takes an overall O(n2 ) · O(1) + O(n2 ) · O(n) + O(n2 log n) =
O(n3 ) time. In practice, the runtime complexity for typical RNA
molecules may be lower than the theoretical one due to the usage of
the hash table.
3
RESULTS
Using our method we have conducted an exhaustive all-against-all
comparison of all the 770 RNA 3D structures currently available in
the PDB (Berman et al., 2000). Condor, a distributed job schedule,
was used to launch this compute-intensive execution on a cluster of
standard PC nodes (Tannenbaum et al., 2001). The size of the PDB
structures ranges from a few nucleotides to thousands of nucleotides
(4–4599 nt). The runtime for a pairwise comparison varied from a few
milliseconds to <40 min on a standard PC workstation (Pentium© 4,
2.60 GHz processor with 2 GB RAM). The most time-consuming
executions were between ribosomal structures with thousands of nucleotides. For a pair of average-size RNA structures with hundreds of
nucleotides the runtime was a few seconds. For each structure, the
1st top-ranking alignments with all the other structures were sorted
by lexicographic order defined on their score and the RMSD between
the matched phosphate atoms. All the results can be accessed via the
web site. We present below several interesting cases supported in the
literature.
3.1
Self-splicing group I introns
RNA splicing is the process of removing introns from a precursor
RNA transcript and combining the flanking exons into a mature RNA.
While the splicing of most introns is carried out by a large ribonucleoprotein complex called the spliceosome, there are some self-splicing
introns that catalyze both their excision and the exon ligation. Crystal
structures for three self-splicing group I introns are currently available: the Azoarcus pre-tRNAI le intron with both exons (PDB:1u6b;
Adams et al., 2004), the Tetrahymena pre-rRNA intron (PDB:1x8w;
Guo et al., 2004) and the Twort ribozyme intron (PDB:1y0q; Golden
et al., 2005). Like other group I self-splicing introns, these three
introns possess a single active site in which a two-stage splicing process occurs. In the first stage, an exogenous guanosine binds to the
ii49
O.Dror et al.
Table 1. Self-splicing group I introns
PDB 1 (size)
PDB 2 (size)
Score
Matched base-pairs
Core size
RMSD
Core base identity
Run time (s)
1u6b (217)
1u6b (217)
1x8w (964)
1x8w (964)
1x8w (964)
1y0q (230)
1x8w (964)
1y0q (230)
1grz (492)
1hr2 (313)
184
113
94
382
180
34
16
14
55
31
116
81
66
272
118
1.40
1.68
1.65
1.82
1.62
0.50
0.58
0.62
0.81
0.97
00.36
03.66
03.82
06.12
09.16
The top-ranking alignments found by an all-against-all PDB comparison for the three currently available self-splicing group I intron structures (PDB:1u6b, 1y0q and 1x8w). For each
alignment the data appearing in the columns are as follows: (1–2) the PDB code and the number of nucleotides of each structure; (3) the score of the alignment; (4) the number of
base pairs in the core; (5) the number of nucleotides in the core; (6) the RMSD distance between the matched nucleotides; (7) the number of matched nucleotides with the same base
identity and (8) the runtime in seconds. The representations of the alignments are available on the website.
Fig. 1. Self-splicing group I introns. Left: The alignment between the Azoarcus Ile-tRNA intron (PDB:1u6b, red) and the Tetrahymena pre-rRNA intron
(PDB:1x8w, chain A only, blue). The core consists of 81 nt with an RMSD of 1.68 Å (depicted in yellow) and contains the guanosine-binding site (green).
Right: Zoom in on the spatially conserved guanosine-binding site. This G-site consists of four adjacent base-triples, that is (A263-G312-C262, G264-C311G414, A265-U310-A261, C266-G309-A306) in the Tetrahymena structure superimposed onto (A129-C178-G128, G130-C177-G206, A131-U176-A127,
C132-G175-A172) in the Azoarcus structure. The G-site’s backbone is depicted in green, where the bases are in red and blue respectively. The backbone of the
rest of the core is depicted in yellow and the backbone of the structurally diverse region is in gray. This figure and all subsequent figures were prepared using
PyMOL (DeLano, 2002, http://www.pymol.org).
intron’s active site, attacks the 5 splice site and cleaves the 5 exon.
In the second stage, the 3 -terminal guanosine binds to the active site
instead of the exogenous one, its 3 phosphoryl group is attacked by
the 3 - hydroxyl of the 5 exon, the two exons are ligated and the
intron is released.
Despite the fact that the catalytic states of the three introns are different, ARTS has indicated that the 3D structure of their catalytic core
is conserved. Furthermore, in the all-against-all PDB comparison
ii50
the first top-ranking structure for PDB:1u6b was PDB:1y0q and the
second one was PDB:1x8w. The results for PDB:1y0q were similar,
that is PDB:1u6b was ranked first and PDB:1x8w was ranked second.
For PDB:1x8w the first two top-ranking alignments were with other
Tetrahymena group I intron structures (PDB:1grz and 1hr2). See
Table 1.
Figure 1 presents the structural alignment between the Azoarcus
and the Tetrahymena introns. The core region consists of 81 nt
ARTS: Alignment of RNA tertiary structures
Fig. 2. Ribosomal protein S8–RNA complexes. Left: The alignment between the spc operon mRNA bound to protein S8 (PDB:1s03, 47 nt, red) and the small
ribosomal subunit (PDB:1n33, 1523 nt, only 16S rRNA and protein S8 are displayed in gray and blue respectively). The core consists of 45 nt with an RMSD
of 1.54 Å (depicted in yellow). Right: Zoom in on the spatially conserved binding site. The binding site’s backbone is depicted in green, where the bases of the
four interacting nucleotides, that is (A12, A14, C15, G37) of the spc mRNA superimposed on nucleotides (A640, A642, C643, G597) of the 16S rRNA, are in
red and blue, respectively.
with an RMSD of 1.68 Å and 0.37 base identity. The spatially
conserved nucleotides are located on helices P3, P4, P6 and P7
that form a molecular cleft in which the catalytic site is located.
In particular, the guanosine-binding site, including the 3 -terminal
guanosine, was found to be spatially conserved, that is, the four
adjacent base-triples (A263-G312-C262, G264-C311-G414, A265U310-A261, C266-G309-A306) in the Tetrahymena structure were
superimposed onto (A129-C178-G128, G130-C177-G206, A131U176-A127, C132-G175-A172) in the Azoarcus structure, where
the 3 -terminal guanosine nucleotides are G414 and G206. These
observations are consistent with the ones of Adams et al. (2004)
and Guo et al. (2004). The alignments between the other introns are
available on the web site.
3.2
Ribosomal protein S8–RNA complexes
In Escherichia coli, ribosomal protein S8 does not only play a key role
in the small ribosomal subunit assembly through its interaction with
16S rRNA, but also regulates its synthesis and the synthesis of other
ribosomal proteins by binding to the spc operon mRNA (Cerretti
et al., 1988). A crystal structure for E.coli S8 bound to the spc operon
mRNA is available (PDB:1s03, 47 nt; Merianos et al., 2004) and in
our all-against-all PDB comparison the top-ranking alignment for
this structure was with the Thermus thermophilus small ribosomal
subunit (PDB:1n33, 1523 nt). The runtime for obtaining this alignment was 10 s and according to it the spc mRNA is structurally
similar to Helix 21 of the 16S rRNA (45 spatially conserved nucleotides with an RMSD of 1.54 Å), despite their low sequence identity
(0.42 base identity). Figure 2 shows the alignment. One can see that
protein S8 binds to the two RNAs in a similar manner. Specifically,
the two binding sites are superimposed, that is nucleotides A12G17, C34-U38 of the spc mRNA are superimposed on nucleotides
A640-C645,G594-U598 of the 16S rRNA, where the interacting nucleotides are (A12, A14, C15, G37) and (A640, A642, C643, G597)
(Merianos et al., 2004). This alignment reaffirms the theory that the
synthesis of ribosomal proteins is regulated by a competition between
rRNA and mRNA for some of the ribosomal proteins.
3.3
Ribonuclease P RNAs
Ribonuclease P (RNase P) is an enzyme that plays a key role in
the biosynthesis of transfer RNAs (tRNAs). It is an endonuclease
that removes the 5 precursor sequences of tRNAs and generates
their mature 5 terminus. In bacteria, RNase P is a ribonucleoprotein
ii51
O.Dror et al.
Table 2. Ribonuclease P RNAs
PDB 1 (size)
PDB 2 (Size)
Score
Matched base-pairs
Core size
RMSD
Core base identity
Run time (s)
1u9s (155)
1u9s (155)
1u9s (155)
1u9s (155)
1nbs (270)
1nbs (270)
1nbs (270)
1nbs (270)
1nbs (270)
1nbs (270)
1nbs (270)
1njn (2765)
1hnx (1511)
1hr2 (313)
1nbs (270)
1nwx (2883)
1ibk (1510)
1u6b (217)
1foq (545)
1d4r (83)
1gax (150)
1m5k (222)
66
66
59
53
85
76
70
61
57
55
55
9
8
7
6
8
5
9
10
11
10
8
48
50
45
41
69
66
52
41
35
35
39
1.75
2.02
1.85
1.80
2.06
2.20
1.65
1.75
1.63
1.54
1.83
0.38
0.28
0.29
0.56
0.30
0.27
0.37
0.32
0.34
0.34
0.31
29.56
07.30
00.57
00.33
26.12
12.32
00.48
01.90
00.13
00.19
00.60
The highest scoring non-redundant alignments detected by an all-against-all PDB comparison for both the A-type (PDB:1u9s) and B-type (PDB:1nbs) RNase P RNAs. For each
alignment the data appearing in the columns are as follows: (1–2) the PDB code and the number of nucleotides of each structure; (3) the score of the alignment; (4) the number of
base pairs in the core; (5) the number of nucleotides in the core; (6) the RMSD distance between the matched nucleotides; (7) the number of matched nucleotides with the same base
identity and (8) the runtime in seconds. The representations of alignments are available on the web site.
consisting of a large RNA and a small protein, where the RNA subunit
(RNase P RNA) is the catalytic component (Pace and Brown, 1995;
Chen and Pace, 1997). Based on comparative sequence analysis,
the bacterial RNase P RNAs are classified into two major types, A
and B. Despite the significant difference in their secondary structure,
both types consist of two independent folding domains: a specificity
domain (S-domain) and a catalytic domain (C-domain). The specificity domain is responsible for binding the pre-tRNA substrate,
while the C-domain contains the active site (Loria and Pan, 1996).
Recently, the 3D structure of the specificity domain has been
determined both for the A-type T.thermophilus (PDB:1u9s) and for
the B-type B.subtilis (PDB:1nbs) RNase P RNAs (Krasilnikov et al.,
2003, 2004). Despite the different overall fold of these two structures, ARTS has indicated that they share a common substructure.
Figure 3 shows their alignment. The core consists of 41 nt with
an RMSD of 1.80 Å and 0.56 base identity. Most of the spatially
conserved nucleotides are located in the four-way junction (especially on stems P7, P10 and P11) and the non-helical module formed
by J11/12 and J12/11, near the pre-tRNA binding site. This result
agrees with the comprehensive structural analysis of Chen and Pace
(1997). Moreover, in the 3D alignment obtained by ARTS the main
interaction nucleotides of the two S-domains with the pre-tRNA substrate, A108-A172-A226 in T.thermophilus and A130-G220-A230
in B.subtilis, are located in similar positions, as reported by Chen and
Pace (1997). However, the triplet A108-A172-A226 is superimposed
onto the triplet A130-A221-A230, that is, A172 is superimposed onto
A221 and not onto G220. This suggests a slightly different interacting
triplet for the B.subtilis structure that shares the same 3D configuration and base identity as the interacting triplet of the T.thermophilus
structure.
In the all-against-all PDB comparison, the alignment between
the two RNase P specificity domains was ranked below some
other interesting alignments with different structures. Specifically,
the top-ranking alignments for PDB:1u9s were with the large and
small ribosomal subunits (e.g. PDB:1njn and PDB:1hnx respectively) and with Tetrahymena group I introns (e.g. PDB:1hr2). For
PDB:1nbs the top-ranking alignments were with the large and small
ribosomal subunits (e.g. PDB:1nwx and PDB:ibk respectively), with
group I introns (e.g. PDB:1u6b), with a pentameric model of the
ii52
Fig. 3. Ribonuclease P (RNase P) RNAs. The alignment between the
specificity domains of the A-type T.thermophilus (PDB:1u9s, blue) and the
B-type B.subtilis (PDB:1nbs, chain B, red) RNase P RNAs. The core consists
of 41 nt with an RMSD of 1.80 Å (depicted in yellow). The phosphates of
the main interacting nucleotides of the two S-domains with the pre-tRNA
substrate are depicted as spheres, where the triplet 1u9s:A108-A172-A226 is
superimposed onto the triplet 1nbs:A130-A221-A230.
bacteriophage φ29 prohead RNA (PDB:1foq), with 29mer fragment
of human signal recognition particle RNA (PDB: 1d4r), with tRNA
(e.g. PDB:1gax) and with a hairpin ribozyme (PDB:1m5k). For
details see Table 2 and the website.
ARTS: Alignment of RNA tertiary structures
4
CONCLUSION
We have described a novel method for aligning nucleic acid tertiary
structures and detecting their common substructures. The common
substructures identified can be large global folds with hundreds and
even thousands of nucleotides (like in the case when comparing two
ribosomal structures) as well as small local motifs with two consecutive base-pairs. The method is highly efficient and fully automated,
where a typical comparison of two nucleic acids takes a few seconds
on a standard PC. The method was used to conduct an all-againstall comparison of all the RNA 3D structures currently available
in the PDB. All the alignments can be accessed via the web site.
Here, we have presented some interesting biological cases that are
supported by the literature. A thorough analysis of the results is
still required. This includes filtering redundant alignments, defining the expect (E) value and compiling a dataset of known and
novel tertiary motifs. Such an analysis can aid in discovering and
classifying RNA folds and tertiary motifs. It may shed new light
on the evolutionary relationship between RNAs and reveal their
functional properties. Possible future extensions for the alignment
method include considering more than the phosphorus positions as
critical points, considering sequence requirements and the identity
of nucleotides and/or base pairs in the scoring function, dealing with
RNA flexibility and aligning multiple nucleic acid structures instead
of only a pair.
ACKNOWLEDGEMENTS
We thank Maxim Shatsky for stimulating discussions. The research
of H.J.W. is partially supported by the Hermann Minkowski-Minerva
Center for Geometry at TAU. The research of R.N. has been funded
in whole or in part with Federal funds from the NCI, NIH, under
contract number NO1-CO-12400. The content of this publication
does not necessarily reflect the view or policies of the Department
of Health and Human Services, nor does mention of trade names,
commercial products, or organization imply endorsement by the US
Government.
The publisher or recipient acknowledges right of the US
Government to retain a nonexclusive, royalty-free license in and to
any copyright covering the article.
Conflict of Interest: none declared.
REFERENCES
Adams,P.L. et al. (2004) Crystal structure of a self-splicing group I intron with both
exons. Nature, 430, 45–50.
Akutsu,T. (1996) Protein structure alignment using dynamic programming and iterative
improvement. IEICE Trans. Inform. Syst., E79D, 1629–1636.
Akutsu,T. (2000) Dynamic programming algorithms for RNA secondary structure
prediction with pseudoknots. Discrete Appl. Math., 104, 45–62.
Alt,H. and Guibas,L.J. (2000) Discrete geometric shapes: matching, interpolation,
and approximation. In Sack,J.-R. and Urrutia,J. (eds), Handbook of Computational
Geometry. Elsevier, pp. 121–153.
Ambühl,C., Chakraborty,S. and Gärtner,B. (2000) Computing largest common point
sets under approximate congruence. In Proceedings of the 8th Annual European
Symposium on Algorithms (ESA), LNCS 1879, 52–63. Springer Verlag.
Batey,R. et al. (1999) Tertiary motifs in RNA structure and folding. Angew. Chem. Int.
Ed. Engl., 38, 2326–2343.
Berman,H. et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242.
Cerretti,D.P. et al. (1988) Translational regulation of the spc operon in Escherichia coli:
identification and structural analysis of the target site for S8 repressor protein. J. Mol.
Biol., 204, 309–325.
Chen,J.L. and Pace,N.R. (1997) Identification of the universally conserved core of
ribonuclease P RNA. RNA, 3, 557–560.
DeLano,W. (2002) The PyMOL molecular graphics system. DeLano Scientific, San
Carlos, CA.
Doudna,J.A. (2000) Structural genomics of RNA. Nat. Struct. Biol., 7 (suppl.),
954–956.
Doudna,J.A. and Cech,T.R. (2002) The chemical repertoire of natural ribozymes.
Nature, 418, 222–228.
Duarte,C.M. and Pyle,A.M. (1998) Stepping through an RNA structure: a novel approach
to conformational analysis. J. Mol. Biol., 284, 1465–1478.
Duarte,C.M. et al. (2003) RNA structure comparison, motif search and discovery using
a reduced representation of RNA conformational space. Nucleic Acids Res., 31,
4755–4761.
Eddy,S.R. (2001) Non-coding RNA genes and the modern RNA world. Nat. Rev. Genet.,
2, 919–929.
Efrat,A. et al. (2001) Geometry helps in bottleneck matching and related problems.
Algorithmica, 31, 1–28.
Gan,H.H. et al. (2003) Exploring the repertoire of RNA secondary motifs using
graph theory with implications for RNA design. Nucleic Acids Res., 31,
2926–2943.
Gendron,P. et al. (2001) Quantitative analysis of nucleic acid three-dimensional
structures. J. Mol. Biol., 308, 919–936.
Golden,B.L. et al. (2005) Crystal structure of a phage Twort group I ribozyme-product
complex. Nat. Struct. Mol. Biol., 12, 82–89.
Goodrich,M.T., Mitchell,J.S.B. and Orletsky,M.W. (1994) Practical methods for approximate geometric pattern matching under rigid motions: (preliminary version).
In Proceedings of the 10th Annual Symposium on Computational Geometry,
pp. 103–112. ACM Press, Stony Brook, New York, United States.
Guo,F. et al. (2004) Structure of the Tetrahymena ribozyme: base triple sandwich and
metal ion at the active site. Mol. Cell, 16, 351–362.
Harrison,A.-M. et al. (2003) Representation, searching and discovery of patterns of bases in complex RNA structures. J. Comput. Aided Mol. Des., 17,
537–549.
Hofacker,I.L. et al. (1994) Fast folding and comparison of RNA secondary structures
(the Vienna RNA package). Monatsh. Chem., 125, 167–188.
Kabsch,W. (1978) A discussion of the solution for the best rotation to relate two sets of
vectors. Acta Crystallogr., A 34, 827–828.
Kaindl,K. and Steipe,B. (1997) Metric properties of the root-mean-square deviation of
vector sets. Acta Crystallogr., A53, 809.
Krasilnikov,A.S. et al. (2003) Crystal structure of the specificity domain of ribonuclease P. Nature, 13, 760–764.
Krasilnikov,A.S. et al. (2004) Basis for structural diversity in homologous RNAs.
Science, 1, 104–107.
Loria,A. and Pan,T. (1996) Domain structure of the ribozyme from eubacterial
ribonuclease p. RNA, 2, 551–563.
Lu,X.-J. and Olson,W.K. (1999) Resolving the discrepancies among nucleic acid
conformational analyses. J. Mol. Biol., 285, 1563–1575.
Lu,X.-J. and Olson,W.K. (2003) 3DNA: a software package for the analysis, rebuilding
and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res.,
31, 5108–5121.
Lu,X.-J. et al. (1999) Overview of nucleic acid analysis programs. J. Biomol. Struct.
Dyn., 16, 833–843.
Mehlhorn,K. (1999) The LEDA Platform of Combinatorial and Geometric Computing.
Cambridge University Press, UK.
Merianos,H.J. et al. (2004) The structure of a ribosomal protein S8/spc operon mRNA
complex. RNA, 10, 954–964.
Moore,P.B. (1999) Structural motifs in RNA. Annu. Rev. Biochem., 68, 287–300.
Pace,N.R. and Brown,J.W. (1995) Evolutionary perspective on the structure and function
of ribonuclease P, a ribozyme. J. Bacteriol., 177, 1919–1928.
Storz,G. (2002) An expanding universe of noncoding RNAs. Science, 296,
1260–1263.
Tamura,M. et al. (2004) SCOR: structural classification of RNA, version 2.0. Nucleic
Acids Res., 32, D182–D184.
Tannenbaum,T., Wright,D., Miller,K. and Livny,M. (2001) Condor—a distributed
job scheduler. In Sterling,T. (ed.), Beowulf Cluster Computing with Linux.
MIT Press.
Wadley,L.M. and Pyle,A.M. (2004) The identification of novel RNA structural motifs
using COMPADRES: an automated approach to structural discovery. Nucleic Acids
Res., 32, 6650–6659.
Yang,H. et al. (2003) Tools for the automatic identification and classification of RNA
base pairs. Nucleic Acids Res., 31, 3450–3460.
Zuker,M., Mathews,D. and Turner,D. (1999) Algorithms and thermodynamics for
RNA secondary structure prediction: a practical guide. In RNA Biochemistry and
Biotechnology. NATO ASI Series, Kluwer Academic Publishers, pp. 11–43.
ii53