Vol. 21 Suppl. 2 2005, pages ii47–ii53 doi:10.1093/bioinformatics/bti1108 BIOINFORMATICS Protein Structure and Function ARTS: alignment of RNA tertiary structures Oranit Dror1,∗ , Ruth Nussinov2,3 and Haim Wolfson1 1 School of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel, 2 Sackler Institute of Molecular Medicine, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv 69978, Israel and 3 Basic Research Program, SAIC-Frederick, Inc., Laboratory of Experimental and Computational Biology, Building 469, Room 151, Frederick, MD 21702, USA ABSTRACT Motivation: A fast growing number of non-coding RNAs have recently been discovered to play essential roles in many cellular processes. Similar to proteins, understanding the functions of these active RNAs requires methods for analyzing their tertiary structures. However, in contrast to the wide range of structure-based approaches available for proteins, there is still a lack of methods for studying RNA structures. Results: We present a new computational method named ARTS (alignment of RNA tertiary structures). The method compares two nucleic acid structures (RNAs or DNAs) and detects a-priori unknown common substructures. These substructures can be either large global folds containing hundreds and even thousands of nucleotides or small local tertiary motifs with at least two successive base pairs. To the best of our knowledge, this is the first method of this type. The method is highly-efficient and was used to conduct an all-against-all comparison of all the RNA structures currently available in the Protein Data Bank. Availability: The program, a web-server and supplementary information are available on http://bioinfo3d.cs.tau.ac.il/ARTS Contact: [email protected] 1 INTRODUCTION There is a rapid growing evidence that RNA molecules are not solely carriers of genetic information, but key players in a wide range of cellular processes, such as protein synthesis and transport, RNA processing and splicing, and chromosome replication (Storz, 2002; Eddy, 2001; Doudna and Cech, 2002). Similar to proteins, the function of these non-coding RNAs is inferred from the 3D structures that they form, rather than their primary sequence per se. Thus, methods for analyzing RNA structures, comparing them with one another and identifying new motifs and folds are of utmost importance for deciphering their biological activities (Doudna, 2000). In the absence of RNA tertiary structures, many computational methods for structure prediction and analysis of RNAs have been developed to work at the secondary structure level, that is the level of base pairing (Hofacker et al., 1994; Zuker et al., 1999; Akutsu, 2000; Gan et al., 2003). Although such methods provide excellent starting points for exploring RNA structures, their inherent limitation is that they are incapable of predicting tertiary motifs. These motifs mainly involve interactions between secondary structure elements, and are key players in establishing the global fold of an RNA (Batey et al., 1999; Moore, 1999; Tamura et al., 2004). ∗ To whom correspondence should be addressed. In the past few years both the number and size of solved RNA tertiary structures has dramatically increased. This has stimulated the development of computational methods for 3D structural analysis of RNAs (Duarte and Pyle, 1998; Lu and Olson, 1999; Lu et al., 1999; Gendron et al., 2001; Yang et al., 2003; Lu and Olson, 2003). Nevertheless, only a few RNA structural comparison methods are currently available. The NASSAM method is a graph theoretic method that searches nucleic acid structures for a given 3D pattern. It represents each base by two vectors and a whole nucleic acid structure as a labeled graph, where the nodes are the vector representations of the bases and the edges are the distances between them. After representing the given 3D pattern and the structure to be searched as graphs, the exponential-time Ullman algorithm for subgraph isomorphism is used to locate the pattern within the structure (Harrison et al., 2003). Another method is PRIMOS (Duarte et al., 2003). This method represents each nucleotide by two pseudo-torsion angles (η and θ ) as calculated by the AMIGOS program (Duarte and Pyle, 1998) and a whole structure as a sequence of η − θ values, which is called an RNA worm. PRIMOS detects structural differences between molecules with the same number of nucleotides by comparing their worms. It is mainly suitable for examining different conformations of the same molecule and searching structures for a specified continuous motif (by comparing the motif’s worm with every possible worm segment of the structures). Both NASSAM and PRIMOS are useful motif searching methods, but they are unsuitable for detecting new motifs that are not specified in advance. To partially overcome this limitation, the COMPADRES method employs PRIMOS’s worm representation and searches structures for new motifs consisting of at least five continuous nucleotides (Wadley and Pyle, 2004). Specifically, for a given dataset of RNA structures, COMPADRES generates an RNA worm representation for the entire dataset by concatenating the worms of all chains. The resulting worm is then plotted against itself so that the η and θ values of each nucleotide are compared to those of the other nucleotides. Finally, the plot is scanned and all diagonals with at least five continuous matches of nucleotides are considered as new motif candidates. Indeed, COMPADRES has been successfully applied by the authors to discover new conformationally recurring motifs. However, an inherent limitation is that the discovered motifs are restricted to be sequential. Here, we describe a new method for comparing 3D nucleic acid structures and identifying a-priori unknown common substructures. The common substructures identified are non-sequential and can be either large global folds containing hundreds and even thousands of © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] ii47 O.Dror et al. nucleotides or small local tertiary motifs with at least two successive base pairs. The method is highly-efficient and suitable for searching large 3D structure databases, like the protein data bank (PDB) (Berman et al., 2000). 2 APPROACH 2.1 Problem definition The input consists of two nucleic acid structures represented by the 3D coordinates of their atoms in PDB format (Berman et al., 2000). We single out the phosphate atoms as critical points and treat the input structures as sets of points in 3D space, where each point is associated with a position of a phosphate atom: A = {ai } and B = {bi } (the indices are according to the primary structure order). The task can be formulated as finding a rigid transformation (rotation and translation) that superimposes the largest number of phosphate atoms of one structure onto the phosphate atoms of the other structure within a predefined distance error. More precisely, for a given ≥ 0 the goal is to find a 3D rigid transformation T and two corresponding point sets A = {ai }ki=1 ⊆ A and B = {bi }ki=1 ⊆ B of maximal cardinality (k) so that the distance between A and T (B ) is at most . We use two distance functions: (1) the root mean square deviation (RMSD), which is defined for A and T (B ) as (( ki=1 T (bi ) − ai 2 )/k)0.5 (Kaindl and Steipe, 1997) and (2) the bottleneck matching metric, which is defined for A and T (B ) as maxi T (bi ) − ai (Efrat et al., 2001). The pure geometric formulation of the above biological problem is known as the largest common point set (LCP) problem in computational geometry (Alt and Guibas, 2000). Although this problem has been extensively studied, there is neither an exact algorithm nor an approximate one for the RMSD metric. For the bottleneck matching metric, there is an exact algorithm, but it is impractical since it requires O(n32.5 ) time, where n = max(|A|, |B|) (Ambühl et al., 2000). There is also an approximate algorithm that guarantees finding the LCP with 8 distance error and cardinality at least as large as that of the LCP with distance error (Goodrich et al., 1994; Akutsu, 1996). The time complexity of this algorithm is O(n8.5 ). However, since for nucleic acid structures n can be thousands of nucleotides (like in the case when comparing two ribosomal structures), this algorithm might also be impractical. From a biological standpoint, solving the LCP problem for the two input nucleic acid structures as if they were pure sets of 3D points might yield false-positive alignments for compact large dense structures like the ribosome. The reason is that almost any arbitrary pair of large compact structures can be aligned so that many of their nucleotides are superimposed, resulting in biologically meaningless alignments. To filter out beforehand these noisy alignments, it is thus preferable to maximize not only the number of corresponding nucleotides, but also the number of corresponding base pairs. The rationale is that more than half of the nucleotides in an average noncoding RNA participate in base-pairing and the double helices that they form are structurally more conserved than loops (Moore, 1999). If we denote the sets of base pairs for structures A and B as Abp = {(ai , aj ) : ai , aj ∈ A, i = j } and Bbp = {(bs , bt ) : bs , bt ∈ B, s = t} respectively, then the task can be formulated as follows: for a given ≥ 0 we are interested in finding a 3D rigid transformation T , two corresponding point sets A = {ai }ki=1 ⊆ A and B = {bi }ki=1 ⊆ B, and two corresponding sets of l base pairs Abp = {(ai , aj ) : ai , aj ∈ = {(bs , bt ) : bs , bt ∈ B , s = t} ⊆ Bbp so A , i = j } ⊆ Abp and Bbp ii48 that the scoring function F (l, k) = w1 ·k+w2 ·l is maximized and the bottleneck matching distance between A and T (B ) is at most . The weights w1 and w2 have been determined to be 1 and 2 respectively. This means that the score for a nucleotide participating in a spatially conserved base pairing is twice the score for an unpaired conserved nucleotide. Note that when w2 = 0 the task is to find the LCP. Here, we present a heuristic algorithm for the problem stated above. The theoretical worst case complexity of the algorithm is O(n3 ) and, as demonstrated in the Results section, it yields good solutions with very short running times. 2.2 Method The key idea is to construct all the possible hypothetical correspondences between two successive base pairs of structure A and two successive base pairs of structure B. We call such a correspondence a seed match. Since the two structures are considered rigid bodies, each seed match uniquely defines a rigid transformation that superimposes the two base pairs of B onto the corresponding base pairs of A. We use this transformation to superimpose the whole structure B onto structure A and extend the seed match by finding additional coinciding base pairs and unpaired nucleotides. Finally, the matches are ranked according to the scoring function and the highest scoring matches are reported. 2.2.1 Seed match construction The set of base pairs (canonical as well as non-canonical) of each structure is identified in an initial stage using the 3DNA software (Lu and Olson, 2003): Abp = {(ai , aj ) : ai , aj ∈ A, i = j } and Bbp = {(bs , bt ) : bs , bt ∈ B, s = t}. Then, we construct the hypothetical seed matches as follows. For each structure we form the set of all base quadrats, where a base quadrat is a 4-tuple of phosphate atoms located on two successive base pairs:  = {(ai , ai+1 , aj , aj +1 ) | (ai , aj +1 ) ∧ (ai+1 , aj ) ∈ Abp } and B̂ = {(bs , bs+1 , bt , bt+1 ) | (bs , bt+1 ) ∧ (bs+1 , bt ) ∈ Bbp }. A seed match is then defined as a correspondence between the phosphate atoms of two base quadrats, one in  and one in B̂, that is {(ai , bs ), (ai+1 , bs+1 ), (aj , bt ), (aj +1 , bt+1 )}. In fact, we are not interested in all the possible matches defined by Â × B̂, but only in seed matches of -congruent base quadrats. These are base quadrats that can be superimposed by a transformation T so that the bottleneck matching distance between them is at most : max{ai − T (bs ), ai+1 − T (bs+1 ), aj − T (bt ), aj +1 − T (bt+1 )} ≤ . Two base quadrats are -congruent only if the maximal difference of their inter-distances is at most 2. Based on this observation, the search space is restricted and all the -congruent base quadrats are efficiently detected by hashing. Specifically, each base quadrat is represented by a rigid motion invariant signature consisting of its six inter-distances. Then, all the base quadrats of structure A are stored in a 6D hash table addressed by their signatures. The base quadrats of structure B are used to query the hash table and for each of them all the base quadrats stored within radius 2 of its signature are extracted. Note that for each base quadrat of B the extracted set of base quadrats necessarily contains all its -congruent base quadrats in A. A seed match is constructed between each base quadrat of B and each base quadrat of A extracted for it from the hash table. Then, for each seed match we use the least-squares fitting technique to compute the transformation that aligns the two base quadrats with minimum RMSD (Kabsch, 1978). Seed matches with RMSD greater than are ignored, since for them the bottleneck matching distance is also greater than . The least-squares fitting technique works in O(n) ARTS: Alignment of RNA tertiary structures time, where n is the size of the given corresponding list. For each seed match there are four corresponding pairs. Thus, the runtime for computing its transformation is O(1). 2.2.2 Global extension In this stage we extend all the seed matches. Specifically, for a given seed match with a transformation T , the goal is to find two corresponding subsets A = {ai }ki=1 ⊆ A and B = {bi }ki=1 ⊆ B with l corresponding base pairs so that the scoring function F (l, k) = w1 · k + w2 · l is maximized and ∀i : ai −T (bi ) ≤ . In case one of the weights in the scoring function is zero (meaning that the goal is to maximize the number of corresponding nucleotides or the number of corresponding base pairs, but not both), the optimization problem can be solved by an exact algorithm for finding maximal matching in a bipartite graph (Mehlhorn, 1999; Akutsu, 1996). For example, if w1 = 0, the bipartite graph is a graph G = (Abp ∪ Bbp , E), where e = ((ai , aj ), (bs , bt )) ∈ E if, and only if, the bottleneck matching distance between the two base pairs it connects after applying the transformation is at most , that is ai − T (bs ) and aj − T (bt ) ≤ . The maximal cardinality matching in this graph is equivalent to the maximal number of base pairs of B that coincide with base pairs of A after applying T . The time required for constructing the graphand solving the maximal bipartite matching problem is O(|G| + |Abp | + |Bbp | · |E|). For n = max(|A|, |B|), there are O(n) base pairs and, as described below, the number of possible matches for each base pair of A is constant. Thus, E = O(n) and the total time complexity is O(n1.5 ). An exact algorithm for maximizing the scoring function in case the two weights are non-zero is currently unknown. Instead, we use the following two-stage greedy approach. We place structure B on a 3D grid. Then, for each base pair (ai , aj ) of A, we access the grid and examine two balls of radius from ai and aj : δ(ai , ) and δ(aj , ). If there is a base pair (bs , bt ) of B such that T (bs ) ∈ δ(ai , ) and T (bt ) ∈ δ(aj , ), then we match it to (ai , aj ). In case more than one base pair of B exists in the two balls, the one with the smallest bottleneck matching distance from (ai , aj ) is selected. For a small , the number of phosphate atoms in a ball of radius is bounded. Thus, for each base pair of A we examine a constant number of possible matches. Since there are O(n) base pairs the overall running time is O(n). In the next stage, we find matches for all the remaining unmatched nucleotides of A in a similar way. The runtime complexity is also O(n). Finally, after extending the seed matches, we refine their transformations by applying the least-squares fitting technique (Kabsch, 1978). Empirically, this greedy algorithm works quite well. We massively tested two versions of it, one with the first stage only for maximizing the number of matched base pairs and the other with the second stage only for maximizing the number of matched nucleotides. We compared these two versions to the equivalent exact maximal bipartite matching algorithms and the results were almost the same, but were obtained with substantially shorter runtime. 2.2.3 Scoring and ranking The extended matches are sorted by lexicographic order defined on their score and the RMSD between the matched phosphate atoms. Since two nucleic acids may share more than one common substructure, we are interested in reporting the k top-ranking matches and not only the first one, where k is a user defined parameter (ten by default). However, we cannot simply take the first k matches in the sorted list, since some of the matches may be redundant. These are matches with approximately the same transformation and a slightly different core. To report the k nonredundant top-ranking matches, we apply iterative clustering. In each iteration, we extract the top-ranking match from the sorted list and add it to a final non-redundant list of top-ranking matches if it differs from all the current matches in this list. To check if two matches are different, we apply their transformations on three phosphate points forming the largest triangle in structure B and compute the RMSD distance of the two transformed triangles. If the distance is below , the two matches are considered similar. Since it takes O(1) time to check if two transformations are similar, the total time required for ranking and reporting the k non-redundant top-ranking matches from m initial matches is O(m log m + mk). This is equal to O(m log m) for a small k. 2.3 Complexity Here, we describe the worst case complexity of the algorithm. Let n be the maximal number of nucleotides in the given nucleic acids, that is n = max(|A|, |B|). The number of base pairs participating in stems for each structure is O(n). Thus, the number of possible base quadrats for each structure is also O(n). In the worst case, all the base quadrats of A will be extracted from the hash table for each base quadrat of B. As a result, we will construct, extend and score O(n2 ) seed matches. This takes an overall O(n2 ) · O(1) + O(n2 ) · O(n) + O(n2 log n) = O(n3 ) time. In practice, the runtime complexity for typical RNA molecules may be lower than the theoretical one due to the usage of the hash table. 3 RESULTS Using our method we have conducted an exhaustive all-against-all comparison of all the 770 RNA 3D structures currently available in the PDB (Berman et al., 2000). Condor, a distributed job schedule, was used to launch this compute-intensive execution on a cluster of standard PC nodes (Tannenbaum et al., 2001). The size of the PDB structures ranges from a few nucleotides to thousands of nucleotides (4–4599 nt). The runtime for a pairwise comparison varied from a few milliseconds to <40 min on a standard PC workstation (Pentium© 4, 2.60 GHz processor with 2 GB RAM). The most time-consuming executions were between ribosomal structures with thousands of nucleotides. For a pair of average-size RNA structures with hundreds of nucleotides the runtime was a few seconds. For each structure, the 1st top-ranking alignments with all the other structures were sorted by lexicographic order defined on their score and the RMSD between the matched phosphate atoms. All the results can be accessed via the web site. We present below several interesting cases supported in the literature. 3.1 Self-splicing group I introns RNA splicing is the process of removing introns from a precursor RNA transcript and combining the flanking exons into a mature RNA. While the splicing of most introns is carried out by a large ribonucleoprotein complex called the spliceosome, there are some self-splicing introns that catalyze both their excision and the exon ligation. Crystal structures for three self-splicing group I introns are currently available: the Azoarcus pre-tRNAI le intron with both exons (PDB:1u6b; Adams et al., 2004), the Tetrahymena pre-rRNA intron (PDB:1x8w; Guo et al., 2004) and the Twort ribozyme intron (PDB:1y0q; Golden et al., 2005). Like other group I self-splicing introns, these three introns possess a single active site in which a two-stage splicing process occurs. In the first stage, an exogenous guanosine binds to the ii49 O.Dror et al. Table 1. Self-splicing group I introns PDB 1 (size) PDB 2 (size) Score Matched base-pairs Core size RMSD Core base identity Run time (s) 1u6b (217) 1u6b (217) 1x8w (964) 1x8w (964) 1x8w (964) 1y0q (230) 1x8w (964) 1y0q (230) 1grz (492) 1hr2 (313) 184 113 94 382 180 34 16 14 55 31 116 81 66 272 118 1.40 1.68 1.65 1.82 1.62 0.50 0.58 0.62 0.81 0.97 00.36 03.66 03.82 06.12 09.16 The top-ranking alignments found by an all-against-all PDB comparison for the three currently available self-splicing group I intron structures (PDB:1u6b, 1y0q and 1x8w). For each alignment the data appearing in the columns are as follows: (1–2) the PDB code and the number of nucleotides of each structure; (3) the score of the alignment; (4) the number of base pairs in the core; (5) the number of nucleotides in the core; (6) the RMSD distance between the matched nucleotides; (7) the number of matched nucleotides with the same base identity and (8) the runtime in seconds. The representations of the alignments are available on the website. Fig. 1. Self-splicing group I introns. Left: The alignment between the Azoarcus Ile-tRNA intron (PDB:1u6b, red) and the Tetrahymena pre-rRNA intron (PDB:1x8w, chain A only, blue). The core consists of 81 nt with an RMSD of 1.68 Å (depicted in yellow) and contains the guanosine-binding site (green). Right: Zoom in on the spatially conserved guanosine-binding site. This G-site consists of four adjacent base-triples, that is (A263-G312-C262, G264-C311G414, A265-U310-A261, C266-G309-A306) in the Tetrahymena structure superimposed onto (A129-C178-G128, G130-C177-G206, A131-U176-A127, C132-G175-A172) in the Azoarcus structure. The G-site’s backbone is depicted in green, where the bases are in red and blue respectively. The backbone of the rest of the core is depicted in yellow and the backbone of the structurally diverse region is in gray. This figure and all subsequent figures were prepared using PyMOL (DeLano, 2002, http://www.pymol.org). intron’s active site, attacks the 5 splice site and cleaves the 5 exon. In the second stage, the 3 -terminal guanosine binds to the active site instead of the exogenous one, its 3 phosphoryl group is attacked by the 3 - hydroxyl of the 5 exon, the two exons are ligated and the intron is released. Despite the fact that the catalytic states of the three introns are different, ARTS has indicated that the 3D structure of their catalytic core is conserved. Furthermore, in the all-against-all PDB comparison ii50 the first top-ranking structure for PDB:1u6b was PDB:1y0q and the second one was PDB:1x8w. The results for PDB:1y0q were similar, that is PDB:1u6b was ranked first and PDB:1x8w was ranked second. For PDB:1x8w the first two top-ranking alignments were with other Tetrahymena group I intron structures (PDB:1grz and 1hr2). See Table 1. Figure 1 presents the structural alignment between the Azoarcus and the Tetrahymena introns. The core region consists of 81 nt ARTS: Alignment of RNA tertiary structures Fig. 2. Ribosomal protein S8–RNA complexes. Left: The alignment between the spc operon mRNA bound to protein S8 (PDB:1s03, 47 nt, red) and the small ribosomal subunit (PDB:1n33, 1523 nt, only 16S rRNA and protein S8 are displayed in gray and blue respectively). The core consists of 45 nt with an RMSD of 1.54 Å (depicted in yellow). Right: Zoom in on the spatially conserved binding site. The binding site’s backbone is depicted in green, where the bases of the four interacting nucleotides, that is (A12, A14, C15, G37) of the spc mRNA superimposed on nucleotides (A640, A642, C643, G597) of the 16S rRNA, are in red and blue, respectively. with an RMSD of 1.68 Å and 0.37 base identity. The spatially conserved nucleotides are located on helices P3, P4, P6 and P7 that form a molecular cleft in which the catalytic site is located. In particular, the guanosine-binding site, including the 3 -terminal guanosine, was found to be spatially conserved, that is, the four adjacent base-triples (A263-G312-C262, G264-C311-G414, A265U310-A261, C266-G309-A306) in the Tetrahymena structure were superimposed onto (A129-C178-G128, G130-C177-G206, A131U176-A127, C132-G175-A172) in the Azoarcus structure, where the 3 -terminal guanosine nucleotides are G414 and G206. These observations are consistent with the ones of Adams et al. (2004) and Guo et al. (2004). The alignments between the other introns are available on the web site. 3.2 Ribosomal protein S8–RNA complexes In Escherichia coli, ribosomal protein S8 does not only play a key role in the small ribosomal subunit assembly through its interaction with 16S rRNA, but also regulates its synthesis and the synthesis of other ribosomal proteins by binding to the spc operon mRNA (Cerretti et al., 1988). A crystal structure for E.coli S8 bound to the spc operon mRNA is available (PDB:1s03, 47 nt; Merianos et al., 2004) and in our all-against-all PDB comparison the top-ranking alignment for this structure was with the Thermus thermophilus small ribosomal subunit (PDB:1n33, 1523 nt). The runtime for obtaining this alignment was 10 s and according to it the spc mRNA is structurally similar to Helix 21 of the 16S rRNA (45 spatially conserved nucleotides with an RMSD of 1.54 Å), despite their low sequence identity (0.42 base identity). Figure 2 shows the alignment. One can see that protein S8 binds to the two RNAs in a similar manner. Specifically, the two binding sites are superimposed, that is nucleotides A12G17, C34-U38 of the spc mRNA are superimposed on nucleotides A640-C645,G594-U598 of the 16S rRNA, where the interacting nucleotides are (A12, A14, C15, G37) and (A640, A642, C643, G597) (Merianos et al., 2004). This alignment reaffirms the theory that the synthesis of ribosomal proteins is regulated by a competition between rRNA and mRNA for some of the ribosomal proteins. 3.3 Ribonuclease P RNAs Ribonuclease P (RNase P) is an enzyme that plays a key role in the biosynthesis of transfer RNAs (tRNAs). It is an endonuclease that removes the 5 precursor sequences of tRNAs and generates their mature 5 terminus. In bacteria, RNase P is a ribonucleoprotein ii51 O.Dror et al. Table 2. Ribonuclease P RNAs PDB 1 (size) PDB 2 (Size) Score Matched base-pairs Core size RMSD Core base identity Run time (s) 1u9s (155) 1u9s (155) 1u9s (155) 1u9s (155) 1nbs (270) 1nbs (270) 1nbs (270) 1nbs (270) 1nbs (270) 1nbs (270) 1nbs (270) 1njn (2765) 1hnx (1511) 1hr2 (313) 1nbs (270) 1nwx (2883) 1ibk (1510) 1u6b (217) 1foq (545) 1d4r (83) 1gax (150) 1m5k (222) 66 66 59 53 85 76 70 61 57 55 55 9 8 7 6 8 5 9 10 11 10 8 48 50 45 41 69 66 52 41 35 35 39 1.75 2.02 1.85 1.80 2.06 2.20 1.65 1.75 1.63 1.54 1.83 0.38 0.28 0.29 0.56 0.30 0.27 0.37 0.32 0.34 0.34 0.31 29.56 07.30 00.57 00.33 26.12 12.32 00.48 01.90 00.13 00.19 00.60 The highest scoring non-redundant alignments detected by an all-against-all PDB comparison for both the A-type (PDB:1u9s) and B-type (PDB:1nbs) RNase P RNAs. For each alignment the data appearing in the columns are as follows: (1–2) the PDB code and the number of nucleotides of each structure; (3) the score of the alignment; (4) the number of base pairs in the core; (5) the number of nucleotides in the core; (6) the RMSD distance between the matched nucleotides; (7) the number of matched nucleotides with the same base identity and (8) the runtime in seconds. The representations of alignments are available on the web site. consisting of a large RNA and a small protein, where the RNA subunit (RNase P RNA) is the catalytic component (Pace and Brown, 1995; Chen and Pace, 1997). Based on comparative sequence analysis, the bacterial RNase P RNAs are classified into two major types, A and B. Despite the significant difference in their secondary structure, both types consist of two independent folding domains: a specificity domain (S-domain) and a catalytic domain (C-domain). The specificity domain is responsible for binding the pre-tRNA substrate, while the C-domain contains the active site (Loria and Pan, 1996). Recently, the 3D structure of the specificity domain has been determined both for the A-type T.thermophilus (PDB:1u9s) and for the B-type B.subtilis (PDB:1nbs) RNase P RNAs (Krasilnikov et al., 2003, 2004). Despite the different overall fold of these two structures, ARTS has indicated that they share a common substructure. Figure 3 shows their alignment. The core consists of 41 nt with an RMSD of 1.80 Å and 0.56 base identity. Most of the spatially conserved nucleotides are located in the four-way junction (especially on stems P7, P10 and P11) and the non-helical module formed by J11/12 and J12/11, near the pre-tRNA binding site. This result agrees with the comprehensive structural analysis of Chen and Pace (1997). Moreover, in the 3D alignment obtained by ARTS the main interaction nucleotides of the two S-domains with the pre-tRNA substrate, A108-A172-A226 in T.thermophilus and A130-G220-A230 in B.subtilis, are located in similar positions, as reported by Chen and Pace (1997). However, the triplet A108-A172-A226 is superimposed onto the triplet A130-A221-A230, that is, A172 is superimposed onto A221 and not onto G220. This suggests a slightly different interacting triplet for the B.subtilis structure that shares the same 3D configuration and base identity as the interacting triplet of the T.thermophilus structure. In the all-against-all PDB comparison, the alignment between the two RNase P specificity domains was ranked below some other interesting alignments with different structures. Specifically, the top-ranking alignments for PDB:1u9s were with the large and small ribosomal subunits (e.g. PDB:1njn and PDB:1hnx respectively) and with Tetrahymena group I introns (e.g. PDB:1hr2). For PDB:1nbs the top-ranking alignments were with the large and small ribosomal subunits (e.g. PDB:1nwx and PDB:ibk respectively), with group I introns (e.g. PDB:1u6b), with a pentameric model of the ii52 Fig. 3. Ribonuclease P (RNase P) RNAs. The alignment between the specificity domains of the A-type T.thermophilus (PDB:1u9s, blue) and the B-type B.subtilis (PDB:1nbs, chain B, red) RNase P RNAs. The core consists of 41 nt with an RMSD of 1.80 Å (depicted in yellow). The phosphates of the main interacting nucleotides of the two S-domains with the pre-tRNA substrate are depicted as spheres, where the triplet 1u9s:A108-A172-A226 is superimposed onto the triplet 1nbs:A130-A221-A230. bacteriophage φ29 prohead RNA (PDB:1foq), with 29mer fragment of human signal recognition particle RNA (PDB: 1d4r), with tRNA (e.g. PDB:1gax) and with a hairpin ribozyme (PDB:1m5k). For details see Table 2 and the website. ARTS: Alignment of RNA tertiary structures 4 CONCLUSION We have described a novel method for aligning nucleic acid tertiary structures and detecting their common substructures. The common substructures identified can be large global folds with hundreds and even thousands of nucleotides (like in the case when comparing two ribosomal structures) as well as small local motifs with two consecutive base-pairs. The method is highly efficient and fully automated, where a typical comparison of two nucleic acids takes a few seconds on a standard PC. The method was used to conduct an all-againstall comparison of all the RNA 3D structures currently available in the PDB. All the alignments can be accessed via the web site. Here, we have presented some interesting biological cases that are supported by the literature. A thorough analysis of the results is still required. This includes filtering redundant alignments, defining the expect (E) value and compiling a dataset of known and novel tertiary motifs. Such an analysis can aid in discovering and classifying RNA folds and tertiary motifs. It may shed new light on the evolutionary relationship between RNAs and reveal their functional properties. Possible future extensions for the alignment method include considering more than the phosphorus positions as critical points, considering sequence requirements and the identity of nucleotides and/or base pairs in the scoring function, dealing with RNA flexibility and aligning multiple nucleic acid structures instead of only a pair. ACKNOWLEDGEMENTS We thank Maxim Shatsky for stimulating discussions. The research of H.J.W. is partially supported by the Hermann Minkowski-Minerva Center for Geometry at TAU. The research of R.N. has been funded in whole or in part with Federal funds from the NCI, NIH, under contract number NO1-CO-12400. The content of this publication does not necessarily reflect the view or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organization imply endorsement by the US Government. The publisher or recipient acknowledges right of the US Government to retain a nonexclusive, royalty-free license in and to any copyright covering the article. Conflict of Interest: none declared. REFERENCES Adams,P.L. et al. (2004) Crystal structure of a self-splicing group I intron with both exons. Nature, 430, 45–50. Akutsu,T. (1996) Protein structure alignment using dynamic programming and iterative improvement. IEICE Trans. Inform. Syst., E79D, 1629–1636. Akutsu,T. (2000) Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discrete Appl. Math., 104, 45–62. Alt,H. and Guibas,L.J. (2000) Discrete geometric shapes: matching, interpolation, and approximation. In Sack,J.-R. and Urrutia,J. (eds), Handbook of Computational Geometry. Elsevier, pp. 121–153. Ambühl,C., Chakraborty,S. and Gärtner,B. (2000) Computing largest common point sets under approximate congruence. In Proceedings of the 8th Annual European Symposium on Algorithms (ESA), LNCS 1879, 52–63. Springer Verlag. Batey,R. et al. (1999) Tertiary motifs in RNA structure and folding. Angew. Chem. Int. Ed. Engl., 38, 2326–2343. Berman,H. et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242. Cerretti,D.P. et al. (1988) Translational regulation of the spc operon in Escherichia coli: identification and structural analysis of the target site for S8 repressor protein. J. Mol. Biol., 204, 309–325. Chen,J.L. and Pace,N.R. (1997) Identification of the universally conserved core of ribonuclease P RNA. RNA, 3, 557–560. DeLano,W. (2002) The PyMOL molecular graphics system. DeLano Scientific, San Carlos, CA. Doudna,J.A. (2000) Structural genomics of RNA. Nat. Struct. Biol., 7 (suppl.), 954–956. Doudna,J.A. and Cech,T.R. (2002) The chemical repertoire of natural ribozymes. Nature, 418, 222–228. Duarte,C.M. and Pyle,A.M. (1998) Stepping through an RNA structure: a novel approach to conformational analysis. J. Mol. Biol., 284, 1465–1478. Duarte,C.M. et al. (2003) RNA structure comparison, motif search and discovery using a reduced representation of RNA conformational space. Nucleic Acids Res., 31, 4755–4761. Eddy,S.R. (2001) Non-coding RNA genes and the modern RNA world. Nat. Rev. Genet., 2, 919–929. Efrat,A. et al. (2001) Geometry helps in bottleneck matching and related problems. Algorithmica, 31, 1–28. Gan,H.H. et al. (2003) Exploring the repertoire of RNA secondary motifs using graph theory with implications for RNA design. Nucleic Acids Res., 31, 2926–2943. Gendron,P. et al. (2001) Quantitative analysis of nucleic acid three-dimensional structures. J. Mol. Biol., 308, 919–936. Golden,B.L. et al. (2005) Crystal structure of a phage Twort group I ribozyme-product complex. Nat. Struct. Mol. Biol., 12, 82–89. Goodrich,M.T., Mitchell,J.S.B. and Orletsky,M.W. (1994) Practical methods for approximate geometric pattern matching under rigid motions: (preliminary version). In Proceedings of the 10th Annual Symposium on Computational Geometry, pp. 103–112. ACM Press, Stony Brook, New York, United States. Guo,F. et al. (2004) Structure of the Tetrahymena ribozyme: base triple sandwich and metal ion at the active site. Mol. Cell, 16, 351–362. Harrison,A.-M. et al. (2003) Representation, searching and discovery of patterns of bases in complex RNA structures. J. Comput. Aided Mol. Des., 17, 537–549. Hofacker,I.L. et al. (1994) Fast folding and comparison of RNA secondary structures (the Vienna RNA package). Monatsh. Chem., 125, 167–188. Kabsch,W. (1978) A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr., A 34, 827–828. Kaindl,K. and Steipe,B. (1997) Metric properties of the root-mean-square deviation of vector sets. Acta Crystallogr., A53, 809. Krasilnikov,A.S. et al. (2003) Crystal structure of the specificity domain of ribonuclease P. Nature, 13, 760–764. Krasilnikov,A.S. et al. (2004) Basis for structural diversity in homologous RNAs. Science, 1, 104–107. Loria,A. and Pan,T. (1996) Domain structure of the ribozyme from eubacterial ribonuclease p. RNA, 2, 551–563. Lu,X.-J. and Olson,W.K. (1999) Resolving the discrepancies among nucleic acid conformational analyses. J. Mol. Biol., 285, 1563–1575. Lu,X.-J. and Olson,W.K. (2003) 3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res., 31, 5108–5121. Lu,X.-J. et al. (1999) Overview of nucleic acid analysis programs. J. Biomol. Struct. Dyn., 16, 833–843. Mehlhorn,K. (1999) The LEDA Platform of Combinatorial and Geometric Computing. Cambridge University Press, UK. Merianos,H.J. et al. (2004) The structure of a ribosomal protein S8/spc operon mRNA complex. RNA, 10, 954–964. Moore,P.B. (1999) Structural motifs in RNA. Annu. Rev. Biochem., 68, 287–300. Pace,N.R. and Brown,J.W. (1995) Evolutionary perspective on the structure and function of ribonuclease P, a ribozyme. J. Bacteriol., 177, 1919–1928. Storz,G. (2002) An expanding universe of noncoding RNAs. Science, 296, 1260–1263. Tamura,M. et al. (2004) SCOR: structural classification of RNA, version 2.0. Nucleic Acids Res., 32, D182–D184. Tannenbaum,T., Wright,D., Miller,K. and Livny,M. (2001) Condor—a distributed job scheduler. In Sterling,T. (ed.), Beowulf Cluster Computing with Linux. MIT Press. Wadley,L.M. and Pyle,A.M. (2004) The identification of novel RNA structural motifs using COMPADRES: an automated approach to structural discovery. Nucleic Acids Res., 32, 6650–6659. Yang,H. et al. (2003) Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res., 31, 3450–3460. Zuker,M., Mathews,D. and Turner,D. (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide. In RNA Biochemistry and Biotechnology. NATO ASI Series, Kluwer Academic Publishers, pp. 11–43. ii53
© Copyright 2026 Paperzz