Vol. 00 no. 00 2011 Pages 1–7 BIOINFORMATICS An efficient RNA pairwise structure comparison by SETTER method. David Hoksza 1,2,∗ and Daniel Svozil 2,∗ 1 SIRET Research Group, Department of Software Engineering, FMP, Charles University in Prague, Czech Republic 2 Laboratory of Informatics and Chemistry, Institute of Chemical Technology Prague, Czech Republic Received on XXXXX; revised on XXXXX; accepted on XXXXX Associate Editor: XXXXXXX ABSTRACT Motivation: Understanding the architecture and function of RNA including the large molecules such as ribosomes requires methods for comparing and analyzing their tertiary and quarternary structures. While structural alignment of short RNAs is achievable in a reasonable amount of time, large structures represent much bigger challenge. However, the growth of structures of large RNAs deposited in the PDB database calls for the development of fast and accurate methods for analyzing their structures, as well as for rapid similarity searches in databases. Results: In this article a novel algorithm for RNA structural comparison SETTER (SEcondary sTructure-based TERtiary Structure Similarity Algorithm) is introduced. SETTER utilizes pairwise comparison method based on 3D similarity of the so-called general secondary structure units (GSSU) resembling secondary structure motifs. The presented algorithm is both accurate and very fast, and does not impose limits on the size of aligned RNA structures. Availability: The SETTER program, as well as all datasets, are freely available from http://siret.cz/hoksza/projects/setter/. Contact: [email protected], [email protected] Supplementary information: Supplementary information is available at Bioinformatics online. 1 INTRODUCTION Though the main biological function of RNA is the transfer of genetic information, the evidence shows that RNA molecules also play key roles in variety of cellular processes (Eddy, 2001). RNA shows, among others, enzymatic activity in ribozymes (Scott, 2007), takes part in transcription regulation (Bartel, 2004) or is involved in chromatin modeling (Kelley and Kuroda, 2000). RNA 3D structure is hierarchical (Tinoco, 1999), and can be divided into primary, secondary, tertiary and quaternary levels. RNA secondary structure motifs (Hendrix et al., 2005) can be defined as double helices interconnected by various types of loop structures, and are stable independently of their 3D folds. Tertiary interactions (Holbrook, 2008) stabilize the overall arrangement and packing of double helices in large RNA structures. The first resolved RNA crystal structure was that for the yeast phenylalanine tRNA (Kim et al., 1974). This achievement was followed by solving structures of other naturally occuring tRNAs (Arnez and Steitz, 1994), as well as of a variety of oligomeric RNA model structures (Holbrook et al., 1991). With the improvements in molecular biological methods (Kim et al., 1995) and in crystallographic techniques (Garman, 2003) the size of solved RNA structures has increased later dramatically. Between the largest RNAs the prominent place is occupied by the structures of small and large ribosomal subunits of different organisms (Ban et al., 2000; Wimberly et al., 2000). Large RNAs1 allowed for the first time detailed studies of RNA structure elements not found in smaller molecules, such as, e.g., continuous interhelical base stacking, RNA domain structure, and helical packing. The function of an RNA molecule is largely determined by its 3D structure that is typically more evolutionary conserved than sequence. Thus, methods for comparative RNA function annotation based on the structural similarity usually yield much better results than sequence based approaches. Though detecting optimal structural similarity between two biomolecules at the tertiary level has been shown to be NP-hard (Kolodny and Linial, 2004), the development of automatic tools capable of efficient and accurate RNA structural alignment and comparison has become an important part of structural bioinformatics. The RNA structural alignment software should be able to work not only with small molecules, but also with large RNAs to allow researchers to study in detail how combinations of basic building blocks result in tertiary and quarternary structures. To be computationally tractable, currently available software tools for comparing two RNA 3D structures, such as ARTS (Dror et al., 2005, 2006), DIAL (Ferrè et al., 2007), iPARTS (Wang et al., 2010), SARA (Capriotti and Marti-Renom, 2008, 2009) or SARSA (Chang et al., 2008) are thus based on heuristic approaches. ARTS (Dror et al., 2005, 2006) detects maximum common substructure using backbone phosphate atoms between two RNA 3D structures, and its theoretical worst complexity is O(N3 ). ARTS is a good tool for detecting RNA structural motifs, however, due to its cubic time complexity, it is not practical for comparison of large RNA molecules. To overcome this problem DIAL server using dynamic programming algorithm 1 ∗ to whom correspondence should be addressed c Oxford University Press 2011. An arbitrary limit of 100 residues is used to define large RNAs (Holbrook, 2008). 1 D. Hoksza and D. Svozil and running in quadratic time was developed (Ferrè et al., 2007). DIAL alignment algorithm is based on the torsion and/or pseudotorsion similarity, sequence similarity, and base-pairing similarity, and it provides access to global, local and semi-global structural alignments. The improvement in speed over DIAL algorithm was later brought by PARTS (Chang et al., 2008), an algorithm based on the so-called structural alphabet (SA). Structural alphabet is an emerging concept in the structural biology of proteins, in which a protein structure is represented as a limited series of ”letters” each assigned to the well-characterized conformation (de Brevern et al., 2000; Camproux et al., 2004). PARTS uses the vector quantization approach to derive an RNA structural alphabet of 23 letters representing the most common backbone conformations. The RNA structures are then reduced to 1D sequences of SA letters, and these are aligned utilizing classical methods for pairwise and multiple sequence alignments. A new set of SA letters was derived for the improved version of PARTS called iPARTS (Wang et al., 2010). iPARTS outperforms its previous version PARTS, as well as (in some aspects) another highly efficient algorithm SARA (Capriotti and Marti-Renom, 2008, 2009). SARA program represents distances among selected atoms as unit vectors existing on the unit spheres (Chew et al., 1999). All-to-all unit-vector RMSD distances of consecutive unit spheres are computed and used as scoring matrix for subsequent dynamic programming based global alignment. In the presented paper a novel pairwise RNA comparison method SETTER (SEcondary sTructure-based TERtiary Structure Similarity Algorithm) is proposed. The method divides the whole RNA structure into the set of non-overlapping generalized secondary structure units (GSSUs), and the distance measure based on RMSD transformations between all possible pairs of GSSUs is utilized to get the resulting RNA structure comparison. The comparison algorithm scales as O(n2 ) with the size of GSSU, and as O(n) with the number of GSSUs in the structure. The segmentation to GSSUs offers the advantage of an exemplary runtimes even for the largest structures. In these only the number of GSSUs increases while their sizes remain practically constant irrespective of the size of the structure. In addition, higher speed of the algorithm does not compromise its accuracy as it is demonstrated by comparing with SARA and iPARTS approaches. 2 For the purpose of the algorithm, each nucleotide in GSSU is represented by its C40 atom, and Watson-Crick (WC) hydrogen bonds are identified using 3DNA (Lu and Olson, 2008). The process iteratively applies following two steps for each GSSU present in the RNA structure. The RNA structure is processed in sequence order and in the first step nucleotides are stored on a stack. This process stops by encountering a nucleotide nti WC bonded with a nucleotide ntj already in the stack. Then, in the second step, a new GSSU G starts to be formed from the pair {nti , ntj } (i.e., the neck) and from all non-hydrogen bonded nucleotides found between nti and ntj (i.e., the loop). Finally, the stem is added to the GSSU by identifying all nucleotides found between the neck and the first nucleotide WC bonded to the residue not stored in the stack. By repeating these two steps, the algorithm iteratively searches for GSSUs, and it stops when the end of the sequence is reached. The GSSU’s extraction algorithm is described in detail in Supplementary Information. Note that even a structure without a single Watson-Crick pair has a GSSU which is identical with the structure itself. 2.2 Comparing Two GSSUs GSSU pairwise comparison lies in the heart of the method. It is employed even when SETTER compares structures consisting of multiple GSSUs. Each GSSU is represented by an ordered set of 3D coordinates enhanced with bonding and nucleotide/atom type information. The common way how to assess similarity between two sets of points is to define a pairing between them. The sets are then superposed by finding translation and rotation of one structure over the other minimizing the mutual distances of the respective paired points. Usually, the root mean square deviation (RMSD) is used as the distance measure, and two structures can be superposed given a pairing/alignment in polynomial time (Kabsch, 1976). However, finding the optimal alignment (which in turns minimizes the distance measure) is an NP-hard problem (Kolodny and Linial, 2004). The optimal solution can be found by exhaustive search, which is, however, computationally not feasible. This problem can be resolved by identifying suboptimal alignments that will likely participate in the optimal alignment. This is the principle idea behind SETTER’s structure comparison process. SETTER generates a set of short alignments which quality is evaluated by Kabsch (Kabsch, 1976) RMSD algorithm. Working with relatively short alignments allows to superpose even the largest RNA structures in a reasonable amount of time. To superpose two GSSUs means to match their loops which in turn implies matching their necks (see Fig. 1). To define the superposition unambiguously in three dimensional space at least three pairs of points are needed. A set of these points is called triplet, and the alignment is formed by matching triplets U METHODS 2.1 A GSSU Identification GSSU usually resembles a hairpin motif possibly containing bulges and/or internal loops in its stem part. Three important elements are recognized in the GSSU: loop, neck, and stem (see Fig. 1). The formal description of the GSSU is given by the following definition. D EFINITION 1. Let R be an RNA structure with nucleotide sequence {ni }n i=1 and let WC ⊂ R denotes the subset of ni participating in a Watson-Crick base pair. By generalized secondary structure unit (GSSU) 2 2 G, we understand a pair of substrings of R, {ni }ii=i and {nj }jj=j 1 1 (i1 ≤ i2 < j1 ≤ j2 , i2 = j1 − 1) of maximum lengths such that each nucleotide nx ∈ G: • i1 ≤ x ≤ i2 : nx ∈ / WC or nx is paired with ny where j1 ≤ y ≤ j2 • j1 ≤ x ≤ j2 : nx ∈ / WC or nx is paired with ny where i1 ≤ y ≤ i2 2 Let imax and jmin be the highest/lowest indexes of the Watson-Crick jmin −1 paired bases in G. We define loop as L = {ni }i=i ⊂ G, stem as max +1 G \ L and neck as the pair {nimax , njmin }. A C A 5' A A C G 3' U 3 1 A C neck G G C G A A C A 2 G C C U G stem C U A loop Fig. 1: Three GSSUs extracted from an RNA structure. The sequence starts at the 5’ end and the numbers denote order of GSSU generation. SETTER 3 G C A 52 922 1 U U G C C A G 2 3AG 916 U C C A 916 G C G C A rotation translation C 911 52 G U 47 U G A G G C G A U 47 U G U A UA 1 CG 922 CG C G U A C G 2 CG911 G Fig. 2: Alignment of GSSU from tRNA domain of transfermessenger RNA (PDB code 1P6V) with GSSU from glutamine tRNA (PDB code 1EXD). The final structural alignment is defined by three nucleotide pairs forming a triplet (the lines 1, 2, and 3). To find the optimal superposition for the given neck pairs (lines 1 and 2), the position of the middle pair is varied (line 3). of a GSSU B it is not clear in which direction the loops are oriented in B A B 3D space (whether the correct alignment is {{nA 1 , n1 }, {n2 , n2 }} or B A B A {{n1 , n2 }, {n2 , n1 }}), and therefore both possibilities are investigated. These tweakings are necessary for accurate GSSU comparison, however they slightly increase the running time of SETTER. Though in most cases the GSSU consists of a stem and a loop, it is not a strict rule, as demonstrated by GSSU number 3 in Fig. 1. Two particular situations can occur — GSSU has a zero-sized loop or the RNA does not have a single hydrogen bond (i.e. it does not have a secondary structure at all). In case of GSSU without the loop the third nucleotide for triplet alignment is selected from the stem and its position within the stem is varied to get the lowest S-distance of the superposed GSSUs. When dealing with a GSSU having no secondary structure, several triplets covering whole structure are formed and used as a basis for the alignment. Such a simplified comparison might not lead to the best possible result but since SETTER is intended to compare mostly large structures the probability of encountering a GSSU with no secondary structure is relatively small. 2.3 between two given structures. SETTER first aligns necks and then tries to identify the final pair in the triplet by aligning each possible pair of loops’ nucleotides. For example, if two GSSUs with loops consisting of n and m nucleotides are to be aligned, n × m alignments are generated (see Fig. 2). For each alignment, a rotation matrix and a translation vector defining optimal superposition of the triplets is calculated, and though these are optimal for given triplet only, they are subsequently used to superpose the whole GSSUs. This possible inaccuracy is a trade-off for efficiency. After the superposition, a 3D nearest neighbors are found for each nucleotide in the aligned GSSUs, and their distances are added to the overall distance of the two GSSUs. Finally, the distance is normalized resulting in a measure referred to as S-distance. The whole process is formalized by Eq. 1. N Nζ (x, G) = γ(G A , G B ) = min1≤i≤|G| {d(x, Gi )} × ζ min1≤i≤|G| {d(x, Gi )} |G A | X i=1 1 0 if x = Gimin otherwise if N N1 (G A i , G B ) ≤ otherwise A δ(G A , G B ) = min t∈T A B A B n |G X| o N Nα (G A i , τ (G B , t))} i=1 S(R , R ) = S(G , G ) = δ(G A ,G B ) ||G A |−|G A || × (1 + min {|G A |,|G B |} ) min {|G A |,|G B |} γ(G A , τ (G B , topt )) (1) where G A (found in an RNA structure RA ) and G B (found in an RNA structure RB ) represent the GSSUs to be compared, Gi stands for i-th nucleotide in the nucleotide sequence of G and |G| for its length. N N (x, G) is the Euclidean distance from the nucleotide x to its nearest neighbor in G. If x and its nearest neighbor share the same nucleotide type, the distance is modified by factor ζ. δ computes the raw distance - T is a set of transpositions resulting from the candidate triplet alignments and τ (G, t) transposes GSSU G using the transposition t. The S-distance is then normalized by function γ counting the number of nearest neighbors within the distance after the optimal transposition topt . Since hydrogen bonds are identified using simple geometric criteria, their detection sometimes may not be correct. This leads to the shift of the neck position within a GSSU. Therefore, SETTER tries to simulate the neck shift by aligning also the residues next to (under) the necks. Moreover, A B B when aligning the neck {nA 1 , n2 } of a GSSU A with the neck {n1 , n2 } Comparing More Than Two GSSUs In case when one of the structures contains more than one GSSU the following three-step multiple GSSU comparison process is implemented (see Fig. 3): 1. Perform all-to-all pairwise GSSU comparisons. 2. Few best GSSU pairs (with low S-distances) are used as seeds for alignment of the rest of the GSSUs. 3. The structures’ GSSUs are aligned (each treated as a whole) and the multiple GSSU alignment with the lowest aggregated S-distance is declared as the overall pairwise structure similarity score. Let us compare structures RA and RB having the distance between them denoted as S. Each GSSU from RA is compared to each GSSU from RB , but only top κ pairs with the minimum distance is processed further. For each of the κ selected pairs {GiA , GjB } the value of S is set to S(GiA , GjB ). In the second step, S-distances of the neighboring GSSU A , GB } pairs are iteratively added to the S. For the GSSU pair {Gi+1 j+1 B A the value of S(Gi+1 , Gj+1 ) and the penalty for the rotation needed to transform the structures from the state corresponding to S(GiA , GjB ) to the A , G B ) are added to S. This process goes state corresponding to S(Gi+1 j+1 from {i + 1, j + 1} until either i + 1 or j + 1 reaches the number of GSSUs A B in R or R . Similarly, the other ends of the structures need to be aligned and so the process is repeated for {i − 1, j − 1}. However, the case when GSSUs in the RNA structure are oriented in opposite direction must also be considered, and another κ alignments must be performed aligning i − 1 residues with j + 1 residues (not shown in Fig. 3). The rotation between two GSSUs imposes a penalty to the S. This penalty is calculated as a distance between the rotation matrices defining the two consequent GSSU superpositions (see Fig. 3c). The penalty for translation is, on the other hand, not included explicitly, as it is already present in the pairwise GSSU S-distance where the difference in GSSUs’ sizes is used for normalization (see section 2.2). This approach to multiple GSSU comparison allows to match any pair of structures including the largest ones in a reasonable amount of time (section 2.6). 2.4 Early termination The nearest neighbor search, which is part of the S-distance computation, has O(n2 ) time complexity with respect to the GSSU’s length n. In addition, the search has to be performed for each of the candidate alignments noticeably decreasing efficiency of SETTER. To increase the algorithm’s speed a simple early termination condition was implemented into the SETTER’s GSSU comparison process. Alignments that are not likely to be the part of the optimal superposition are identified, and for these alignments 3 D. Hoksza and D. Svozil shows that S-distance follows a right-skewed distribution. The right-skewed distributions are commonly fitted with one of the following families: the Weibull distribution, the gamma distribution and its special case the chisquare distribution, the log-normal distribution, the power log-normal distribution and the Gumbel extreme value distribution. To show how well data follow the fitted distribution visual inspection of quantile-quantile plots (QQ-plots) can be used, or the fit can be tested by two-sample KolmogorovSmirnov test. In the case of S-distance the statistical significance of a given alignment can be derived using the cumulative distribution function of the log-normal distribution. The probability density function ρ(x) of log-normal distribution of a random variable x is given as G2A G1B G4A RA G2A G5A G3A G3A G2B G1A B R G1B G4A G3B G2B G3B G4B G4B a) b) c) Fig. 3: Multiple GSSU structure comparison works in four steps. The figure represents a situation where only the optimal GSSU pair is considered (κ = 1), and the direction of the alignment is given by the order of GSSUs. a) A schematic representation of two RNA structures RA and RB with 5 and 4 GSSUs, respectively. The most similar GSSUs pair is {G3A , G2B } (shaded rectangle). Structures RA and RB are aligned based on the rotation and translation of this pair. The superposition needed to optimally align G4A and G3B was obtained during the all-to-all GSSU pairwise comparison stage, and the rotation angle needed to get from the state {G3A , G2B } to {G4A , G3B } (first down arrow in the figure) is thus known. The current state is changed to {G4A , G3B }, and the process is repeated for the pair {G5A , G4B } (second down arrow in the figure). Similarly, the algorithm must also process in the opposite direction from the position of G2A and G1B (up arrow in the figure). c) The rotation angles ξi from the previous step are used in the penalty function π() which represents a weight function for the GSSU distances. To get the final S-distance, the sum of weighted GSSU S-distances is normalized by the difference in number (β) of the GSSUs in RA and RB structures. the nearest neighbor search is skipped. These alignments will very likely have the triplet S-distance higher than the lowest GSSU distance obtained up to that time. Specifically, S of the triplet-based GSSUs will be probably lower than S of the GSSU they come from. If a triplet T A ⊂ G A is aligned with a triplet T B ⊂ G B with S(G A , G B ) = χ being the best result so far, the comparison computation can be terminated if S(T A , T B ) × 1/λ > χ. Since the early termination is a heuristic (S(T A , T B ) < S(G A , G B ) does not have to be valid), the early termination condition is strengthened by introducing the parameter λ ≥ 1. By varying the λ parameter, the trade-off between accuracy and speed can be set. The higher the λ is, the less often the early termination occurs, and the more accurate and slower the algorithm is. 2.5 Statistical Significance of Structural Alignment The statistical significance of a particular alignment can be derived using the cumulative distribution function of the underlying distribution. Fig. 4 4 ρ(x) = G5A √ x 1 2πσ 2 e − ln x−µ 2 2σ where parameters µ and σ are the mean and standard deviation, respectively, of the variable’s natural logarithm that is, by definition, normally distributed. On a non-logarithmized scale µ is called the location parameter and σ the scale parameter. These parameters must be determined, and once they are known, they can be used to derive the statistical significance of a particular alignment given as its p-value. p-value corresponds to the probability P (x ≤ X) that the variable X takes a value lower or equal to x P (X ≤ x) = 1 1 ln(X) − µ + erf ( √ ) 2 2 2σ 2 where erf (x) is the error function defined as Z x 2 2 e−t dt erf (x) = + π 0 The error function can be approximated (Abramowitz and Stegun, 1970) as 2 erf (x) ≈ 1 − (a1 t + a2 t2 + . . . + a5 t5 )e−x , t = 1 1+p×x where p = 0.3275911, a1 = 0.254829592, a2 = −0.284496736, a3 = 1.421413741, a4 = −1.453152027 and a5 = 1.061405429. The smaller the p-value, the more statistically significant the S-distance is. For the determination of parameters µ and σ a set of unrelated structure pairs based on their sequence similarity (sequence similarity threshold of 80% was used) was prepared. Because the µ and σ parameters depend on the length of the shorter structure N in the structure pair they must be determined for different lengths separately. For each length a dataset containing 50,000 structure pairs was generated by randomly cutting the regions of the given length from structures longer than N . The structure pair data sets of lengths 6, 9, 10, 12, 15, 18, 20, 21, 25, 30, 35, . . . 300 residues were prepared. The S-distance was determined for each alignment in the given structure pairs data set, and the parameters µ and σ of the log-normal distribution were found by maximum likelihood fitting. The distribution was fitted using only the lower half of the data (i.e. data lying below their median). Removing data with high S-distances does not compromise the quality of calculations because we are generally not interested in highly dissimilar structures. In addition, the length dependence of µ and σ is smoothened by excluding dissimilar structures bacause of a significant noise reduction. All statistical calculations were performed using the R system version 2.13.1 (R Development Core Team, 2011) with the package MASS (version 7.3-14). 2.6 Effectiveness Evaluation RNA motifs have been classified according to function, 3D structure or tertiary interaction in the SCOR (Structural classification of RNA) database (Klosterman et al., 2002; Tamura et al., 2004). The SCOR classification system is based on the structure called directed acyclic graph to reflect the fact, that RNA structural elements can have several distinct features and may belong to multiple classes. SETTER was evaluated and compared to other RNA structural alignment algorithms by using three datasets based on functional classification SETTER 3 3.1 RESULTS AND DISCUSSION Statistical Parameters Evaluation QQ-plots show that data follow the fitted log-normal distribution very closely except for the region of high S-distances (see Fig. 4). However, poor fitting in this region will not seriously influence the results of database searching, as we are generally not interested in highly dissimilar structures. The quality of the fit was further confirmed by Kolmogorov-Smirnov test that provided p-values close to zero for all sequence lengths. The location and scale parameters (µ and σ) of the log-normal distribution depend on the length of the sequence N (Capriotti and Marti-Renom, 2008) in a non-trivial way (see Fig. 5). Location parameter µ can be fitted by one polynomial curve µ = a + b × N + c × n2 (2) where a = −4.8779 × 10−1 , b = 1.8993 × 10−2 , and c = −3.2770 × 10−5 . On the other hand, scale parameter σ decreases with increasing length in the region of lower lengths (up to 18 residues), and increases with length for longer sequences. Thus, the dependence of the σ parameter on the sequence length N was fitted separately for these two regions, each fit following the power law. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.5 ● ● ● ● 1.3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● σ 2.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● µ Scale parameter ● ● ● ● ● 1.4 3.0 ● ● 1.5 3.5 Location parameter ● ● ● ● 1.0 ● ● ● ● 1.1 ● 0.5 obtained from the SCOR, version 2.0.3. — FSCOR, T-FSCOR, and RFSCOR (Capriotti and Marti-Renom, 2008). The FSCOR contains all RNA chains with more than three nucleotides with unique functional classification, the R-FSCOR is a structurally dissimilar subset of the FSCOR, and the T-FSCOR set contains structures from the FSCOR set not present in the R-FSCOR set. Using these datasets, RNA similarity methods were evaluated by their ability to correctly assign the functional (i.e. SCOR) classification to the query RNA structure by comparison with a database of classified RNA structures. Two RNA structures can be either functionally identical (the so-called exact classification, i.e., they have the same deepest SCOR classification) or functionally similar (the socalled similar classification, i.e., they do not agree at the deepest level but share classification at the parent level). Specifically, two experiments were performed — a leave-one-out test on the FSCOR dataset and a test assigning functions to structures from the T-FSCOR with the R-FSCOR serving as the database set. Two quality measures were utilized to compare various classification algorithms — classification accuracy (ACC), and area under the ROC curve (AUC). ACC is defined as the ratio of correctly classified structures to the total number of queries. The ROC curve is obtained by identifying the most similar database structure (nearest neighbor) and its p-value for each query. The nearest neighbors for all queries are sorted according to their p-values, and the p-value threshold is varied from the most similar (with the lowest p-value) to the most dissimilar pair. For each threshold, the number of true positives (T P ) is calculated as the number of structure pairs with identical/similar function lying above this threshold. Rest of the structures lying above the threshold then represents false positives (F P ). If P (positives) is the number of pairs with identical/similar function in the whole result set and N (negatives) the number of pairs with different functions, then FNP is called false positive ratio (F P R) and TPP true positive ratio (T P R). The ROC curve is given as T P R (y-axis) against F P R (x-axis) generated by varying the p-value threshold. The area under the ROC curve (AUC) can then be calculated. High AUC means that correct classifications are present mostly at the top of the list of nearest neighbors sorted by their p-values. It must also be stressed, that both ACC and AUC should be reported to objectively compare classification methods, because with high ACC it is more difficult to get high AUC (see Supplementary Information). ● ● ● ● ● ● ●● ● ●● 0 50 100 150 Length 200 250 300 0 50 100 150 Length 200 250 300 Fig. 5: Length dependence of log-normal parameters. While location parameter µ was fitted by one curve (left plot), two curves were needed for scale parameter σ (right plot). σ = a × Nb (3) For sequence lengths up to 18 a and b parameters are equal to a = 1.0239 and b = − 3.1648 × 10−1 , and for longer sequences a = 2.0280 × 10−1 and b = − 2.3256 × 10−1 . The relations (2) and (3) provide a simple way to calculate µ and σ parameters for any sequence length. 3.2 Effectiveness Comparison SETTER was compared with two most effective approaches SARA and iPARTS using published AUC (SARA, iPARTS) and ACC (SARA only) values on the FSCOR and T/R-FSCOR datasets. Following settings were used in SETTER: α = 0.2, β = 2, = 4 Å κ = 3, and λ = 1 (see sections 2.2, 2.3 and 2.4 for details). The results are summarized in Table 1. Three sets of results with varying p-value threshold are presented for SETTER. The percentage of classified structures for the given p-value threshold characterizes each of these sets and is called coverage. ST RpV =1 corresponds to the classification where the structures are sorted according to their p-values but no threshold is applied (FSCOR coverage equals to 100%). For ST RpV =0.01 the query is classified only if the p-value of the S-distance between the query and the closest structure is less than 1%. In this case 63 structures were not classified corresponding to the FSCOR coverage of 85%. Similarly, ST RpV =4.4E−07 corresponds to the FSCOR coverage equaling to 58.7% at which level SARA was evaluated (Capriotti and MartiRenom, 2009). However, while SARA coverage of T/R-FSCOR dataset equals to 56%, SETTERs’ coverage at this p-value increases to 64.3%. Another issue influencing the accuracy is the effect of the score (p-value in SETTER) rounding. SARA rounds its score to two digits which always leads to artificial increase in ACC, and sometimes may lead to artificial increase in AUC as well. The reason is that correctly classified structures with the same score are used preferrentially over the wrongly classified structures with the same score. However, SETTER rounds to more digits, and is much less susceptible to this behavior. Since iPARTS does not employ any filtering, SETTER and iPARTS were compared at the same coverage 100%, i.e., no filtering by p-value was applied (p-value = 1). Results in Table 1 show, that SETTER outperforms iPARTS in AUC for both FSCOR and T/RFSCOR datasets, respectively. To compare the accuracy of SETTER and SARA, same level of coverage should be utilized. SARA’s coverage of 58.7% for the FSCOR dataset corresponds to the pvalue of 4.4 × 10−7 in SETTER. At this siginificance level SARA’s 5 D. Hoksza and D. Svozil 5 10 20 10 15 4000 0 0 10 20 5 30 40 0 Theoretical quantiles 20 30 40 50 0 20 40 Distance ● ● 40 40 ●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 10 15 Distance 0 Data quantiles 0 10000 8000 0 0 60 Theoretical quantiles 80 ●● ● ●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 50 100 60 80 Distance ● ● ● 60 7 ● 40 6 20 5 0 4 ● 20 3 Distance 10 2 ●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 Length = 300 4000 4000 0 5000 0 0 1 2 3 4 5 6 7 Length = 200 8000 Length = 100 0 Frequency 15000 Length = 10 150 200 Theoretical quantiles ●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 ● ● ● 300 400 200 Theoretical quantiles Fig. 4: The histograms of frequencies of S-distances were fitted by log-normal distribution (upper plots, the curves). Each of the sequence length must be fitted separately (histograms for only four sequence lengths are shown). The quality of fit can be visually inspected from the quantile-quantile (Q-Q) plots (lower plots). If the distribution of data is similar to the fitted log-normal distribution, the points in the Q-Q plot lie on the straight line. coverage on the T/R-FSCOR dataset equals to 56%. Using the pvalue threshold of 4.4 × 10−7 SETTER performs better than SARA in terms of AUC both for FSCOR and T/R-FSCOR datasets (see Table 1). For iPARTS, ACC was not reported, and necessary tests can not be performed using the iPARTS web interface. However, SETTER can be compared with SARA for which ACC values are avilable. Rather high AUC of SETTER should be reflected in lowered ACC. Still, SETTER’s ACC is higher than that of SARA in exact classification both for FSCOR (by 3.6%) and T/R-FSCOR (by 7.6%) datasets, but lower in similar classification by 4.2% for FSCOR, and by 6.8% for T/F-SCOR dataset. However, when comparing SETTER with SARA on the T/R-FSCOR dataset it must be taken into account that at the p-value threshold equaling to 4.4 × 10−7 SARA’s coverage is 56% while SETTER’s coverage is 64.3%. Reducing higher coverage of SETTER to the 56% of SARA would lead to increase in SETTER’s accuracy. Thus, it can be concluded that SETTER performs better than iPARTS and SARA in terms of AUC and is comparable with SARA in terms of ACC. Table 1. ACC and AUC comparison of iPARTS, SARA, SETTER on the FSCOR and T/R-FSCOR datasets. The values are given in % and are reported for exact/similar classification. ROC curves from which AUC was calculated are shown in Supplementary Information. iPARTS SARA ST RpV =1 ST RpV =0.01 ST RpV =4.4E−07 3.3 FSCOR AUC ACC 72/92 ? 61/83 81.4/95.3 85/93 57.8/67.5 90/95 69.1/75.3 83/91 85.0/91.1 T/R-FSCOR AUC ACC 77/90 ? 58/85 78.0/94.5 93/93 67.4/70.5 90/90 77.0/80.1 85/85 85.6/87.7 Speed Comparison Experiments measuring running times of SARA, SETTER and iPARTS were also performed. The runtime of SETTER was measured on Linux machine with 4 Intel(R) Xeon(R) CPUs E7540, 6 2GHz (the algorithm is not parallelized thus only one core per comparison was utilized) and 132 GB of RAM (however, the average memory size needed to store the representations of all RNA structures from the FSCOR set was less than 3.3 MB) running Red Hat Linux. Because neither SARA, nor iPARTS offer the standalone versions for download, their runtimes were obtained from their web interfaces. Thus, not using the same hardware for all benchmarked algorithms the comparison is only qualitative. However, the variations between SETTER and SARA/iPARTS are substantial (see Table 2), and can not be attributed to different hardware setup only. Table 2. Runtime comparison of iPARTS, SARA and SETTER for datasets of RNA structures of various size. The D1 set contains tRNA structures 1EHZ:A, 1H3E:B, 1I9V:A, 2TRA:A and 1YFG:A (average length 76 nucleotides), D2 set contains ribozyme P4-P6 domain 1GID:A, 1HR2:A and 1L8V:A (average length 157 nucleotides), D3 contains domain V of 23S rRNA 1FFZ:A and 1FG0:A (average length 496 nucleotides), D4 contains 16S rRNA 1J5E:A and 2AVY:A (average length 1522 nucleotides), and D5 contains the currently largest RNA structures in PDB – yeast 25S rRNA 3O58:1 and 3O5H:1 (average length 3396 nucleotides). data set D1 D2 D3 D4 D5 iPARTS 1.1 s 2.6 s 17.0 s 2.8 min ? SARA 1.7 s 9.2 s ? ? ? SETTER 0.3 s 1.2 s 3.4 s 19.7 s 57.2 s Table 2 shows times of all-to-all comparisons on four datasets containing RNA structures of various size. While SARA can align only structures having no more than 1000 residues, and iPARTS only structures with no more than 1900 residues, there are no size limits in SETTER. iPARTS utilizes dynamic programming technique for the alignment, and thus the speed grows quadratically with the increasing structure size. SETTER’s nearest neighbor identification procedure scales as O(n2 ) with the size of the GSSU (not with the size of structure!), and employment of the heuristic SETTER speed optimization with λ = 1 further reduces the number of O(n2 ) computations. Comparison with SARA is difficult, because for two of four data sets the server times out, and returns no results. However, from the runtime experiments it can be concluded, that SETTER is much better suited for the alignment of even the largest RNA structures. 4 CONCLUSIONS Fast and accurate algorithm for RNA structural comparison SETTER has been developed and tested on a large benchmark data set. The efficiency of the algorithm is given by the decomposition of the RNA structure into the set of generalized secondary structure motifs (GSSUs). Such a segmentation offers good scalability with respect to the structure size (SETTER scales linearly with the structure size) because the number of residues in GSSUs (SETTER scales quadratically with the GSSU size) generally does not increase with increased size of the RNA structure. The efficiency of the method is clearly demonstrated by comparing with iPARTS, and SARA approaches. SETTER method is capable of aligning even the largest RNA structures deposited in the PDB database in a reasonable amount of time (less than 1 minute), and represents thus an important addition to the portfolio of automatic RNA structural analysis tools. ACKNOWLEDGEMENT This work was supported by the Czech Science Foundation (GAČR) project Nr. 201/09/0683 and by the Ministry of Education of the Czech Republic MSM6046137302. REFERENCES Abramowitz, M. and Stegun, I. A. (1970). Handbook of Mathematical Functions. Dover, New York. Arnez, J. G. and Steitz, T. A. (1994). Crystal structure of unmodified tRNA(Gln) complexed with glutaminyl-tRNA synthetase and ATP suggests a possible role for pseudo-uridines in stabilization of RNA structure. Biochemistry, 33(24), 7560–7567. PMID: 8011621. Ban, N., Nissen, P., Hansen, J., Moore, P. B., and Steitz, T. A. (2000). The complete atomic structure of the large ribosomal subunit at 2.4 a resolution. Science (New York, N.Y.), 289(5481), 905–920. PMID: 10937989. Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, 116, 281–297. Camproux, A. C., Gautier, R., and Tufféry, P. (2004). A Hidden Markov Model Derived Structural Alphabet for Proteins. Journal of Molecular Biology, 339(3), 591–605. Capriotti, E. and Marti-Renom, M. A. (2008). Rna structure alignment by a unit-vector approach. Bioinformatics, 24, i112–i118. Capriotti, E. and Marti-Renom, M. A. (2009). Sara: a server for function annotation of rna structures. Nucl. Acids Res., page gkp433. Chang, Y.-F., Huang, Y.-L., and Lu, C. L. (2008). Sarsa: a web tool for structural alignment of rna using a structural alphabet. Nucleic Acids Res, 36(Web-ServerIssue), 19–24. Chew, L. P., Huttenlocher, D., Kedem, K., and Kleinberg, J. (1999). Fast detection of common geometric substructure in proteins. Journal of computational biology : a journal of computational molecular cell biology, 6(3-4), 313–325. de Brevern, A. G., Etchebest, C., and Hazout, S. (2000). Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins, 41(3), 271–287. Dror, O., Nussinov, R., and Wolfson, H. (2005). ARTS: alignment of RNA tertiary structures. Bioinformatics, 21 Suppl 2. Dror, O., Nussinov, R., and Wolfson, H. J. (2006). The ARTS web server for aligning RNA tertiary structures. Nucleic Acids Res, 34(Web Server issue). Eddy, S. R. (2001). Noncoding RNA genes and the modern RNA world . Nature Reviews Genetics, 2(12), 919–929. Ferrè, F., Ponty, Y., Lorenz, W. A., and Clote, P. (2007). Dial: a web server for the pairwise alignment of two rna three-dimensional structures using nucleotide, dihedral angle and base-pairing similarities. Nucleic Acids Res, 35(Web-ServerIssue), 659–668. Garman, E. (2003). ’Cool’ crystals: macromolecular cryocrystallography and radiation damage. Current Opinion in Structural Biology, 13(5), 545–551. PMID: 14568608. Hendrix, D. K., Brenner, S. E., and Holbrook, S. R. (2005). RNA structural motifs: building blocks of a modular biomolecule. Quarterly reviews of biophysics, 38(3), 221–243. Holbrook, S. R. (2008). Structural principles from large RNAs. Annual review of biophysics, 37(1), 445–464. Holbrook, S. R., Cheong, C., Tinoco, I, J., and Kim, S. H. (1991). Crystal structure of an RNA double helix incorporating a track of non-Watson-Crick base pairs. Nature, 353(6344), 579–581. PMID: 1922368. Kabsch, W. (1976). A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 32(5), 922–923. Kelley, R. L. and Kuroda, M. I. (2000). Noncoding RNA genes in dosage compensation and imprinting. Cell, 103(1), 9–12. Kim, R., Holbrook, E. L., Jancarik, J., and Kim, S. H. (1995). Synthesis and purification of milligram quantities of short RNA transcripts. BioTechniques, 18(6), 992–994. PMID: 7546725. Kim, S. H., Suddath, F. L., Quigley, G. J., McPherson, A., Sussman, J. L., Wang, A. H., Seeman, N. C., and Rich, A. (1974). Three-dimensional tertiary structure of yeast phenylalanine transfer RNA. Science (New York, N.Y.), 185(149), 435–440. PMID: 4601792. Klosterman, P. S., Tamura, M., Holbrook, S. R., and Brenner, S. E. (2002). SCOR: a Structural Classification of RNA database. Nucleic Acids Res, 30(1), 392–394. Kolodny, R. and Linial, N. (2004). Approximate protein structural alignment in polynomial time. Proc Natl Acad Sci U S A, 101(33), 12201–12206. Lu, X.-J. and Olson, W. K. (2008). 3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nature Protocols, 3(7), 1213–1227. R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051-07-0. Scott, W. G. (2007). Ribozymes. Current Opinion in Structural Biology, 17(3), 280– 286. Tamura, M., Hendrix, D. K., Klosterman, P. S., Schimmelman, N. R., Brenner, S. E., and Holbrook, S. R. (2004). SCOR: Structural Classification of RNA, version 2.0. Nucleic Acids Res, 32(Database issue). Tinoco, I. (1999). How RNA folds. Journal of Molecular Biology, 293(2), 271–281. Wang, C.-W., Chen, K.-T., and Lu, C. L. (2010). iPARTS: an improved tool of pairwise alignment of rna tertiary structures. Nucleic Acids Res, 38 Suppl, W340–7. Wimberly, B. T., Brodersen, D. E., Clemons, W M, J., Morgan-Warren, R. J., Carter, A. P., Vonrhein, C., Hartsch, T., and Ramakrishnan, V. (2000). Structure of the 30S ribosomal subunit. Nature, 407(6802), 327–339. PMID: 11014182. 7
© Copyright 2026 Paperzz