An efficient RNA pairwise structure comparison by SETTER method.

Vol. 00 no. 00 2011
Pages 1–7
BIOINFORMATICS
An efficient RNA pairwise structure comparison by
SETTER method.
David Hoksza 1,2,∗ and Daniel Svozil 2,∗
1
SIRET Research Group, Department of Software Engineering, FMP, Charles University in Prague,
Czech Republic
2
Laboratory of Informatics and Chemistry, Institute of Chemical Technology Prague, Czech
Republic
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT
Motivation: Understanding the architecture and function of RNA
including the large molecules such as ribosomes requires methods
for comparing and analyzing their tertiary and quarternary structures.
While structural alignment of short RNAs is achievable in a
reasonable amount of time, large structures represent much bigger
challenge. However, the growth of structures of large RNAs deposited
in the PDB database calls for the development of fast and accurate
methods for analyzing their structures, as well as for rapid similarity
searches in databases.
Results: In this article a novel algorithm for RNA structural
comparison SETTER (SEcondary sTructure-based TERtiary Structure
Similarity Algorithm) is introduced. SETTER utilizes pairwise
comparison method based on 3D similarity of the so-called general
secondary structure units (GSSU) resembling secondary structure
motifs. The presented algorithm is both accurate and very fast, and
does not impose limits on the size of aligned RNA structures.
Availability: The SETTER program, as well as all datasets, are freely
available from http://siret.cz/hoksza/projects/setter/.
Contact: [email protected], [email protected]
Supplementary information: Supplementary information is available
at Bioinformatics online.
1
INTRODUCTION
Though the main biological function of RNA is the transfer of
genetic information, the evidence shows that RNA molecules also
play key roles in variety of cellular processes (Eddy, 2001). RNA
shows, among others, enzymatic activity in ribozymes (Scott, 2007),
takes part in transcription regulation (Bartel, 2004) or is involved in
chromatin modeling (Kelley and Kuroda, 2000).
RNA 3D structure is hierarchical (Tinoco, 1999), and can be
divided into primary, secondary, tertiary and quaternary levels.
RNA secondary structure motifs (Hendrix et al., 2005) can be
defined as double helices interconnected by various types of loop
structures, and are stable independently of their 3D folds. Tertiary
interactions (Holbrook, 2008) stabilize the overall arrangement
and packing of double helices in large RNA structures. The first
resolved RNA crystal structure was that for the yeast phenylalanine
tRNA (Kim et al., 1974). This achievement was followed by
solving structures of other naturally occuring tRNAs (Arnez
and Steitz, 1994), as well as of a variety of oligomeric RNA
model structures (Holbrook et al., 1991). With the improvements
in molecular biological methods (Kim et al., 1995) and in
crystallographic techniques (Garman, 2003) the size of solved RNA
structures has increased later dramatically. Between the largest
RNAs the prominent place is occupied by the structures of small
and large ribosomal subunits of different organisms (Ban et al.,
2000; Wimberly et al., 2000). Large RNAs1 allowed for the first
time detailed studies of RNA structure elements not found in smaller
molecules, such as, e.g., continuous interhelical base stacking, RNA
domain structure, and helical packing.
The function of an RNA molecule is largely determined by
its 3D structure that is typically more evolutionary conserved
than sequence. Thus, methods for comparative RNA function
annotation based on the structural similarity usually yield much
better results than sequence based approaches. Though detecting
optimal structural similarity between two biomolecules at the
tertiary level has been shown to be NP-hard (Kolodny and Linial,
2004), the development of automatic tools capable of efficient and
accurate RNA structural alignment and comparison has become
an important part of structural bioinformatics. The RNA structural
alignment software should be able to work not only with small
molecules, but also with large RNAs to allow researchers to study
in detail how combinations of basic building blocks result in
tertiary and quarternary structures. To be computationally tractable,
currently available software tools for comparing two RNA 3D
structures, such as ARTS (Dror et al., 2005, 2006), DIAL (Ferrè
et al., 2007), iPARTS (Wang et al., 2010), SARA (Capriotti and
Marti-Renom, 2008, 2009) or SARSA (Chang et al., 2008) are thus
based on heuristic approaches. ARTS (Dror et al., 2005, 2006)
detects maximum common substructure using backbone phosphate
atoms between two RNA 3D structures, and its theoretical worst
complexity is O(N3 ). ARTS is a good tool for detecting RNA
structural motifs, however, due to its cubic time complexity, it is
not practical for comparison of large RNA molecules. To overcome
this problem DIAL server using dynamic programming algorithm
1
∗ to
whom correspondence should be addressed
c Oxford University Press 2011.
An arbitrary limit of 100 residues is used to define large RNAs (Holbrook,
2008).
1
D. Hoksza and D. Svozil
and running in quadratic time was developed (Ferrè et al., 2007).
DIAL alignment algorithm is based on the torsion and/or pseudotorsion similarity, sequence similarity, and base-pairing similarity,
and it provides access to global, local and semi-global structural
alignments. The improvement in speed over DIAL algorithm was
later brought by PARTS (Chang et al., 2008), an algorithm based
on the so-called structural alphabet (SA). Structural alphabet is an
emerging concept in the structural biology of proteins, in which
a protein structure is represented as a limited series of ”letters”
each assigned to the well-characterized conformation (de Brevern
et al., 2000; Camproux et al., 2004). PARTS uses the vector
quantization approach to derive an RNA structural alphabet of 23
letters representing the most common backbone conformations.
The RNA structures are then reduced to 1D sequences of SA
letters, and these are aligned utilizing classical methods for pairwise
and multiple sequence alignments. A new set of SA letters was
derived for the improved version of PARTS called iPARTS (Wang
et al., 2010). iPARTS outperforms its previous version PARTS,
as well as (in some aspects) another highly efficient algorithm
SARA (Capriotti and Marti-Renom, 2008, 2009). SARA program
represents distances among selected atoms as unit vectors existing
on the unit spheres (Chew et al., 1999). All-to-all unit-vector RMSD
distances of consecutive unit spheres are computed and used as
scoring matrix for subsequent dynamic programming based global
alignment.
In the presented paper a novel pairwise RNA comparison
method SETTER (SEcondary sTructure-based TERtiary Structure
Similarity Algorithm) is proposed. The method divides the
whole RNA structure into the set of non-overlapping generalized
secondary structure units (GSSUs), and the distance measure based
on RMSD transformations between all possible pairs of GSSUs
is utilized to get the resulting RNA structure comparison. The
comparison algorithm scales as O(n2 ) with the size of GSSU,
and as O(n) with the number of GSSUs in the structure. The
segmentation to GSSUs offers the advantage of an exemplary
runtimes even for the largest structures. In these only the number
of GSSUs increases while their sizes remain practically constant
irrespective of the size of the structure. In addition, higher speed of
the algorithm does not compromise its accuracy as it is demonstrated
by comparing with SARA and iPARTS approaches.
2
For the purpose of the algorithm, each nucleotide in GSSU is represented
by its C40 atom, and Watson-Crick (WC) hydrogen bonds are identified
using 3DNA (Lu and Olson, 2008). The process iteratively applies following
two steps for each GSSU present in the RNA structure. The RNA structure is
processed in sequence order and in the first step nucleotides are stored on a
stack. This process stops by encountering a nucleotide nti WC bonded with
a nucleotide ntj already in the stack. Then, in the second step, a new GSSU
G starts to be formed from the pair {nti , ntj } (i.e., the neck) and from
all non-hydrogen bonded nucleotides found between nti and ntj (i.e., the
loop). Finally, the stem is added to the GSSU by identifying all nucleotides
found between the neck and the first nucleotide WC bonded to the residue
not stored in the stack. By repeating these two steps, the algorithm iteratively
searches for GSSUs, and it stops when the end of the sequence is reached.
The GSSU’s extraction algorithm is described in detail in Supplementary
Information. Note that even a structure without a single Watson-Crick pair
has a GSSU which is identical with the structure itself.
2.2
Comparing Two GSSUs
GSSU pairwise comparison lies in the heart of the method. It is employed
even when SETTER compares structures consisting of multiple GSSUs.
Each GSSU is represented by an ordered set of 3D coordinates enhanced
with bonding and nucleotide/atom type information. The common way how
to assess similarity between two sets of points is to define a pairing between
them. The sets are then superposed by finding translation and rotation of one
structure over the other minimizing the mutual distances of the respective
paired points. Usually, the root mean square deviation (RMSD) is used
as the distance measure, and two structures can be superposed given a
pairing/alignment in polynomial time (Kabsch, 1976). However, finding the
optimal alignment (which in turns minimizes the distance measure) is an
NP-hard problem (Kolodny and Linial, 2004). The optimal solution can be
found by exhaustive search, which is, however, computationally not feasible.
This problem can be resolved by identifying suboptimal alignments that will
likely participate in the optimal alignment. This is the principle idea behind
SETTER’s structure comparison process. SETTER generates a set of short
alignments which quality is evaluated by Kabsch (Kabsch, 1976) RMSD
algorithm. Working with relatively short alignments allows to superpose
even the largest RNA structures in a reasonable amount of time.
To superpose two GSSUs means to match their loops which in turn implies
matching their necks (see Fig. 1). To define the superposition unambiguously
in three dimensional space at least three pairs of points are needed. A set of
these points is called triplet, and the alignment is formed by matching triplets
U
METHODS
2.1
A
GSSU Identification
GSSU usually resembles a hairpin motif possibly containing bulges and/or
internal loops in its stem part. Three important elements are recognized in
the GSSU: loop, neck, and stem (see Fig. 1). The formal description of the
GSSU is given by the following definition.
D EFINITION 1. Let R be an RNA structure with nucleotide sequence
{ni }n
i=1 and let WC ⊂ R denotes the subset of ni participating in a
Watson-Crick base pair. By generalized secondary structure unit (GSSU)
2
2
G, we understand a pair of substrings of R, {ni }ii=i
and {nj }jj=j
1
1
(i1 ≤ i2 < j1 ≤ j2 , i2 = j1 − 1) of maximum lengths such that
each nucleotide nx ∈ G:
• i1 ≤ x ≤ i2 : nx ∈
/ WC or nx is paired with ny where j1 ≤ y ≤ j2
• j1 ≤ x ≤ j2 : nx ∈
/ WC or nx is paired with ny where i1 ≤ y ≤ i2
2
Let imax and jmin be the highest/lowest indexes of the Watson-Crick
jmin −1
paired bases in G. We define loop as L = {ni }i=i
⊂ G, stem as
max +1
G \ L and neck as the pair {nimax , njmin }.
A
C
A
5'
A A C G
3'
U
3
1
A
C
neck
G
G
C G
A A C
A
2
G C
C U G
stem
C
U
A
loop
Fig. 1: Three GSSUs extracted from an RNA structure. The
sequence starts at the 5’ end and the numbers denote order of GSSU
generation.
SETTER
3
G C
A
52
922
1
U
U
G
C
C
A
G
2
3AG
916
U
C
C
A
916
G C
G
C
A
rotation
translation
C 911
52
G
U 47
U
G
A
G
G
C
G
A
U 47
U
G
U
A
UA
1 CG
922
CG
C
G
U
A
C
G
2
CG911
G
Fig. 2: Alignment of GSSU from tRNA domain of transfermessenger RNA (PDB code 1P6V) with GSSU from glutamine
tRNA (PDB code 1EXD). The final structural alignment is defined
by three nucleotide pairs forming a triplet (the lines 1, 2, and 3). To
find the optimal superposition for the given neck pairs (lines 1 and
2), the position of the middle pair is varied (line 3).
of a GSSU B it is not clear in which direction the loops are oriented in
B
A
B
3D space (whether the correct alignment is {{nA
1 , n1 }, {n2 , n2 }} or
B
A
B
A
{{n1 , n2 }, {n2 , n1 }}), and therefore both possibilities are investigated.
These tweakings are necessary for accurate GSSU comparison, however they
slightly increase the running time of SETTER.
Though in most cases the GSSU consists of a stem and a loop, it is not
a strict rule, as demonstrated by GSSU number 3 in Fig. 1. Two particular
situations can occur — GSSU has a zero-sized loop or the RNA does not
have a single hydrogen bond (i.e. it does not have a secondary structure
at all). In case of GSSU without the loop the third nucleotide for triplet
alignment is selected from the stem and its position within the stem is
varied to get the lowest S-distance of the superposed GSSUs. When dealing
with a GSSU having no secondary structure, several triplets covering whole
structure are formed and used as a basis for the alignment. Such a simplified
comparison might not lead to the best possible result but since SETTER is
intended to compare mostly large structures the probability of encountering
a GSSU with no secondary structure is relatively small.
2.3
between two given structures. SETTER first aligns necks and then tries to
identify the final pair in the triplet by aligning each possible pair of loops’
nucleotides. For example, if two GSSUs with loops consisting of n and m
nucleotides are to be aligned, n × m alignments are generated (see Fig. 2).
For each alignment, a rotation matrix and a translation vector defining
optimal superposition of the triplets is calculated, and though these are
optimal for given triplet only, they are subsequently used to superpose the
whole GSSUs. This possible inaccuracy is a trade-off for efficiency. After
the superposition, a 3D nearest neighbors are found for each nucleotide in
the aligned GSSUs, and their distances are added to the overall distance of
the two GSSUs. Finally, the distance is normalized resulting in a measure
referred to as S-distance. The whole process is formalized by Eq. 1.
N Nζ (x, G) =
γ(G A , G B ) =
min1≤i≤|G| {d(x, Gi )} × ζ
min1≤i≤|G| {d(x, Gi )}
|G A | X
i=1
1
0
if x = Gimin
otherwise
if N N1 (G A i , G B ) ≤ otherwise
A
δ(G A , G B ) = min
t∈T
A
B
A
B
n |G
X|
o
N Nα (G A i , τ (G B , t))}
i=1
S(R , R ) =
S(G , G ) =
δ(G A ,G B )
||G A |−|G A ||
× (1 + min {|G A |,|G B |} )
min {|G A |,|G B |}
γ(G A , τ (G B , topt ))
(1)
where G A (found in an RNA structure RA ) and G B (found in an RNA
structure RB ) represent the GSSUs to be compared, Gi stands for i-th
nucleotide in the nucleotide sequence of G and |G| for its length. N N (x, G)
is the Euclidean distance from the nucleotide x to its nearest neighbor
in G. If x and its nearest neighbor share the same nucleotide type, the
distance is modified by factor ζ. δ computes the raw distance - T is a
set of transpositions resulting from the candidate triplet alignments and
τ (G, t) transposes GSSU G using the transposition t. The S-distance is then
normalized by function γ counting the number of nearest neighbors within
the distance after the optimal transposition topt .
Since hydrogen bonds are identified using simple geometric criteria, their
detection sometimes may not be correct. This leads to the shift of the neck
position within a GSSU. Therefore, SETTER tries to simulate the neck
shift by aligning also the residues next to (under) the necks. Moreover,
A
B
B
when aligning the neck {nA
1 , n2 } of a GSSU A with the neck {n1 , n2 }
Comparing More Than Two GSSUs
In case when one of the structures contains more than one GSSU the
following three-step multiple GSSU comparison process is implemented
(see Fig. 3):
1. Perform all-to-all pairwise GSSU comparisons.
2. Few best GSSU pairs (with low S-distances) are used as seeds for
alignment of the rest of the GSSUs.
3. The structures’ GSSUs are aligned (each treated as a whole) and the
multiple GSSU alignment with the lowest aggregated S-distance is
declared as the overall pairwise structure similarity score.
Let us compare structures RA and RB having the distance between
them denoted as S. Each GSSU from RA is compared to each GSSU
from RB , but only top κ pairs with the minimum distance is processed
further. For each of the κ selected pairs {GiA , GjB } the value of S is set
to S(GiA , GjB ). In the second step, S-distances of the neighboring GSSU
A , GB }
pairs are iteratively added to the S. For the GSSU pair {Gi+1
j+1
B
A
the value of S(Gi+1 , Gj+1 ) and the penalty for the rotation needed to
transform the structures from the state corresponding to S(GiA , GjB ) to the
A , G B ) are added to S. This process goes
state corresponding to S(Gi+1
j+1
from {i + 1, j + 1} until either i + 1 or j + 1 reaches the number of GSSUs
A
B
in R or R . Similarly, the other ends of the structures need to be aligned
and so the process is repeated for {i − 1, j − 1}. However, the case when
GSSUs in the RNA structure are oriented in opposite direction must also
be considered, and another κ alignments must be performed aligning i − 1
residues with j + 1 residues (not shown in Fig. 3).
The rotation between two GSSUs imposes a penalty to the S. This penalty
is calculated as a distance between the rotation matrices defining the two
consequent GSSU superpositions (see Fig. 3c). The penalty for translation
is, on the other hand, not included explicitly, as it is already present in
the pairwise GSSU S-distance where the difference in GSSUs’ sizes is
used for normalization (see section 2.2). This approach to multiple GSSU
comparison allows to match any pair of structures including the largest ones
in a reasonable amount of time (section 2.6).
2.4
Early termination
The nearest neighbor search, which is part of the S-distance computation,
has O(n2 ) time complexity with respect to the GSSU’s length n. In
addition, the search has to be performed for each of the candidate alignments
noticeably decreasing efficiency of SETTER. To increase the algorithm’s
speed a simple early termination condition was implemented into the
SETTER’s GSSU comparison process. Alignments that are not likely to be
the part of the optimal superposition are identified, and for these alignments
3
D. Hoksza and D. Svozil
shows that S-distance follows a right-skewed distribution. The right-skewed
distributions are commonly fitted with one of the following families: the
Weibull distribution, the gamma distribution and its special case the chisquare distribution, the log-normal distribution, the power log-normal
distribution and the Gumbel extreme value distribution. To show how well
data follow the fitted distribution visual inspection of quantile-quantile plots
(QQ-plots) can be used, or the fit can be tested by two-sample KolmogorovSmirnov test. In the case of S-distance the statistical significance of a given
alignment can be derived using the cumulative distribution function of the
log-normal distribution. The probability density function ρ(x) of log-normal
distribution of a random variable x is given as
G2A
G1B
G4A
RA
G2A
G5A
G3A
G3A
G2B
G1A
B
R
G1B
G4A
G3B
G2B
G3B
G4B
G4B
a)
b)
c)
Fig. 3: Multiple GSSU structure comparison works in four steps.
The figure represents a situation where only the optimal GSSU pair
is considered (κ = 1), and the direction of the alignment is given
by the order of GSSUs. a) A schematic representation of two RNA
structures RA and RB with 5 and 4 GSSUs, respectively. The most
similar GSSUs pair is {G3A , G2B } (shaded rectangle). Structures RA
and RB are aligned based on the rotation and translation of this
pair. The superposition needed to optimally align G4A and G3B was
obtained during the all-to-all GSSU pairwise comparison stage, and
the rotation angle needed to get from the state {G3A , G2B } to {G4A ,
G3B } (first down arrow in the figure) is thus known. The current
state is changed to {G4A , G3B }, and the process is repeated for
the pair {G5A , G4B } (second down arrow in the figure). Similarly,
the algorithm must also process in the opposite direction from the
position of G2A and G1B (up arrow in the figure). c) The rotation
angles ξi from the previous step are used in the penalty function
π() which represents a weight function for the GSSU distances. To
get the final S-distance, the sum of weighted GSSU S-distances is
normalized by the difference in number (β) of the GSSUs in RA
and RB structures.
the nearest neighbor search is skipped. These alignments will very likely
have the triplet S-distance higher than the lowest GSSU distance obtained
up to that time. Specifically, S of the triplet-based GSSUs will be probably
lower than S of the GSSU they come from. If a triplet T A ⊂ G A is aligned
with a triplet T B ⊂ G B with S(G A , G B ) = χ being the best result so far,
the comparison computation can be terminated if S(T A , T B ) × 1/λ > χ.
Since the early termination is a heuristic (S(T A , T B ) < S(G A , G B ) does
not have to be valid), the early termination condition is strengthened by
introducing the parameter λ ≥ 1. By varying the λ parameter, the trade-off
between accuracy and speed can be set. The higher the λ is, the less often
the early termination occurs, and the more accurate and slower the algorithm
is.
2.5
Statistical Significance of Structural Alignment
The statistical significance of a particular alignment can be derived using
the cumulative distribution function of the underlying distribution. Fig. 4
4
ρ(x) =
G5A
√
x
1
2πσ 2
e
− ln x−µ
2
2σ
where parameters µ and σ are the mean and standard deviation, respectively,
of the variable’s natural logarithm that is, by definition, normally distributed.
On a non-logarithmized scale µ is called the location parameter and σ the
scale parameter. These parameters must be determined, and once they are
known, they can be used to derive the statistical significance of a particular
alignment given as its p-value. p-value corresponds to the probability P (x ≤
X) that the variable X takes a value lower or equal to x
P (X ≤ x) =
1
1
ln(X) − µ
+ erf ( √
)
2
2
2σ 2
where erf (x) is the error function defined as
Z x
2
2
e−t dt
erf (x) = +
π
0
The error function can be approximated (Abramowitz and Stegun, 1970) as
2
erf (x) ≈ 1 − (a1 t + a2 t2 + . . . + a5 t5 )e−x , t =
1
1+p×x
where p = 0.3275911, a1 = 0.254829592, a2 = −0.284496736, a3 =
1.421413741, a4 = −1.453152027 and a5 = 1.061405429. The smaller
the p-value, the more statistically significant the S-distance is.
For the determination of parameters µ and σ a set of unrelated structure
pairs based on their sequence similarity (sequence similarity threshold of
80% was used) was prepared. Because the µ and σ parameters depend
on the length of the shorter structure N in the structure pair they must
be determined for different lengths separately. For each length a dataset
containing 50,000 structure pairs was generated by randomly cutting the
regions of the given length from structures longer than N . The structure pair
data sets of lengths 6, 9, 10, 12, 15, 18, 20, 21, 25, 30, 35, . . . 300 residues
were prepared. The S-distance was determined for each alignment in the
given structure pairs data set, and the parameters µ and σ of the log-normal
distribution were found by maximum likelihood fitting. The distribution
was fitted using only the lower half of the data (i.e. data lying below their
median). Removing data with high S-distances does not compromise the
quality of calculations because we are generally not interested in highly
dissimilar structures. In addition, the length dependence of µ and σ is
smoothened by excluding dissimilar structures bacause of a significant noise
reduction. All statistical calculations were performed using the R system
version 2.13.1 (R Development Core Team, 2011) with the package MASS
(version 7.3-14).
2.6
Effectiveness Evaluation
RNA motifs have been classified according to function, 3D structure
or tertiary interaction in the SCOR (Structural classification of RNA)
database (Klosterman et al., 2002; Tamura et al., 2004). The SCOR
classification system is based on the structure called directed acyclic graph
to reflect the fact, that RNA structural elements can have several distinct
features and may belong to multiple classes.
SETTER was evaluated and compared to other RNA structural alignment
algorithms by using three datasets based on functional classification
SETTER
3
3.1
RESULTS AND DISCUSSION
Statistical Parameters Evaluation
QQ-plots show that data follow the fitted log-normal distribution
very closely except for the region of high S-distances (see Fig. 4).
However, poor fitting in this region will not seriously influence the
results of database searching, as we are generally not interested
in highly dissimilar structures. The quality of the fit was further
confirmed by Kolmogorov-Smirnov test that provided p-values
close to zero for all sequence lengths. The location and scale
parameters (µ and σ) of the log-normal distribution depend on the
length of the sequence N (Capriotti and Marti-Renom, 2008) in a
non-trivial way (see Fig. 5). Location parameter µ can be fitted by
one polynomial curve
µ = a + b × N + c × n2
(2)
where a = −4.8779 × 10−1 , b = 1.8993 × 10−2 , and c =
−3.2770 × 10−5 . On the other hand, scale parameter σ decreases
with increasing length in the region of lower lengths (up to 18
residues), and increases with length for longer sequences. Thus, the
dependence of the σ parameter on the sequence length N was fitted
separately for these two regions, each fit following the power law.
●
●
● ●
●
●
●
●
● ● ●
● ● ●
● ● ● ● ● ● ●
●
●
●
2.5
●
●
●
●
1.3
●
●
●
● ●
●
●
●
●
●
●
● ● ●
●
●
●
●
● ●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
1.2
●
●
●
●
●
●
●
●
●
●
●
●
● ●
σ
2.0
1.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
µ
Scale parameter
●
●
● ● ●
1.4
3.0
● ●
1.5
3.5
Location parameter
●
●
●
●
1.0
●
●
●
●
1.1
●
0.5
obtained from the SCOR, version 2.0.3. — FSCOR, T-FSCOR, and RFSCOR (Capriotti and Marti-Renom, 2008). The FSCOR contains all
RNA chains with more than three nucleotides with unique functional
classification, the R-FSCOR is a structurally dissimilar subset of the
FSCOR, and the T-FSCOR set contains structures from the FSCOR set
not present in the R-FSCOR set. Using these datasets, RNA similarity
methods were evaluated by their ability to correctly assign the functional
(i.e. SCOR) classification to the query RNA structure by comparison with a
database of classified RNA structures. Two RNA structures can be either
functionally identical (the so-called exact classification, i.e., they have
the same deepest SCOR classification) or functionally similar (the socalled similar classification, i.e., they do not agree at the deepest level but
share classification at the parent level). Specifically, two experiments were
performed — a leave-one-out test on the FSCOR dataset and a test assigning
functions to structures from the T-FSCOR with the R-FSCOR serving as the
database set.
Two quality measures were utilized to compare various classification
algorithms — classification accuracy (ACC), and area under the ROC curve
(AUC). ACC is defined as the ratio of correctly classified structures to
the total number of queries. The ROC curve is obtained by identifying
the most similar database structure (nearest neighbor) and its p-value for
each query. The nearest neighbors for all queries are sorted according to
their p-values, and the p-value threshold is varied from the most similar
(with the lowest p-value) to the most dissimilar pair. For each threshold,
the number of true positives (T P ) is calculated as the number of structure
pairs with identical/similar function lying above this threshold. Rest of the
structures lying above the threshold then represents false positives (F P ).
If P (positives) is the number of pairs with identical/similar function in
the whole result set and N (negatives) the number of pairs with different
functions, then FNP is called false positive ratio (F P R) and TPP true
positive ratio (T P R). The ROC curve is given as T P R (y-axis) against
F P R (x-axis) generated by varying the p-value threshold. The area under
the ROC curve (AUC) can then be calculated. High AUC means that correct
classifications are present mostly at the top of the list of nearest neighbors
sorted by their p-values. It must also be stressed, that both ACC and AUC
should be reported to objectively compare classification methods, because
with high ACC it is more difficult to get high AUC (see Supplementary
Information).
●
● ●
●
●
●
●● ●
●●
0
50
100
150
Length
200
250
300
0
50
100
150
Length
200
250
300
Fig. 5: Length dependence of log-normal parameters. While
location parameter µ was fitted by one curve (left plot), two curves
were needed for scale parameter σ (right plot).
σ = a × Nb
(3)
For sequence lengths up to 18 a and b parameters are equal to
a = 1.0239 and b = − 3.1648 × 10−1 , and for longer sequences
a = 2.0280 × 10−1 and b = − 2.3256 × 10−1 . The relations
(2) and (3) provide a simple way to calculate µ and σ parameters for
any sequence length.
3.2
Effectiveness Comparison
SETTER was compared with two most effective approaches SARA
and iPARTS using published AUC (SARA, iPARTS) and ACC
(SARA only) values on the FSCOR and T/R-FSCOR datasets.
Following settings were used in SETTER: α = 0.2, β = 2, = 4 Å
κ = 3, and λ = 1 (see sections 2.2, 2.3 and 2.4 for details).
The results are summarized in Table 1. Three sets of results
with varying p-value threshold are presented for SETTER. The
percentage of classified structures for the given p-value threshold
characterizes each of these sets and is called coverage. ST RpV =1
corresponds to the classification where the structures are sorted
according to their p-values but no threshold is applied (FSCOR
coverage equals to 100%). For ST RpV =0.01 the query is classified
only if the p-value of the S-distance between the query and the
closest structure is less than 1%. In this case 63 structures were not
classified corresponding to the FSCOR coverage of 85%. Similarly,
ST RpV =4.4E−07 corresponds to the FSCOR coverage equaling to
58.7% at which level SARA was evaluated (Capriotti and MartiRenom, 2009). However, while SARA coverage of T/R-FSCOR
dataset equals to 56%, SETTERs’ coverage at this p-value increases
to 64.3%. Another issue influencing the accuracy is the effect of
the score (p-value in SETTER) rounding. SARA rounds its score
to two digits which always leads to artificial increase in ACC, and
sometimes may lead to artificial increase in AUC as well. The reason
is that correctly classified structures with the same score are used
preferrentially over the wrongly classified structures with the same
score. However, SETTER rounds to more digits, and is much less
susceptible to this behavior.
Since iPARTS does not employ any filtering, SETTER and
iPARTS were compared at the same coverage 100%, i.e., no filtering
by p-value was applied (p-value = 1). Results in Table 1 show, that
SETTER outperforms iPARTS in AUC for both FSCOR and T/RFSCOR datasets, respectively. To compare the accuracy of SETTER
and SARA, same level of coverage should be utilized. SARA’s
coverage of 58.7% for the FSCOR dataset corresponds to the pvalue of 4.4 × 10−7 in SETTER. At this siginificance level SARA’s
5
D. Hoksza and D. Svozil
5
10
20
10
15
4000
0
0
10
20
5
30
40
0
Theoretical quantiles
20
30
40
50
0
20
40
Distance
●
●
40
40
●●●●●●●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
30
10
15
Distance
0
Data quantiles
0
10000
8000
0
0
60
Theoretical quantiles
80
●● ●
●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
50
100
60
80
Distance
●
●
●
60
7
●
40
6
20
5
0
4
●
20
3
Distance
10
2
●●●●●●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
Length = 300
4000
4000
0
5000
0
0 1 2 3 4 5 6 7
Length = 200
8000
Length = 100
0
Frequency
15000
Length = 10
150
200
Theoretical quantiles
●●●●●●●●●● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
100
●
●
●
300
400
200
Theoretical quantiles
Fig. 4: The histograms of frequencies of S-distances were fitted by log-normal distribution (upper plots, the curves). Each of the sequence
length must be fitted separately (histograms for only four sequence lengths are shown). The quality of fit can be visually inspected from the
quantile-quantile (Q-Q) plots (lower plots). If the distribution of data is similar to the fitted log-normal distribution, the points in the Q-Q
plot lie on the straight line.
coverage on the T/R-FSCOR dataset equals to 56%. Using the pvalue threshold of 4.4 × 10−7 SETTER performs better than SARA
in terms of AUC both for FSCOR and T/R-FSCOR datasets (see
Table 1). For iPARTS, ACC was not reported, and necessary tests
can not be performed using the iPARTS web interface. However,
SETTER can be compared with SARA for which ACC values
are avilable. Rather high AUC of SETTER should be reflected in
lowered ACC. Still, SETTER’s ACC is higher than that of SARA
in exact classification both for FSCOR (by 3.6%) and T/R-FSCOR
(by 7.6%) datasets, but lower in similar classification by 4.2%
for FSCOR, and by 6.8% for T/F-SCOR dataset. However, when
comparing SETTER with SARA on the T/R-FSCOR dataset it must
be taken into account that at the p-value threshold equaling to
4.4 × 10−7 SARA’s coverage is 56% while SETTER’s coverage
is 64.3%. Reducing higher coverage of SETTER to the 56% of
SARA would lead to increase in SETTER’s accuracy. Thus, it can
be concluded that SETTER performs better than iPARTS and SARA
in terms of AUC and is comparable with SARA in terms of ACC.
Table 1. ACC and AUC comparison of iPARTS, SARA, SETTER on the
FSCOR and T/R-FSCOR datasets. The values are given in % and are
reported for exact/similar classification. ROC curves from which AUC was
calculated are shown in Supplementary Information.
iPARTS
SARA
ST RpV =1
ST RpV =0.01
ST RpV =4.4E−07
3.3
FSCOR
AUC
ACC
72/92 ?
61/83 81.4/95.3
85/93 57.8/67.5
90/95 69.1/75.3
83/91 85.0/91.1
T/R-FSCOR
AUC
ACC
77/90 ?
58/85 78.0/94.5
93/93 67.4/70.5
90/90 77.0/80.1
85/85 85.6/87.7
Speed Comparison
Experiments measuring running times of SARA, SETTER and
iPARTS were also performed. The runtime of SETTER was
measured on Linux machine with 4 Intel(R) Xeon(R) CPUs E7540,
6
2GHz (the algorithm is not parallelized thus only one core per
comparison was utilized) and 132 GB of RAM (however, the
average memory size needed to store the representations of all RNA
structures from the FSCOR set was less than 3.3 MB) running
Red Hat Linux. Because neither SARA, nor iPARTS offer the
standalone versions for download, their runtimes were obtained
from their web interfaces. Thus, not using the same hardware
for all benchmarked algorithms the comparison is only qualitative.
However, the variations between SETTER and SARA/iPARTS are
substantial (see Table 2), and can not be attributed to different
hardware setup only.
Table 2. Runtime comparison of iPARTS, SARA and SETTER for
datasets of RNA structures of various size. The D1 set contains tRNA
structures 1EHZ:A, 1H3E:B, 1I9V:A, 2TRA:A and 1YFG:A (average length
76 nucleotides), D2 set contains ribozyme P4-P6 domain 1GID:A, 1HR2:A
and 1L8V:A (average length 157 nucleotides), D3 contains domain V of 23S
rRNA 1FFZ:A and 1FG0:A (average length 496 nucleotides), D4 contains
16S rRNA 1J5E:A and 2AVY:A (average length 1522 nucleotides), and D5
contains the currently largest RNA structures in PDB – yeast 25S rRNA
3O58:1 and 3O5H:1 (average length 3396 nucleotides).
data set
D1
D2
D3
D4
D5
iPARTS
1.1 s
2.6 s
17.0 s
2.8 min
?
SARA
1.7 s
9.2 s
?
?
?
SETTER
0.3 s
1.2 s
3.4 s
19.7 s
57.2 s
Table 2 shows times of all-to-all comparisons on four datasets
containing RNA structures of various size. While SARA can align
only structures having no more than 1000 residues, and iPARTS
only structures with no more than 1900 residues, there are no
size limits in SETTER. iPARTS utilizes dynamic programming
technique for the alignment, and thus the speed grows quadratically
with the increasing structure size. SETTER’s nearest neighbor
identification procedure scales as O(n2 ) with the size of the GSSU
(not with the size of structure!), and employment of the heuristic
SETTER
speed optimization with λ = 1 further reduces the number of
O(n2 ) computations. Comparison with SARA is difficult, because
for two of four data sets the server times out, and returns no results.
However, from the runtime experiments it can be concluded, that
SETTER is much better suited for the alignment of even the largest
RNA structures.
4
CONCLUSIONS
Fast and accurate algorithm for RNA structural comparison
SETTER has been developed and tested on a large benchmark data
set. The efficiency of the algorithm is given by the decomposition
of the RNA structure into the set of generalized secondary structure
motifs (GSSUs). Such a segmentation offers good scalability with
respect to the structure size (SETTER scales linearly with the
structure size) because the number of residues in GSSUs (SETTER
scales quadratically with the GSSU size) generally does not increase
with increased size of the RNA structure. The efficiency of the
method is clearly demonstrated by comparing with iPARTS, and
SARA approaches. SETTER method is capable of aligning even
the largest RNA structures deposited in the PDB database in a
reasonable amount of time (less than 1 minute), and represents thus
an important addition to the portfolio of automatic RNA structural
analysis tools.
ACKNOWLEDGEMENT
This work was supported by the Czech Science Foundation (GAČR)
project Nr. 201/09/0683 and by the Ministry of Education of the
Czech Republic MSM6046137302.
REFERENCES
Abramowitz, M. and Stegun, I. A. (1970). Handbook of Mathematical Functions.
Dover, New York.
Arnez, J. G. and Steitz, T. A. (1994). Crystal structure of unmodified tRNA(Gln)
complexed with glutaminyl-tRNA synthetase and ATP suggests a possible role
for pseudo-uridines in stabilization of RNA structure. Biochemistry, 33(24),
7560–7567. PMID: 8011621.
Ban, N., Nissen, P., Hansen, J., Moore, P. B., and Steitz, T. A. (2000). The complete
atomic structure of the large ribosomal subunit at 2.4 a resolution. Science (New
York, N.Y.), 289(5481), 905–920. PMID: 10937989.
Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function.
Cell, 116, 281–297.
Camproux, A. C., Gautier, R., and Tufféry, P. (2004). A Hidden Markov Model Derived
Structural Alphabet for Proteins. Journal of Molecular Biology, 339(3), 591–605.
Capriotti, E. and Marti-Renom, M. A. (2008). Rna structure alignment by a unit-vector
approach. Bioinformatics, 24, i112–i118.
Capriotti, E. and Marti-Renom, M. A. (2009). Sara: a server for function annotation of
rna structures. Nucl. Acids Res., page gkp433.
Chang, Y.-F., Huang, Y.-L., and Lu, C. L. (2008). Sarsa: a web tool for structural
alignment of rna using a structural alphabet. Nucleic Acids Res, 36(Web-ServerIssue), 19–24.
Chew, L. P., Huttenlocher, D., Kedem, K., and Kleinberg, J. (1999). Fast detection of
common geometric substructure in proteins. Journal of computational biology : a
journal of computational molecular cell biology, 6(3-4), 313–325.
de Brevern, A. G., Etchebest, C., and Hazout, S. (2000). Bayesian probabilistic
approach for predicting backbone structures in terms of protein blocks. Proteins,
41(3), 271–287.
Dror, O., Nussinov, R., and Wolfson, H. (2005). ARTS: alignment of RNA tertiary
structures. Bioinformatics, 21 Suppl 2.
Dror, O., Nussinov, R., and Wolfson, H. J. (2006). The ARTS web server for aligning
RNA tertiary structures. Nucleic Acids Res, 34(Web Server issue).
Eddy, S. R. (2001). Noncoding RNA genes and the modern RNA world . Nature
Reviews Genetics, 2(12), 919–929.
Ferrè, F., Ponty, Y., Lorenz, W. A., and Clote, P. (2007). Dial: a web server for
the pairwise alignment of two rna three-dimensional structures using nucleotide,
dihedral angle and base-pairing similarities. Nucleic Acids Res, 35(Web-ServerIssue), 659–668.
Garman, E. (2003). ’Cool’ crystals: macromolecular cryocrystallography and radiation
damage. Current Opinion in Structural Biology, 13(5), 545–551. PMID: 14568608.
Hendrix, D. K., Brenner, S. E., and Holbrook, S. R. (2005). RNA structural motifs:
building blocks of a modular biomolecule. Quarterly reviews of biophysics, 38(3),
221–243.
Holbrook, S. R. (2008). Structural principles from large RNAs. Annual review of
biophysics, 37(1), 445–464.
Holbrook, S. R., Cheong, C., Tinoco, I, J., and Kim, S. H. (1991). Crystal structure of
an RNA double helix incorporating a track of non-Watson-Crick base pairs. Nature,
353(6344), 579–581. PMID: 1922368.
Kabsch, W. (1976). A solution for the best rotation to relate two sets of vectors. Acta
Crystallographica Section A, 32(5), 922–923.
Kelley, R. L. and Kuroda, M. I. (2000). Noncoding RNA genes in dosage compensation
and imprinting. Cell, 103(1), 9–12.
Kim, R., Holbrook, E. L., Jancarik, J., and Kim, S. H. (1995). Synthesis and purification
of milligram quantities of short RNA transcripts. BioTechniques, 18(6), 992–994.
PMID: 7546725.
Kim, S. H., Suddath, F. L., Quigley, G. J., McPherson, A., Sussman, J. L., Wang, A. H.,
Seeman, N. C., and Rich, A. (1974). Three-dimensional tertiary structure of yeast
phenylalanine transfer RNA. Science (New York, N.Y.), 185(149), 435–440. PMID:
4601792.
Klosterman, P. S., Tamura, M., Holbrook, S. R., and Brenner, S. E. (2002). SCOR: a
Structural Classification of RNA database. Nucleic Acids Res, 30(1), 392–394.
Kolodny, R. and Linial, N. (2004). Approximate protein structural alignment in
polynomial time. Proc Natl Acad Sci U S A, 101(33), 12201–12206.
Lu, X.-J. and Olson, W. K. (2008). 3DNA: a versatile, integrated software system for the
analysis, rebuilding and visualization of three-dimensional nucleic-acid structures.
Nature Protocols, 3(7), 1213–1227.
R Development Core Team (2011). R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3900051-07-0.
Scott, W. G. (2007). Ribozymes. Current Opinion in Structural Biology, 17(3), 280–
286.
Tamura, M., Hendrix, D. K., Klosterman, P. S., Schimmelman, N. R., Brenner, S. E.,
and Holbrook, S. R. (2004). SCOR: Structural Classification of RNA, version 2.0.
Nucleic Acids Res, 32(Database issue).
Tinoco, I. (1999). How RNA folds. Journal of Molecular Biology, 293(2), 271–281.
Wang, C.-W., Chen, K.-T., and Lu, C. L. (2010). iPARTS: an improved tool of pairwise
alignment of rna tertiary structures. Nucleic Acids Res, 38 Suppl, W340–7.
Wimberly, B. T., Brodersen, D. E., Clemons, W M, J., Morgan-Warren, R. J., Carter,
A. P., Vonrhein, C., Hartsch, T., and Ramakrishnan, V. (2000). Structure of the 30S
ribosomal subunit. Nature, 407(6802), 327–339. PMID: 11014182.
7