A runtime heuristic for faster probabilistic analysis of RNA abstract

A runtime heuristic for faster probabilistic analysis of
RNA abstract shapes
Stefan Janssen∗1
1 Faculty
of Technology, Bielefeld University, 33615 Bielefeld, Germany
Email: Stefan Janssen - [email protected];
∗ Corresponding
author
Abstract
Biologists often deduce function of a molecule by analysing its structure. Observing RNA structures is nearly
impossible, which raises the need for computational predictions.
In this research paper we present a runtime
heuristic for the probabilistic shape analysis which predicts functional classes of secondary structures for RNA
sequences.
Due to the methodology of algebraic dynamic programming we are able to precisely separate the
search space and dynamically compile so called Thermodynamic Matchers which compute the probability of
exactly one shape class.
By a clever choice for shape classes for this procedure we achieve a runtime speed up
ranging from three fold for smaller sequences up to 300 fold for sequences with more than 400 nucleotides. This
advantage must be bought by an uncertainty about a small part of the complete search space that is not explored
any more.
Introduction
putational predictions.
RNA structure prediction
Since 3D structures have too much variety, cur-
In the clockwork of a living cell, single stranded ri-
rent predicting algorithms fall back to secondary
bonucleic acid (RNA) molecules play a more active
structure, that is the set of basepairs.
role than their close relatives, the double stranded
RNA sequence
deoxyribonucleic acid (DNA) molecules. Besides its
dynamic energy parameters, previously measured in
well known function as messengers (mRNA) in the
wet lab studies [1], the RNA folding problem is to
process of protein synthesis, RNA also performs as
nd the one secondary structure
regulators (OxyS, riboswitches), carries amino acids
modynamically optimal, i.e. has minimal free energy
(tRNA) and even has enzymatic capabilities (ribo-
(MFE), out of the search space
some).
all possible secondary structures for
Like for proteins, the function directly de-
s
of length
n
Given an
and various thermo-
xM F E
F (s)
s.
that is ther-
that contains
The dynamic
pends on the structure of the RNA. Thus biologists
programming methodology allows us to explore the
can make assumptions on the function or relation-
exponentially growing
ship of RNAs by analysing the structure. Observing
nding
the real structure of RNA in its natural environment
Nebel et al. [2] expect the number of possible candi-
is nearly impossible, which raises the need for com-
dates in this search space
1
xM F E
within
F (s) in polynomial time, thus
O(n3 ) time and O(n2 ) space.
F (s) to be 1.44n ·3.45·n
−3
2
.
Complete probabilistic shape analysis in exponen-
Algorithms for optimisation problems that depend on estimated scoring schemes should always
tial time
consider suboptimal solutions.
Small inaccuracies
The state of an RNA molecule can be seen as a Boltz-
in the scores for dierent candidates in the search
mann ensemble of structures. The challenge of prob-
space leads to uncertainties about the choice of the
abilistic shape analysis [5] is to determine whether
optimal candidate. For RNA folding it is likely that
there is some shape class of structures in this en-
the assigned energy for the true biological secondary
semble that is internally similar, distinct from the
structure is slightly worse than for
xM F E .
So it is
rest, and collectively dominates the probabilities of
worth to also take a look at suboptimal candidates,
all other shape classes. The dominating shape class
which energies are within a range
β
of the predicted
should be the biological relevant one.
minimal free energy (MFE). Suboptimal RNA fold-
For example, shape probabilities have been used
ing [3] leads to an exponential runtime, because a
to assign signicance levels to predicted miRNA pre-
particular part
F β (s)
of the exponential
F (s)
must
cursors. [6, 7]
be considered. Unfortunately, many suboptimal can-
We use the partition function to calculate proba-
didates are nearly the same. This enforces a larger
bilities for secondary structures. In general, the par-
deviation of the optimal score to gather candidates
tition function provides a measure of the total num-
with crucial dierences. This results in higher run-
ber of states (here secondary structures) weighted
times and much larger outputs.
by their individual energies at a particular temperature. The partition function of the complete search
F (s)
Qs =
−Ej
P
RT , where
j∈F (s) e
Ej is the energy of secondary structure j , R the universal gas constant (0.00198717 kcal/K) and T the
space
Abstract shape analysis
temperature in Kelvin.
x
abstracts from the position and the length of helical
regions and neglects unpaired loop regions.
calculation, the runtime of the probabilistic shape
1 Thus,
analysis remains
the focus is on the specic arrangement of helices.
Here
xπ
is the shape string, representing the result
h,
number of dierent shape classes, every shape class
in
which is
F (s). For example, both
(((...)))..((....)) and
..((...))((...)).... map to the same shape
string [][], which denotes two adjacent hairpins.
Within a shape class h, the secondary structure with
a well-dened subset of
or shrep for short.
has to be inspected and thus enumerated,
ample, complete probabilistic shape analysis for a
sequence of length 400 requires 12 hours of today's
hardware, while abstract shape analysis takes only
38 seconds.
the best free energy becomes the shape representa-
hshrep ,
F (s)
resulting in a considerably higher runtime. For ex-
secondary structures
tive
O(n3 ·1.1n ). In contrast to the simβ restricts the exponential
ple shape analysis, where
of the abstraction. All secondary structures with the
same shape string build one shape class
is dened as
P rob(x) = e
/Qs . The probability P rob(hi ) of a
shape class hi is the sum of probabilities of its class
P
members P rob(hi ) =
x∈hi P rob(x).
By moving the division by Qs to the end of the
structures by applying abstraction and classication.
for a secondary structure
x ∈ F (s)
−Ex
RT
ageable output of calculating suboptimal secondary
π(x)
The probability of a par-
ticular secondary structure
The abstract shape analysis attacks the unman-
A shape function
is dened as
Sucient probabilistic shape analysis in
polynomial time
In comparison to
the list of suboptimals, the list of all shreps is much
shorter and the dierences between two shreps are of
Despite abstraction, probabilistic shape analysis is
more importance. By amalgamating shape abstraction and enumeration of the search space into the
still exponential regarding the number of answers
same dynamic recurrences, the program RNAshapes
and runtime.
[4] is able to decrease the exponential factor from
stable conformations of a biological relevant RNA
1.44n
should be limited to just a few.
to
1.1n ,
because there are less shape classes
However, the number of dierent
In an ideal world
there should only be a few shape classes with high
than secondary structures.
1 There are ve dierent levels of abstractions available, two of them also account for unpaired loop regions. For simplicity
here we deal only with the strongest abstraction level without unpaired loop regions.
2
|P(s)| = 1.1|s|
|ROrakel(s)| for α=0.9
|Renergy(s)| for α=0.9
|Rsampling(100, s)| for α=0.9
0.2
100000
10000
-0.2
1000
-0.4
100
-0.6
10
-0.8
0
100
200
300
400
500
1
700
600
sequence length n = |s|
Figure 1: Evaluation of our runtime heuristic. We used approximately 200 randomly generated sequences
with a step width of ve bases.
The blue curve depicts the number of shapes in
P (s).
The green curve
illustrates the perfect number of shape classes to achieve the given combined probability of
α = 0.9.
The red
Renergy (s) as a replacement for the oracle. The lower cyan coloured curve stands
shapes for Rsampling (100, s) and the upper one shows the deviation of the given α,
curve shows the results for
for the number of used
coloured in black.
probability and many more that are relatively un-
pendently compute the probability of just a part of
likely.
F (s), e.g.
Following this principle, the central idea of
our proposed heuristic is to accumulate these few
very likely shape classes in a list
R(s).
Instead of
secondary structures as the shape class of the prob-
P (s) for
abilistic shape analysis. Every deviation would lead
computing probabilities for all shape classes
s we only perform the calculation for each
3
member in R(s) within O(n ) time. This results in
3
an overall runtime of O(n · |R(s)|) in contrast to
O(n3 · |P (s)|) where |P (s)| ≈ 1.1n . The runtime
a sequence
saving directly depends on the ratio
number of shape classes in
a shape class, we have to ensure that such
a specialised program contains precisely the same
to an incorrect probability.
Due to the required accuracy and the exponential amount of dierent shape classes, all attempts
of manually adapting the dynamic programming re-
|R(s)|
|P (s)| , i.e. the
currences are out of question.
R(s).
The procedure leads to two questions: First, how
To encounter this obstacle we wrote a program
R(s) in advance?
that accepts a single shape string as input and out-
Second, how can we ensure that the independently
puts an adapted grammar. This adapted grammar
calculated probabilities are exactly the same as in
enumerates exactly the secondary structures of the
the probabilistic shape approach?
input shape class.
to guess promising shape classes for
Since the RNAshapes program
follows the methodology of ADP [8], we can replace
its grammar for generating
Exact probabilities via Thermodynamical Match-
changed.
hi
of the probabilistic shape anal-
ysis is a well-dened part of
structures
x
F (s).
This procedure allows us to dynamically
compile specialised programs for each shape class in
R(s).
All secondary
These Thermodynamic Matchers (TDM) cal-
culate the partition function for their shape class in
are members of exactly one shape class
and the superset of all structures of all shape classes
O(n3 ) time.
equals the search space of secondary structures for
tition function
s: ∪x∈hi x = F (s),
of the shape class.
where
i = 1, . . . , |P (s)|.
with our adapted
the thermodynamic energy parameters, remain un-
ers
Each shape class
F (s)
grammar, while the used algebras, containing all
To inde-
3
Division by the uniquely computed par-
Qs
for
F (s) results in the probability
|R(s)| (log scale)
a(Rsampling(100, s)) - 0.9
0
1e+06
1e+06
2 GB
100000
runtime in seconds (log scale)
10000
1 GB
1000
100
10
1
ROrakel(s)
0.1
Renergy(s)
0.01
RNAshapes -i 100
RNAshapes -p
Rsampling(100, s)
0.001
0
100
200
300
400
500
600
sequence length n = |s|
Figure 2: Empirically measured runtimes.
The colours correspond to Figure 1.
In addition the magenta
coloured curve shows the runtime for the sampling method that is integrated in RNAshapes and used by
our runtime heuristic. RNAshapes -p stands for the exact probabilistic shape analysis while RNAshapes
-i addresses the sampling mode.
Economic selection of shapes for
R(s)
Shape class guessing:
RNAshapes already has an integrated heuristic for
|R(s)| ==P
|P (s)|, the combined probability
α = a(R(s)) :=
h∈R(s) P (h) is always less than
a(P (s)) = 1. Otherwise the heuristic cannot save
Unless
the probabilistic shape analysis. After spanning the
search space,
picked.
any runtime and acts just as RNAshapes. The run-
strings.
P (s) \ R(s).
R(s)
to combine the appar-
α
and
Thus, this sampling
exactly.
a(P (s)).
R(s).
N.
shape
|P (s)|,
but the results are of no means guaranteed to be
To combine both advantages, we use the quickly
We investigated a couple of dierent methods of
guessing shape classes for the list
N
The runtime no longer depends on
method is much faster then our proposed heuristic
ently conicting goals of minimising the runtime and
minimising the gap between
secondary structures are randomly
but on the constant factor
The success of the heuristic depends on a clever
selection of shapes for
N
The shape class probabilities are then ap-
proximated as frequencies by counting all
time saving must be bought by an uncertainty about
the unexplored parts of the search space
Rsampling (N, s)
calculated shape classes from the sampling as mem-
Here we
bers for
present two of them.
R(s) and precisely determine their probabil-
ities via TDMs. The cyan coloured curve in Figure 1
documents the results for a sample size of
N = 100.
At a certain point, the combined probability of at
most
Shape class guessing:
N
a given
Renergy (s)
dierent shape classes is insucient to full
α,
even if we could guess the shape classes
perfectly. We have to lower
α
or increase
N.
First, we compute a simple shape analysis with an
adequate choice for
β.
Then we go through this list
Results
of shape classes that are ordered by their free energies. Until
a(Renergy (s)) ≥ α,
we generate a TDM
R(s)
for each shape class and calculate its probability. In
To gain a lower bound for sizes of
Renergy (s) we re-run
the simple shape analysis with a larger β . Results for
condition
this method are depicted by the red curve in Figure
shape classes of
1.
ing probabilities. The green curve in Figure 1 is a
case of too few shape classes in
α = 0.9
under the
we employed an oracle. For eval-
uation, this wondrous oracle is able to denominate
4
P (s)
in an ordering with decreas-
smoothed visualisation of such perfectly composed
R(s)
sets
for 200 sequences
s
runtime saving compared to RNAshapes.
of dierent lengths.
References
The blue straight line shows the exponential growth
of
P (s).
The wide discrepancy between the green
and the blue curve gives versatile room for practical
1. Mathews DH, Sabina J, Zuker M, Turner DH:
Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal
of Molecular Biology 1999, 288(5):911940.
replacements for the oracle.
All results are given by the two graphs shown in
this paper. Figure 1 is a comparison of the sizes of
P (s)
and dierently constructed sets
with their impacts on
Rsampling
a(R(s)).
R(s)
together
For the method
in Figure 1 we decided to x
N
to
100.
2. Nebel
Shapes
The upper cyan coloured curve illustrates the deviation of
a(Rsampling (100, s))
from
α.
dierent grammar sizes.
Rsampling (N, s),
Figure 2
4. Giegerich R, Voÿ B, Rehmsmeier M:
shapes of RNA.
32(16):4843.
Our presented heuristic decreases the exponential
O(n3 · |P (s)|)
the
sampling
method
to
guess
shapes
for
Belt J, van Laake L, Vos J, Verloop R, van de
of RNAshapes, our proposed heuristic provides ex-
Wetering M, Guryev V, Takada S, van Zonneveld
act probabilities for the shape classes. The tradeo
Many novel
mammalian microRNA candidates identied by extensive cloning and RAKE analysis. Genome Res 2006, 16(10):12891298.
AJ, Mano H, Plasterk R, Cuppen E:
between speedup and size of the unexplored search
α or N .
The
use of TDMs allows calculation of signicant larger
input sequences compared to RNAshapes -p, which
exceeds 2 GB of memory for sequence lengths of ap-
7. Lu J, Shen Y, Wu Q, Kumar S, He B, Shi S,
The birth and
death of microRNA genes in Drosophila.
Nat Genet 2008, 40(3):351355.
proximately 400 nucleotides. TDMs are also a good
Carthew RW, Wang SM, Wu CI:
target for parallelisation and thus another source for
further speedups. Figure 2 shows that the speedup
grows with growing sequence length, ranging from
three-fold for
n = 200
up to 300-fold for
Abstract
Nucleic Acids Research 2004,
6. Berezikov E, van Tetering G, Verheul M, van de
In contrast to the sampling method
space can be set by the user via choice of
2008,
Complete
probabilistic analysis of RNA shapes. BMC
Biol 2006, 4.
O(n3 · N ).
is the constant number of secondary structures
Rsampling (N, s).
Abstract
5. Voÿ B, Giegerich R, Rehmsmeier M:
for the probabilistic shape
analysis to a polynomial runtime of
from
On
Submitted
Complete suboptimal folding of RNA
and the stability of secondary structures.
Biopolymers 1999, 49(2):145165.
Conclusions
N
A:
P:
shows our analysis of empirically measured runtimes.
runtime
RNA.
3. Wuchty S, Fontana W, Hofacker IL, Schuster
To include
the overhead of compiling the TDMs and running
the sample method to get
Scheid
nebel/www_shapes/www_RNAShapes.pdf ].
The larger the gram-
mar, the higher is the compile time.
of
[http://wwwagak.informatik.uni-kl.de/sta/
Dierent shape strings result in dierent TDMs,
i.e.
ME,
A discipline
of dynamic programming over sequence
data. Science of Computer Programming 2004,
51(3):215263.
n = 400.
8. Giegerich R, Meyer C, Steen P:
We suggest to use our heuristic for sequences longer
than 200 nucleotides, where the overhead of compiling the TDMs pays o and achieves a signicant
5