A runtime heuristic for faster probabilistic analysis of RNA abstract shapes Stefan Janssen∗1 1 Faculty of Technology, Bielefeld University, 33615 Bielefeld, Germany Email: Stefan Janssen - [email protected]; ∗ Corresponding author Abstract Biologists often deduce function of a molecule by analysing its structure. Observing RNA structures is nearly impossible, which raises the need for computational predictions. In this research paper we present a runtime heuristic for the probabilistic shape analysis which predicts functional classes of secondary structures for RNA sequences. Due to the methodology of algebraic dynamic programming we are able to precisely separate the search space and dynamically compile so called Thermodynamic Matchers which compute the probability of exactly one shape class. By a clever choice for shape classes for this procedure we achieve a runtime speed up ranging from three fold for smaller sequences up to 300 fold for sequences with more than 400 nucleotides. This advantage must be bought by an uncertainty about a small part of the complete search space that is not explored any more. Introduction putational predictions. RNA structure prediction Since 3D structures have too much variety, cur- In the clockwork of a living cell, single stranded ri- rent predicting algorithms fall back to secondary bonucleic acid (RNA) molecules play a more active structure, that is the set of basepairs. role than their close relatives, the double stranded RNA sequence deoxyribonucleic acid (DNA) molecules. Besides its dynamic energy parameters, previously measured in well known function as messengers (mRNA) in the wet lab studies [1], the RNA folding problem is to process of protein synthesis, RNA also performs as nd the one secondary structure regulators (OxyS, riboswitches), carries amino acids modynamically optimal, i.e. has minimal free energy (tRNA) and even has enzymatic capabilities (ribo- (MFE), out of the search space some). all possible secondary structures for Like for proteins, the function directly de- s of length n Given an and various thermo- xM F E F (s) s. that is ther- that contains The dynamic pends on the structure of the RNA. Thus biologists programming methodology allows us to explore the can make assumptions on the function or relation- exponentially growing ship of RNAs by analysing the structure. Observing nding the real structure of RNA in its natural environment Nebel et al. [2] expect the number of possible candi- is nearly impossible, which raises the need for com- dates in this search space 1 xM F E within F (s) in polynomial time, thus O(n3 ) time and O(n2 ) space. F (s) to be 1.44n ·3.45·n −3 2 . Complete probabilistic shape analysis in exponen- Algorithms for optimisation problems that depend on estimated scoring schemes should always tial time consider suboptimal solutions. Small inaccuracies The state of an RNA molecule can be seen as a Boltz- in the scores for dierent candidates in the search mann ensemble of structures. The challenge of prob- space leads to uncertainties about the choice of the abilistic shape analysis [5] is to determine whether optimal candidate. For RNA folding it is likely that there is some shape class of structures in this en- the assigned energy for the true biological secondary semble that is internally similar, distinct from the structure is slightly worse than for xM F E . So it is rest, and collectively dominates the probabilities of worth to also take a look at suboptimal candidates, all other shape classes. The dominating shape class which energies are within a range β of the predicted should be the biological relevant one. minimal free energy (MFE). Suboptimal RNA fold- For example, shape probabilities have been used ing [3] leads to an exponential runtime, because a to assign signicance levels to predicted miRNA pre- particular part F β (s) of the exponential F (s) must cursors. [6, 7] be considered. Unfortunately, many suboptimal can- We use the partition function to calculate proba- didates are nearly the same. This enforces a larger bilities for secondary structures. In general, the par- deviation of the optimal score to gather candidates tition function provides a measure of the total num- with crucial dierences. This results in higher run- ber of states (here secondary structures) weighted times and much larger outputs. by their individual energies at a particular temperature. The partition function of the complete search F (s) Qs = −Ej P RT , where j∈F (s) e Ej is the energy of secondary structure j , R the universal gas constant (0.00198717 kcal/K) and T the space Abstract shape analysis temperature in Kelvin. x abstracts from the position and the length of helical regions and neglects unpaired loop regions. calculation, the runtime of the probabilistic shape 1 Thus, analysis remains the focus is on the specic arrangement of helices. Here xπ is the shape string, representing the result h, number of dierent shape classes, every shape class in which is F (s). For example, both (((...)))..((....)) and ..((...))((...)).... map to the same shape string [][], which denotes two adjacent hairpins. Within a shape class h, the secondary structure with a well-dened subset of or shrep for short. has to be inspected and thus enumerated, ample, complete probabilistic shape analysis for a sequence of length 400 requires 12 hours of today's hardware, while abstract shape analysis takes only 38 seconds. the best free energy becomes the shape representa- hshrep , F (s) resulting in a considerably higher runtime. For ex- secondary structures tive O(n3 ·1.1n ). In contrast to the simβ restricts the exponential ple shape analysis, where of the abstraction. All secondary structures with the same shape string build one shape class is dened as P rob(x) = e /Qs . The probability P rob(hi ) of a shape class hi is the sum of probabilities of its class P members P rob(hi ) = x∈hi P rob(x). By moving the division by Qs to the end of the structures by applying abstraction and classication. for a secondary structure x ∈ F (s) −Ex RT ageable output of calculating suboptimal secondary π(x) The probability of a par- ticular secondary structure The abstract shape analysis attacks the unman- A shape function is dened as Sucient probabilistic shape analysis in polynomial time In comparison to the list of suboptimals, the list of all shreps is much shorter and the dierences between two shreps are of Despite abstraction, probabilistic shape analysis is more importance. By amalgamating shape abstraction and enumeration of the search space into the still exponential regarding the number of answers same dynamic recurrences, the program RNAshapes and runtime. [4] is able to decrease the exponential factor from stable conformations of a biological relevant RNA 1.44n should be limited to just a few. to 1.1n , because there are less shape classes However, the number of dierent In an ideal world there should only be a few shape classes with high than secondary structures. 1 There are ve dierent levels of abstractions available, two of them also account for unpaired loop regions. For simplicity here we deal only with the strongest abstraction level without unpaired loop regions. 2 |P(s)| = 1.1|s| |ROrakel(s)| for α=0.9 |Renergy(s)| for α=0.9 |Rsampling(100, s)| for α=0.9 0.2 100000 10000 -0.2 1000 -0.4 100 -0.6 10 -0.8 0 100 200 300 400 500 1 700 600 sequence length n = |s| Figure 1: Evaluation of our runtime heuristic. We used approximately 200 randomly generated sequences with a step width of ve bases. The blue curve depicts the number of shapes in P (s). The green curve illustrates the perfect number of shape classes to achieve the given combined probability of α = 0.9. The red Renergy (s) as a replacement for the oracle. The lower cyan coloured curve stands shapes for Rsampling (100, s) and the upper one shows the deviation of the given α, curve shows the results for for the number of used coloured in black. probability and many more that are relatively un- pendently compute the probability of just a part of likely. F (s), e.g. Following this principle, the central idea of our proposed heuristic is to accumulate these few very likely shape classes in a list R(s). Instead of secondary structures as the shape class of the prob- P (s) for abilistic shape analysis. Every deviation would lead computing probabilities for all shape classes s we only perform the calculation for each 3 member in R(s) within O(n ) time. This results in 3 an overall runtime of O(n · |R(s)|) in contrast to O(n3 · |P (s)|) where |P (s)| ≈ 1.1n . The runtime a sequence saving directly depends on the ratio number of shape classes in a shape class, we have to ensure that such a specialised program contains precisely the same to an incorrect probability. Due to the required accuracy and the exponential amount of dierent shape classes, all attempts of manually adapting the dynamic programming re- |R(s)| |P (s)| , i.e. the currences are out of question. R(s). The procedure leads to two questions: First, how To encounter this obstacle we wrote a program R(s) in advance? that accepts a single shape string as input and out- Second, how can we ensure that the independently puts an adapted grammar. This adapted grammar calculated probabilities are exactly the same as in enumerates exactly the secondary structures of the the probabilistic shape approach? input shape class. to guess promising shape classes for Since the RNAshapes program follows the methodology of ADP [8], we can replace its grammar for generating Exact probabilities via Thermodynamical Match- changed. hi of the probabilistic shape anal- ysis is a well-dened part of structures x F (s). This procedure allows us to dynamically compile specialised programs for each shape class in R(s). All secondary These Thermodynamic Matchers (TDM) cal- culate the partition function for their shape class in are members of exactly one shape class and the superset of all structures of all shape classes O(n3 ) time. equals the search space of secondary structures for tition function s: ∪x∈hi x = F (s), of the shape class. where i = 1, . . . , |P (s)|. with our adapted the thermodynamic energy parameters, remain un- ers Each shape class F (s) grammar, while the used algebras, containing all To inde- 3 Division by the uniquely computed par- Qs for F (s) results in the probability |R(s)| (log scale) a(Rsampling(100, s)) - 0.9 0 1e+06 1e+06 2 GB 100000 runtime in seconds (log scale) 10000 1 GB 1000 100 10 1 ROrakel(s) 0.1 Renergy(s) 0.01 RNAshapes -i 100 RNAshapes -p Rsampling(100, s) 0.001 0 100 200 300 400 500 600 sequence length n = |s| Figure 2: Empirically measured runtimes. The colours correspond to Figure 1. In addition the magenta coloured curve shows the runtime for the sampling method that is integrated in RNAshapes and used by our runtime heuristic. RNAshapes -p stands for the exact probabilistic shape analysis while RNAshapes -i addresses the sampling mode. Economic selection of shapes for R(s) Shape class guessing: RNAshapes already has an integrated heuristic for |R(s)| ==P |P (s)|, the combined probability α = a(R(s)) := h∈R(s) P (h) is always less than a(P (s)) = 1. Otherwise the heuristic cannot save Unless the probabilistic shape analysis. After spanning the search space, picked. any runtime and acts just as RNAshapes. The run- strings. P (s) \ R(s). R(s) to combine the appar- α and Thus, this sampling exactly. a(P (s)). R(s). N. shape |P (s)|, but the results are of no means guaranteed to be To combine both advantages, we use the quickly We investigated a couple of dierent methods of guessing shape classes for the list N The runtime no longer depends on method is much faster then our proposed heuristic ently conicting goals of minimising the runtime and minimising the gap between secondary structures are randomly but on the constant factor The success of the heuristic depends on a clever selection of shapes for N The shape class probabilities are then ap- proximated as frequencies by counting all time saving must be bought by an uncertainty about the unexplored parts of the search space Rsampling (N, s) calculated shape classes from the sampling as mem- Here we bers for present two of them. R(s) and precisely determine their probabil- ities via TDMs. The cyan coloured curve in Figure 1 documents the results for a sample size of N = 100. At a certain point, the combined probability of at most Shape class guessing: N a given Renergy (s) dierent shape classes is insucient to full α, even if we could guess the shape classes perfectly. We have to lower α or increase N. First, we compute a simple shape analysis with an adequate choice for β. Then we go through this list Results of shape classes that are ordered by their free energies. Until a(Renergy (s)) ≥ α, we generate a TDM R(s) for each shape class and calculate its probability. In To gain a lower bound for sizes of Renergy (s) we re-run the simple shape analysis with a larger β . Results for condition this method are depicted by the red curve in Figure shape classes of 1. ing probabilities. The green curve in Figure 1 is a case of too few shape classes in α = 0.9 under the we employed an oracle. For eval- uation, this wondrous oracle is able to denominate 4 P (s) in an ordering with decreas- smoothed visualisation of such perfectly composed R(s) sets for 200 sequences s runtime saving compared to RNAshapes. of dierent lengths. References The blue straight line shows the exponential growth of P (s). The wide discrepancy between the green and the blue curve gives versatile room for practical 1. Mathews DH, Sabina J, Zuker M, Turner DH: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of Molecular Biology 1999, 288(5):911940. replacements for the oracle. All results are given by the two graphs shown in this paper. Figure 1 is a comparison of the sizes of P (s) and dierently constructed sets with their impacts on Rsampling a(R(s)). R(s) together For the method in Figure 1 we decided to x N to 100. 2. Nebel Shapes The upper cyan coloured curve illustrates the deviation of a(Rsampling (100, s)) from α. dierent grammar sizes. Rsampling (N, s), Figure 2 4. Giegerich R, Voÿ B, Rehmsmeier M: shapes of RNA. 32(16):4843. Our presented heuristic decreases the exponential O(n3 · |P (s)|) the sampling method to guess shapes for Belt J, van Laake L, Vos J, Verloop R, van de of RNAshapes, our proposed heuristic provides ex- Wetering M, Guryev V, Takada S, van Zonneveld act probabilities for the shape classes. The tradeo Many novel mammalian microRNA candidates identied by extensive cloning and RAKE analysis. Genome Res 2006, 16(10):12891298. AJ, Mano H, Plasterk R, Cuppen E: between speedup and size of the unexplored search α or N . The use of TDMs allows calculation of signicant larger input sequences compared to RNAshapes -p, which exceeds 2 GB of memory for sequence lengths of ap- 7. Lu J, Shen Y, Wu Q, Kumar S, He B, Shi S, The birth and death of microRNA genes in Drosophila. Nat Genet 2008, 40(3):351355. proximately 400 nucleotides. TDMs are also a good Carthew RW, Wang SM, Wu CI: target for parallelisation and thus another source for further speedups. Figure 2 shows that the speedup grows with growing sequence length, ranging from three-fold for n = 200 up to 300-fold for Abstract Nucleic Acids Research 2004, 6. Berezikov E, van Tetering G, Verheul M, van de In contrast to the sampling method space can be set by the user via choice of 2008, Complete probabilistic analysis of RNA shapes. BMC Biol 2006, 4. O(n3 · N ). is the constant number of secondary structures Rsampling (N, s). Abstract 5. Voÿ B, Giegerich R, Rehmsmeier M: for the probabilistic shape analysis to a polynomial runtime of from On Submitted Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 1999, 49(2):145165. Conclusions N A: P: shows our analysis of empirically measured runtimes. runtime RNA. 3. Wuchty S, Fontana W, Hofacker IL, Schuster To include the overhead of compiling the TDMs and running the sample method to get Scheid nebel/www_shapes/www_RNAShapes.pdf ]. The larger the gram- mar, the higher is the compile time. of [http://wwwagak.informatik.uni-kl.de/sta/ Dierent shape strings result in dierent TDMs, i.e. ME, A discipline of dynamic programming over sequence data. Science of Computer Programming 2004, 51(3):215263. n = 400. 8. Giegerich R, Meyer C, Steen P: We suggest to use our heuristic for sequences longer than 200 nucleotides, where the overhead of compiling the TDMs pays o and achieves a signicant 5
© Copyright 2025 Paperzz