Mol. Based Math. Biol. 2017; 5:1–20 Review Article Open Access Fenix Huang, Christian Reidys, and Reza Rezazadegan* Fatgraph models of RNA structure DOI 10.1515/mlbmb-2017-0001 Received October 23, 2016; accepted December 25, 2016 Abstract: In this review paper we discuss fatgraphs as a conceptual framework for RNA structures. We discuss various notions of coarse-grained RNA structures and relate them to fatgraphs. We motivate and discuss the main intuition behind the fatgraph model and showcase its applicability to canonical as well as noncanonical base pairs. Recent discoveries regarding novel recursions of pseudoknotted (pk) configurations as well as their translation into context-free grammars for pk-structures are discussed. This is shown to allow for extending the concept of partition functions of sequences w.r.t. a fixed structure having non-crossing arcs to pk-structures. We discuss minimum free energy folding of pk-structures and combine these above results outlining how to obtain an inverse folding algorithm for PK structures. Keywords: RNA, pseudoknot, fatgraph, genus, context free grammar 1 Introduction RNA molecules are polymers in nucleotides guanine (G), uracil (U), adenine (A) and cytosine (C). They are essential not only in the production of proteins from genes in DNA (coding RNA) but also as noncoding RNA molecules which have diverse functions in living organisms such as translation of proteins in the ribosome, gene regulation and as enzymes [16, 23]. The first noncoding RNA molecules to be discovered were ribozymes. These molecules work as enzymes and were discovered by Cech and Altman in the early 1980s [2, 11]. This discovery supported the RNA World Hypothesis [26] that the earliest forms of life relied solely on RNA molecules for both the storage of genetic material and the catalysis of chemical reactions. Other types of noncoding RNA include transfer RNA (tRNAs) that provide a physical link between messenger RNA and the amino acids of the resulting protein, long noncoding RNA and small nuclear RNA snoRNA. In fact only one fifth of the transcription in the human genome comes from genes that encode proteins [35]. The main difference between RNA and DNA, apart from the existence of the nucleotide U instead of T (thymine), is that RNA is mostly single stranded and hence its nucleotides can form bonds with each other. This results in the molecule folding to form a 3-dimensional structure. The function of a noncoding RNA molecule is determined by this 3-dimensional spatial configuration, called the tertiary structure. Therefore determining this structure i.e. measuring the coordinates of each atom in the molecule at near Ångstrom resolution is of utmost importance. In Figure 1 we see a depiction of this three dimensional structure of the yeast phenylalanine transfer RNA. The experimental determination of RNA tertiary structure uses techniques such as X-ray crystallography [31] and NMR [45]. These methods are expensive and time consuming; for example crystallization of RNA is challenging because of its flexibility. Therefore one is interested in the algorithmic prediction of RNA struc- *Corresponding Author: Reza Rezazadegan: Biocomplexity Institute of Virginia Tech, 1015 Life Science Circle, VA 24060, Blacksburg, U.S.A., E-mail: [email protected] Fenix Huang: Biocomplexity Institute of Virginia Tech, 1015 Life Science Circle, VA 24060, Blacksburg, U.S.A., E-mail: [email protected] Christian Reidys: Biocomplexity Institute of Virginia Tech, 1015 Life Science Circle, VA 24060, Blacksburg, U.S.A., E-mail: [email protected] © 2017 Fenix Huang et al., licensee De Gruyter Open. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. Unauthenticated Download Date | 6/16/17 9:06 PM 2 | Fenix Huang, Christian Reidys, and Reza Rezazadegan Figure 1: The tertiary structure of the yeast phenylalanine tRNA with PDB ID 1EHZ. The three dimensional structure of this molecule was determined in [70]. Graphics generated using NGL [66, 67]. ture. There are currently several programs available for predicting RNA 3D structure such as MC-fold [53], Vfold [81], FARNA [18], etc. See [40] for a more comprehensive list. However the existing algorithms are computationally expensive and don’t yield precise results [40, Chapter 4]. For this reason most of the effort in the prediction of RNA structure has been focused on predicting the coarse-grained RNA structure. This means that instead of the coordinates of each one of the many atoms in each base one is interested in determining the position of one or more precisely defined points in each nucleotide such as the center of mass or the positions of a few specific atoms. 76 5’ (A) (B) 3’ A G C G 70 G A U 60 U C U A G A C A C A G U G A C U C G10 C U G U G C U U U C 50 G G A G C U GG A G A C G G G 20 C G A U 30 G C 40 A U C A G U G A A C G C U U A U A A 10 C G C C A G C U U U C C A 20 C C C G A U U G U U G G A A G A A A C C 20 10 60 50 40 30 70 5’ G 30 A G 40 C U G G A A G G C U G A C 1 1 10 20 30 40 48 3’ 48 Figure 2: Examples of RNA secondary structures with and without crossings and their associated graphs An even simpler structural measure is given by the secondary structure of RNA [4, 79]. Note that both the tertiary and coarse-grained structure of RNA are concerned with how the molecule embeds into the space. The secondary structure on the other hand is only concerned with knowing which bases pair with which. As such it can be presented as an abstract graph (i.e. one that does not come with an embedding into space). The nodes of this graph are the bases and edges are either given by the covalent bonds in the backbone or the base pairs between distant bases. Two examples of such structures are given in Fig. 2. Secondary structure is much simpler than the tertiary one but it still retains a great deal of information about molecule’s structure. In the literature the term secondary structure is usually used for the structures in whose graph, the edges do not intersect, i.e. noncrossing structures, but in this article we use this term to refer to any graph of pairings between bases. Unauthenticated Download Date | 6/16/17 9:06 PM Fatgraph models of RNA structure | 3 The graph presentation of RNA structure [73] has become a ubiquitous tool in studying RNA. However in recent years it has been noted that this presentation can be further enhanced into a fatgraph [7, 55] (Fig. 5). Recall that secondary structure is a coarse-grained one in which each base is regarded as a point. An intermediate step between this and an all atom model of RNA is to consider the geometry of different types of bonds between bases. The most stable type of bonds between RNA bases are the Watson-Crick bonds A-U and C-G. However there is a variety of noncanonical base pairs that can be formed between nucleotides. Geometrically each nucleotide is modeled by a triangle whose three edges are distinguished as the Watson-Crick edge, the Hoogsteen edge and the sugar edge. Each side of one base can possibly bond with any side of another [39]. See Figure 3. SU H W- W- G C H W- C SU G H W- C SU G H SUG G H C H W- SU G W- C W- C 5’ 3’ 3’ 5’ H W- C H G SU SUG SUG H SU C H H H SUG H W-C W-C H C C SUG W- W- SUG G W- C SUG SU H W- G SUG H SU G H C H SU C W-C W- H W- H G G W-C G C SU G SU G W-C SU W- H SU H SU G W-C G W-C (B) SU W-C (A) SU H C 5’ 5’ 3’ 3’ H W- C Figure 3: Twelve different types of noncanonical base pairs and the ribbons associated to them in the fatgraph. The labels WC, SUG and H denote the Watson-Creek, Hoogsteen and sugar edges of the nucleotide, see section 4 for details. The first row shows the cis base pairs and their associated straight ribbon. The second row depicts the trans base pairs and their associated twisted ribbon. W H -C Accordingly one can enhance the graph node associated to a base into a triangle implying that the edges can also be enhances into ribbons. This turns out to be a useful byproduct since the bonds between two edges can be either in the cis or trans orientation and this can be encoded by the representing ribbons being straight or twisted. An enhanced graph whose edges are given by ribbons that can be either twisted or straight is called ribbon graph or a fatgraph [57]. In Fig. 4 we present two examples of ribbon graphs associated to RNA structures with noncanonical base pairs. SUG -C H W SUG (A) (B) (C) Figure 4: A: The three edges of a nucleotide base in cis and trans orientations. B: For a trans base pair the associated edge in the fatgraph is twisted. C: For a cis base pair the ribbon is untwisted. Even when considering bases as simple points, fatgraphs are still relevant to RNA structure. One can draw the backbone of the secondary structure on the x-axis and the arcs in the upper half plane. This way we get a counterclockwise order on the edges incident on each vertex which we can use to traverse the graph. The loops we get in such traversal are the same as the (hairpin, bulge, multi-, etc. ) loops in the RNA structure. It can be easily seen (see Section 4) that a graph with a cyclic order on the the edges incident on each vertex is equivalent to a ribbon graph. This presentation (with bases as single points) is useful in studying and Unauthenticated Download Date | 6/16/17 9:06 PM 4 | Fenix Huang, Christian Reidys, and Reza Rezazadegan predicting pseudoknotted structures as we will see below. Pseudoknots are RNA structure motives that involve base pairs between bases that are neither nested nor concatenated [10, 14, 69, 80]. This means that the contact structure of a pseudoknotted (PK for short) structure contains arcs that cross each other. They were, and still are, mostly ignored because i. the vast majority of experimentally determined RNA structures do not involve pseudoknots and ii. predicting and even classifying pseudoknotted structures is theoretically and computationally much more demanding compared to noncrossing structures. In fact RNA secondary structure prediction problem with generalized pseudoknots, where the energy function depends on adjacent base pairs, is NP-hard [1, 44]. For this reason, commonly used RNA secondary structure prediction tools, including mfold [82], Vienna RNA [30] and NUPAK [21] exclude pseudoknots. However the importance of PK structures is becoming increasingly more evident. Pseudoknots are for example functionally important in tRNAs [43], telomerase RNA [72], and ribosomal RNAs [37]. The prediction of RNA pseudoknotted structures can be seen as an intermediate between secondary and tertiary structure prediction. In vitro selection experiments have produced pseudoknotted RNA families that bind to the HIV-1 reverse transcriptase [75]. There have been attempts to use the number of crossings to filter the set of PK structures compatible with a given sequence [63]. However determining the class of structures that the algorithm recognizes is not easy and in fact in the case of [63] this class was characterized in a subsequent paper [64]. Another issue with this approach is that a set of parallel arcs intersecting another arc can result in a high number of crossings but this does not result in a more complicated structure. 1 2 3 4 5 6 34 2 5 1 6 Figure 5: A fatgraph and its associated surface which has genus one Fatgraphs turn out to be a powerful framework to describe PK structures. In particular they come with a new numerical invariant called genus, i.e. the number of holes in the ribbon surface associated to it. Figure 5 depicts a fatgraph and its associated surface having genus one. Genus can be simply expressed in terms of the number of the vertices, edges and boundary components of the fatgraph, see Section 4 for details. One can use genus to filter the set of PK structures, in other words use genus as a measure of the complexity of such a structure. As opposed to the number of crossings, genus is not changed when we collapse a stack. The first application of genus to the study of PK structures was given in [7]. The authors study databases of RNA structures and show that even for large RNA molecules of about 3 kilobases the genus remains smaller than 18 but for random sequences having the same length it can go up to 400. In [3] the authors use the relation between fatgraphs and the moduli space of Riemann surfaces [54] to compute the generating function for the number of fatgraphs with a given genus and minimum number of parallel arcs. This is used to imply the existence of neutral networks of distinct molecules with the same structures of any genus. However neither of these two papers provide a grammar for decomposing irreducible (primitive) pseudoknots. This task was first done in [59]. In that paper genus is used to obtain a recursion and hence a (multiple context free) grammar for decomposing PK structures. One important contribution of that paper is defining Unauthenticated Download Date | 6/16/17 9:06 PM Fatgraph models of RNA structure | 5 the right class of PK structures, called 𝛾 -structures, whose numbers are manageable for a given genus. These are PK structures without parallel arcs (called shadows) all whose irreducible components have genus ≤ 𝛾 . A shadow is called irreducible if for each two arcs in it there is a sequence of mutually intersecting arcs from one to the other. It is shown in [59] that there are finitely many irreducible 𝛾 -structures for a given 𝛾 . It is also shown that there are exactly four irreducible 1-structures. Then a grammar for decomposing any 1-structure is given. As a result, a 1-structure can have arbitrarily high genus (and the number of crossings) but it is composed of blocks of small complexity. The resulting DP algorithm and software gfold can fold an RNA sequence of length N in O(N 6 ) time. In a recent work of Huang and Reidys [32] a genus recursion due to Chapuy [12] is used to obtain a context free grammar (CFG) for decomposing any PK structure. Chapuy’s method adeptly finds vertices (called trisections) at which the genus of a fatgraph is located. These are the vertices at which the order on the edges coming from the cyclic order and the one coming from the traversal of the boundary do not match. Chapuy introduces a slicing operation on such vertices that decreases genus by at least one. There is also the inverse operation of gluing of vertices which increases genus. These operations are employed to obtain a combinatorial recursion for the number of fatgraphs with a given genus and number of edges, having only one boundary component. The algorithm of Huang and Reidys uses this recursion to convert any PK structure (which has nonzero genus) to a noncrossing one (which has genus zero) together with a labeling of its arcs. The labeling can be used to re-obtain the PK structure. The resulting noncrossing structure depends on a choice of trisections and in it the backbone is permuted. CFG’s are simpler than MCFG’s and so the passage from MCFG to CFG has implications for time and space complexity. Thus it is conceivable that a folding algorithm based on [32] would be more efficient. Inverse folding involves finding an RNA sequence that folds into a given structure S. It is useful in designing synthetic RNA molecules such as new drugs [30]. Even though local search algorithms for inverse folding noncrossing structures have been around for a while [8, 29, 30], such algorithms for PK structures are new [25]. The difficulty again arises from decomposing a PK structure into simple pieces. The local search inverse folding algorithms are comprised of two steps. The first step involves finding a heuristically suitable seed sequence to initiate the search and the second a walk in sequence space that takes one closer and closer to a candidate sequence. In [8] the seed is taken to be the sequence σ for which the energy E(σ, S) is minimal (called the MFE sequence). The first inverse folding algorithm for PK structures was given in [25]. It uses a random sequence as seed and is able to inverse fold PK structures with at most two mutually crossing arcs. In Section 4.3 we outline upcoming work in which we use the novel grammar of [32] to obtain an algorithm that can find the MFE sequence for a given PK structure. This will in turn be utilized by a stochastic local search algorithm, similar to the one in [8], to inverse fold such a PK structure. Global search algorithms for inverse folding, also called Inverse sampling [62], involve assigning probabilities to the sequences compatible with a given structure S as how likely they are to fold into S. The grammar of [32] makes it possible to extend inverse sampling to PK structures. 2 Preliminaries In this section we recall the basic notions and terminology in RNA secondary structure folding. Unauthenticated Download Date | 6/16/17 9:06 PM 6 | Fenix Huang, Christian Reidys, and Reza Rezazadegan 2.1 RNA secondary structures An RNA molecule consists of a sequence of nucleotides A, U, C and G. These nucleotides can form WatsonCrick A-U, C-G as well as wobble G-U base pairs. RNA nucleotides can form noncanonical base pairs as well (Section 4) however for the purpose of RNA secondary structure prediction they are usually ignored. These base pairs make the RNA molecule fold into a 3-dimensional shape and this shape is essential to the functions of the RNA molecule. Since an RNA molecule is directed, from the 3’ to the 5’ end, bases can be enumerated from 1 to N and hence a base pair can be represented by a pair (i, j) where j > i. Geometrically a secondary structure can be represented by its nucleoltides as points on the x-axis and the base pairs as arcs in the upper half plane. This representation is called a contact structure [60]. We say a structure S is the concatenation of two substructures S1 , S2 if each arc in S belongs to one of S1 or S2 and there is a base k such that the arcs of S1 are on one side of k and the same for S2 . Similarly S is the result of nesting S1 in S2 if all the arcs in S1 are underneath a single arc in S2 . A contact structure is called noncrossing if none of the arcs in it intersect or in other words if there are no base pairs (i, j), (k, l) such that i < j < k < l. Structures which contain intersecting arcs are called pseudoknotted (PK). 2.2 Folding noncrossing structures For a given RNA sequence there are exponentially many secondary strcuctures compatible with it [68] and therefore a naive approach to finding the MFE structure would be computationally impossible, even for sequences of moderate length. The dynamical programming (DP) approach to folding [51] uses the fact that structures can be nested in one another and that the energy is additive w.r.t. the nesting of structures. So a recursion that can exhaustively express secondary structures as combinations of simpler ones can be used together with DP to obtain an algorithm for finding MFE or computing the partition function. It is easy to see that a noncrossing structure can be recursively decomposed into the concatenation and nesting of its constituent arcs. Conversely for a given N all such structures can be obtained in polynomial time by nesting and concatenating simpler structures [71]. This can be formalized using the language of context free grammars (CFG’s). Recall that a CFG consists of sets V , Σ of nonterminal resp. terminal symbols and a set R of production rules which are maps from V to (V ∪ Σ)* . Here * is the Kleene star and denotes the set of all words in the alphabet given by V ∪ Σ. For RNA secondary structures the grammar goes back to the work of Smith and Waterman [71]. The nonterminal symbols are L and S which respectively denote a structure with a rainbow arc (maximal arc) and an arbitrary structure. The production rules are given by S → LS | • S | ϵ L → aSb where a and b denote the bases of an arc, • denotes an unpaired base and t is the empty structure without any crossings. These production rules are depicted in Figure 6. 3 Fatgraphs In this section we recall the basic theory of fatgraphs. Unauthenticated Download Date | 6/16/17 9:06 PM Fatgraph models of RNA structure | 7 Figure 6: The grammar for decomposing secondary structures without crossings 3.1 Basic facts The contact structure from a secondary structure has an additional structure i.e. a cyclic counterclockwise order on the edges incident on each vertex. This order comes from the orientation of the plane. A graph that is equipped with a cyclic order on the edges incident on each vertex is called a fatgraph. More details on the concepts and proofs presented in this and the next subsection can be found in e.g. [38] and [57]. To be precise the cyclic order < has to be defined on half edges in the graph. This is because if two (possibly identical) vertices u, v share edges e, e′ then e and e′ may appear in different orders on u compared to v. For example on the left hand side of Figure 7 the edge 2 comes before the edge 3 in the counterclockwise order however the other halves of these two edges, i.e. 7 and 8 come in reverse order. Let H be the set of half edges in a fatgraph G so the cardinality of H is twice the number of edges in G. To a fatgraph G we can associate a surface S in a canonical way. It is obtained by thickening the edges of the fatgraph into ribbons and the vertices into discs. The order < on the half edges is used to resolve the singularity when the ribbons associated to two edges meet. Equivalently, S can be characterized by the fact that G is embedded in it in such a way that it doesn’t intersect itself and the cyclic order on its edges is compatible with the orientation of the surface [57]. Despite their topological nature, fatgraphs can be described combinatorially and concisely by a pair of permutations [12]. Recall that a permutation on the set {1, . . . , n} is a one to one and onto function f from the set to itself. A cycle of f for an i ∈ {1, . . . , n} is given by (i, f (i), f (f (i)), . . . , f k (i)) such that k ≤ n and f k+1 (i) = i. Cycles are considered up to cyclic rearrangement of their elements and each permutation f can be decomposed as a product of its cycles. This decomposition is unique up to conjugation and permutation of the cycles. For each vertex v let e1 < e2 < · · · < e k be the half edges incident on it, ordered by the cyclic order <. We can consider the cyclic permutation (e1 , e2 , . . . , e k ) from which we can recover the cyclic order around v. Composing these permutations for all the vertices in G we get a permutation on H which we denote by σ. Note that since each half edge is incident on only one vertex, it appears in only one of the cyclic permutations that make σ so the order in which we compose the cycles is immaterial. It is easy to see that each vertex in G corresponds to a cycle in σ. For half edges incident on a vertex of degree n we have e ≤ e′ ⇔ ∃ k < n σ k (e) = e′ . (1) To be able to recover G, in addition to σ we need to specify which half edge pairs with which. This can be done with an additional permutation α on H which sends each half edge to its couple and vice versa. So α is fixed-point free and satisfies α2 = α. In this way each fatgraph G can be represented combinatorially by the triple (H, σ, α). The composition 𝛾 := α ∘ σ is a permutation on H whose cycles give us the boundary components of the surface associated to G [38]. By abuse of terminology we call these boundary components the boundary components of G itself. Example 3.1. For example for the fatgraph in Figure 5 we have H = {1, 2, . . . , 16}, α = (1, 2)(3, 4)(5, 6)(7, 8)(9, 10)(11, 12)(13, 14)(15, 16), σ = (3, 1)(7, 5, 4)(9, 2, 8)(13, 11, 10)(15, 6, 14)(12, 16). Unauthenticated Download Date | 6/16/17 9:06 PM 8 | Fenix Huang, Christian Reidys, and Reza Rezazadegan Here we have written the cycles of σ so that the i’th cycle from left denotes the i’th vertex of the fatgraph. We have 𝛾 = α ∘ σ = (1, 4, 8, 10, 14, 16, 11, 9)(2, 7, 6, 13, 12, 15, 5, 3) so the surface associated to this fatgraph has two boundary components. The set of fatgraphs enjoys a duality map called Poincare Duality [60]. The Poincare dual of the fatgraph G = (H, σ, α) is given by G′ = (H, α ∘ σ, α). As Poincare duality swaps σ with 𝛾 , the vertices of G (cycles of σ) are in a 1-1 correspondence with the boundary components of G′ (cycles of 𝛾 ) and the boundary components of G are in bijection with the vertices of G′ . Also because α2 = id, we have (G′ )′ = G. In Figure 7 you can see an example of this duality. 2 6 5 dual 4 3 7 10 2 8 9 5 3 7 9 (B) 8 dual 4 6 1 (A) 2 8 4 6 1 10 5 3 3 7 9 1 10 (C) Figure 7: (A): a fatgraph represented by permutations. We have σ = (1, 2, 3, 4, 5, , 6, 7, 8, 9, 10), the unique vertex obtained by collapsing the backbone, α = (1, 10)(2, 5)(3, 8)(4, 7)(6, 9) the fixed-point free involution representing the ribbons, and 𝛾 = (1, 5, 9)(2, 8, 6, 4)(3, 7)(10). (B) We consider the σ′ = 𝛾 = (1, 5, 9)(2, 8, 6, 4)(3, 7)(10) as vertices. (C): Connecting halfedges by ribbons following the fix-point free involution α = (1, 10)(2, 5)(3, 8)(4, 7)(6, 9), we construct the Poincaré dual (A). The dual interchanges the boundary components by vertices, hence producing a unicellular map. Fatgraphs which have only one boundary component, are called unicellular maps. By construction the Poincare dual of a unicellular map has only one vertex. In the case of the contact structure of an RNA secondary structure if we collapse the backbone into a point we get a fatgraph with only one vertex and hence its Poincare dual will be a unicellular map. See Fig. 8. This unicellular map is what we mean by the fatgraph associated to an RNA secondary structure. Figure 8: Collapsing the backbone into a single vertex. The Poincare dual of the resulting fatgraph is a unicellular map. For a unicellular map, the permutation 𝛾 consists of only one cycle and hence it induces a cyclic order <𝛾 on H which gives a traversal of all the half-edges. This order can be made into a linear order by choosing a root edge. For an RNA secondary structure this edge is chosen to be the rainbow arc which is given by the pair (1, N). If this arc does not exist in the structure it is added as a virtual base pair. Unauthenticated Download Date | 6/16/17 9:06 PM Fatgraph models of RNA structure | 9 3.2 Genus as a measure of complexity One of the main advantages of fatgraphs over ordinary graphs is that they come with a new numerical invariant i.e. the genus of the surface associated to them. Any orientable surface is topologically equivalent to the connected sum of a number of tori [28] and this number is called the genus of the surface. So genus is roughly speaking the number of holes (but not punctures) in the surface. The Euler characteristic of a fatgraph G with v vertices, e edges and r boundary components is defined to be χ(G) = v − e + r and from the equivalence of simplicial and singular homology [28] we know that the Euler characteristic is related to genus by χ(G) = 2 − 2g. So we have 1 g(G) = 1 + (e − v − r). (2) 2 Since Poincare duality swaps vertices with boundary components v ↔ r and preserves the number of the edges, it preserves genus. In the previous subsection we associated to each RNA secondary structure a unicellular map. We call the genus of this unicellular map, the genus of the secondary structure. Because the plane has genus zero, a planar fatgraph has genus zero too. So each noncrossing secondary structure has genus zero, because it is embedded into the plane. So we can use genus as a measure of the complexity of a PK structure. In the rest of this section we will see how genus is useful in obtaining a recursion for decomposing PK structure into simpler ones. One advantage of genus as a measure of complexity is that, unlike other measures such as the number of crossings, it is invariant under collapsing stacks [59]. This is important because computing the energy of helices is less demanding. This means that if we have a set of parallel arcs in a structure and we remove all but one of them, genus is unchanged. The structure obtained from S by collapsing all the stacks in it is called the shadow of S [59]. 3.3 Decomposing fatgraphs As mentioned in the last subsection, it is a well known fact in topology that if the surface associated to a fatgraph has genus g then it can be decomposed into a connected sum of g tori [28]. If the surface in question is the surface associated to a fatgraph then it would be useful to be able to do this decomposition in a way that is compatible with the structure of the fatgraph itself. This means finding a composition operation on the set of fatgraphs so that all genus g fatgraphs can be expressed as compositions of fatgraphs of lower genus. One such decomposition for unicellular maps was given by Chapuy [12] which we recall in this subsection. In Section 4 we will see how this decomposition can be used to obtain a grammar for decomposing RNA PK structures. For a planar fatgraph, which by definition has genus zero, the two orders <𝛾 , < on the half-edges incident on each vertex coincide because both of them are induced by the orientation of the plane. This means that one can use discrepancies between these two orders to detect genus. Let G = (H, σ, α) be a unicellular map of genus g and let e1 < σ e2 < σ e3 are three edges incident on the same vertex v that do not appear in the same order in <𝛾 . Chapuy [12] calls such a triple of edges intertwined. He defines two operations on intertwined edges, called slicing and gluing, which reduce or increase the genus of a fatgraph respectively. Slicing. With the same assumptions as in the last paragraph, can write the cycle of σ corresponding to v as (e1 , a1 , . . . , a k1 , e2 , b1 , . . . , b k2 , e3 , c1 , . . . , c k3 ). (3) ′ Let σ be the permutation of H which is the same as σ except for having the cycles (e1 , a1 , . . . , a k1 ), (e2 , b1 , . . . , b k2 ), (e3 , c1 , . . . , c k3 ) ′ ′ (4) ′ instead of v. The fatgraph G = (H, σ , α) can easily be seen to be a unicellular map [12]. Because σ has two cycles more than σ, G′ has two vertices more than G, so by formula (2) its genus is g − 1. Unauthenticated Download Date | 6/16/17 9:06 PM 10 | Fenix Huang, Christian Reidys, and Reza Rezazadegan Gluing. This is the reverse of the slicing map. If we have three vertices in a fatgraph given by cycles (a1 , . . . , a k ), (b1 , . . . , b k′ ), (c1 , . . . , c k′′ ) then we can take a new permutation which has the single cycle (a1 , b2 , . . . , b k′ , b1 , c2 , . . . , c k′′ , c1 , a2 , . . . , a k ) in place of the above three. The corresponding fatgraph is still a unicellular map and has two vertices less than the original one so its genus is one less. See Fig. 9. In Out a3 glue a1 a2 B ... A II g= ( ... a1, A , a2, B , a3, ... ) s= ... ( a1, I )( a2,II )( a3,III ) ... a’1 slice a’3 ... I ... ... ... III a’2 ... B A g= ( ... a1, B , a3, A , a2 ,... ) s= ... ( a1, II, a2, III, a3 , I, ) ... Figure 9: Gluing and slicing A trisection is a half-edge e such that σ(e) <𝛾 e and e is not the minimum half edge of the vertex w.r.t. <𝛾 . Chapuy proves that a unicellular map of genus g has exactly 2g trisections. A trisection can be sliced in a special way. Let e3 := e be a trisection incident on a vertex v and let e1 denote the minimum half-edge of v, let e2 denote the <𝛾 -minimum among half-edges which are between e1 and e3 (w.r.t. < σ ) but larger than e3 with respect to <𝛾 . We have e1 <𝛾 e3 <𝛾 e2 and so these three edges are intertwined and we can use them to slice v. After the slicing e now belongs to a new vertex and it may or may not be the minimum half edge. If not it still satisfies σ′ (e) <𝛾 ′ e and so is a trisection. Therefore we can iterate the above procedure until e becomes the minimum half-edge of the vertex to which it belongs. 4 Modeling RNA structure with fatgraphs Throughout the history of science, mathematics has been used to model natural phenomena. The purpose of such models is primarily to enable us to predict the result of a process without having to experiment each time. For example calculus and the theory of Hilbert spaces were used to model classical and quantum mechanics respectively. As history tells us a model that seems to be sufficient at predicting natural processes at first may later prove to be inefficient, as demonstrated by the example of the classical and quantum mechanics. The purpose of a mathematical model for RNA is to enable us to classify and communicate RNA structure patterns and to predict RNA folding. The first application of fatgraphs to studying RNA was by Penner and Waterman [56] in which the authors enhanced the space of secondary structures into a simplicial complex and used fatgraphs and their relation to the Teichmüller theory of surfaces to determine its topological type. More recently Penner et al. [55] associated a fatgraph model to proteins. They use the genus and the number of boundary components of the fatgraphs to obtain robust descriptors for proteins. In this section we present how fatgraphs, whose mathematical theory was presented in the last section make it possible to efficiently classify, fold and inverse fold PK structures. We begin by describing how fatgraphs are essential to the modeling of noncanonical base pairs in RNA. Unauthenticated Download Date | 6/16/17 9:06 PM Fatgraph models of RNA structure | 11 As mentioned in the introduction, in addition to the Watson-Creek and wobble pairs, RNA structures exhibit a plethora of additional, non-canonical base-pairs. These base pairs mediate specific interactions responsible for RNA-RNA self-assembly and RNA-protein recognition. RNA purine and pyrimidine bases present three edges for hydrogen-bonding interactions: the Watson-Crick edge, the Hoogsteen edge, and the Sugar edge denoted by W-C, H and SUG respectively [39]. In this section we outline future work in which we use fatgraphs to model noncanonical base pairs in RNA. A given edge of one base can potentially interact in a plane with any one of the three edges of a second base, and can do so in either cis or trans orientation of the glycosidic bonds. This gives rise to 12 basic geometric types with at least two H bonds connecting the bases, see Figure 3. The canonical A-U, G-C pairs and G-U wobble pairs belong to the cis Watson-Crick/Watson-Crick geometry. RNA folding, as well as interactive three-dimensional modeling, requires keeping track of the relative orientations of the strands to which the interacting bases belong. A triangle has two sides, for one side the three edges follow (H, SUG, W-C) counterclockwise, and for another side the three edges follow (H, W-C, SUG) counterclockwise. A base pair is parallel if the two triangles are on the same side and anti-parallel, otherwise. This leads to considering non-orientable fatgraphs as detailed in Figure 4. Roughly speaking a non-orientable fatgraph is one whose associated surface can be non-orientable. More precisely such a fatgraph comes with additional information as to whether the ribbon associated to each edge should be twisted or straight. To keep track of the boundary components of a nono-orientable fatgraph it is preferred to consider sectors instead of ribbons. A sector is the wedge between the two ribbons incident on a vertex and for non-orientable fatgraphs with n edges we can take the set H of cardinality 2n to be the set of the sectors. So a possibly non-orientable fatgraph is a triple (H, σ, 𝛾 ) such that for each pair (e, σ(e)) there is another pair (f , σ(f )) such that we have either σ e / σ(e) 𝛾 σ(f ) o 𝛾 σ or σ e 𝛾 f / σ(e) 𝛾 f σ / σ(f ). In the first case we have an untwisted ribbon and in the second case a twisted one. See Fig. 10 for two examples of such fatgraphs. Note that for a non-orientable fatgraph it is not true that α = 𝛾 ∘ σ−1 satisfies α2 = id. The notion of Poincare duality can be defined for these fatgraph by swapping σ and 𝛾 . 5 -3 -4 6 2 1 8 7 (A) 5 -3 -7 1 8 4 6 -2 (B) Figure 10: Two examples of nonorientable fatgraphs. The plus and minus signs on sectors denote counterclockwise and clockwise orientations respectively. Unauthenticated Download Date | 6/16/17 9:06 PM 12 | Fenix Huang, Christian Reidys, and Reza Rezazadegan 4.1 A context free grammar for decomposing PK structures We saw in the subsection 3.3 that a genus g PK structure together with a blueprint is equivalent to a labeled tree (G N , κ). The Poincare dual of this tree is a genus zero fatgraph S with one vertex which has the same edges as the tree. So an edge e of S can be regarded as an edge of G N and to which we assign the label of its child. Since the root of the tree is not involved in any slicing it gets the zero label. We then add the unpaired bases to S and what we obtain is a noncrossing secondary structure with labeled arcs. We then apply the classical grammar for secondary structures to (S, σ) to decompose the original PK structure. Whereas the old grammar applied to arcs, the new grammar applies to labeled arcs and it either removes an arc or reduces its label. We now describe the resulting grammar in more detail but it’s worth mentioning that different choices of blueprints result in different labeled noncrossing structures. ∑︀ This grammar has new nonterminals S(l(ti ))i and L(l(ti ))i for i = 1, . . . , N. Here l i = σ a σ a |i where σ a runs i i i i over all the edge labels in the noncrossing structure G N that belong to S. We set t i = 1 if and only if the corresponding nonterminal contains a transitional arc for i (meaning it contains an arc a such that σ a |i = 1) and t i = 0 otherwise. The new terminals are •, d σ a , d′σ a where • denotes an isolated base and d σ a , d′σ a denote the two bases of a labeled arc. There are two production rules in this grammar. i )i Decomposition: this means to decompose a nonterminal symbol S(ℓ into one left- and one right-symbol, (t i )i together with consistent labelings. This is formalized as i )i (ν i )i | • S(l(ti ))i . S S(l(ti ))i → ϵ | S(µ (p ) (q ) i i i i i i (5) i i The new labels µ i , ν i , p i , q i are derived using the fact that the resulting structure must be a λ-structure i.e. a labeled noncrossing structure obtained from a PK structure and a blueprint. Detailed formulas for these labels are given in [32]. i )i . The key point here is to Arc-removal: this means to remove the outer arc, a, of the nonterminal L(ℓ (t i )i provide this arc with a label, σ a consistent with the label associated with the generated nonterminals. (l i )i −σ a ′ d σa L(l(ti ))i → d σ a S(p ) i i (6) i i (l′ ) Given L(l(ti ))i any arc-removal based production rule generates the nonterminal S(ti′ )i , where t′i ≤ t i and i i i i l i − l′i ≤ 1. This implies for the label of the removed arc σ a |h = (l h − l′h ), 1 ≤ h ≤ r. It is proved in [32] that this grammar is context free. Slicing can be used to convert a crossing structure to a noncrossing one. Then one can apply the classic grammar, mentioned in Section 2.2, to the latter to decompose the former. If the structure has genus g then we will need exactly g slicing operations. A blueprint [32] for a unicellular map G is a sequence {(G i , h i )}i∈{0,N} of unicellular maps G i and a trisection h i for G i such that G0 = G, G i+1 is obtained from G i by slicing the latter at h i and the genus of G N is zero (so h N is empty). Note that N is not necessarily determined by g. Note also that all G i have the same number of edges. Since G N has genus zero, it is a planar tree. We now recall from [32] a way of adding labels to the vertices of this tree so that we can recover G from it (without having to know the other G i in the blueprint). The labels {κ v }v are elements of Z/2N and are assigned by backtracking the vertices through the slicings. We first assign unreduced labels κ′v . Starting from a vertex v =: v N of G N , it descends from a unique vertex v N−1 of G N−1 . Passing from G N−1 to G N , v N−1 was either sliced or not. If so then the (N − 1)’st coordinate of κ′v would be 1 and otherwise 0. We then do the same for v N−1 and if it was the result of a slicing of a vertex v N−2 on G N−2 then the (N − 2)’nd coordinate of κ′v would be one. In general the i’th coordinate of κ′v is 1 if and only if the vertex v is the descendant of a vertex which was involved in a slicing in G i . Unauthenticated Download Date | 6/16/17 9:06 PM Fatgraph models of RNA structure | 13 We then reduce the κ′v to obtain κ v . Among the vertices that result from the slicing of the same vertex at stage i only one of them has the i’th coordinate of its label equal to 1 and this vertex is chosen to be the one which is minimal in the <𝛾 order. As a result the sum of all 1’s in all the labels κ v equals 2g + N. The noncrossing graph G N together with the labeling κ v is called the λ-structure associated to the blueprint. In Figure 11 we display an example of a blueprint for a PK structure and the associated λ-structure. Figure 11: A PK structure (left), a blue print of it (center) and the associated λ-structure (right). The blueprint can be reconstructed from G N and the labeling as follows. We glue the set of all vertices v of G N for which κ v |N = 1. This gives us a unicellular map which is the same as G N−1 . We label it by setting the last coordinate of κ v to zero if v was not involved in the aforementioned gluing and the sum of the labels of the glued vertices (sans the last coordinate) otherwise. We can then iterate this process until we obtain G0 of genus g. 4.2 Folding pseudoknotted structures The main difficulty in folding PK structures is that, because of the intersections between the arcs a useful recursion for exhausting these structures (similar to the one given in Section 2.2 for noncrossing structures) eluded researchers for a long time. For certain restricted classes of pseudoknots, Polynomial-time dynamic programming (DP) algorithms can be devised. In contrast to the O(N 2 ) space and O(N 3 ) time solution for simple secondary structures [51, 79, 83], however, most of these approaches are computationally much more demanding. The design of pseudoknot folding algorithms thus has been governed more by the need to limit computational cost and achieve a manageable complexity of the recursion rather than the conscious choice of a particularly natural search space of RNA structures. The following references provide a list of DP approaches to RNA structure prediction using different structure classes characterized in terms of recursion equations and/or stochastic grammars: [1, 9, 13, 19, 21, 36, 42, 44, 47, 58, 63, 76]. The inter-relationships of some of these classes of RNA structures have been clarified in part by [17] and [65]. In addition to these exact algorithms, a plethora of heuristic approaches to pseudoknot prediction have been proposed in the literature; see e.g., [15, 49] and the references therein. Prior to [59] there had been a few attempts at classifying PK structures for example [27] suggested using book-embeddings, [34] focused on the maximal set of pairwise crossing base pairs, and [7] based the classification on topological embeddings. While these classifications have in common that simple secondary structure forms the most primitive class of structures, they differ already in the construction of the first nontrivial class of pseudoknots. A workable approach to 3-noncrossing structures requires the enumeration of an exponentially growing number of diagrams which are then “filled in” by means of DP [33]; a Monte-Carlo approach utilizing the topological approach with a very simple matching-like energy model was explored by [77]. Unauthenticated Download Date | 6/16/17 9:06 PM 14 | Fenix Huang, Christian Reidys, and Reza Rezazadegan The first polynomial time algorithm for folding a class of PK structures was given in [59]. It relies on an efficient topological recursion for enumerating PK structures. One considers the shadow of an RNA structure which is given by removing all unpaired bases and collapsing each set of parallel arcs (called a stack) into a single arc. The shadow of a structure has the same genus and number of loops as the structure itself. A shadow is called irreducible if for each two arcs in it there is a sequence of mutually intersecting arcs from one to the other. Each shadow can be decomposed into a composition of irreducible pieces by iteratively removing nested or concatenated components. A shadow is called a 𝛾 -structure if all its irreducible components have genus ≤ 𝛾 . It is shown in [59] that there are finitely many irreducible 𝛾 structures for a given 𝛾 . It is also shown that there are exactly four irreducible 1-structures, see Fig. 12. H K L M Figure 12: The four irreducible genus one shadows Then a MCFG, called R1 , for decomposing any 1-structure is given [59]. This grammar is as follows. – Non-terminal S, representing noncrossing secondary structure elements according to the rules given above. – Non-terminals I and T, representing an arbitrary 1-structure where I can have genus zero but the genus of T is positive. − → – Non-terminals X = [X1 , X2 ] with two components used to represent a fragment-pair with nested arcs, X ∈ {H, K, L, M }. – Terminals (X, )X denoting the opening and closing of a base pair, respectively, where X is one of the types H, K, L or M. Different brackets as well as the different non-terminals of pattern X are used to distinguish nestings of the various types of shadows. Finally, we specify the production rules of the unambiguous MCFG R1 . I → S|T (7) S → (S)S| • S|ϵ (8) T → I(T)S (9) T → IA1 IB1 IA2 IB2 S (10) T → IA1 IB1 IA2 IC1 IB2 IC2 S (11) T → IA1 IB1 IC1 IA2 IB2 IC2 S (12) T → IA1 IB1 IC1 IA2 ID1 IB2 IC2 ID2 S (13) → [(XIX1 , X2 I)X] | [(X, )X] (14) − → X Here X ∈ {H, K, L, M } indicates one of the four types of irreducible 1-structures. In the above formulas a pair of parenthesis indicates a base pair and a bullet is an unpaired base. In words the equations (9)-(13) mean that an arbitrary 1-structure may be decomposed into an irreducible 1-structure with other 1-structures nested under its arcs (but not intersecting them). It is proved in [59] that any RNA 1-structure can be uniquely decomposed via R1 and any diagram generated via R1 is a 1-structure. The resulting DP algorithm and software gfold can fold an RNA sequence of length N in O(N 6 ) time. It is worth noting that there is currently no consensus for the energy contribution of different types of pseudoknotted structures. gfold is capable of Boltzmann sampling PK structures as well. Recall that the Boltzmann ensemble [22] of a given RNA sequence is the probability space of all the structures compatible with it. The central notion Unauthenticated Download Date | 6/16/17 9:06 PM Fatgraph models of RNA structure | here is the partition function of an RNA sequence [48]. For a sequence σ it is given by ∑︁ Q(σ) = exp(−E(S, σ)/RT) 15 (15) S where the sum ranges over all the structures S compatible with σ and R, T are the universal gas constant and the temperature respectively. The probability that σ will fold to a specific structure S is then given by exp(−E(S, σ)/RT)/Q(σ). In [48] McCaskill observed that the dynamic programming routines folding MFE structures [48] allows one to compute the partition function of all possible noncrossing structures for a given sequence. Predictions such as base pairing probabilities are obtained in [29, 30] and are parallelized in [24]. In [20, 74] a statistically valid sampling of secondary structures in the Boltzmann ensemble is derived and is used to calculate the sampling statistics of structural features. The first algorithm for computing a partition function that includes PK structures was given in [21]. The advantage of gfold over this algorithm is that it uses different penalties for different types of irreducible 1-structures. One reason that Boltzmann sampling is useful is that because of the inaccuracies in the energy parameters in the thermodynamic model, the MFE structure may not coincide with the natural structure to which a sequence folds. So instead of focusing on the MFE structures one can sample structures from the Boltzmann ensemble. In the next subsection we will see how fatgraphs can be useful in the closely related method of inverse sampling. 4.3 Inverse folding of PK structures using fatgraphs Given a structure S, inverse folding involves finding an RNA sequence σ which folds into S. Inverse folding is useful for example in designing synthetic noncoding RNA molecules which may be used as drugs or for regulating gene expression. In this section, after recalling the basics of inverse folding RNA sequences to noncrossing structures, we outline upcoming work that uses the CFG from subsection 4.1 to inverse fold such sequences to PK structures. In this article we use the thermodynamic model for RNA folding so we are looking for a sequence whose minimum free energy (MFE) structure in the Turner model equals S. Note that due to the fact that a lot of sequences fold to the same structure, the existence of such σ is not guarantied for all structures S [68]. So one may weaken the condition on σ to be a sequence whose MFE structure has minimum distance to S, where distance is measured using some secondary structure metric [50] which can simply be the number of the bases in which the two structures do not agree. However even this relaxed criterion can be computationally quite expensive and so the existing algorithms use heuristic methods to find a sequence which is reasonably close to being the MFE sequence. The existing algorithms fall into two general categories: local search and global search. Local search algorithms [30] involve two steps. Given a structure, S, the first step is to find a “suitable” sequence σ0 which is compatible with S and is usually chosen heuristically so that it does not lie very far from the desired inverse fold sequence. The second step involves a walk in the sequence space that given a sequence σ i (with σ0 as starting point) finds another σ i+1 in the neighborhood of it such that the distance of σ i+1 ’s MFE structure to S (computed using a secondary structure metric) is smaller than that of σ i . The algorithm stops when there is no such neighbor to be found. (Recall that the neighbors of a sequence are the ones which differ with it only in one nucleotide i.e. point mutations.) In the original algorithm for inverse folding RNAInverse [30] the initial sequence σ0 was chosen at random. Later in [8] this sequence was taken to be the one which minimizes energy among all compatible pairs (σ0 , S). This is called the MFE sequence. The MFE sequence is a good choice because the inverse fold is expected to have a low energy w.r.t. S [41, 61]. This sequence can be found in linear time (in the length of S) using a DP algorithm, due to Busch and Backofen [8]. The main rule of thumb for computing the MFE sequence of a secondary structure in [8] is that if we know the MFE sequence of a closed substructure (i.e. one without any arcs exiting it and has an arc between its outermost bases) for all the six possible assignments of base pairs to its outer arc then for computing the MFE Unauthenticated Download Date | 6/16/17 9:06 PM 16 | Fenix Huang, Christian Reidys, and Reza Rezazadegan sequence of the whole structure we don’t need to know anything more about this substructure. The algorithm starts with computing the energies of the base pairs which are maximally nested under other arcs, for all the 6 possible combinations for each. Having computed the MFE of a substructure S i for all the allowed base assignments it then proceeds to compute the MFEs of the substructure S i+1 which is obtained from S i by adding the base pair which is immediately larger than the largest base pair in S i . If each base pair had only one predecessor (w.r.t. <) then the time complexity of this algorithm would be proportional to the number of base pair times the number of allowed pairs i.e. linear in the length of the structure. However the closing arc of a multiloop has more than one predecessor and so the time complexity of this algorithm would be exponential in the number of loops inside the multiloop. This problem can be resolved by assigning a DP matrix to each multiloop. This basically means that instead of taking into account all the possible assignments of bases to the loops in a multiloop, one computes the MFE, for each one and for each assignment of its closing base pair. The first algorithm for inverse folding pseudoknotted structures was given in [25]. It can find the MFE sequence for a structure with at most two crossings and it uses a random sequence for the first step. The difficulty of finding the MFE sequence for a PK structure lies in the fact that up until recently there was no grammar for decomposing PK structures. Such a grammar was given in [32] and was described in Section 4.1. In a future work we use this grammar to decompose a given PK structure and then find its MFE sequence. This procedure is somewhat similar to that of Busch and Backofen but with the CFG from [32] used instead of the nestedness order to decompose a structure. Once the MFE sequence is found, a stochastic local search algorithm is used to find the inverse fold. The following is a list of different modules involved in this line of work. – The context free grammar of [32]. Given a PK structure S, we pick a blueprint for its associated fatgraph and using the method of [32] obtain a labeled noncrossing structure (G, κ v ). Then the grammar from the aforementioned paper is used to decompose this latter structure into a set of labeled noncrossing arcs. – An algorithm for finding the MFE sequence of a PK structure. After performing the above step one can obtain the MFE sequence for S by finding the minimum free energy sequence for the labeled noncrossing structure (G, κ v ). This is done in a similar manner to the Busch-Backofen algorithm described above. – A stochastic local search algorithm. This step is very similar to the corresponding steps in [8] and [25] i.e. starting with the MFE sequence it recursively finds sequences the distance of whose MFE structure (in base pair metric) is closer and closer to S. – gfold. The local search step above involves folding the sequence obtained in each step to measure the distance of its MFE structure with S. Since in our case S is pseudoknotted, we use gfold to fold σ to see whether its MFE structure is close to S. Just as in the case of [8] we expect using the MFE sequence instead of a random sequence (as done in [25]) will dramatically decrease the number of the iterations needed in the stochastic local search step. This is significant in the light of the fact that folding for PK structures takes much longer time than for the noncrossing structures considered in [8]. A problem closely related to inverse folding is inverse sampling. Given an RNA secondary structure S inverse sampling for S involves assigning a probability to each sequence compatible with it. The probability measures the likelihood of the sequence to fold into S. In [62] and independently in [5] a dual partition function is assigned to the ensemble of sequences σ compatible with a given structure S. (See also [78].) Let E(S, σ) denote the energy of the pair (S, σ). The dual partition function is given by ∑︁ Q(S) = exp(−E(S, σ)/RT) (16) σ where the sum is over all sequences compatible with S. This partition function, for noncrossing structures, can be computed using a DP algorithm similar to that for the usual structure-space partition function. Since probability is in reverse proportion with energy, the MFE sequence is the one with the highest probability. This Unauthenticated Download Date | 6/16/17 9:06 PM Fatgraph models of RNA structure | 17 means that if we randomly sample sequences compatible with S, the sequence with the highest frequency is likely to be the MFE sequence. In [62] this partition function is used to obtain an inverse sampling algorithm for noncrossing RNA structures. This inverse sampling can be viewed as a global search for the MFE sequence. Another advantage of the algorithm in [62] is that it can control the number of C-G base pairs in the sampled sequences. In [5] inverse sampling of noncrossing structures is used to assign a mutual information to a sequence-structure pair. For a given structure compatible sequences whose mutual information is close to each other are believed to contain similar genetic data. The substructures with fewest inverse samples are where the hidden patterns exist. The fatgraph driven CFG from [32] makes it possible to extent this paradigm to pseudoknotted structures. 5 Conclusion In this article we recalled the importance of predicting RNA structure and how the innovative use of fatgraphs in this realm can help us understand and predict RNA structure better, specially when taking pseudoknotted structures or non-canonical base pairs into account. Fatgraphs have more structure than ordinary graphs and this extra structure enables one to use tools from topology for studying them augmenting the combinatorial methods. Classically, when considering only Watson-Crick and wobble base pairs, RNA secondary structures have been represented by graphs. In this model the backbone is drawn as a horizontal line and base pairs are denoted by arcs drawn in the upper half-plane. Later studies [39] show that modeling noncanonical base pairs between nucleotides requires keeping track of the relative orientations of the strands to which the interacting bases belong. In [39] a nucleotide is modeled as a triangle having two sides (faces). For one side the three edges of the nucleotide (Watson-Creek, Hoogsteen and sugar) follow counterclockwise, and for the other they follow clockwise. The graph model of RNA secondary structure is not capable of describing these newly discovered features, while fatgraph model can achieve this easily. Furthermore, cis and trans base pairs can be represented by untwisted and twisted ribbons respectively in the fatgraph model. Fatgraphs bring a number of other advantages to the study of RNA. First of all the genus of fatgraphs can be used to filter and classify PK structures. Classifying RNA structures including pseudoknots by topological genus was studied in [6, 52, 59]. Although there are a lot of different classification schemes for pseudoknots, the fatgraph model seems to be the best of them. For one thing there are a finite number of shapes for any given genus [59] (see Section 4.2) and also adding parallel arcs does not increase genus. This makes biological sense since adding parallel arcs only increases the size of a helix and does not increase the complexity of the structure. For another reason, in the fatgraph model, various loops to which a structure is decomposed in the nearest neighbor thermodynamic model [46], naturally correspond to the boundary components of the fatgraph associated to the structure. The use of fatgraphs also makes it possible to effectively decompose more complicated PK structures into simpler ones as exemplified by the CFG from [32] (Section 4.1). In that paper a method is developed to transfer a PK structure into a labeled secondary structure, called a λ-structure. This method gives us the first contextfree grammar for PK structures. One future application of this grammar is in inverse folding RNA structures. That is to find a sequence which folds to a given structure. This problem arises for example in designing new drugs or gene regulators [30]. If the given structure is noncrossing, one approach is studied in [30] and refined in [8]. It consists of two parts. One first finds a sequence σ0 which is compatible with the given structure and in the second step one uses a stochastic local search algorithm to find sequences σ i which fold closer and closer to the desired structure. The second step involves folding each individual σ i and so is time consuming. Therefore it was shown in [8] that if instead of picking σ0 at random as in [30], one uses the minimum free energy sequence for the structure, the number of iterations is reduced significantly. Despite their importance in biological processes [37, 43, 72], there has been no efficient algorithm for inverse folding pseudoknotted RNA structures. The only existing algorithm was given in [25] and it uses a two Unauthenticated Download Date | 6/16/17 9:06 PM 18 | Fenix Huang, Christian Reidys, and Reza Rezazadegan step method similar to [30] with a random sequence as the seed. Since folding sequences to PK structures is much more time consuming [59], if we can reduce the number of iterations in the second step by finding the MFE sequence of a PK structure, it will improve the inverse folding algorithm dramatically. The fatgraph model for RNA and specifically the CFG in [32] (Section 4.1) makes it possible to effectively decompose the λ-structure associated to a PK structure to find the MFE sequence. Using this method we can also extend the results in [5] which samples sequences for a given secondary structure to include PK structures. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] T. Akutsu. Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discr. Appl. Math., 104:45–62, 2000. Sidney Altman. Enzymatic cleavage of rna by rna. Bioscience Reports, 10(4):317–337, 1990. Jørgen E. Andersen, Robert C. Penner, Christian M. Reidys, and Michael S. Waterman. Topological classification and enumeration of RNA structrues by genus. J. Math. Biol., 2012. prepreint. Maximillian H. Bailor, Xiaoyan Sun, and Hashim M. Al-Hashimi. Topology links RNA secondary structure with global conformation, dynamics, and adaptation. Science, 327:202–206, 2010. Christopher Barrett, Fenix Huang, and Christian Reidys. Sequence-structure relations of biopolymers. M. Bon, C. Micheletti, and H. Orland. Mcgenus: a monte carlo algorithm to predict RNA secondary structures with pseudoknots. Nucleic Acids Res., 41(3):1895–900, 2013. Michael Bon, Graziano Vernizzi, Henri Orland, and A. Zee. Topological classification of RNA structures. J. Mol. Biol., 379:900– 911, 2008. A. Busch and R. Backofen. INFO-RNA–a fast approach to inverse rna folding. Bioinformatics, 22(15):1823–31, 2006. Liming Cai, Russell L. Malmberg, and Yunzhou Wu. Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics, 19 S1:i66–i73, 2003. S. Cao and S. J. Chen. Predicting RNA pseudoknot folding thermodynamics. Nucl. Acids. Res., 34(9):2634–2652, 2006. Thomas R. Cech. Self-splicing and enzymatic activity of an intervening sequence rna from tetrahymena. Bioscience Reports, 10(3):239–261, 1990. G. Chapuy. A new combinatorial identity for unicellular maps, via a direct bijective approach. Adv. Appl. Math., 47(4):874– 893, 2011. Ho-Lin Chen, Anne Condon, and Hosna Jabbari. An O(n5 ) algorithm for MFE prediction of kissing hairpins and 4-chains in nucleic acids. J. Comp. Biol., 16:803–815, 2009. J. L. Chen and C. W. Greider. Functional analysis of the pseudoknot structure in human telomerase RNA. Proc. Natl. Acad. Sci. USA, 102(23):8080–5, 2005. S. J. Chen. RNA folding: conformational statistics, folding kinetics, and ion electrostatics. Annu Rev Biophys, 37:197–214, 2008. J. Cheng, P. Kapranov, J. Drenkow, S. Dike, S. Brubaker, S. Patel, J. Long, D. Stern, H. Tammana, G. Helt, V. Sementchenko, A. Piccolboni, S. Bekiranov, D. K. Bailey, M. Ganesh, S. Ghosh, I. Bell, D. S. Gerhard, and T. R. Gingeras. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science, 308(5725):1149–54, 2005. Anne Condon, Beth Davy, Baharak Rastegari, Shelly Zhao, and Finbarr Tarrant. Classifying RNA pseudoknotted structures. Theor. Comp. Sci., 320:35–50, 2004. Rhiju Das and David Baker. Automated de novo prediction of native-like rna tertiary structures. Proceedings of the National Academy of Sciences, 104(37):14664–14669, 2007. Jitender S. Deogun, Ruben Donis, Olga Komina, and Fangrui Ma. RNA secondary structure prediction with simple pseudoknots. In Proceedings of the second conference on Asia-Pacific bioinformatics (APBC 2004), pages 239–246. Australian Computer Society, 2004. Y. Ding and C. E. Lawrence. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res., 31:7280–7301, 2003. R. M. Dirks and N. A. Pierce. A partition function algorithm for nucleic acid secondary structure including pseudoknots. J. Comput. Chem., 24:1664–1677, 2003. Philippe Duchon, Philippe Flajolet, Guy Louchard, and Gilles Schaeffer. Boltzmann samplers for the random generation of combinatorial structures. Combinatorics, Probability and Computing, 13:2004, 2004. S. R. Eddy. Non-coding RNA genes and the modern rna world. Nat. Rev. Genet., 2(12):919–29, 2001. M. Fekete, I.L. Hofacker, and P.F. Stadler. Prediction of RNA base pairing probabilities on massively parallel computers. J. Comput. Biol., 7:171–182, 2000. J. Z Gao, L. Y. Li, and C. M. Reidys. Inverse folding of RNA pseudoknot structures. Algorithms Mol Biol., 5:R27, 2010. Walter Gilbert. Origin of life: The rna world. Nature, 319(6055):618–618, Feb 1986. Unauthenticated Download Date | 6/16/17 9:06 PM Fatgraph models of RNA structure | 19 [27] Christian Haslinger and Peter F. Stadler. RNA structures with pseudo-knots: Graph-theoretical and combinatorial properties. Bull. Math. Biol., 61:437–467, 1999. [28] A. Hatcher. Algebraic Topology. Cambridge University Press, 2002. [29] I. L. Hofacker. The vienna RNA secondary structure server. Nucl. Acids Res., 31:3429–3431, 2003. [30] Ivo L Hofacker, Walter Fontana, Peter F Stadler, L Sebastian Bonhoeffer, Manfred Tacker, and Peter Schuster. Fast folding and comparison of RNA secondary structures. Monatsh. Chem., 125:167–188, 1994. [31] Stephen R. Holbrook and Sung-Hou Kim. Rna crystallography. Biopolymers, 44(1):3–21, 1997. [32] F. Huang and C. M. Reidys. Topological language for RNA. arXiv:1605.02628, 2016. [33] Fenix Huang, Wade Peng, and C. M. Reidys. Folding 3-noncrossing RNA pseudoknot structures. J. Comp. Biol., 16:1549–1575, 2009. [34] Emma Y. Jin, Jing Qin, and C. M. Reidys. Combinatorics of RNA structures with pseudoknots. Bull. Math. Biol., 70:45–67, 2008. [35] Philipp Kapranov, Jill Cheng, Sujit Dike, David A. Nix, Radharani Duttagupta, Aarron T. Willingham, Peter F. Stadler, Jana Hertel, Jörg Hackermüller, Ivo L. Hofacker, Ian Bell, Evelyn Cheung, Jorg Drenkow, Erica Dumais, Sandeep Patel, Gregg Helt, Madhavan Ganesh, Srinka Ghosh, Antonio Piccolboni, Victor Sementchenko, Hari Tammana, and Thomas R. Gingeras. Rna maps reveal new rna classes and a possible function for pervasive transcription. Science, 316(5830):1484–1488, 2007. [36] Yuki Kato, Hiroyuki Seki, and Tadao Kasami. RNA pseudoknotted structure prediction using stochastic multiple context-free grammar. IPSJ Digital Courier, 2:655–664, 2006. [37] D. Konings and R. Gutell. A comparison of thermodynamic foldings with comparatively derived structures of 16s and 16s-like rRNAs. RNA, 1:559–574, 1995. [38] Seigei Lando and Alexander Zvonkin. Graphs on surfaces and applications. Springer, 2004. [39] N B Leontis and E Westhof. Geometric nomenclature and classification of rna base pairs. RNA, 7(4):499–512, 2001. [40] Neocles Leontis and Eric Westhof. RNA 3D Structure Analysis and Prediction (Nucleic Acids and Molecular Biology). Springer, 2012. [41] A. Levin, M. Lis, Y. Ponty, C. W. O’Donnell, S. Devadas, B. Berger, and J. Waldispühl. A global sampling approach to designing and reengineering RNA secondary structures. Nucl. Acids Res., 40(20):10041–10052, 2012. [42] Hengwu Li and Daming Zhu. A new pseudoknots folding algorithm for RNA structure prediction. In L. Wang, editor, COCOON 2005, volume 3595, pages 94–103, Berlin, 2005. Springer. [43] A. Loria and T. Pan. Domain structure of the ribozyme from eubacterial ribonuclease. RNA, 2:551–563, 1996. [44] R. B. Lyngsø and C. N. Pedersen. RNA pseudoknot prediction in energy-based models. J. Comp. Biol., 7:409–427, 2000. [45] Alexander Marchanka, Bernd Simon, Gerhard Althoff-Ospelt, and Teresa Carlomagno. Rna structure determination by solidstate nmr spectroscopy. Nature Communications, 6:7024 EP –, May 2015. Article. [46] D Mathews, M Disney, J Childs, S Schroeder, M Zuker, and D. Turner. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci, 101:7287–7292, 2004. [47] Hiroshi Matsui, Kengo Sato, and Yasubumi Sakakibara. Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Bioinformatics, 21:2611–2617, 2005. [48] J. S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29:1105–1119, 1990. [49] Dirk Metzler and Markus E. Nebel. Predicting RNA secondary structures with pseudoknots by MCMC sampling. J. Math. Biol., 56:161–181, 2008. [50] Vincent Moulton, Michael Zuker, Michael Steel, Robin Pointon, and David Penny. Metrics on rna secondary structures. Journal of Computational Biology, 7(1-2):277–292, 2004. [51] R. Nussinov, G. Piecznik, J. R. Griggs, and D. J. Kleitman. Algorithms for loop matching. SIAM J. Appl. Math., 35(1):68–82, 1978. [52] H. Orland and A. Zee. RNA folding and large n matrix theory. Nuclear Physics B, 620:456–476, 2002. [53] M. Parisien and F. Major. The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature, 452:51–55, 2008. [54] R. C. Penner. Cell decomposition and compactification of Riemann’s moduli space in decorated Teichmüller theory. In Nils Tongring and R. C. Penner, editors, Woods Hole Mathematics-perspectives in math and physics, pages 263–301. World Scientific, Singapore, 2004. arXiv: math. GT/0306190. [55] R. C. Penner, Michael Knudsen, Carsten Wiuf, and Jørgen Ellegaard Andersen. Fatgraph models of proteins. Comm. Pure Appl. Math., 63:1249–1297, 2010. [56] R. C. Penner and M. S. Waterman. Spaces of RNA secondary structures. Adv. Math., 101:31–49, 1993. [57] Robert C. Penner. Decorated Teichmüller theory. QGM Master Class Series. European Mathematical Society (EMS), Zürich, 2012. With a foreword by Yuri I. Manin. [58] Jens Reeder and Robert Giegerich. Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics, 5:104, 2004. [59] C. M. Reidys, F. Huang, J. E. Andersen, R. C. Penner, P. F. Stadler, and M. E. Nebel. Topology and prediction of RNA pseudoknots. Bioinformatics, 27:1076–1085, 2011. [60] Christian Reidys. Combinatorial Computational Biology of RNA. Springer, 2011. Unauthenticated Download Date | 6/16/17 9:06 PM 20 | Fenix Huang, Christian Reidys, and Reza Rezazadegan [61] V. Reinharz, Y. Ponty, and J. Waldispühl. A weighted sampling algorithm for the design of RNA sequences with targeted secondary structure and nucleotide distribution. Bioinformatics, 29(13):308–315, 2013. [62] Vladimir Reinharz, Yann Ponty, and Jérôme Waldispühl. A weighted sampling algorithm for the design of rna sequences with targeted secondary structure and nucleotide distribution. Bioinformatics, 29(13):i308–i315, 2013. [63] Elena Rivas and Sean R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J. Mol. Biol., 285:2053–2068, 1999. [64] Elena Rivas and Sean R. Eddy. The language of RNA: A formal grammar that includes pseudoknots. Bioinformatics, 16:334– 340, 2000. [65] Einar Andreas Rødland. Pseudoknots in RNA secondary structures: Representation, enumeration, and prevalence. J. Comp. Biol., 13:1197–1213, 2006. [66] Alexander S. Rose, Anthony R. Bradley, Yana Valasatava, Jose M. Duarte, Andreas Prlić, and Peter W. Rose. Web-based molecular graphics for large complexes. In Proceedings of the 21st International Conference on Web3D Technology, Web3D ’16, pages 185–186, New York, NY, USA, 2016. ACM. [67] Alexander S. Rose and Peter W. Hildebrand. Ngl viewer: a web application for molecular visualization. Nucleic Acids Research, 43(W1):W576–W579, 2015. [68] Peter Schuster. How to search for rna structures theoretical concepts in evolutionary biotechnology. Journal of Biotechnology, 41(2):239 – 257, 1995. [69] David B. Searls. The language of genes. Nature, 420(6912):211–217, Nov 2002. [70] H Shi and P B Moore. The crystal structure of yeast phenylalanine trna at 1.93 a resolution: a classic structure revisited. RNA, 6(8):1091–1105, 2000. [71] T.F. Smith and M.S. Waterman. RNA secondary structure. Math. Biol., 42:31–49, 1978. [72] David W Staple and Samuel E Butcher. Pseudoknots: RNA structures with diverse functions. PLoS Biol., 3:e213, 2005. [73] J E Tabaska, R B Cary, H N Gabow, and G D Stormo. An RNA folding method capable of identifying pseudoknots and base triples. Bioinformatics, 14:691–699, 1998. [74] Manfred Tacker, Peter F. Stadler, Erich G. Bornberg-Bauer, Ivo L. Hofacker, and Peter Schuster. Algorithm independent properties of RNA structure prediction. Eur. Biophy. J., 25:115–130, 1996. [75] C. Tuerk, S. MacDougal, and L. Gold. RNA pseudoknots that inhibit human immunodeficiency virus type 1 reverse transcriptase. Proc. Natl. Acad. Sci. USA, 89(15):6988–6992, 1992. [76] A. Uemura Y., Hasegawa, S. Kobayashi, and T. Yokomori. Tree adjoining grammars for RNA structure prediction. Theor. Comp. Sci., 210:277–303, 1999. [77] Graziano Vernizzi and Henri Orland. Large-N random matrices for RNA folding. Acta Phys. Polon., 36:2821–2827, 2005. [78] J. Waldisp¨ uhl, B. Behzadi, and J.-M. Steyaert. An approximate matching algorithm for finding (sub-)optimal sequences in s-attributed grammars. Bioinformatics, 18(suppl 2):S250–S259, 2002. [79] M. S. Waterman. Secondary structure of single-stranded nucleic acids. Adv. Math. (Suppl. Studies), 1:167–212, 1978. [80] E. Westhof and L. Jaeger. RNA pseudoknots. Curr. Opin. Struct. Biol., 2:327–333, 1992. [81] Xiaojun Xu, Peinan Zhao, and Shi-Jie Chen. Prediction of RNA base pairing probabilities on massively parallel computers. PLoS ONE, 2014. [82] M Zuker. On finding all suboptimal foldings of an RNA molecule. Science, 244:48–52, 1989. [83] M. Zuker and P. Stiegler. Optimal computer folding of larger RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res., 9:133–148, 1981. Unauthenticated Download Date | 6/16/17 9:06 PM
© Copyright 2026 Paperzz