Fatgraph models of RNA structure

Mol. Based Math. Biol. 2017; 5:1–20
Review Article
Open Access
Fenix Huang, Christian Reidys, and Reza Rezazadegan*
Fatgraph models of RNA structure
DOI 10.1515/mlbmb-2017-0001
Received October 23, 2016; accepted December 25, 2016
Abstract: In this review paper we discuss fatgraphs as a conceptual framework for RNA structures. We discuss various notions of coarse-grained RNA structures and relate them to fatgraphs. We motivate and discuss
the main intuition behind the fatgraph model and showcase its applicability to canonical as well as noncanonical base pairs. Recent discoveries regarding novel recursions of pseudoknotted (pk) configurations as
well as their translation into context-free grammars for pk-structures are discussed. This is shown to allow
for extending the concept of partition functions of sequences w.r.t. a fixed structure having non-crossing arcs
to pk-structures. We discuss minimum free energy folding of pk-structures and combine these above results
outlining how to obtain an inverse folding algorithm for PK structures.
Keywords: RNA, pseudoknot, fatgraph, genus, context free grammar
1 Introduction
RNA molecules are polymers in nucleotides guanine (G), uracil (U), adenine (A) and cytosine (C). They are
essential not only in the production of proteins from genes in DNA (coding RNA) but also as noncoding RNA
molecules which have diverse functions in living organisms such as translation of proteins in the ribosome,
gene regulation and as enzymes [16, 23]. The first noncoding RNA molecules to be discovered were ribozymes.
These molecules work as enzymes and were discovered by Cech and Altman in the early 1980s [2, 11]. This discovery supported the RNA World Hypothesis [26] that the earliest forms of life relied solely on RNA molecules
for both the storage of genetic material and the catalysis of chemical reactions. Other types of noncoding
RNA include transfer RNA (tRNAs) that provide a physical link between messenger RNA and the amino acids
of the resulting protein, long noncoding RNA and small nuclear RNA snoRNA. In fact only one fifth of the
transcription in the human genome comes from genes that encode proteins [35].
The main difference between RNA and DNA, apart from the existence of the nucleotide U instead of T
(thymine), is that RNA is mostly single stranded and hence its nucleotides can form bonds with each other.
This results in the molecule folding to form a 3-dimensional structure. The function of a noncoding RNA
molecule is determined by this 3-dimensional spatial configuration, called the tertiary structure. Therefore
determining this structure i.e. measuring the coordinates of each atom in the molecule at near Ångstrom
resolution is of utmost importance. In Figure 1 we see a depiction of this three dimensional structure of the
yeast phenylalanine transfer RNA.
The experimental determination of RNA tertiary structure uses techniques such as X-ray crystallography
[31] and NMR [45]. These methods are expensive and time consuming; for example crystallization of RNA is
challenging because of its flexibility. Therefore one is interested in the algorithmic prediction of RNA struc-
*Corresponding Author: Reza Rezazadegan: Biocomplexity Institute of Virginia Tech, 1015 Life Science Circle, VA 24060,
Blacksburg, U.S.A., E-mail: [email protected]
Fenix Huang: Biocomplexity Institute of Virginia Tech, 1015 Life Science Circle, VA 24060, Blacksburg, U.S.A., E-mail:
[email protected]
Christian Reidys: Biocomplexity Institute of Virginia Tech, 1015 Life Science Circle, VA 24060, Blacksburg, U.S.A., E-mail:
[email protected]
© 2017 Fenix Huang et al., licensee De Gruyter Open.
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.
Unauthenticated
Download Date | 6/16/17 9:06 PM
2 | Fenix Huang, Christian Reidys, and Reza Rezazadegan
Figure 1: The tertiary structure of the yeast phenylalanine tRNA with PDB ID 1EHZ. The three dimensional structure of this
molecule was determined in [70]. Graphics generated using NGL [66, 67].
ture. There are currently several programs available for predicting RNA 3D structure such as MC-fold [53],
Vfold [81], FARNA [18], etc. See [40] for a more comprehensive list. However the existing algorithms are computationally expensive and don’t yield precise results [40, Chapter 4]. For this reason most of the effort in the
prediction of RNA structure has been focused on predicting the coarse-grained RNA structure. This means
that instead of the coordinates of each one of the many atoms in each base one is interested in determining the position of one or more precisely defined points in each nucleotide such as the center of mass or the
positions of a few specific atoms.
76
5’
(A)
(B)
3’
A
G
C
G 70
G
A
U
60
U
C U
A
G A C A C
A
G
U G A
C U C G10
C U G U G
C
U
U U
C 50
G
G A G C
U
GG A
G
A
C G
G G
20
C G
A U
30 G C 40
A U
C
A
G
U
G A A
C
G
C
U
U
A
U A
A
10
C G C
C
A G C U U U C
C
A
20
C C C G A U U G U U G G A A G
A
A
A
C
C
20
10
60
50
40
30
70
5’
G
30 A
G
40 C U G G A A
G G C U G A C
1
1
10
20
30
40
48
3’ 48
Figure 2: Examples of RNA secondary structures with and without crossings and their associated graphs
An even simpler structural measure is given by the secondary structure of RNA [4, 79]. Note that both the
tertiary and coarse-grained structure of RNA are concerned with how the molecule embeds into the space.
The secondary structure on the other hand is only concerned with knowing which bases pair with which. As
such it can be presented as an abstract graph (i.e. one that does not come with an embedding into space).
The nodes of this graph are the bases and edges are either given by the covalent bonds in the backbone or the
base pairs between distant bases. Two examples of such structures are given in Fig. 2. Secondary structure is
much simpler than the tertiary one but it still retains a great deal of information about molecule’s structure.
In the literature the term secondary structure is usually used for the structures in whose graph, the edges do
not intersect, i.e. noncrossing structures, but in this article we use this term to refer to any graph of pairings
between bases.
Unauthenticated
Download Date | 6/16/17 9:06 PM
Fatgraph models of RNA structure |
3
The graph presentation of RNA structure [73] has become a ubiquitous tool in studying RNA. However in
recent years it has been noted that this presentation can be further enhanced into a fatgraph [7, 55] (Fig. 5).
Recall that secondary structure is a coarse-grained one in which each base is regarded as a point. An intermediate step between this and an all atom model of RNA is to consider the geometry of different types of bonds
between bases. The most stable type of bonds between RNA bases are the Watson-Crick bonds A-U and C-G.
However there is a variety of noncanonical base pairs that can be formed between nucleotides. Geometrically
each nucleotide is modeled by a triangle whose three edges are distinguished as the Watson-Crick edge, the
Hoogsteen edge and the sugar edge. Each side of one base can possibly bond with any side of another [39].
See Figure 3.
SU
H
W-
W-
G
C
H
W-
C
SU
G
H
W-
C
SU
G
H
SUG
G
H
C
H
W-
SU
G
W-
C
W-
C
5’
3’
3’
5’
H
W-
C
H
G
SU
SUG
SUG
H
SU
C
H
H
H
SUG
H
W-C
W-C
H
C
C
SUG
W-
W-
SUG
G
W-
C
SUG
SU
H
W-
G
SUG
H
SU
G
H
C
H
SU
C
W-C
W-
H
W-
H
G
G
W-C
G
C
SU
G
SU
G
W-C
SU
W-
H
SU
H
SU
G
W-C
G
W-C
(B)
SU
W-C
(A)
SU
H
C
5’
5’
3’
3’
H
W-
C
Figure 3: Twelve different types of noncanonical base pairs and the ribbons associated to them in the fatgraph. The labels WC, SUG and H denote the Watson-Creek, Hoogsteen and sugar edges of the nucleotide, see section 4 for details. The first row
shows the cis base pairs and their associated straight ribbon. The second row depicts the trans base pairs and their associated
twisted ribbon.
W
H
-C
Accordingly one can enhance the graph node associated to a base into a triangle implying that the edges
can also be enhances into ribbons. This turns out to be a useful byproduct since the bonds between two
edges can be either in the cis or trans orientation and this can be encoded by the representing ribbons being
straight or twisted. An enhanced graph whose edges are given by ribbons that can be either twisted or straight
is called ribbon graph or a fatgraph [57]. In Fig. 4 we present two examples of ribbon graphs associated to
RNA structures with noncanonical base pairs.
SUG
-C
H
W
SUG
(A)
(B)
(C)
Figure 4: A: The three edges of a nucleotide base in cis and trans orientations. B: For a trans base pair the associated edge in
the fatgraph is twisted. C: For a cis base pair the ribbon is untwisted.
Even when considering bases as simple points, fatgraphs are still relevant to RNA structure. One can
draw the backbone of the secondary structure on the x-axis and the arcs in the upper half plane. This way we
get a counterclockwise order on the edges incident on each vertex which we can use to traverse the graph.
The loops we get in such traversal are the same as the (hairpin, bulge, multi-, etc. ) loops in the RNA structure.
It can be easily seen (see Section 4) that a graph with a cyclic order on the the edges incident on each vertex
is equivalent to a ribbon graph. This presentation (with bases as single points) is useful in studying and
Unauthenticated
Download Date | 6/16/17 9:06 PM
4 | Fenix Huang, Christian Reidys, and Reza Rezazadegan
predicting pseudoknotted structures as we will see below.
Pseudoknots are RNA structure motives that involve base pairs between bases that are neither nested nor
concatenated [10, 14, 69, 80]. This means that the contact structure of a pseudoknotted (PK for short) structure
contains arcs that cross each other. They were, and still are, mostly ignored because
i. the vast majority of experimentally determined RNA structures do not involve pseudoknots and
ii. predicting and even classifying pseudoknotted structures is theoretically and computationally much
more demanding compared to noncrossing structures.
In fact RNA secondary structure prediction problem with generalized pseudoknots, where the energy
function depends on adjacent base pairs, is NP-hard [1, 44]. For this reason, commonly used RNA secondary
structure prediction tools, including mfold [82], Vienna RNA [30] and NUPAK [21] exclude pseudoknots.
However the importance of PK structures is becoming increasingly more evident. Pseudoknots are for
example functionally important in tRNAs [43], telomerase RNA [72], and ribosomal RNAs [37]. The prediction
of RNA pseudoknotted structures can be seen as an intermediate between secondary and tertiary structure
prediction. In vitro selection experiments have produced pseudoknotted RNA families that bind to the HIV-1
reverse transcriptase [75].
There have been attempts to use the number of crossings to filter the set of PK structures compatible with
a given sequence [63]. However determining the class of structures that the algorithm recognizes is not easy
and in fact in the case of [63] this class was characterized in a subsequent paper [64]. Another issue with this
approach is that a set of parallel arcs intersecting another arc can result in a high number of crossings but
this does not result in a more complicated structure.
1
2
3
4
5
6
34
2 5
1 6
Figure 5: A fatgraph and its associated surface which has genus one
Fatgraphs turn out to be a powerful framework to describe PK structures. In particular they come with a
new numerical invariant called genus, i.e. the number of holes in the ribbon surface associated to it. Figure
5 depicts a fatgraph and its associated surface having genus one. Genus can be simply expressed in terms of
the number of the vertices, edges and boundary components of the fatgraph, see Section 4 for details. One
can use genus to filter the set of PK structures, in other words use genus as a measure of the complexity of
such a structure. As opposed to the number of crossings, genus is not changed when we collapse a stack.
The first application of genus to the study of PK structures was given in [7]. The authors study databases
of RNA structures and show that even for large RNA molecules of about 3 kilobases the genus remains smaller
than 18 but for random sequences having the same length it can go up to 400. In [3] the authors use the relation
between fatgraphs and the moduli space of Riemann surfaces [54] to compute the generating function for the
number of fatgraphs with a given genus and minimum number of parallel arcs. This is used to imply the
existence of neutral networks of distinct molecules with the same structures of any genus.
However neither of these two papers provide a grammar for decomposing irreducible (primitive) pseudoknots. This task was first done in [59]. In that paper genus is used to obtain a recursion and hence a (multiple
context free) grammar for decomposing PK structures. One important contribution of that paper is defining
Unauthenticated
Download Date | 6/16/17 9:06 PM
Fatgraph models of RNA structure |
5
the right class of PK structures, called 𝛾 -structures, whose numbers are manageable for a given genus. These
are PK structures without parallel arcs (called shadows) all whose irreducible components have genus ≤ 𝛾 .
A shadow is called irreducible if for each two arcs in it there is a sequence of mutually intersecting arcs from
one to the other.
It is shown in [59] that there are finitely many irreducible 𝛾 -structures for a given 𝛾 . It is also shown that
there are exactly four irreducible 1-structures. Then a grammar for decomposing any 1-structure is given.
As a result, a 1-structure can have arbitrarily high genus (and the number of crossings) but it is composed
of blocks of small complexity. The resulting DP algorithm and software gfold can fold an RNA sequence of
length N in O(N 6 ) time.
In a recent work of Huang and Reidys [32] a genus recursion due to Chapuy [12] is used to obtain a context free grammar (CFG) for decomposing any PK structure. Chapuy’s method adeptly finds vertices (called
trisections) at which the genus of a fatgraph is located. These are the vertices at which the order on the edges
coming from the cyclic order and the one coming from the traversal of the boundary do not match. Chapuy
introduces a slicing operation on such vertices that decreases genus by at least one. There is also the inverse
operation of gluing of vertices which increases genus. These operations are employed to obtain a combinatorial recursion for the number of fatgraphs with a given genus and number of edges, having only one boundary
component.
The algorithm of Huang and Reidys uses this recursion to convert any PK structure (which has nonzero
genus) to a noncrossing one (which has genus zero) together with a labeling of its arcs. The labeling can be
used to re-obtain the PK structure. The resulting noncrossing structure depends on a choice of trisections
and in it the backbone is permuted. CFG’s are simpler than MCFG’s and so the passage from MCFG to CFG
has implications for time and space complexity. Thus it is conceivable that a folding algorithm based on [32]
would be more efficient.
Inverse folding involves finding an RNA sequence that folds into a given structure S. It is useful in designing synthetic RNA molecules such as new drugs [30]. Even though local search algorithms for inverse folding
noncrossing structures have been around for a while [8, 29, 30], such algorithms for PK structures are new
[25]. The difficulty again arises from decomposing a PK structure into simple pieces. The local search inverse
folding algorithms are comprised of two steps. The first step involves finding a heuristically suitable seed
sequence to initiate the search and the second a walk in sequence space that takes one closer and closer to
a candidate sequence. In [8] the seed is taken to be the sequence σ for which the energy E(σ, S) is minimal
(called the MFE sequence).
The first inverse folding algorithm for PK structures was given in [25]. It uses a random sequence as seed
and is able to inverse fold PK structures with at most two mutually crossing arcs. In Section 4.3 we outline
upcoming work in which we use the novel grammar of [32] to obtain an algorithm that can find the MFE
sequence for a given PK structure. This will in turn be utilized by a stochastic local search algorithm, similar
to the one in [8], to inverse fold such a PK structure.
Global search algorithms for inverse folding, also called Inverse sampling [62], involve assigning probabilities to the sequences compatible with a given structure S as how likely they are to fold into S. The grammar
of [32] makes it possible to extend inverse sampling to PK structures.
2 Preliminaries
In this section we recall the basic notions and terminology in RNA secondary structure folding.
Unauthenticated
Download Date | 6/16/17 9:06 PM
6 | Fenix Huang, Christian Reidys, and Reza Rezazadegan
2.1 RNA secondary structures
An RNA molecule consists of a sequence of nucleotides A, U, C and G. These nucleotides can form WatsonCrick A-U, C-G as well as wobble G-U base pairs. RNA nucleotides can form noncanonical base pairs as well
(Section 4) however for the purpose of RNA secondary structure prediction they are usually ignored. These
base pairs make the RNA molecule fold into a 3-dimensional shape and this shape is essential to the functions
of the RNA molecule.
Since an RNA molecule is directed, from the 3’ to the 5’ end, bases can be enumerated from 1 to N and
hence a base pair can be represented by a pair (i, j) where j > i. Geometrically a secondary structure can be
represented by its nucleoltides as points on the x-axis and the base pairs as arcs in the upper half plane. This
representation is called a contact structure [60].
We say a structure S is the concatenation of two substructures S1 , S2 if each arc in S belongs to one of S1
or S2 and there is a base k such that the arcs of S1 are on one side of k and the same for S2 . Similarly S is the
result of nesting S1 in S2 if all the arcs in S1 are underneath a single arc in S2 .
A contact structure is called noncrossing if none of the arcs in it intersect or in other words if there are
no base pairs (i, j), (k, l) such that i < j < k < l. Structures which contain intersecting arcs are called pseudoknotted (PK).
2.2 Folding noncrossing structures
For a given RNA sequence there are exponentially many secondary strcuctures compatible with it [68] and
therefore a naive approach to finding the MFE structure would be computationally impossible, even for sequences of moderate length. The dynamical programming (DP) approach to folding [51] uses the fact that
structures can be nested in one another and that the energy is additive w.r.t. the nesting of structures. So a
recursion that can exhaustively express secondary structures as combinations of simpler ones can be used
together with DP to obtain an algorithm for finding MFE or computing the partition function.
It is easy to see that a noncrossing structure can be recursively decomposed into the concatenation and
nesting of its constituent arcs. Conversely for a given N all such structures can be obtained in polynomial time
by nesting and concatenating simpler structures [71]. This can be formalized using the language of context
free grammars (CFG’s). Recall that a CFG consists of sets V , Σ of nonterminal resp. terminal symbols and a
set R of production rules which are maps from V to (V ∪ Σ)* . Here * is the Kleene star and denotes the set of
all words in the alphabet given by V ∪ Σ.
For RNA secondary structures the grammar goes back to the work of Smith and Waterman [71]. The nonterminal symbols are L and S which respectively denote a structure with a rainbow arc (maximal arc) and an
arbitrary structure. The production rules are given by
S → LS | • S | ϵ
L → aSb
where a and b denote the bases of an arc, • denotes an unpaired base and t is the empty structure without
any crossings. These production rules are depicted in Figure 6.
3 Fatgraphs
In this section we recall the basic theory of fatgraphs.
Unauthenticated
Download Date | 6/16/17 9:06 PM
Fatgraph models of RNA structure |
7
Figure 6: The grammar for decomposing secondary structures without crossings
3.1 Basic facts
The contact structure from a secondary structure has an additional structure i.e. a cyclic counterclockwise
order on the edges incident on each vertex. This order comes from the orientation of the plane. A graph that
is equipped with a cyclic order on the edges incident on each vertex is called a fatgraph. More details on the
concepts and proofs presented in this and the next subsection can be found in e.g. [38] and [57].
To be precise the cyclic order < has to be defined on half edges in the graph. This is because if two (possibly
identical) vertices u, v share edges e, e′ then e and e′ may appear in different orders on u compared to v. For
example on the left hand side of Figure 7 the edge 2 comes before the edge 3 in the counterclockwise order
however the other halves of these two edges, i.e. 7 and 8 come in reverse order. Let H be the set of half edges
in a fatgraph G so the cardinality of H is twice the number of edges in G.
To a fatgraph G we can associate a surface S in a canonical way. It is obtained by thickening the edges of
the fatgraph into ribbons and the vertices into discs. The order < on the half edges is used to resolve the singularity when the ribbons associated to two edges meet. Equivalently, S can be characterized by the fact that
G is embedded in it in such a way that it doesn’t intersect itself and the cyclic order on its edges is compatible
with the orientation of the surface [57].
Despite their topological nature, fatgraphs can be described combinatorially and concisely by a pair of
permutations [12]. Recall that a permutation on the set {1, . . . , n} is a one to one and onto function f from
the set to itself. A cycle of f for an i ∈ {1, . . . , n} is given by (i, f (i), f (f (i)), . . . , f k (i)) such that k ≤ n and
f k+1 (i) = i. Cycles are considered up to cyclic rearrangement of their elements and each permutation f can be
decomposed as a product of its cycles. This decomposition is unique up to conjugation and permutation of
the cycles.
For each vertex v let e1 < e2 < · · · < e k be the half edges incident on it, ordered by the cyclic order <.
We can consider the cyclic permutation (e1 , e2 , . . . , e k ) from which we can recover the cyclic order around
v. Composing these permutations for all the vertices in G we get a permutation on H which we denote by σ.
Note that since each half edge is incident on only one vertex, it appears in only one of the cyclic permutations
that make σ so the order in which we compose the cycles is immaterial. It is easy to see that each vertex in G
corresponds to a cycle in σ. For half edges incident on a vertex of degree n we have
e ≤ e′ ⇔ ∃ k < n
σ k (e) = e′ .
(1)
To be able to recover G, in addition to σ we need to specify which half edge pairs with which. This can be
done with an additional permutation α on H which sends each half edge to its couple and vice versa. So α
is fixed-point free and satisfies α2 = α. In this way each fatgraph G can be represented combinatorially by
the triple (H, σ, α). The composition 𝛾 := α ∘ σ is a permutation on H whose cycles give us the boundary
components of the surface associated to G [38]. By abuse of terminology we call these boundary components
the boundary components of G itself.
Example 3.1. For example for the fatgraph in Figure 5 we have H = {1, 2, . . . , 16},
α = (1, 2)(3, 4)(5, 6)(7, 8)(9, 10)(11, 12)(13, 14)(15, 16),
σ = (3, 1)(7, 5, 4)(9, 2, 8)(13, 11, 10)(15, 6, 14)(12, 16).
Unauthenticated
Download Date | 6/16/17 9:06 PM
8 | Fenix Huang, Christian Reidys, and Reza Rezazadegan
Here we have written the cycles of σ so that the i’th cycle from left denotes the i’th vertex of the fatgraph. We
have 𝛾 = α ∘ σ = (1, 4, 8, 10, 14, 16, 11, 9)(2, 7, 6, 13, 12, 15, 5, 3) so the surface associated to this fatgraph
has two boundary components.
The set of fatgraphs enjoys a duality map called Poincare Duality [60]. The Poincare dual of the fatgraph
G = (H, σ, α) is given by G′ = (H, α ∘ σ, α). As Poincare duality swaps σ with 𝛾 , the vertices of G (cycles of σ)
are in a 1-1 correspondence with the boundary components of G′ (cycles of 𝛾 ) and the boundary components
of G are in bijection with the vertices of G′ . Also because α2 = id, we have (G′ )′ = G. In Figure 7 you can see
an example of this duality.
2
6
5
dual
4
3
7
10
2
8
9
5
3
7
9
(B)
8
dual
4
6
1
(A)
2
8
4
6
1
10
5
3 3
7
9
1
10
(C)
Figure 7: (A): a fatgraph represented by permutations. We have σ = (1, 2, 3, 4, 5, , 6, 7, 8, 9, 10), the unique vertex obtained
by collapsing the backbone, α = (1, 10)(2, 5)(3, 8)(4, 7)(6, 9) the fixed-point free involution representing the ribbons, and
𝛾 = (1, 5, 9)(2, 8, 6, 4)(3, 7)(10). (B) We consider the σ′ = 𝛾 = (1, 5, 9)(2, 8, 6, 4)(3, 7)(10) as vertices. (C): Connecting halfedges by ribbons following the fix-point free involution α = (1, 10)(2, 5)(3, 8)(4, 7)(6, 9), we construct the Poincaré dual (A).
The dual interchanges the boundary components by vertices, hence producing a unicellular map.
Fatgraphs which have only one boundary component, are called unicellular maps. By construction the
Poincare dual of a unicellular map has only one vertex. In the case of the contact structure of an RNA secondary structure if we collapse the backbone into a point we get a fatgraph with only one vertex and hence
its Poincare dual will be a unicellular map. See Fig. 8. This unicellular map is what we mean by the fatgraph
associated to an RNA secondary structure.
Figure 8: Collapsing the backbone into a single vertex. The Poincare dual of the resulting fatgraph is a unicellular map.
For a unicellular map, the permutation 𝛾 consists of only one cycle and hence it induces a cyclic order
<𝛾 on H which gives a traversal of all the half-edges. This order can be made into a linear order by choosing
a root edge. For an RNA secondary structure this edge is chosen to be the rainbow arc which is given by the
pair (1, N). If this arc does not exist in the structure it is added as a virtual base pair.
Unauthenticated
Download Date | 6/16/17 9:06 PM
Fatgraph models of RNA structure |
9
3.2 Genus as a measure of complexity
One of the main advantages of fatgraphs over ordinary graphs is that they come with a new numerical invariant i.e. the genus of the surface associated to them. Any orientable surface is topologically equivalent to
the connected sum of a number of tori [28] and this number is called the genus of the surface. So genus is
roughly speaking the number of holes (but not punctures) in the surface. The Euler characteristic of a fatgraph G with v vertices, e edges and r boundary components is defined to be χ(G) = v − e + r and from the
equivalence of simplicial and singular homology [28] we know that the Euler characteristic is related to genus
by χ(G) = 2 − 2g. So we have
1
g(G) = 1 + (e − v − r).
(2)
2
Since Poincare duality swaps vertices with boundary components v ↔ r and preserves the number of the
edges, it preserves genus.
In the previous subsection we associated to each RNA secondary structure a unicellular map. We call
the genus of this unicellular map, the genus of the secondary structure. Because the plane has genus zero,
a planar fatgraph has genus zero too. So each noncrossing secondary structure has genus zero, because it is
embedded into the plane. So we can use genus as a measure of the complexity of a PK structure. In the rest
of this section we will see how genus is useful in obtaining a recursion for decomposing PK structure into
simpler ones.
One advantage of genus as a measure of complexity is that, unlike other measures such as the number
of crossings, it is invariant under collapsing stacks [59]. This is important because computing the energy of
helices is less demanding. This means that if we have a set of parallel arcs in a structure and we remove all
but one of them, genus is unchanged. The structure obtained from S by collapsing all the stacks in it is called
the shadow of S [59].
3.3 Decomposing fatgraphs
As mentioned in the last subsection, it is a well known fact in topology that if the surface associated to a
fatgraph has genus g then it can be decomposed into a connected sum of g tori [28]. If the surface in question
is the surface associated to a fatgraph then it would be useful to be able to do this decomposition in a way
that is compatible with the structure of the fatgraph itself. This means finding a composition operation on the
set of fatgraphs so that all genus g fatgraphs can be expressed as compositions of fatgraphs of lower genus.
One such decomposition for unicellular maps was given by Chapuy [12] which we recall in this subsection.
In Section 4 we will see how this decomposition can be used to obtain a grammar for decomposing RNA PK
structures.
For a planar fatgraph, which by definition has genus zero, the two orders <𝛾 , < on the half-edges incident
on each vertex coincide because both of them are induced by the orientation of the plane. This means that
one can use discrepancies between these two orders to detect genus. Let G = (H, σ, α) be a unicellular map of
genus g and let e1 < σ e2 < σ e3 are three edges incident on the same vertex v that do not appear in the same
order in <𝛾 . Chapuy [12] calls such a triple of edges intertwined. He defines two operations on intertwined
edges, called slicing and gluing, which reduce or increase the genus of a fatgraph respectively.
Slicing. With the same assumptions as in the last paragraph, can write the cycle of σ corresponding to v
as
(e1 , a1 , . . . , a k1 , e2 , b1 , . . . , b k2 , e3 , c1 , . . . , c k3 ).
(3)
′
Let σ be the permutation of H which is the same as σ except for having the cycles
(e1 , a1 , . . . , a k1 ), (e2 , b1 , . . . , b k2 ), (e3 , c1 , . . . , c k3 )
′
′
(4)
′
instead of v. The fatgraph G = (H, σ , α) can easily be seen to be a unicellular map [12]. Because σ has two
cycles more than σ, G′ has two vertices more than G, so by formula (2) its genus is g − 1.
Unauthenticated
Download Date | 6/16/17 9:06 PM
10 | Fenix Huang, Christian Reidys, and Reza Rezazadegan
Gluing. This is the reverse of the slicing map. If we have three vertices in a fatgraph given by cycles
(a1 , . . . , a k ), (b1 , . . . , b k′ ), (c1 , . . . , c k′′ ) then we can take a new permutation which has the single cycle
(a1 , b2 , . . . , b k′ , b1 , c2 , . . . , c k′′ , c1 , a2 , . . . , a k ) in place of the above three. The corresponding fatgraph is
still a unicellular map and has two vertices less than the original one so its genus is one less. See Fig. 9.
In
Out
a3
glue
a1
a2
B
...
A
II
g= ( ... a1, A , a2, B , a3, ... )
s= ... ( a1, I )( a2,II )( a3,III ) ...
a’1
slice
a’3
...
I
...
...
...
III
a’2
...
B
A
g= ( ... a1, B , a3, A , a2 ,... )
s= ... ( a1, II, a2, III, a3 , I, ) ...
Figure 9: Gluing and slicing
A trisection is a half-edge e such that σ(e) <𝛾 e and e is not the minimum half edge of the vertex w.r.t. <𝛾 .
Chapuy proves that a unicellular map of genus g has exactly 2g trisections.
A trisection can be sliced in a special way. Let e3 := e be a trisection incident on a vertex v and let e1 denote the
minimum half-edge of v, let e2 denote the <𝛾 -minimum among half-edges which are between e1 and e3 (w.r.t.
< σ ) but larger than e3 with respect to <𝛾 . We have e1 <𝛾 e3 <𝛾 e2 and so these three edges are intertwined
and we can use them to slice v. After the slicing e now belongs to a new vertex and it may or may not be the
minimum half edge. If not it still satisfies σ′ (e) <𝛾 ′ e and so is a trisection. Therefore we can iterate the above
procedure until e becomes the minimum half-edge of the vertex to which it belongs.
4 Modeling RNA structure with fatgraphs
Throughout the history of science, mathematics has been used to model natural phenomena. The purpose of
such models is primarily to enable us to predict the result of a process without having to experiment each time.
For example calculus and the theory of Hilbert spaces were used to model classical and quantum mechanics
respectively. As history tells us a model that seems to be sufficient at predicting natural processes at first may
later prove to be inefficient, as demonstrated by the example of the classical and quantum mechanics. The
purpose of a mathematical model for RNA is to enable us to classify and communicate RNA structure patterns
and to predict RNA folding.
The first application of fatgraphs to studying RNA was by Penner and Waterman [56] in which the authors enhanced the space of secondary structures into a simplicial complex and used fatgraphs and their
relation to the Teichmüller theory of surfaces to determine its topological type. More recently Penner et al.
[55] associated a fatgraph model to proteins. They use the genus and the number of boundary components
of the fatgraphs to obtain robust descriptors for proteins.
In this section we present how fatgraphs, whose mathematical theory was presented in the last section
make it possible to efficiently classify, fold and inverse fold PK structures. We begin by describing how fatgraphs are essential to the modeling of noncanonical base pairs in RNA.
Unauthenticated
Download Date | 6/16/17 9:06 PM
Fatgraph models of RNA structure |
11
As mentioned in the introduction, in addition to the Watson-Creek and wobble pairs, RNA structures
exhibit a plethora of additional, non-canonical base-pairs. These base pairs mediate specific interactions responsible for RNA-RNA self-assembly and RNA-protein recognition. RNA purine and pyrimidine bases present
three edges for hydrogen-bonding interactions: the Watson-Crick edge, the Hoogsteen edge, and the Sugar
edge denoted by W-C, H and SUG respectively [39]. In this section we outline future work in which we use
fatgraphs to model noncanonical base pairs in RNA.
A given edge of one base can potentially interact in a plane with any one of the three edges of a second
base, and can do so in either cis or trans orientation of the glycosidic bonds. This gives rise to 12 basic geometric types with at least two H bonds connecting the bases, see Figure 3. The canonical A-U, G-C pairs and
G-U wobble pairs belong to the cis Watson-Crick/Watson-Crick geometry. RNA folding, as well as interactive three-dimensional modeling, requires keeping track of the relative orientations of the strands to which
the interacting bases belong. A triangle has two sides, for one side the three edges follow (H, SUG, W-C)
counterclockwise, and for another side the three edges follow (H, W-C, SUG) counterclockwise. A base pair
is parallel if the two triangles are on the same side and anti-parallel, otherwise. This leads to considering
non-orientable fatgraphs as detailed in Figure 4.
Roughly speaking a non-orientable fatgraph is one whose associated surface can be non-orientable. More
precisely such a fatgraph comes with additional information as to whether the ribbon associated to each edge
should be twisted or straight. To keep track of the boundary components of a nono-orientable fatgraph it is
preferred to consider sectors instead of ribbons. A sector is the wedge between the two ribbons incident on a
vertex and for non-orientable fatgraphs with n edges we can take the set H of cardinality 2n to be the set of
the sectors. So a possibly non-orientable fatgraph is a triple (H, σ, 𝛾 ) such that for each pair (e, σ(e)) there is
another pair (f , σ(f )) such that we have either
σ
e
/ σ(e)
𝛾
σ(f ) o
𝛾
σ
or
σ
e
𝛾
f
/ σ(e)
𝛾
f
σ
/ σ(f ).
In the first case we have an untwisted ribbon and in the second case a twisted one. See Fig. 10 for two
examples of such fatgraphs. Note that for a non-orientable fatgraph it is not true that α = 𝛾 ∘ σ−1 satisfies
α2 = id. The notion of Poincare duality can be defined for these fatgraph by swapping σ and 𝛾 .
5
-3
-4
6
2
1
8
7
(A)
5
-3
-7
1
8
4
6
-2
(B)
Figure 10: Two examples of nonorientable fatgraphs. The plus and minus signs on sectors denote counterclockwise and clockwise orientations respectively.
Unauthenticated
Download Date | 6/16/17 9:06 PM
12 | Fenix Huang, Christian Reidys, and Reza Rezazadegan
4.1 A context free grammar for decomposing PK structures
We saw in the subsection 3.3 that a genus g PK structure together with a blueprint is equivalent to a labeled
tree (G N , κ). The Poincare dual of this tree is a genus zero fatgraph S with one vertex which has the same
edges as the tree. So an edge e of S can be regarded as an edge of G N and to which we assign the label of its
child. Since the root of the tree is not involved in any slicing it gets the zero label. We then add the unpaired
bases to S and what we obtain is a noncrossing secondary structure with labeled arcs.
We then apply the classical grammar for secondary structures to (S, σ) to decompose the original PK
structure. Whereas the old grammar applied to arcs, the new grammar applies to labeled arcs and it either
removes an arc or reduces its label. We now describe the resulting grammar in more detail but it’s worth
mentioning that different choices of blueprints result in different labeled noncrossing structures.
∑︀
This grammar has new nonterminals S(l(ti ))i and L(l(ti ))i for i = 1, . . . , N. Here l i = σ a σ a |i where σ a runs
i i
i i
over all the edge labels in the noncrossing structure G N that belong to S. We set t i = 1 if and only if the
corresponding nonterminal contains a transitional arc for i (meaning it contains an arc a such that σ a |i = 1)
and t i = 0 otherwise. The new terminals are •, d σ a , d′σ a where • denotes an isolated base and d σ a , d′σ a denote
the two bases of a labeled arc.
There are two production rules in this grammar.
i )i
Decomposition: this means to decompose a nonterminal symbol S(ℓ
into one left- and one right-symbol,
(t i )i
together with consistent labelings. This is formalized as
i )i (ν i )i
| • S(l(ti ))i .
S
S(l(ti ))i → ϵ | S(µ
(p ) (q )
i i
i i
i i
(5)
i i
The new labels µ i , ν i , p i , q i are derived using the fact that the resulting structure must be a λ-structure i.e. a
labeled noncrossing structure obtained from a PK structure and a blueprint. Detailed formulas for these
labels are given in [32].
i )i
. The key point here is to
Arc-removal: this means to remove the outer arc, a, of the nonterminal L(ℓ
(t i )i
provide this arc with a label, σ a consistent with the label associated with the generated nonterminals.
(l i )i −σ a ′
d σa
L(l(ti ))i → d σ a S(p
)
i i
(6)
i i
(l′ )
Given L(l(ti ))i any arc-removal based production rule generates the nonterminal S(ti′ )i , where t′i ≤ t i and
i i
i i
l i − l′i ≤ 1. This implies for the label of the removed arc σ a |h = (l h − l′h ), 1 ≤ h ≤ r.
It is proved in [32] that this grammar is context free.
Slicing can be used to convert a crossing structure to a noncrossing one. Then one can apply the classic
grammar, mentioned in Section 2.2, to the latter to decompose the former. If the structure has genus g then
we will need exactly g slicing operations.
A blueprint [32] for a unicellular map G is a sequence {(G i , h i )}i∈{0,N} of unicellular maps G i and a
trisection h i for G i such that G0 = G, G i+1 is obtained from G i by slicing the latter at h i and the genus of G N
is zero (so h N is empty). Note that N is not necessarily determined by g. Note also that all G i have the same
number of edges.
Since G N has genus zero, it is a planar tree. We now recall from [32] a way of adding labels to the vertices
of this tree so that we can recover G from it (without having to know the other G i in the blueprint). The labels
{κ v }v are elements of Z/2N and are assigned by backtracking the vertices through the slicings. We first assign
unreduced labels κ′v . Starting from a vertex v =: v N of G N , it descends from a unique vertex v N−1 of G N−1 .
Passing from G N−1 to G N , v N−1 was either sliced or not. If so then the (N − 1)’st coordinate of κ′v would be 1
and otherwise 0. We then do the same for v N−1 and if it was the result of a slicing of a vertex v N−2 on G N−2
then the (N − 2)’nd coordinate of κ′v would be one. In general the i’th coordinate of κ′v is 1 if and only if the
vertex v is the descendant of a vertex which was involved in a slicing in G i .
Unauthenticated
Download Date | 6/16/17 9:06 PM
Fatgraph models of RNA structure |
13
We then reduce the κ′v to obtain κ v . Among the vertices that result from the slicing of the same vertex
at stage i only one of them has the i’th coordinate of its label equal to 1 and this vertex is chosen to be the
one which is minimal in the <𝛾 order. As a result the sum of all 1’s in all the labels κ v equals 2g + N. The
noncrossing graph G N together with the labeling κ v is called the λ-structure associated to the blueprint. In
Figure 11 we display an example of a blueprint for a PK structure and the associated λ-structure.
Figure 11: A PK structure (left), a blue print of it (center) and the associated λ-structure (right).
The blueprint can be reconstructed from G N and the labeling as follows. We glue the set of all vertices v
of G N for which κ v |N = 1. This gives us a unicellular map which is the same as G N−1 . We label it by setting the
last coordinate of κ v to zero if v was not involved in the aforementioned gluing and the sum of the labels of
the glued vertices (sans the last coordinate) otherwise. We can then iterate this process until we obtain G0 of
genus g.
4.2 Folding pseudoknotted structures
The main difficulty in folding PK structures is that, because of the intersections between the arcs a useful
recursion for exhausting these structures (similar to the one given in Section 2.2 for noncrossing structures)
eluded researchers for a long time.
For certain restricted classes of pseudoknots, Polynomial-time dynamic programming (DP) algorithms
can be devised. In contrast to the O(N 2 ) space and O(N 3 ) time solution for simple secondary structures
[51, 79, 83], however, most of these approaches are computationally much more demanding. The design of
pseudoknot folding algorithms thus has been governed more by the need to limit computational cost and
achieve a manageable complexity of the recursion rather than the conscious choice of a particularly natural
search space of RNA structures. The following references provide a list of DP approaches to RNA structure
prediction using different structure classes characterized in terms of recursion equations and/or stochastic
grammars: [1, 9, 13, 19, 21, 36, 42, 44, 47, 58, 63, 76]. The inter-relationships of some of these classes of RNA
structures have been clarified in part by [17] and [65]. In addition to these exact algorithms, a plethora of
heuristic approaches to pseudoknot prediction have been proposed in the literature; see e.g., [15, 49] and the
references therein.
Prior to [59] there had been a few attempts at classifying PK structures for example [27] suggested using book-embeddings, [34] focused on the maximal set of pairwise crossing base pairs, and [7] based the
classification on topological embeddings. While these classifications have in common that simple secondary
structure forms the most primitive class of structures, they differ already in the construction of the first nontrivial class of pseudoknots. A workable approach to 3-noncrossing structures requires the enumeration of
an exponentially growing number of diagrams which are then “filled in” by means of DP [33]; a Monte-Carlo
approach utilizing the topological approach with a very simple matching-like energy model was explored by
[77].
Unauthenticated
Download Date | 6/16/17 9:06 PM
14 | Fenix Huang, Christian Reidys, and Reza Rezazadegan
The first polynomial time algorithm for folding a class of PK structures was given in [59]. It relies on an
efficient topological recursion for enumerating PK structures. One considers the shadow of an RNA structure
which is given by removing all unpaired bases and collapsing each set of parallel arcs (called a stack) into
a single arc. The shadow of a structure has the same genus and number of loops as the structure itself. A
shadow is called irreducible if for each two arcs in it there is a sequence of mutually intersecting arcs from one
to the other. Each shadow can be decomposed into a composition of irreducible pieces by iteratively removing
nested or concatenated components. A shadow is called a 𝛾 -structure if all its irreducible components have
genus ≤ 𝛾 . It is shown in [59] that there are finitely many irreducible 𝛾 structures for a given 𝛾 . It is also shown
that there are exactly four irreducible 1-structures, see Fig. 12.
H
K
L
M
Figure 12: The four irreducible genus one shadows
Then a MCFG, called R1 , for decomposing any 1-structure is given [59]. This grammar is as follows.
– Non-terminal S, representing noncrossing secondary structure elements according to the rules given
above.
– Non-terminals I and T, representing an arbitrary 1-structure where I can have genus zero but the genus
of T is positive.
−
→
– Non-terminals X = [X1 , X2 ] with two components used to represent a fragment-pair with nested arcs,
X ∈ {H, K, L, M }.
– Terminals (X, )X denoting the opening and closing of a base pair, respectively, where X is one of the
types H, K, L or M.
Different brackets as well as the different non-terminals of pattern X are used to distinguish nestings of
the various types of shadows. Finally, we specify the production rules of the unambiguous MCFG R1 .
I
→
S|T
(7)
S
→
(S)S| • S|ϵ
(8)
T
→
I(T)S
(9)
T
→
IA1 IB1 IA2 IB2 S
(10)
T
→
IA1 IB1 IA2 IC1 IB2 IC2 S
(11)
T
→
IA1 IB1 IC1 IA2 IB2 IC2 S
(12)
T
→
IA1 IB1 IC1 IA2 ID1 IB2 IC2 ID2 S
(13)
→
[(XIX1 , X2 I)X] | [(X, )X]
(14)
−
→
X
Here X ∈ {H, K, L, M } indicates one of the four types of irreducible 1-structures. In the above formulas a pair
of parenthesis indicates a base pair and a bullet is an unpaired base. In words the equations (9)-(13) mean
that an arbitrary 1-structure may be decomposed into an irreducible 1-structure with other 1-structures nested
under its arcs (but not intersecting them).
It is proved in [59] that any RNA 1-structure can be uniquely decomposed via R1 and any diagram generated via R1 is a 1-structure. The resulting DP algorithm and software gfold can fold an RNA sequence of
length N in O(N 6 ) time. It is worth noting that there is currently no consensus for the energy contribution of
different types of pseudoknotted structures.
gfold is capable of Boltzmann sampling PK structures as well. Recall that the Boltzmann ensemble [22]
of a given RNA sequence is the probability space of all the structures compatible with it. The central notion
Unauthenticated
Download Date | 6/16/17 9:06 PM
Fatgraph models of RNA structure |
here is the partition function of an RNA sequence [48]. For a sequence σ it is given by
∑︁
Q(σ) =
exp(−E(S, σ)/RT)
15
(15)
S
where the sum ranges over all the structures S compatible with σ and R, T are the universal gas constant
and the temperature respectively. The probability that σ will fold to a specific structure S is then given by
exp(−E(S, σ)/RT)/Q(σ). In [48] McCaskill observed that the dynamic programming routines folding MFE
structures [48] allows one to compute the partition function of all possible noncrossing structures for a given
sequence. Predictions such as base pairing probabilities are obtained in [29, 30] and are parallelized in [24].
In [20, 74] a statistically valid sampling of secondary structures in the Boltzmann ensemble is derived and is
used to calculate the sampling statistics of structural features. The first algorithm for computing a partition
function that includes PK structures was given in [21]. The advantage of gfold over this algorithm is that it
uses different penalties for different types of irreducible 1-structures.
One reason that Boltzmann sampling is useful is that because of the inaccuracies in the energy parameters in the thermodynamic model, the MFE structure may not coincide with the natural structure to which a
sequence folds. So instead of focusing on the MFE structures one can sample structures from the Boltzmann
ensemble. In the next subsection we will see how fatgraphs can be useful in the closely related method of
inverse sampling.
4.3 Inverse folding of PK structures using fatgraphs
Given a structure S, inverse folding involves finding an RNA sequence σ which folds into S. Inverse folding
is useful for example in designing synthetic noncoding RNA molecules which may be used as drugs or for
regulating gene expression. In this section, after recalling the basics of inverse folding RNA sequences to
noncrossing structures, we outline upcoming work that uses the CFG from subsection 4.1 to inverse fold such
sequences to PK structures.
In this article we use the thermodynamic model for RNA folding so we are looking for a sequence whose
minimum free energy (MFE) structure in the Turner model equals S. Note that due to the fact that a lot of
sequences fold to the same structure, the existence of such σ is not guarantied for all structures S [68]. So one
may weaken the condition on σ to be a sequence whose MFE structure has minimum distance to S, where
distance is measured using some secondary structure metric [50] which can simply be the number of the
bases in which the two structures do not agree. However even this relaxed criterion can be computationally
quite expensive and so the existing algorithms use heuristic methods to find a sequence which is reasonably
close to being the MFE sequence.
The existing algorithms fall into two general categories: local search and global search. Local search
algorithms [30] involve two steps. Given a structure, S, the first step is to find a “suitable” sequence σ0 which
is compatible with S and is usually chosen heuristically so that it does not lie very far from the desired inverse
fold sequence. The second step involves a walk in the sequence space that given a sequence σ i (with σ0 as
starting point) finds another σ i+1 in the neighborhood of it such that the distance of σ i+1 ’s MFE structure to S
(computed using a secondary structure metric) is smaller than that of σ i . The algorithm stops when there is
no such neighbor to be found. (Recall that the neighbors of a sequence are the ones which differ with it only
in one nucleotide i.e. point mutations.)
In the original algorithm for inverse folding RNAInverse [30] the initial sequence σ0 was chosen at random. Later in [8] this sequence was taken to be the one which minimizes energy among all compatible pairs
(σ0 , S). This is called the MFE sequence. The MFE sequence is a good choice because the inverse fold is expected to have a low energy w.r.t. S [41, 61]. This sequence can be found in linear time (in the length of S)
using a DP algorithm, due to Busch and Backofen [8].
The main rule of thumb for computing the MFE sequence of a secondary structure in [8] is that if we know
the MFE sequence of a closed substructure (i.e. one without any arcs exiting it and has an arc between its
outermost bases) for all the six possible assignments of base pairs to its outer arc then for computing the MFE
Unauthenticated
Download Date | 6/16/17 9:06 PM
16 | Fenix Huang, Christian Reidys, and Reza Rezazadegan
sequence of the whole structure we don’t need to know anything more about this substructure. The algorithm
starts with computing the energies of the base pairs which are maximally nested under other arcs, for all the
6 possible combinations for each. Having computed the MFE of a substructure S i for all the allowed base assignments it then proceeds to compute the MFEs of the substructure S i+1 which is obtained from S i by adding
the base pair which is immediately larger than the largest base pair in S i . If each base pair had only one
predecessor (w.r.t. <) then the time complexity of this algorithm would be proportional to the number of base
pair times the number of allowed pairs i.e. linear in the length of the structure. However the closing arc of a
multiloop has more than one predecessor and so the time complexity of this algorithm would be exponential
in the number of loops inside the multiloop. This problem can be resolved by assigning a DP matrix to each
multiloop. This basically means that instead of taking into account all the possible assignments of bases to
the loops in a multiloop, one computes the MFE, for each one and for each assignment of its closing base pair.
The first algorithm for inverse folding pseudoknotted structures was given in [25]. It can find the MFE
sequence for a structure with at most two crossings and it uses a random sequence for the first step. The
difficulty of finding the MFE sequence for a PK structure lies in the fact that up until recently there was no
grammar for decomposing PK structures. Such a grammar was given in [32] and was described in Section 4.1.
In a future work we use this grammar to decompose a given PK structure and then find its MFE sequence.
This procedure is somewhat similar to that of Busch and Backofen but with the CFG from [32] used instead
of the nestedness order to decompose a structure. Once the MFE sequence is found, a stochastic local search
algorithm is used to find the inverse fold. The following is a list of different modules involved in this line of
work.
– The context free grammar of [32]. Given a PK structure S, we pick a blueprint for its associated fatgraph
and using the method of [32] obtain a labeled noncrossing structure (G, κ v ). Then the grammar from
the aforementioned paper is used to decompose this latter structure into a set of labeled noncrossing
arcs.
– An algorithm for finding the MFE sequence of a PK structure. After performing the above step one can
obtain the MFE sequence for S by finding the minimum free energy sequence for the labeled noncrossing structure (G, κ v ). This is done in a similar manner to the Busch-Backofen algorithm described above.
– A stochastic local search algorithm. This step is very similar to the corresponding steps in [8] and [25]
i.e. starting with the MFE sequence it recursively finds sequences the distance of whose MFE structure
(in base pair metric) is closer and closer to S.
– gfold. The local search step above involves folding the sequence obtained in each step to measure the
distance of its MFE structure with S. Since in our case S is pseudoknotted, we use gfold to fold σ to see
whether its MFE structure is close to S.
Just as in the case of [8] we expect using the MFE sequence instead of a random sequence (as done
in [25]) will dramatically decrease the number of the iterations needed in the stochastic local search step.
This is significant in the light of the fact that folding for PK structures takes much longer time than for the
noncrossing structures considered in [8].
A problem closely related to inverse folding is inverse sampling. Given an RNA secondary structure S
inverse sampling for S involves assigning a probability to each sequence compatible with it. The probability
measures the likelihood of the sequence to fold into S. In [62] and independently in [5] a dual partition function is assigned to the ensemble of sequences σ compatible with a given structure S. (See also [78].) Let E(S, σ)
denote the energy of the pair (S, σ). The dual partition function is given by
∑︁
Q(S) =
exp(−E(S, σ)/RT)
(16)
σ
where the sum is over all sequences compatible with S. This partition function, for noncrossing structures,
can be computed using a DP algorithm similar to that for the usual structure-space partition function. Since
probability is in reverse proportion with energy, the MFE sequence is the one with the highest probability. This
Unauthenticated
Download Date | 6/16/17 9:06 PM
Fatgraph models of RNA structure |
17
means that if we randomly sample sequences compatible with S, the sequence with the highest frequency is
likely to be the MFE sequence.
In [62] this partition function is used to obtain an inverse sampling algorithm for noncrossing RNA structures. This inverse sampling can be viewed as a global search for the MFE sequence. Another advantage of the
algorithm in [62] is that it can control the number of C-G base pairs in the sampled sequences. In [5] inverse
sampling of noncrossing structures is used to assign a mutual information to a sequence-structure pair. For
a given structure compatible sequences whose mutual information is close to each other are believed to contain similar genetic data. The substructures with fewest inverse samples are where the hidden patterns exist.
The fatgraph driven CFG from [32] makes it possible to extent this paradigm to pseudoknotted structures.
5 Conclusion
In this article we recalled the importance of predicting RNA structure and how the innovative use of fatgraphs
in this realm can help us understand and predict RNA structure better, specially when taking pseudoknotted
structures or non-canonical base pairs into account. Fatgraphs have more structure than ordinary graphs and
this extra structure enables one to use tools from topology for studying them augmenting the combinatorial
methods.
Classically, when considering only Watson-Crick and wobble base pairs, RNA secondary structures have
been represented by graphs. In this model the backbone is drawn as a horizontal line and base pairs are denoted by arcs drawn in the upper half-plane. Later studies [39] show that modeling noncanonical base pairs
between nucleotides requires keeping track of the relative orientations of the strands to which the interacting bases belong. In [39] a nucleotide is modeled as a triangle having two sides (faces). For one side the three
edges of the nucleotide (Watson-Creek, Hoogsteen and sugar) follow counterclockwise, and for the other they
follow clockwise. The graph model of RNA secondary structure is not capable of describing these newly discovered features, while fatgraph model can achieve this easily. Furthermore, cis and trans base pairs can be
represented by untwisted and twisted ribbons respectively in the fatgraph model.
Fatgraphs bring a number of other advantages to the study of RNA. First of all the genus of fatgraphs can
be used to filter and classify PK structures. Classifying RNA structures including pseudoknots by topological
genus was studied in [6, 52, 59]. Although there are a lot of different classification schemes for pseudoknots,
the fatgraph model seems to be the best of them. For one thing there are a finite number of shapes for any
given genus [59] (see Section 4.2) and also adding parallel arcs does not increase genus. This makes biological
sense since adding parallel arcs only increases the size of a helix and does not increase the complexity of the
structure. For another reason, in the fatgraph model, various loops to which a structure is decomposed in
the nearest neighbor thermodynamic model [46], naturally correspond to the boundary components of the
fatgraph associated to the structure.
The use of fatgraphs also makes it possible to effectively decompose more complicated PK structures into
simpler ones as exemplified by the CFG from [32] (Section 4.1). In that paper a method is developed to transfer
a PK structure into a labeled secondary structure, called a λ-structure. This method gives us the first contextfree grammar for PK structures. One future application of this grammar is in inverse folding RNA structures.
That is to find a sequence which folds to a given structure. This problem arises for example in designing new
drugs or gene regulators [30]. If the given structure is noncrossing, one approach is studied in [30] and refined
in [8]. It consists of two parts. One first finds a sequence σ0 which is compatible with the given structure and
in the second step one uses a stochastic local search algorithm to find sequences σ i which fold closer and
closer to the desired structure. The second step involves folding each individual σ i and so is time consuming.
Therefore it was shown in [8] that if instead of picking σ0 at random as in [30], one uses the minimum free
energy sequence for the structure, the number of iterations is reduced significantly.
Despite their importance in biological processes [37, 43, 72], there has been no efficient algorithm for
inverse folding pseudoknotted RNA structures. The only existing algorithm was given in [25] and it uses a two
Unauthenticated
Download Date | 6/16/17 9:06 PM
18 | Fenix Huang, Christian Reidys, and Reza Rezazadegan
step method similar to [30] with a random sequence as the seed. Since folding sequences to PK structures
is much more time consuming [59], if we can reduce the number of iterations in the second step by finding
the MFE sequence of a PK structure, it will improve the inverse folding algorithm dramatically. The fatgraph
model for RNA and specifically the CFG in [32] (Section 4.1) makes it possible to effectively decompose the
λ-structure associated to a PK structure to find the MFE sequence. Using this method we can also extend the
results in [5] which samples sequences for a given secondary structure to include PK structures.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
T. Akutsu. Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Discr. Appl. Math.,
104:45–62, 2000.
Sidney Altman. Enzymatic cleavage of rna by rna. Bioscience Reports, 10(4):317–337, 1990.
Jørgen E. Andersen, Robert C. Penner, Christian M. Reidys, and Michael S. Waterman. Topological classification and enumeration of RNA structrues by genus. J. Math. Biol., 2012. prepreint.
Maximillian H. Bailor, Xiaoyan Sun, and Hashim M. Al-Hashimi. Topology links RNA secondary structure with global conformation, dynamics, and adaptation. Science, 327:202–206, 2010.
Christopher Barrett, Fenix Huang, and Christian Reidys. Sequence-structure relations of biopolymers.
M. Bon, C. Micheletti, and H. Orland. Mcgenus: a monte carlo algorithm to predict RNA secondary structures with pseudoknots. Nucleic Acids Res., 41(3):1895–900, 2013.
Michael Bon, Graziano Vernizzi, Henri Orland, and A. Zee. Topological classification of RNA structures. J. Mol. Biol., 379:900–
911, 2008.
A. Busch and R. Backofen. INFO-RNA–a fast approach to inverse rna folding. Bioinformatics, 22(15):1823–31, 2006.
Liming Cai, Russell L. Malmberg, and Yunzhou Wu. Stochastic modeling of RNA pseudoknotted structures: a grammatical
approach. Bioinformatics, 19 S1:i66–i73, 2003.
S. Cao and S. J. Chen. Predicting RNA pseudoknot folding thermodynamics. Nucl. Acids. Res., 34(9):2634–2652, 2006.
Thomas R. Cech. Self-splicing and enzymatic activity of an intervening sequence rna from tetrahymena. Bioscience Reports,
10(3):239–261, 1990.
G. Chapuy. A new combinatorial identity for unicellular maps, via a direct bijective approach. Adv. Appl. Math., 47(4):874–
893, 2011.
Ho-Lin Chen, Anne Condon, and Hosna Jabbari. An O(n5 ) algorithm for MFE prediction of kissing hairpins and 4-chains in
nucleic acids. J. Comp. Biol., 16:803–815, 2009.
J. L. Chen and C. W. Greider. Functional analysis of the pseudoknot structure in human telomerase RNA. Proc. Natl. Acad.
Sci. USA, 102(23):8080–5, 2005.
S. J. Chen. RNA folding: conformational statistics, folding kinetics, and ion electrostatics. Annu Rev Biophys, 37:197–214,
2008.
J. Cheng, P. Kapranov, J. Drenkow, S. Dike, S. Brubaker, S. Patel, J. Long, D. Stern, H. Tammana, G. Helt, V. Sementchenko,
A. Piccolboni, S. Bekiranov, D. K. Bailey, M. Ganesh, S. Ghosh, I. Bell, D. S. Gerhard, and T. R. Gingeras. Transcriptional
maps of 10 human chromosomes at 5-nucleotide resolution. Science, 308(5725):1149–54, 2005.
Anne Condon, Beth Davy, Baharak Rastegari, Shelly Zhao, and Finbarr Tarrant. Classifying RNA pseudoknotted structures.
Theor. Comp. Sci., 320:35–50, 2004.
Rhiju Das and David Baker. Automated de novo prediction of native-like rna tertiary structures. Proceedings of the National
Academy of Sciences, 104(37):14664–14669, 2007.
Jitender S. Deogun, Ruben Donis, Olga Komina, and Fangrui Ma. RNA secondary structure prediction with simple pseudoknots. In Proceedings of the second conference on Asia-Pacific bioinformatics (APBC 2004), pages 239–246. Australian
Computer Society, 2004.
Y. Ding and C. E. Lawrence. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res.,
31:7280–7301, 2003.
R. M. Dirks and N. A. Pierce. A partition function algorithm for nucleic acid secondary structure including pseudoknots. J.
Comput. Chem., 24:1664–1677, 2003.
Philippe Duchon, Philippe Flajolet, Guy Louchard, and Gilles Schaeffer. Boltzmann samplers for the random generation of
combinatorial structures. Combinatorics, Probability and Computing, 13:2004, 2004.
S. R. Eddy. Non-coding RNA genes and the modern rna world. Nat. Rev. Genet., 2(12):919–29, 2001.
M. Fekete, I.L. Hofacker, and P.F. Stadler. Prediction of RNA base pairing probabilities on massively parallel computers. J.
Comput. Biol., 7:171–182, 2000.
J. Z Gao, L. Y. Li, and C. M. Reidys. Inverse folding of RNA pseudoknot structures. Algorithms Mol Biol., 5:R27, 2010.
Walter Gilbert. Origin of life: The rna world. Nature, 319(6055):618–618, Feb 1986.
Unauthenticated
Download Date | 6/16/17 9:06 PM
Fatgraph models of RNA structure |
19
[27] Christian Haslinger and Peter F. Stadler. RNA structures with pseudo-knots: Graph-theoretical and combinatorial properties.
Bull. Math. Biol., 61:437–467, 1999.
[28] A. Hatcher. Algebraic Topology. Cambridge University Press, 2002.
[29] I. L. Hofacker. The vienna RNA secondary structure server. Nucl. Acids Res., 31:3429–3431, 2003.
[30] Ivo L Hofacker, Walter Fontana, Peter F Stadler, L Sebastian Bonhoeffer, Manfred Tacker, and Peter Schuster. Fast folding
and comparison of RNA secondary structures. Monatsh. Chem., 125:167–188, 1994.
[31] Stephen R. Holbrook and Sung-Hou Kim. Rna crystallography. Biopolymers, 44(1):3–21, 1997.
[32] F. Huang and C. M. Reidys. Topological language for RNA. arXiv:1605.02628, 2016.
[33] Fenix Huang, Wade Peng, and C. M. Reidys. Folding 3-noncrossing RNA pseudoknot structures. J. Comp. Biol., 16:1549–1575,
2009.
[34] Emma Y. Jin, Jing Qin, and C. M. Reidys. Combinatorics of RNA structures with pseudoknots. Bull. Math. Biol., 70:45–67,
2008.
[35] Philipp Kapranov, Jill Cheng, Sujit Dike, David A. Nix, Radharani Duttagupta, Aarron T. Willingham, Peter F. Stadler, Jana
Hertel, Jörg Hackermüller, Ivo L. Hofacker, Ian Bell, Evelyn Cheung, Jorg Drenkow, Erica Dumais, Sandeep Patel, Gregg Helt,
Madhavan Ganesh, Srinka Ghosh, Antonio Piccolboni, Victor Sementchenko, Hari Tammana, and Thomas R. Gingeras. Rna
maps reveal new rna classes and a possible function for pervasive transcription. Science, 316(5830):1484–1488, 2007.
[36] Yuki Kato, Hiroyuki Seki, and Tadao Kasami. RNA pseudoknotted structure prediction using stochastic multiple context-free
grammar. IPSJ Digital Courier, 2:655–664, 2006.
[37] D. Konings and R. Gutell. A comparison of thermodynamic foldings with comparatively derived structures of 16s and 16s-like
rRNAs. RNA, 1:559–574, 1995.
[38] Seigei Lando and Alexander Zvonkin. Graphs on surfaces and applications. Springer, 2004.
[39] N B Leontis and E Westhof. Geometric nomenclature and classification of rna base pairs. RNA, 7(4):499–512, 2001.
[40] Neocles Leontis and Eric Westhof. RNA 3D Structure Analysis and Prediction (Nucleic Acids and Molecular Biology). Springer,
2012.
[41] A. Levin, M. Lis, Y. Ponty, C. W. O’Donnell, S. Devadas, B. Berger, and J. Waldispühl. A global sampling approach to designing
and reengineering RNA secondary structures. Nucl. Acids Res., 40(20):10041–10052, 2012.
[42] Hengwu Li and Daming Zhu. A new pseudoknots folding algorithm for RNA structure prediction. In L. Wang, editor, COCOON
2005, volume 3595, pages 94–103, Berlin, 2005. Springer.
[43] A. Loria and T. Pan. Domain structure of the ribozyme from eubacterial ribonuclease. RNA, 2:551–563, 1996.
[44] R. B. Lyngsø and C. N. Pedersen. RNA pseudoknot prediction in energy-based models. J. Comp. Biol., 7:409–427, 2000.
[45] Alexander Marchanka, Bernd Simon, Gerhard Althoff-Ospelt, and Teresa Carlomagno. Rna structure determination by solidstate nmr spectroscopy. Nature Communications, 6:7024 EP –, May 2015. Article.
[46] D Mathews, M Disney, J Childs, S Schroeder, M Zuker, and D. Turner. Incorporating chemical modification constraints into
a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci, 101:7287–7292, 2004.
[47] Hiroshi Matsui, Kengo Sato, and Yasubumi Sakakibara. Pair stochastic tree adjoining grammars for aligning and predicting
pseudoknot RNA structures. Bioinformatics, 21:2611–2617, 2005.
[48] J. S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29:1105–1119, 1990.
[49] Dirk Metzler and Markus E. Nebel. Predicting RNA secondary structures with pseudoknots by MCMC sampling. J. Math.
Biol., 56:161–181, 2008.
[50] Vincent Moulton, Michael Zuker, Michael Steel, Robin Pointon, and David Penny. Metrics on rna secondary structures.
Journal of Computational Biology, 7(1-2):277–292, 2004.
[51] R. Nussinov, G. Piecznik, J. R. Griggs, and D. J. Kleitman. Algorithms for loop matching. SIAM J. Appl. Math., 35(1):68–82,
1978.
[52] H. Orland and A. Zee. RNA folding and large n matrix theory. Nuclear Physics B, 620:456–476, 2002.
[53] M. Parisien and F. Major. The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature, 452:51–55,
2008.
[54] R. C. Penner. Cell decomposition and compactification of Riemann’s moduli space in decorated Teichmüller theory. In
Nils Tongring and R. C. Penner, editors, Woods Hole Mathematics-perspectives in math and physics, pages 263–301. World
Scientific, Singapore, 2004. arXiv: math. GT/0306190.
[55] R. C. Penner, Michael Knudsen, Carsten Wiuf, and Jørgen Ellegaard Andersen. Fatgraph models of proteins. Comm. Pure
Appl. Math., 63:1249–1297, 2010.
[56] R. C. Penner and M. S. Waterman. Spaces of RNA secondary structures. Adv. Math., 101:31–49, 1993.
[57] Robert C. Penner. Decorated Teichmüller theory. QGM Master Class Series. European Mathematical Society (EMS), Zürich,
2012. With a foreword by Yuri I. Manin.
[58] Jens Reeder and Robert Giegerich. Design, implementation and evaluation of a practical pseudoknot folding algorithm
based on thermodynamics. BMC Bioinformatics, 5:104, 2004.
[59] C. M. Reidys, F. Huang, J. E. Andersen, R. C. Penner, P. F. Stadler, and M. E. Nebel. Topology and prediction of RNA pseudoknots. Bioinformatics, 27:1076–1085, 2011.
[60] Christian Reidys. Combinatorial Computational Biology of RNA. Springer, 2011.
Unauthenticated
Download Date | 6/16/17 9:06 PM
20 | Fenix Huang, Christian Reidys, and Reza Rezazadegan
[61] V. Reinharz, Y. Ponty, and J. Waldispühl. A weighted sampling algorithm for the design of RNA sequences with targeted
secondary structure and nucleotide distribution. Bioinformatics, 29(13):308–315, 2013.
[62] Vladimir Reinharz, Yann Ponty, and Jérôme Waldispühl. A weighted sampling algorithm for the design of rna sequences
with targeted secondary structure and nucleotide distribution. Bioinformatics, 29(13):i308–i315, 2013.
[63] Elena Rivas and Sean R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J.
Mol. Biol., 285:2053–2068, 1999.
[64] Elena Rivas and Sean R. Eddy. The language of RNA: A formal grammar that includes pseudoknots. Bioinformatics, 16:334–
340, 2000.
[65] Einar Andreas Rødland. Pseudoknots in RNA secondary structures: Representation, enumeration, and prevalence. J. Comp.
Biol., 13:1197–1213, 2006.
[66] Alexander S. Rose, Anthony R. Bradley, Yana Valasatava, Jose M. Duarte, Andreas Prlić, and Peter W. Rose. Web-based
molecular graphics for large complexes. In Proceedings of the 21st International Conference on Web3D Technology, Web3D
’16, pages 185–186, New York, NY, USA, 2016. ACM.
[67] Alexander S. Rose and Peter W. Hildebrand. Ngl viewer: a web application for molecular visualization. Nucleic Acids Research, 43(W1):W576–W579, 2015.
[68] Peter Schuster. How to search for rna structures theoretical concepts in evolutionary biotechnology. Journal of Biotechnology, 41(2):239 – 257, 1995.
[69] David B. Searls. The language of genes. Nature, 420(6912):211–217, Nov 2002.
[70] H Shi and P B Moore. The crystal structure of yeast phenylalanine trna at 1.93 a resolution: a classic structure revisited.
RNA, 6(8):1091–1105, 2000.
[71] T.F. Smith and M.S. Waterman. RNA secondary structure. Math. Biol., 42:31–49, 1978.
[72] David W Staple and Samuel E Butcher. Pseudoknots: RNA structures with diverse functions. PLoS Biol., 3:e213, 2005.
[73] J E Tabaska, R B Cary, H N Gabow, and G D Stormo. An RNA folding method capable of identifying pseudoknots and base
triples. Bioinformatics, 14:691–699, 1998.
[74] Manfred Tacker, Peter F. Stadler, Erich G. Bornberg-Bauer, Ivo L. Hofacker, and Peter Schuster. Algorithm independent
properties of RNA structure prediction. Eur. Biophy. J., 25:115–130, 1996.
[75] C. Tuerk, S. MacDougal, and L. Gold. RNA pseudoknots that inhibit human immunodeficiency virus type 1 reverse transcriptase. Proc. Natl. Acad. Sci. USA, 89(15):6988–6992, 1992.
[76] A. Uemura Y., Hasegawa, S. Kobayashi, and T. Yokomori. Tree adjoining grammars for RNA structure prediction. Theor.
Comp. Sci., 210:277–303, 1999.
[77] Graziano Vernizzi and Henri Orland. Large-N random matrices for RNA folding. Acta Phys. Polon., 36:2821–2827, 2005.
[78] J. Waldisp¨ uhl, B. Behzadi, and J.-M. Steyaert. An approximate matching algorithm for finding (sub-)optimal sequences in
s-attributed grammars. Bioinformatics, 18(suppl 2):S250–S259, 2002.
[79] M. S. Waterman. Secondary structure of single-stranded nucleic acids. Adv. Math. (Suppl. Studies), 1:167–212, 1978.
[80] E. Westhof and L. Jaeger. RNA pseudoknots. Curr. Opin. Struct. Biol., 2:327–333, 1992.
[81] Xiaojun Xu, Peinan Zhao, and Shi-Jie Chen. Prediction of RNA base pairing probabilities on massively parallel computers.
PLoS ONE, 2014.
[82] M Zuker. On finding all suboptimal foldings of an RNA molecule. Science, 244:48–52, 1989.
[83] M. Zuker and P. Stiegler. Optimal computer folding of larger RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res., 9:133–148, 1981.
Unauthenticated
Download Date | 6/16/17 9:06 PM