Finding Consensus Energy Folding Landscapes Between RNA

University of Central Florida
Electronic Theses and Dissertations
Masters Thesis (Open Access)
Finding Consensus Energy Folding Landscapes
Between RNA Sequences
2015
Joshua Burbridge
University of Central Florida
Find similar works at: http://stars.library.ucf.edu/etd
University of Central Florida Libraries http://library.ucf.edu
Part of the Computer Engineering Commons
STARS Citation
Burbridge, Joshua, "Finding Consensus Energy Folding Landscapes Between RNA Sequences" (2015). Electronic Theses and
Dissertations. Paper 5032.
This Masters Thesis (Open Access) is brought to you for free and open access by STARS. It has been accepted for inclusion in Electronic Theses and
Dissertations by an authorized administrator of STARS. For more information, please contact [email protected].
FINDING CONSENSUS ENERGY FOLDING LANDSCAPES
BETWEEN RNA SEQUENCES
by
JOSHUA BURBRIDGE
B.S. University of Central Florida 2013
A thesis submitted in partial fulfillment of the requirements
for the degree of Master of Science
in the Department of Electrical Engineering and Computer Science
in the College of Engineering and Computer Science
at the University of Central Florida
Orlando, Florida
Summer Term
2015
Major Professor: Shaojie Zhang
© 2015 Joshua Burbridge
ii
ABSTRACT
In molecular biology, the secondary structure of a ribonucleic acid (RNA) molecule is
closely related to its biological function. One problem in structural bioinformatics is to
determine the two- and three-dimensional structure of RNA using only sequencing
information, which can be obtained at low cost. This entails designing sophisticated
algorithms to simulate the process of RNA folding using detailed sets of thermodynamic
parameters.
The set of all chemically feasible structures an RNA molecule can assume, as well as the
energy associated with each structure, is called its energy folding landscape. This research
focuses on defining and solving the problem of finding the consensus landscape between
multiple RNA molecules. Specifically, we discuss how this problem is equivalent to the
problem of Balanced Global Network Alignment, and what effect a solution to this problem
would have on our understanding of RNA.
Because this problem is known to be NP-hard, we instead define an approximate consensus
on a landscape of reduced size, which dramatically reduces the searching space associated
with the problem. We use the program RNASLOpt to enumerate all stable local optimal
secondary structures in multiple landscapes within a certain energy and stability range of
the minimum free energy (MFE) structure. We then encode these using an extended
structural alphabet and perform sequence alignment using a structural substitution matrix
to find and rank the best matches between the sets based on stability, energy, and structural
distance. We apply this method to twenty landscapes from four sets of riboswitches from
Bacillus subtillis in order to predict their native “on” and “off” structures. We find that this
iii
method significantly reduces the size of the list of candidate structures, as well as
increasing the ranking of previously obscure secondary structures, resulting in more
accurate predictions overall. Advances in the field of structural bioinformatics can help
elucidate the underlying mechanisms of many genetic diseases.
iv
ACKNOWLEDGMENTS
The author would like to thank his adviser and committee chair, Dr. Shaojie Zhang, for
his invaluable intellectual and moral support during the research process.
v
TABLE OF CONTENTS
LIST OF FIGURES ............................................................................................................................. vii
LIST OF TABLES .............................................................................................................................. viii
LIST OF EQUATIONS ........................................................................................................................ ix
CHAPTER ONE: INTRODUCTION ....................................................................................................... 1
1.1 General Background............................................................................................................... 1
1.1.1 Dynamic Programming.................................................................................................... 1
1.1.2 RNA ................................................................................................................................. 4
1.1.3 RNA Secondary Structure................................................................................................ 8
1.1.4 RNA Folding ................................................................................................................... 11
1.1.5 Folding Landscapes ....................................................................................................... 15
1.1.6 Folding Pathways .......................................................................................................... 18
1.2 Consensus Folding Landscapes ............................................................................................ 20
1.3 Riboswitches ........................................................................................................................ 24
1.4 Predicting Native Structures of Riboswitches ...................................................................... 26
CHAPTER TWO: PREVIOUS WORKS ................................................................................................ 28
2.1 RNASLOpt ............................................................................................................................. 28
2.2 RNAConSLOpt....................................................................................................................... 29
2.3 GraphClust ........................................................................................................................... 31
CHAPTER THREE: METHODOLOGY ................................................................................................. 34
3.1 RNASLOpt ............................................................................................................................. 34
3.2 Brand nEw Alphabet for RNA (BEAR) ................................................................................... 38
3.3 Substitution Matrices and MBR ........................................................................................... 39
3.4 Datasets ............................................................................................................................... 41
3.5 Software Pipeline ................................................................................................................. 41
CHAPTER FOUR: RESULTS AND DISCUSSION ................................................................................. 48
4.1 Ranks of Target Matches in Complete Pairwise Alignment ................................................. 48
4.2 Comparison of Ranks to RNA SLOpt Prediction ................................................................... 58
4.3 Analysis of Consensus Landscapes....................................................................................... 62
4.4 Benchmarking ...................................................................................................................... 77
CHAPTER FIVE: CONCLUSIONS AND FUTURE WORK ..................................................................... 81
5.1 Conclusions, Advantages, Disadvantages, and Limitations ................................................. 81
5.2 Future Work and Alternative Approaches ........................................................................... 85
APPENDIX: RNA SEQUENCES AND STRUCTURES USED .................................................................. 88
LIST OF REFERENCES ...................................................................................................................... 93
vi
LIST OF FIGURES
Figure 1 - Example of the Secondary Structure of an RNA Molecule .............................................. 9
Figure 2 - Examples of the Six Different Sub-Structural Elements................................................. 11
Figure 3 - MUSHI Pipeline .............................................................................................................. 42
Figure 4 - Performance of RNASLOpt compared to performance of MUSHI................................. 61
Figure 5 - Performance of RNASLOpt compared to performance of MUSHI (without adenine)... 62
Figure 6 - Changes in target 'On' structures in adenine (graphed) ................................................ 64
Figure 7 - Changes in target 'Off' structures in adenine (graphed) ............................................... 65
Figure 8 - Changes in target 'On' structures in lysine (graphed).................................................... 67
Figure 9 - Changes in target 'Off' structures in lysine (graphed) ................................................... 68
Figure 10 - Lysine native 'Off' structure ......................................................................................... 69
Figure 11 - Top structure returned by RNASLOpt for sequence lysine 3 ....................................... 70
Figure 12 - Top structure returned by RNASLOpt+MUSHI for lysine 3 after comparison with lysine
1 ..................................................................................................................................................... 70
Figure 13 - Changes in target 'On' structures in TPP (graphed) ..................................................... 72
Figure 14 - Changes in target 'Off' structures in TPP (graphed) .................................................... 73
Figure 15 - Changes in target 'On' structures in FMN (graphed) ................................................... 75
Figure 16 - Changes in target 'Off' structures in FMN (graphed) ................................................... 76
Figure 17 - Average effect of MUSHI on landscape size ................................................................ 77
Figure 18 - Structural similarity of RNASLOpt and RNAConSLOpt 'On' target structures to native
'On' structure ................................................................................................................................. 79
Figure 19 - Structural similarity of RNASLOpt and RNAConSLOpt 'Off' target structures to native
'Off' structure ................................................................................................................................. 79
vii
LIST OF TABLES
Table 1 - RNASLOpt Parameters and Landscape Sizes ................................................................... 43
Table 2 - Results of aligning each landscape with native ‘On’ and ‘Off’ structures....................... 48
Table 3 – Complete pairwise structural alignment for adenine ‘On’............................................. 50
Table 4 – Complete pairwise structural alignment for adenine ‘Off’ ............................................ 50
Table 5 – Complete pairwise structural alignment for lysine ‘On’ ................................................ 51
Table 6 – Complete pairwise structural alignment for lysine ‘Off’ ................................................ 51
Table 7 – Complete pairwise structural alignment for TPP ‘On’.................................................... 52
Table 8 – Complete pairwise structural alignment for TPP ‘Off’ ................................................... 52
Table 9 – Complete pairwise structural alignment for FMN ‘On’ .................................................. 53
Table 10 – Complete pairwise structural alignment for FMN ‘Off’ ............................................... 53
Table 11 – Complete pairwise alignment rankings of ‘On’ target matches for different values of
α, β, γ, gap, and bonus respectively .............................................................................................. 56
Table 12 – Complete pairwise alignment rankings of ‘Off’ target matches for different values of
α, β, γ, gap, and bonus respectively .............................................................................................. 57
Table 13 - Effect of MUSHI on adenine landscape size .................................................................. 63
Table 14 – Changes in target ‘On’ structures in adenine (tabulated) ............................................ 64
Table 15 – Changes in target ‘Off’ structures in adenine (tabulated) ........................................... 65
Table 16 - Effect of MUSHI on lysine landscape size ..................................................................... 66
Table 17 – Changes in target ‘On’ structures in lysine (tabulated)................................................ 66
Table 18 – Changes in target ‘Off’ structures in lysine (tabulated) ............................................... 67
Table 19 - Effect of MUSHI on TPP landscape size ......................................................................... 71
Table 20 – Changes in target ‘On’ structures in TPP (tabulated) ................................................... 71
Table 21 – Changes in target ‘Off’ structures in TPP (tabulated) .................................................. 72
Table 22 - Effect of MUSHI on FMN landscape size ....................................................................... 74
Table 23 – Changes in target ‘On’ structures in FMN (tabulated) ................................................. 74
Table 24 – Changes in target ‘Off’ structures in FMN (tabulated)................................................. 75
Table 25 - Performance of RNAConSLOpt on riboswitches from B. Subtilis .................................. 78
viii
LIST OF EQUATIONS
Equation 1 – Needleman-Wunsch recursive function……………………………………………………….………….2
Equation 2 – Recursive function approximating landscape size………………………………………..…………11
Equation 3 – Asymptotic formula approximating landscape size…………………………………………………11
Equation 4 – Recursive function for Nussinov’s algorithm…………………………………………….……………12
Equation 5 – Recursive function for Zuker-Sankoff’s algorithm…………………………………….…………….13
Equation 6 – Conserved topological interactions in global network alignment……………..…………….22
Equation 7 – Total similarity score in balanced global network alignment…………………………………..22
Equation 8 – Weighted sum in balanced global network alignment………………………….……………….22
Equation 9 – Covariance and conservation score from RNAalifold……………………………………………..30
Equation 10 – Bonus given to columns with compensatory mutations in RNAalifold…………………..30
Equation 11 – Penalty given to columns not following the consensus structure in RNAalifold…..30
Equation 12 – Log-odds score in substitution matrices……………………………………………………………….40
Equation 13 – Mattei-variant of the Needleman-Wunsch Algorithm………………………………………….44
Equation 14 – Weighted sum ranking pairs of structures in MUSHI…………………………………………….45
Equation 15 – Stability ranking in MUSHI……………………………………………………………………………………45
Equation 16 – Energy ranking in MUSHI……………………………………………………………………………………..45
Equation 17 – Structural similarity ranking in MUSHI………………………………………………………………….46
ix
CHAPTER ONE: INTRODUCTION
1.1 General Background
Bioinformatics is one of the fastest-growing academic fields. Since the year 2000, rapid
advances in processing power coupled with dramatic decreases in the cost of DNA
sequencing technology have opened the floodgates for research institutions to carry out
various studies of the human genome. Very often, a new research study will provide insight
into one particular question while inevitably spawning multiple additional questions. Thus,
we expect the exponential explosion of novel problems in bioinformatics to continue for
some time. In this thesis, we formulate and suggest solutions to such a novel problem, one
which has thus far remained largely untouched by other computer scientists: finding the
consensus energy folding landscape between multiple RNA sequences. Before presenting
the problem statement, it will be necessary to briefly review the computational strategies
used, as well as provide the appropriate background information necessary for
understanding the problem. Due to the amount of background knowledge necessary to fully
understand the problem, and in order to make this document accessible to readers from
both Computer Science and Biology backgrounds, Chapter 1 is somewhat extensive. The
reader is encouraged to skip over sections containing information with which they are
already well-acquainted.
1.1.1 Dynamic Programming
Dynamic programming is a classic computational strategy central to many different
problems in bioinformatics, including the problem of finding a consensus folding
landscape. First formally proposed by Richard Bellman in 1953, “dynamic programming”
1
is an ambiguous name for a very clever strategy. The characteristic approach for a dynamic
programming solution is to identify the different subproblems within a larger problem,
solve these smaller subproblems, and then combine the smaller solutions into a solution to
the overall problem. A classic example of the power of dynamic programming is its use in
generating Fibonacci numbers. The recursive definition of the Fibonacci sequence is F(n)
= F(n-1) + F(n-2). This definition splits each instance of the problem into two smaller
subproblems, and so a naïve recursive function to compute Fibonacci numbers would result
in O(2N) time complexity. The power of dynamic programming lies in recognizing that
there are only N distinct subproblems (F1, F2, …, FN), and so we can reduce the exponential
solution to a linear solution simply by memoizing the result of each distinct subproblem.
One of the earliest and most well-known uses of dynamic programming in bioinformatics
is the Needleman-Wunsch algorithm for global sequence alignment[1]. Biologists are often
interested in computing the “sequence identity” of two polymers, usually DNA, RNA, or
protein. Informally, the sequence identity specifies the similarity between two given
strings. An alignment algorithm seeks the best possible mapping from characters in the
first sequence to characters in the second, while also allowing insertion of gaps in either
sequence. The algorithm operates by recognizing that the optimal solution for a
subsequence of either string is also included in the global solution. We can proceed by first
examining the last characters of both sequences. There are then three possibilities: they can
be aligned, a gap can be inserted in the first sequence, or a gap can be inserted in the second
sequence. The recursive function for the Needleman-Wunsch algorithm can be written as
𝐷(𝑖 − 1, 𝑗 − 1) + 𝑠(𝑥𝑖 , 𝑦𝑗 )
𝐷(𝑖, 𝑗) = max {
𝐷(𝑖 − 1, 𝑗) + 𝑔
𝐷(𝑖, 𝑗 − 1) + 𝑔
2
𝑥𝑚𝑎𝑡𝑐ℎ
𝑆(𝑥𝑖 , 𝑦𝑗 ) = {
𝑥𝑚𝑖𝑠𝑚𝑎𝑐ℎ
𝑖𝑓 𝑥𝑖 = 𝑦𝑗
𝑖𝑓 𝑥𝑖 ≠ 𝑦𝑗
(1)
where g is the penalty for inserting a gap, 𝑥𝑚𝑎𝑡𝑐ℎ is the bonus for aligning two identical
characters, and 𝑥𝑚𝑖𝑠𝑚𝑎𝑡𝑐ℎ is the penalty for aligning two different characters. The value of
these parameters can be adjusted to suit the needs of the specific application.
However, implementing the recursive function alone will not solve the problem. After
transforming D into an iterative procedure, we will end up with an (M+1) by (N+1) matrix,
in which D(m, n) will represent the best possible score achievable by an optimal alignment
of the first m characters of the first sequence, and the first n characters of the second. To
recover the alignment that generated this score, we perform a traceback procedure. While
creating matrix D, we simultaneously create a traceback matrix T, where each cell T(i, j)
corresponds to cell D(i, j). Each cell in T contains three Booleans, which we call left, up,
and diagonal. When the value for D(i, j) is calculated, it is defined as the maximum of the
three values D(i-1, j), D(i, j-1), and D(i-1, j-1). If the final value in D(i, j) is equal to any
of these three values, to corresponding Boolean is switched to “true”, indicating where the
value D(i, j) was derived from, and ultimately marking the trail through the matrix that will
show us the optimal solution. When matrix D is filled, we begin the traceback procedure
in cell T(i+1, j+1), and check each Boolean in the cell. If we find that T(i+1, j+1).left is
true, this means the last character in the second sequence was matched with a gap in the
first sequence, so we print this as the first column of the alignment. If T(i+1, j+1).diagonal
is true, this means we matched the corresponding characters in both sequences with each
other, so we add the next character of both sequences to the alignment. Finally, if T(i+1,
j+1).up is true, we match the last character in the first sequence with an inserted gap in the
3
second sequence. Proceeding in this fashion from the lower right corner of T to the upper
left corner of T will generate an optimal alignment between the two sequences. It is
important to note, however, that there may be more than one optimal alignment. In order
to recover all of them, we need to write a procedure that can enumerate all distinct paths
from T(i+1, j+1) to T(0, 0).
We sometimes expand upon the penalty for adding a gap into the alignment with an affine
gap function. This is called the affine gap penalty, and it recognizes the fact that, from a
biological standpoint, it may be more difficult for a sequence to mutate and “open” a gap
(that is, to increase the length of the sequence) than to keep the characters aligned but
mismatched, or to extend a gap that is already open.
Clearly, the Needleman-Wunsch algorithm can be modified in many useful ways. This is
important to note, as we will present a modified version of this algorithm in chapter 3.
1.1.2 RNA
The primary focus of this thesis is to apply strategies like dynamic programming to a
particular subfield within bioinformatics called structural bioinformatics. In structural
bioinformatics, we are interesting in characterizing the three-dimensional shape of various
biological molecules. The biological molecule of interest in this study is called ribonucleic
acid, also known as RNA. RNA plays a pivotal role in the process known to molecular
biologists as “central dogma,” which we will briefly explain.
It is now nearly universal knowledge that the physical and behavioral traits organisms pass
down to their offspring are packaged in the form of genes. These genes are nothing more
than complex sequences of the four nucleotides adenine, guanine, cytosine, and thymine,
4
chemically bonded to create a long, double-stranded polymer known as deoxyribonucleic
acid, or DNA. Sequences of exactly three nucleotides in DNA called codons can be
translated into one amino acid, of which eukaryotes use twenty-one. Long sequences of
amino acids called polypeptides fold into complex shapes and becomes proteins, which
perform innumerable biological functions within the cell, many of which are still unknown.
However, DNA is not directly translated into protein. It relies on RNA as an intermediary.
In order for a gene to be expressed in the cell, the two strands of its corresponding sequence
in DNA must be pulled apart by an enzyme called RNA Polymerase. This enzyme attaches
to one strand of the DNA and moves down the sequence, pausing at each nucleotide in the
DNA to add the corresponding nucleotide to a growing chain matching the DNA sequence
to be copied. Once finished, RNA Polymerase disconnects itself from the DNA, and
releases the short molecule containing the copy of the sequence. The process of copying
the sequence is called transcription, and the molecule which is a direct copy of the sequence
is called RNA. RNA differs from DNA in three major ways. First, it uses the nucleotide
uracil in place of thymine. Therefore, if a DNA sequence were “GCGCATA”, the
corresponding RNA sequence would be “GCGCAUA”. Second, RNA has a hydroxyl
group attached to the 2’ position of its pentose ring, whereas DNA does not (hence the
prefix “deoxy-”). This makes RNA less stable than DNA. Third and most importantly for
this thesis, RNA is a single-stranded molecule, whereas DNA is double-stranded. This
structural difference makes RNA more flexible and able to fold into complex shapes, which
is often the key determinant of its function.
After transcription, the RNA molecule must be translated into an amino acid sequence to
complete the gene expression process. However, before this can happen, some important
5
pre-processing steps must occur. The most important of these steps is called RNA splicing.
Contrary to what was once popular belief, not all DNA directly codes for an amino acid
sequence. In fact, in the human genome, only about 1% of our DNA will eventually be
translated into protein. Approximately 25% of the remaining DNA has been associated
with regulatory elements and other non-coding portions of genes, but the function of much
of our genome is still unknown. What is clear, however, is that after a gene is transcribed,
molecules called small nuclear ribonucleoproteins (snRNPs) cut the RNA molecule into
fragments that can be classified as either exons or introns. Exons are stretches of RNA that
will be translated into amino acid sequences, whereas introns may perform a variety of
other functions.
Once the process of RNA splicing is complete, the exons are, in essence, “stitched” back
together, so that only the coding portion remains in the RNA molecule. After undergoing
some additional processing, the strings of exons now known as messenger RNA (mRNA)
exits the nucleus, and is picked up by a ribosome. The purpose of the ribosome is to scan
through the RNA molecule and translate each codon into a single amino acid, thus building
a chain of amino acids in the same way that RNA Polymerase builds a chain of nucleotides.
This process is called translation. Central dogma refers to the entire process in which DNA
is transcribed into RNA, which is translated into proteins. While this is surely an
oversimplification of the process (indeed, some viruses known as reverse transcriptase
actually attack the cell by reversing this process and inserting foreign sequences into the
DNA), this basic explanation is sufficient for understanding the role that RNA plays in the
cell, and why we are interested in predicting its shape.
6
In general, RNA can be broadly classified into two categories, coding RNA and non-coding
RNA (ncRNA). ncRNA is defined as any RNA sequence that is not directly translated into
a polypeptide. There are many different classes of RNA molecules. Messenger RNA
(mRNA) is the concatenation of exons in a gene that undergo translation. This is coding
RNA. ncRNA contains many smaller groups. We have already discussed small nuclear
ribonucleoproteins (snRNPs), which cleave RNA into exons and introns. As the name
suggests, snRNPs are complexes that consist of RNA and proteins. The RNA portion of
this complex is referred to as small nuclear RNA (snRNA). These molecules are typically
about 150 nucleotides (nt) in length. Similarly, ribosomes are composed of RNA and
protein molecules. The RNA in ribosomes is called ribosomal RNA (rRNA). In order for
the ribosome to match a codon with a particular amino acid, it bonds with transfer RNA
(tRNA). tRNAs are a class of RNA molecules with a highly conserved cloverleaf shape.
There exists a specific tRNA for each possible pairing of one codon with one amino acid.
The tRNA’s structure has a region that allows it to bond to a codon in an mRNA sequence,
as well as a region that bonds to a specific amino acid. Thus, each tRNA acts as a sort of
grammatical rule by adding its particular amino acid to the growing polypeptide in the
ribosome when it recognizes the correct codon in the mRNA. Some RNA molecules can
silence the expression of genes by destroying the corresponding mRNA molecule in a
process called RNA interference (RNAi). The two most important types of these molecules
are called micro RNA (miRNA) and small interfering RNA (siRNA), both of which share
roughly the same function but usually have dramatically different structures, thus affecting
the specific mechanism by which they act. Small nucleolar RNAs (snoRNAs) function
primarily in processing other RNA molecules, such as rRNA.
7
More classes of RNA exist than have been described here, and still more have yet to be
discovered. Ultimately, there are two important facts that are central to this thesis. First, it
is clear that RNA plays many complex roles in the cell beyond the simple storage of protein
coding information. Second, and most importantly, the various functions of RNA
molecules are differentiated primarily by the structure of the molecule itself. Thus, the
more we understand about how RNA molecules acquire their three-dimensional shape, the
more we can infer about their roles in gene regulation and expression.
1.1.3 RNA Secondary Structure
The structures of RNA and other biological molecules are typically classified in terms of
primary, secondary, tertiary, and sometimes quaternary structure. An RNA molecule’s
primary structure refers to the specific sequence of the nucleotides it contains, and is simply
written as a string: GCGCAUA. The chemical structure of RNA allows it to be very flexible
such that, if it is energetically favorable, the molecule will fold back onto itself, allowing
some nucleotides to form hydrogen bonds with other nucleotides farther downstream. We
refer to these bonds as base pairs. In general, the following base pairs are energetically
favorable, in order from most stable to least stable: G-C, A-U, G-U. The varying stability
between these three possible base pairs is determined by the number of hydrogen bonds
the nucleotides can form. When guanine bonds with cytosine, both components are held
together by three hydrogen bonds, therefore conferring greater stability to the molecule as
a whole than an A-U base pair, which is joined by two hydrogen bonds. Occasionally, it is
possible for three nucleotides to form a more complex configuration in which all three are
bonded to each other, resulting in a base triplet. However, this is an uncommon occurrence,
8
and it is usually safe to assume that if any two nucleotides form a base pair together, those
nucleotides can no longer form base pairs with any other nucleotides in the sequence.
When describing the secondary structure of an RNA molecule, we typically number each
nucleotide from 1 to N, where N is the length of the sequence. The notation for referring
to a base pair is (i, j), where i and j refer to the i-th and j-th nucleotides in the sequence,
respectively. Graphically, we may represent an RNA sequence and its corresponding
structure as follows:
AAAGCGAAAGCGAAACGCAAAGCGAAACGCAAACGCAAA
...(((...(((...)))...(((...)))...)))...
This string representation corresponds to the following two-dimensional structure:
Figure 1 - Example of the Secondary Structure of an RNA Molecule
This is called dot-bracket notation. A dot in the i-th location in the structure indicates that,
in this structure, the i-th nucleotide in the sequence is not part of any base pair, while an
open or close parenthesis indicates that the associated nucleotide is paired with the
nucleotide associated with the corresponding complementing parenthesis. Thus, a
9
secondary structure can be fully described by a sequence length N and a list of K base pairs
{(i1, j1), …, (ik, jk)}.
In addition to ignoring base triplets, we will also ignore pseudoknots. A pseudoknot is
defined as a set of base pairs (i, j) and (i’, j’) such that i < i’ < j < j’. While such formations
do appear in some RNA molecules, pseudoknots are considered uncommon if not rare [2],
and excluding them from our model simplifies the problem of predicting secondary
structure dramatically. Thus, the language describing the set of all valid RNA structures
can be described as L = ∑*, where ∑ = { ., (, ) }, such that all parentheses are balanced.
This is a slight variation of the well-known Dyck language, which is context-free.
Longer, more complicated secondary structures have recurring motifs, which can be
classified into six categories: hairpin loops, stacks, internal loops, bulges, multiloops, and
dangling or unpaired bases. The diagram below shows an example of a structure with each
kind of motif. Because “motif” is a word with special connotations in sequence analysis,
we henceforth refer to these as secondary sub-structural elements (SSEs).
10
Figure 2 - Examples of the Six Different Sub-Structural Elements
1.1.4 RNA Folding
It is important to note that one sequence may have many different possible structures.
Based on the definition of admissible structures above, the size of the subset of words of
length N+1 in the language L satisfies the recurrence
𝑁−2
𝑇(𝑁 + 1) = 𝑇(𝑁) + ∑ 𝑇(𝑘) ∗ 𝑇(𝑁 − 𝑘 − 1)
𝑘=0
(2)
where T(N) is the number of possible structures of a sequence of length N [3]. The first
term on the right hand side of the equation accounts for all possible structures where a new
nucleotide N+1 remains unpaired, and the summation accounts for all possible structures
when nucleotide N+1 forms a base pair with nucleotide k+1. We can also specify the base
cases as T(0) = T(1) = T(2) = 1.
This recurrence can be approximated by the formula [4]
11
𝑁
3
15 + 7√5
3 + √5
𝑇(𝑁) = √(
) ∗ 𝑁 −2 ∗ (
)
8𝜋
2
(3)
which, for a sequence length of 150, equates to approximately 5.75 * 1029, a colossal
number of structures for a relatively short sequence. However, because we normally place
further restrictions on the class of chemically feasible structures, this formula represents a
vastly overestimated upper bound. Examples of these restrictions includes setting a
minimum for the number of nucleotides that must occur between any two base pairs.
Clearly, two adjacent nucleotides are already bonded, and cannot be considered in the same
base pair. Additionally, hairpin loops typically require at least three nucleotides between
the closing members of the base pair because electromagnetic forces between nucleotides
make it very difficult for a molecule to bend sharply, only including one or two nucleotides
in the loop. Thus, three rules define the class of permissible RNA structures: (i) all
parentheses must be balanced, (ii) we do not consider pseudoknots, and (iii) for any base
pair (i, j), j – i ≥ 4.
Once we begin to consider the actual sequence information, further restrictions are placed
on the set of permissible structures. Most importantly, we only consider the following pairs:
G-C, A-U, and G-U. This property can alter the size of the set of permissible structures,
taking it from only one structure (if, for example, the entire sequence is composed of a
single nucleotide), up to a maximum that depends on the sequence itself. In general, the
constraints imposed by the sequence data reduce the size of the landscape (a term formally
defined in section 1.1.6) dramatically.
12
Simple thermodynamics requires that complex molecules conform into energetically stable
structures. Thus, the first attempts at predicting RNA secondary structure focused on
selecting the structure with minimum free energy (MFE) from the set of permissible
structures. The earliest algorithm to achieve a reasonable solution to this problem was
given by Nussinov et al. in 1980 [5]. Because base pairs contribute to the stability of the
overall structure, this dynamic programming solution focuses on finding the structure in
which the number of base pairs are maximized. The recursive function for the algorithm
can be written
𝑀(𝑖, 𝑗 − 1)
𝑀(𝑖, 𝑗) = 𝑚𝑎𝑥 {
𝑀(𝑖, 𝑘 − 1) + 𝑀(𝑘 + 1, 𝑗 − 1) + 𝛿(𝑘, 𝑗)
𝑓𝑜𝑟 𝑖 ≤ 𝑘 < 𝑗
(4)
where δ(k, j) = 1 if the respective nucleotides can form a base pair. Note that this scoring
function does not differentiate base pairs by stability, but it can easily be modified to do
so. Filling this matrix requires O(N3) time. However, base pairs are not the only elements
within a structure that affect its overall stability. SSEs in various forms such as hairpin
loops or internal loops destabilize the overall structure, but some loops destabilize it more
than others. Additionally, there may be multiple structures with the maximum number of
base pairs, and this solution would be unable to differentiate between them. Therefore, in
order to determine the true MFE structure, a good folding algorithm must account for a
larger set of parameters than Nussinov’s algorithm.
A more sophisticated algorithm was developed by Zuker and Steigler in 1981, refined by
Zuker and Sankoff in 1984, and again by Sankoff in 1985 [2, 3, 6]. This algorithm takes
into account the destabilizing energies of SSEs such as hairpin loops and internal loops. Its
recursive function can be written
13
𝑉(𝑖, 𝑗)
𝑊(𝑖 + 1, 𝑗)
𝑊(𝑖, 𝑗) = 𝑚𝑖𝑛
𝑊(𝑖, 𝑗 − 1)
𝑚𝑖𝑛
{𝑊(𝑖,
𝑘) + 𝑊(𝑘 + 1, 𝑗)}
{
𝑖≤𝑘≤𝑗−1
𝑒ℎ(𝑖, 𝑗)
𝑉(𝑖 + 1, 𝑗 − 1) + 𝑒𝑠(𝑟𝑖 , 𝑟𝑗 , 𝑟𝑖+1 , 𝑟𝑗−1 )
𝑉(𝑖, 𝑗) = 𝑚𝑖𝑛
𝑚𝑖𝑛𝑖<𝑖 ′ <𝑗′ <𝑗 && 2<𝑖 ′ −𝑖+𝑗−𝑗′ {𝑉(𝑖 ′ , 𝑗 ′ ) + 𝑒𝑏𝑖(𝑖, 𝑗, 𝑖 ′ , 𝑗 ′ )}
{ 𝑚𝑖𝑛𝑖+1≤𝑘≤𝑗−2 {𝑊(𝑖 + 1, 𝑘) + 𝑊(𝑘 + 1, 𝑗 − 1)} + 𝑎
(5)
where W(i, j) is the score of the best possible structure on subsequence [i…j], and V(i, j)
is the score of the best possible structure on subsequence [i…j] with the added assumption
that i and j form a base pair in that particular structure. Additionally, eh(i, j) is the energy
penalty of a hairpin loop closed by base pair (i, j), es(ri, rj, ri+1, rj-1) is the energy bonus
given by “stacking” base pairs (i, j) and (i+1, j-1) together, a function that depends
specifically on the nucleotides involved in the stacking, ebi(i, j, i’, j’) is the energy penalty
given by the internal loop or bulge closed by base pairs (i, j) and (i’, j’), and a is the energy
penalty given by opening a new multiloop structure. In simple terms, the matrix W
accounts for 4 possibilities: i and j form a base pair, i unpaired, j is unpaired, or both i and
j are in a base pair, but not with each other. Note that the possibility that both i and j are
unpaired can be accounted for by choosing the second and third options consecutively. The
matrix V assumes i and j form a base pair, and then determines the SSE of minimum energy
that could be closed by (i, j). The advantage of this algorithm is that the functions eh, es,
ebi, and a can be continuously updated to reflect parameters given by new experiments.
Michael Zuker’s MFOLD software package uses a version of this algorithm which runs in
O(N4) time.
14
However, while this algorithm has been used to successfully compute the MFE structure
of many different sequences, it has been shown that the MFE structure is not often the
native structure assumed by an RNA molecule [7]. This is because the native structure
depends on the dynamic folding process of the molecule, which can depend on how the
molecule interacts with external forces, and is thus extremely difficult to predict. Very
often, the RNA molecule will end up in a structure that is both energetically favorable and
stable, but is not necessarily the structure with the minimum possible energy given its
sequence. Thus, algorithms that find the MFE structure are limited, and we must continue
to search for new and creative ways to visualize and solve the problem.
1.1.5 Folding Landscapes
We discussed earlier the notion of a folding landscape, which is composed of all possible
structures an RNA molecule can take. The landscape can naturally be represented as a
graph in which any two structures that differ only by the addition or subtraction of a base
pair are connected by an edge. These edges can also be directed if it is desired that an edge
represents a transition from a structure with higher free energy to a structure with lower
free energy. The edges can also be weighted if it is desired that an edge should include the
difference in free energy between two structures. However, these are just alternate
representations of the same information. For the purposes of this research, we will represent
the landscape as an unweighted, undirected graph. Each node is encoded with the free
energy of its associated structure.
As previously discussed, the size of the folding landscape is enormous, but finite.
RNAsubopt, introduced by Wuchty et al. in 1999 [8] is capable of enumerating all
suboptimal structures in the energy range from the MFE to some arbitrary upper limit.
15
Additionally, BARRIERS, introduced by Flamm et al. in 2002 [9], is capable of
constructing the exact energy landscape by establishing that the topology of the landscape
always forms a hierarchy, and so the landscape can be represented in a form called a barrier
tree. Therefore, it is possible to fully realize an actual folding landscape. However, the size
of a landscape for any RNA sequence of sufficient length is overwhelmingly large, so we
are still in essence restricted to working with the portion of the landscape within a certain
percentage of the MFE, rather than the full landscape.
To solve the problem of finding a consensus folding landscape, we must first start with a
rigorous mathematical definition of a landscape as well as its interesting features. Flamm
et al. have already provided such definitions [9], and the relevant ones will be explained
here.
Let a configuration space be a graph G(V, E), where each v ∈ V represents a unique
secondary structure, and each pair of vertices v1, v2 ∈ V whose associated structures in dotbracket notation differ only by the addition or deletion of a single base pair is connected
by an unweighted, undirected edge e ∈ E. Then a landscape L(G, f) is defined by a
configuration space and a function f: V ⟶ ℝ. In the case of our RNA landscape, this
function f is precisely the function that calculates the free energy of a structure.
There are a couple of intuitive assumptions we can make about identifying the properties
of native structures from a folding landscape. The first and most obvious is that a native
structure will be locally optimal, or equivalently a local minimum in the landscape. This
makes sense because any structure that is not a local minimum can spontaneously transform
into a structure that is a local minimum by adding another base pair, much like a ball rolling
down a hill. Thus, a folding RNA molecule will not come to rest in a conformation that is
16
not a local minimum. The second assumption is that that native structure should be stable.
Specifically, stability of a structure x refers to the difference in energy between x and height
of the lowest saddle point between x and any other local minimum y ∈ V. This is an
important feature of the landscape because for RNA molecules to function reliably, they
should be strongly resistant to external forces that may act on the structure, breaking some
base pairs and possibly pushing the molecule into a new energy basin and structure.
Because local optimality and stability are two independent properties, it is important to
note that for two local minima A and B, A may have greater total energy that B, but it may
also be more stable. The third assumption deals with the notion of accessibility of a local
minimum. That is, it may be possible to have a local minimum with low total energy which
is also highly stable, but for which the probability that a random structure in V lies in its
associated basin is far less than the probability of the same event for a different basin. If
we visualize an energy basin as generally assuming the shape of a bowl, the accessibility
property refers to the width of the bowl. Flamm et al. have given us explicit ways to
measure these three properties, as we will now see.
The neighbors of a vertex v ∈ V can be defined by the set ∂v = ∂{v} = {y ∈ V | {v, y} ∈
E}. This definition extends to the neighboring set of a connected component: ∂A = {y ∈ V
\ A | ∃x ∈ A: {x, y} ∈ E}. ∂A is called the boundary of A. The neighborhood of A is the
union of set A and its boundary. That is, N(A) = A ∪ ∂A. A vertex x is a local minimum if
f(x) ≤ f(y) for all y ∈ ∂x. If the inequality is changed to f(x) < f(y), x is called a strict local
minimum. M is the set of all local minima in the landscape. As mentioned, it is assumed
the vertex representing the native structure of an RNA molecule belongs to M. A landscape
is non-degenerate or invertible if f(x) = f(y) → x = y ∀x,y ∈ V. Some degenerate RNA
17
folding landscapes exist. A walk of length k in the landscape is a sequence of vertices p =
{x1, x2, …, xk} such that each xi ∈ V and {xi, xi+1} ∈ E. The set of all possible walks
between vertices x and y is denoted by Pxy. X and y are mutually accessible at height n if
there exists a walk p ∈ Pxy such that f(z) ≤ n for all z ∈ p. Finally, the saddle height 𝑓̂(x,
y) between two vertices x and y is the minimum height at which they are mutually
accessible. That is, 𝑓̂(x, y) = 𝑚𝑖𝑛𝑝∈𝑷𝑥𝑦 𝑚𝑎𝑥𝑧∈𝑝 𝑓(𝑧).
With these definitions, we can measure the three characteristics discussed previously. A
structure is locally optimal if its associated vertex is a local minimum in the landscape. The
stability of a structure x can be represented as the minimal saddle height between x and
any other structure. When discussing the saddle height between two adjacent local minima,
the number represents the energy required to transform one structure into the other (for the
purposes of this thesis, “adjacent” means that the shortest walk between two local minima
can be divided into one ascending phase followed by one descending phase).
1.1.6 Folding Pathways
Assuming that the dynamic folding process occurs by the discrete addition or subtraction
of individual base pairs, it can be modeled as a path through the landscape. Here, discrete
means that no two base pairs are altered at precisely the same time, so a path denotes the
series of transformations an RNA molecule assumes during the folding process. Naturally,
we are primarily interested in the end point of this path, which we assume lies in a stable
energy basin. However, this can depend on the molecule’s starting position. If we assume
in vitro folding, the molecule begins in a “flat” conformation, containing no base pairs.
Then, small fluctuations in thermal noise allow the molecule to begin folding, travelling a
18
path in the landscape that is almost entirely determined by sequence information. However,
this is not necessarily the case in vivo. In the cell, there may be many external forces that
can interfere with the folding process. More importantly, when RNA molecules are
assembled in the transcription process, the free-floating portion of the molecule may begin
to fold before the rest of the molecule is assembled. Because each nucleotide added to the
sequence adds a new set of structures to the folding landscape, the RNA molecule is
constantly jumping between different landscapes until it arrives at the final one. The
structure of the molecule at this point is then considered its starting point in the final
landscape, and in the absence of any external stabilizing forces can drastically alter the end
point of the path, or the final structure. This is akin to dropping a marble directly above
different locations on a complex topological surface. Assuming the marble can only roll
downhill (that is, there are no external forces acting upon the system), choosing an arbitrary
starting location for the marble necessarily excludes a set of final locations. Thus, even if
we could perfectly determine the step an RNA molecule may take at each point in the
landscape, we still cannot predict the final vertex without knowing the starting vertex.
The accessibility property discussed in the previous section has important implications for
folding pathways. If we assume that the folding pathway for an RNA molecule ends in a
local minimum that is highly inaccessible, there is a greater probability that some external
force may act on the molecule before it reaches the basin, and because the basin is so
inaccessible, the molecule may never recover from this error and will end up misfolded.
Thus, it seems clear that pressure from natural selection would also drive the creation of
nucleotide sequences for which the target structure in the folding landscape lies in a highly
accessible basin, therefore minimizing the occurrences of a misfold.
19
1.2 Consensus Folding Landscapes
At last, we are ready to move to the primary focus of this thesis: defining and solving the
problem of finding the consensus folding landscape between two or more RNA sequences.
To the best of our knowledge, Li et al. [10] are the only group who have researched this
subject, but even in that paper, the problem is not formally defined. This thesis will give a
formal definition of the problem, explain the efficacy of some solutions on constrained
versions of the problem, and discuss what implications solving this problem could have for
the field of structural bioinformatics.
Because we have already defined a folding landscape as a graph, the intuitive notion for a
consensus folding landscape is that it should be an alignment of two graphs that conserves
a maximal amount of correspondence between them. If two landscapes can be said to have
a consensus in some region, then both of the associated RNA molecules should be governed
by the same set of folding dynamics when they are in those regions. How can we formulate
this problem? Clearly, one aspect should involve comparing the nodes of each graph, and
finding the best correspondence between the two sets. The best correspondence could be
defined in terms of structural similarity, total energy level, stability, or some combination
of these factors. In fact, this is exactly the approach we take in Chapter 3. However, finding
such a correspondence is not enough. Because a folding landscape defines how a molecule
behaves during the folding process, a consensus landscape should also define how both
molecules behave when they are in the consensus zone. Therefore, a consensus landscape
should maximize the size of a correspondence between the vertices of two networks while
also conserving topological information.
20
Although no one has yet defined the problem of finding a consensus folding landscape, it
is somewhat obvious from the previous discussion that this problem is identical to the
Global Network Alignment (GNA) problem. This problem is defined rigorously in relation
to protein-protein interaction networks by Zaslavskiy et al. [11].The definition we are about
to give is more or less a transcription of their definition.
Assume we are trying to find some global alignment between graphs G and H. We begin
by assuming G and H have the same number of vertices. Even though this is rarely the
case, we can “simulate” this property by adding some number of “dead” vertices to each
graph so that they have the same order. If any particular alignment maps a “live” vertex in
G to a dead vertex in H, this means that the associated structure in G does not map to any
real structure in H. Now that the vertex sets are the same size, we are looking for a bijection
between the two sets that maximizes the number of conserved topological interactions.
Such a mapping is given by a permutation π of {1, 2, …, N}, which are the vertices of G.
In each permutation, the i-th vertex of G is mapped to the π(i)-th vertex of H. Each
permutation π can also be represented by a permutation matrix P, where Pij = 1 if and only
if π(i) = j. Then, the set of all possible permutations is defined by P = {P ∈ {0,1}NxN | P1N
= 1N, PT1N = 1N}, where 1N is the column vector with N entries all equal to 1. For clarity,
the identity matrix INxN represents the permutation where the first vertex of G is mapped to
the first vertex of H, the second vertex of the G is mapped to the second vertex of H, and
so on. The two conditions on the right side of the definition of P mean that each distinct
row and each distinct column in a permutation matrix P must sum to exactly one. This is
because each vertex in G must map to exactly one vertex in H, and the relationship is
bidirectional.
21
We can score each permutation by recording the number of interactions it conserves. That
is, we are interested in the number of vertex pairs (i, j) that are connected in G where π(i)
and π(j) are also connected in H. We denote this number by J(P), and it is clear that we
seek to maximize this quantity. If we apply the permutation encoded by P to the graph H,
we can obtain a new graph isomorphic to H, denoted by P(H). This is because the
permutation simply shuffles the vertex labels. Note that this is not equivalent to multiplying
PAH. Now, the adjacency matrix for the permuted graph AP(H) can be obtained from
multiplying PAHPT [12]. Because adjacency matrices are symmetric, J(P) is then equivalent
to half the number of entries in both AG and AP(H) that are simultaneously equal to 1.
Formally,
𝑁
1
𝐽(𝑃) = ∑ [𝐴𝐺 ]𝑖𝑗 [𝐴𝑃(𝐻) ]𝑖𝑗
2
𝑖,𝑗=1
(6)
Zaslavskiy et al. [11] define two formulations of this problem. The first is referred to as the
Constrained Global Network Alignment problem, and can be applied in the case where we
have a list of candidate matchings between vertices, and we simply want to disambiguate
the set of matchings by disallowing some correspondences, which allows us to rule out
large numbers of permutations simultaneously. For the purposes of finding a consensus
folding landscape, however, we can make no such obvious constraints on which structures
can match. Therefore, we will focus on the second formulation of the problem given by
Zaslavskiy et al. [11], which is called the Balanced Global Network Alignment problem.
This formulation takes into account the degrees of similarity between all vertices in the
network, and allows for the trade-off or balancing between vertex similarity and
topological information. Assuming we have an N x N similarity matrix C in which C ij
22
denotes the similarity score between vertex I ∈ G and j ∈ H, the total similarity for any
permutation P can be denoted
𝑁
𝑆(𝑃) = ∑ 𝐶𝑖,𝜋(𝑖)
𝑖=1
(7)
Finally, we can state the BGNA problem as the optimization of the following weighted
sum
𝑚𝑎𝑥𝑝∈𝑷 𝜆𝐽(𝑃) + (1 − 𝜆)𝑆(𝑃)
(8)
where λ is a weighing factor determining the relative importance of topological information
and vertex similarity, respectively.
If the Balanced GNA problem could be solved exactly, it would illuminate many aspects
of RNA folding landscapes. Currently, little is known about quantifying the change enacted
on a landscape by a change in the RNA sequence. Two identical RNA sequences also have
identical folding landscapes, but if we change a single nucleotide in the second sequence,
how dramatically does its landscape change? We could quantify this using the change in
the best alignment given by a solution to BGNA. We could then measure the point at which
two landscapes “diverge” from each other due to differences in the primary sequence. This
could establish a sequence identity threshold between two RNA molecules, below which
any consensus folding landscape of sufficient size could be said to be statistically
significant. In clearer terms, we mean that the consensus landscape between two sequences
that differ only be a single nucleotide is more likely to be a result of high sequence identity
between the two molecules, rather than having any biological significance. However, if
two molecules with lower sequence identity share a relatively large consensus, this
23
information is likely to clue us in on significant biological structures, reducing the list of
candidate
structures,
and
improving
native
structure
prediction
significantly.
Unfortunately, we can currently only speak of these advantages in general terms, because
the intractability of BGNA prohibits us from quantifying exactly how much a solution
would help us.
The Balanced GNA problem is known to be NP-hard. The number of possible permutations
is N!, where N is even greater than the estimated size of the landscape discussed in section
1.1.4 due to the additional dead vertices required by the problem formulation. A number
of approximate solutions to the problem have been proposed [11, 13, 14], but even these
are incapable of processing RNA folding landscapes, which are enormous in size. Thus,
we will also propose an approximate solution specific to RNA folding landscapes, in which
we perform a similar matching on a landscape whose size has been dramatically reduced
by predicting and only considering the stable local optimal structures. The method for
predicting these “points of interest” on the landscape is discussed in section 2.1, and
methods for redefining and finding a consensus are discussed in chapter 3.
1.3 Riboswitches
The transcription and translation of mRNA is largely regulated by ncRNA. However,
mRNA molecules are also capable of self-regulation. Riboswitches are cis-regulatory
elements usually found in the 5’ UTR of mRNA molecules. Most known riboswitches exist
in the RNA of bacteria, though the existence of at least one riboswitch has been verified in
eukaryotes [15]. Riboswitch sequences are relatively short (usually around 150 nt long)
and consist of two primary domains. The aptamer domain is responsible for binding a very
specific ligand, whose presence in the cellular solution triggers changes in mRNA
24
processing. The second domain is the expression platform, which is the portion of the
sequence that can act as an interface between mRNA and ribosomes or other cellular
machinery. The aptamer domain and expression platform overlap in the sequence. The
overlapping portion is called the switching sequence, and it undergoes structural changes
in response to the binding of a ligand to the aptamer domain. The result is that the
riboswitch can assume two native structures, commonly referred to as the “on” and “off”
positions. Thus, while most ncRNA molecules assume one stable local optimal structure
in the folding landscape, riboswitches assume two native structures, and the binding of the
ligand to the aptamer domain provides a mechanism for the riboswitch to follow a path
from one energy basin to another and back. The exact mechanism by which this structural
change occurs depends on the riboswitch, and requires detailed knowledge of the atomic
structure of the molecule.
There are three primary mechanisms by which riboswitches can regulate RNA. The first is
transcription termination. This occurs after the riboswitch has been transcribed from a
DNA sequence but before the rest of the gene has been transcribed by RNA polymerase.
In the presence of the appropriate ligand, the switching sequence may form what is known
as a rho-independent hairpin loop, which causes the transcription process to stall. The
hairpin loop is then followed by a polyuracil chain, which destabilizes the transcription
complex, letting it detach from the DNA strand and ending the process prematurely.
Because this step occurs before the alternative splicing of the gene, the riboswitch also
disallows the introns from being transcribed, and therefore it also regulates ncRNA. The
second mechanism is translation inhibition. In this case, the expression platform coincides
with the ribosome-bonding site on the mRNA, and the switching sequence partially alters
25
its structure. This makes the mRNA unable to bond with a ribosome, prohibiting translation
from occurring. The third mechanism is ribozyme activation. In this case, the riboswitch
acts as a ribozyme, and when activated by a ligand, automatically cleaves itself, effectively
destroying the mRNA in the process. Riboswitches perform regulatory duties in many other
ways, but these are the most common.
1.4 Predicting Native Structures of Riboswitches
In this thesis, we are interested in using an approximate solution to the global network
alignment problem to predict the native structures of different riboswitches. The rationale
behind this is that different RNA sequences may regulate the same genes and bind the same
ligands, and are therefore considered the same riboswitch, even though the primary
structure of these sequences may differ somewhat. Because each riboswitch must bind a
single particular ligand, the structure of the aptamer domain must be very strongly
conserved among all sequences comprising the same riboswitch. The riboswitch would not
function correctly if it were to accidentally bind the wrong ligand. Thus, though the
different manifestations of the same riboswitch may differ in sequence and therefore
folding landscape, we can make a strong assumption that there must be two structures
common to both landscapes: the native “on” and “off” structures of the riboswitch. Current
methods of secondary structure prediction tend to focus on the candidate structures derived
from a single sequence, and the result is that viable theoretical candidates may be returned,
but without having any biological significance. However, if we compare the points of
interests across multiple landscapes, we may be able to narrow our pool of candidate
structures.
26
One possible limitation to this approach is that the sequences for a riboswitch may have
high (>90%) sequence identity, which may mean that the landscapes for each sequence are
very similar, and therefore finding a consensus among them may not yield much useful
information. Essentially, if we are interested in the intersection between two sets, and the
sets happen to be very similar, the intersection will be large, which may not help us in our
search. However, if the folding landscapes of two sequences diverge rapidly with
decreasing sequence identity, the intersection is likely to be very small, which would
improve structure prediction. The validity of such speculation remains to be seen.
27
CHAPTER TWO: PREVIOUS WORKS
There are a number of previous related works, some of which form a strong basis for this
research. However, because of the novelty of the problem, very little work has been
performed to directly solve the problem of finding a consensus landscape, except for
RNAConSLOpt, discussed in section 2.2.
2.1 RNASLOpt
RNASLOpt is a program published in 2011 by Dr. Yuan Li and Dr. Shaojie Zhang at the
University of Central Florida [7]. The problem that their research seeks to address is how
to deal with the prohibitively large size of the energy folding landscape. RNASLOpt filters
out the vast majority of these extraneous structures by computing the Stable, Locally
Optimal structures in the landscape, hence the acronym RNASLOpt. The assumptions
underlying the methodology are simple: the native structure should not spontaneously
change into another structure that is thermodynamically more favorable (it should be
locally optimal), and the barrier energy of the native structure should be sufficiently high
that the secondary structure cannot be altered by thermal noise or other unpredictable
events (it should be stable).
RNASLOpt takes an RNA sequence and uses Bafna’s algorithm [16] to compute all
possible putative stacks in O(n2) time within user-defined parameters. In order to
significantly reduce the time of the computation, Li et al. do not consider stacks with length
less than 4, because Bafna showed that the fraction of stacks missed with this cutoff is less
than 10%. It is important to note that the native structures of the riboswitches we use in
28
this thesis contain a small number stacks of length 2 and 3, as well as isolated base pairs,
so we will have to accept this as merely an approximation.
Using this list of putative stacks, RNASLOpt can use two separate algorithms to compute
optimal configurations of these stacks, one based on the Nussinov model [5], and the other
based on the Turner energy model [17], and also partially based on the Zuker-Sankoff
algorithm [3]. A configuration is considered locally optimal when no new stacks can be
added to the structure without conflicting with another stack or forming a pseudoknot.
RNASLOpt returns the list of locally optimal structures. It then computes the stability of
each of these structures, determined by the height of the lowest saddle point adjacent to
each energy basin. If this energy barrier is less than ΔB, the structure is discarded. Finally,
the remaining structures are ranked according to stability.
Li et al. benchmark RNASLOpt using seven riboswitches (Adenine-BS, Adenine-VV,
Guanine, SAM, C-di-GMP, Lysine, and TPP) against other RNA secondary structure
prediction software, including mfold, RNAShapes, and RNAlocopt. In all cases except
Lysine, RNASLOpt ranked the correct native structure higher than its competitors,
indicating that its unique way of estimating the “points of interest” in the landscape is
sound. Therefore, the methodology of this thesis will begin with results returned from
RNASLOpt and attempt to optimize them, although the data sets will come from the paper
summarized in the next section.
2.2 RNAConSLOpt
Li et al. expanded on RNASLOpt by creating RNAConSLOpt [10], which, based on a
thorough review of the literature, is currently the only software package that addresses the
29
problem of finding a consensus folding landscape. The input to the program is a multiple
sequence alignment. RNAConSLOpt then analyzes the alignment using the covariance and
conservation score introduced in RNAalifold by Hofacker et al. [18]. Specifically, if we
visualize the multiple sequence alignment as a matrix in which each row contains a
sequence and each column contains a set of nucleotides that have been aligned, we can
calculate the covariance and conservation score between columns i and j, denoted by γij,
using the following formula:
𝛾𝑖𝑗 =
1
(𝐶 − 𝜙1 𝑞𝑖𝑗 )
𝑛 𝑖𝑗
(9)
where ϕ1 is a weighting factor set to 1 by default, n is the number of sequences in the
alignment, and Cij is defined by
𝐶𝑖𝑗 =
2
𝑛−1
𝑗
𝑗
𝑗
𝑖
𝑖
{𝑑(𝑎𝑘 , 𝑑𝑙 ) + 𝑑(𝑎𝑘 , 𝑎𝑙 )
0
1≤𝑘<𝑙≤𝑛
𝑗
𝑖𝑓 (𝑎𝑘𝑖 ∙ 𝑎𝑘 ) 𝑎𝑛𝑑 (𝑎𝑙𝑖 ∙ 𝑎𝑙 )
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
∑
(10)
𝑗
where 𝑎𝑘𝑖 refers to the ith nucleotide in sequence k, (𝑎𝑘𝑖 ∙ 𝑎𝑘 ) means the associated
𝑗
nucleotides can form a base pair, and d(𝑎𝑘𝑖 , 𝑎𝑘 ) = 1 if the nucleotides are equal and 0 if
they are not. Finally, qij is defined by
0
𝑞𝑖𝑗 = ∑ {0.25
1≤𝑘≤𝑛
1
𝑗
𝑖𝑓 𝑎𝑘𝑖 ∙ 𝑎𝑘
𝑗
𝑖𝑓 𝑏𝑜𝑡ℎ 𝑎𝑘𝑖 𝑎𝑛𝑑 𝑎𝑘 𝑎𝑟𝑒 𝑔𝑎𝑝𝑠
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(11)
30
In other words, Cij represents a bonus given to columns with compensatory mutations that
keep the consensus structure, while qij is a penalty assessed to columns that do not follow
the consensus structure. RNASLOpt computes γij for all possible pairs of distinct columns
in the alignment, and uses this value in a modified version of the recursive function from
[7] to compute locally optimal configurations of consensus stacks. Using the same heuristic
from the 2011 paper, RNAConSLOpt calculates the barrier energy of each ConLOpt
structure to determine the ConSLOpt structures. Because this methodology takes into
account stacks shared between multiple sequences, the number of structures is reduced in
comparison to RNASLOpt, essentially cutting out most of the structures that are simply a
consequence of the RNA sequence, and leaving the biologically relevant structures. It is
important to note that while other tools such as LocARNA [19-21] are already capable of
computing the consensus structure of a multiple sequence alignment, these tools focus
exclusively on the best possible consensus structure, and are therefore inappropriate to
apply to RNA molecules that have alternate functional structures, such as riboswitches.
In a sense, MUSHI represents the converse of RNAConSLOpt. While RNAConSLOpt
takes an alignment first and computes optimal configurations of stacks (the consensus
landscape) second, MUSHI takes input from RNASLOpt, which computes the landscapes
first, and performs structural alignment second. In section 4.4, we compare the results of
RNASLOpt + MUSHI to RNAConSLOpt, explaining the advantages and disadvantages of
each.
2.3 GraphClust
In 2012, Heyne et al. [22] published a paper describing GraphClust, a tool for ncRNA
annotation. As discussed previously, recent studies have begun to indicate that ncRNA
31
plays a crucial role in gene expression [23], and yet we know relatively little about how
such molecules function. One reason for this is that the classes of ncRNA are quite diverse,
and ncRNAs sharing a common structure do not necessarily have high sequence identity,
so sequencing information alone is not as useful in ncRNA clustering as it is in mRNAs.
Thus, ncRNA annotation requires clustering using sequence-structure information, which
is usually computationally expensive. GraphClust solves this problem by decomposing the
landscape and using fast heuristics, many of which operate in constant time. Because of
this speedup, GraphClust can scale to hundreds of thousands of sequences.
GraphClust’s relevance to this thesis is due to its methodology, which is similar but not
equivalent to finding the consensus landscape between two sequences. First, the sequences
are analyzed and many suboptimal structures are enumerated. The structures are then
represented by a graph, where the vertices are nucleotides, and an edge exists between
vertices if and only if the corresponding nucleotides are adjacent in the sequence, or they
form a base pair. Furthermore, an additional node is added in the middle of each pair of
stacking base pairs, which induces important features into the graph. Because in practice
one often only has a partial transcript of the RNA sequence, the authors further consider
subsequences of the original sequence. In this way, each sequence is represented by a set
of disconnected graphs. The authors can then compare these representative graphs using a
graph kernel, which is a simple way of computing the similarity between two graphs.
Specifically, they use the neighborhood subgraph pairwise distance kernel, which is a
decomposition kernel introduced by Costa and Grave in 2010 [24]. Further Heyne et al.
propose a fast method for testing for graph isomorphism, which reduces two isomorphic
graphs to an identical string, which can then be mapped to an integer using an iterative
32
hashing procedure. At this point, determining isomorphism between two graphs simply
amounts to checking whether or not these integers are equal.
Using these special distance measures with the graph kernel, the authors are able to cluster
ncRNAs based on the similarity of their representative structures, and thereby detect new
classes of ncRNAs. This methodology is important to us because it shows us one method
of comparing elements of a sequence’s landscape to find some sort of consensus which can
be used to cluster the sequences. However, our desired outcomes are different. In our case,
we already start with a group of ncRNAs that we know are part of the same class, and we
want to analyze their landscapes in order to find common structures. GraphClust, by
contrast, only seeks to cluster the ncRNAs, showing that they are part of the same class,
but not necessarily making any definitive statements on their secondary structure.
These three papers will prove to be the most relevant to our discussion of consensus
landscapes. In order to analyze various riboswitches and the folding landscapes, we have
written a custom piece of software in Java capable of performing the tasks explained in
detail in the next two chapters. Originally, we envisioned a structural alphabet similar to
the BEAR alphabet before actually learning of BEAR’s existence. The original structural
alphabet was rudimentary, dividing each SSE into only five categories: multiloop,
unpaired, stack, hairpin, internal loop/bulge. Thus, MUSHI was chosen as an acronym.
33
CHAPTER THREE: METHODOLOGY
Clearly, the problem of global network alignment is exceedingly difficult, especially for
energy folding networks, which are enormous in size. Current methods cannot solve the
problem directly. Therefore, it is necessary to find the best approximate solution by
working with an abridged data set. One observation of note is that the folding landscape
includes all possible structures on any particular sequence, and necessarily includes
structures that are clearly not of any biological significance. If we can devise some method
of filtering out these uninteresting data points before any sort of alignment is attempted, it
may be possible to find a solution. The primary method of investigation in this thesis
combines the approaches of RNASLOpt, a structural alphabet called BEAR, and a
structural substation matrix called MBR. In the following chapter, we explain how these
approaches are combined to provide insight on our new problem, and in Chapter 4 present
the results of applying them to predict the native structures of riboswitches.
3.1 RNASLOpt
As discussed in the literature review, RNASLOpt [7] takes an RNA sequence, finds the
putative stacks predicted using the methods of Bafna et al. [16], and enumerates all possible
locally optimal structures that can be constructed using the stacks. These structures are then
filtered by stability, so the output of the program is a list of stable local optimal structures.
Because these are the points in the landscape that are most likely to be the native structure,
the input to MUSHI will be the output of RNASLOpt. The goal of this methodology is to
improve native structure prediction by both reducing the size of the list of candidates output
34
by RNASLOpt and reordering them such that the structures most closely resembling the
native structures move toward the top.
Our primary thesis is that this can be done by solving, exactly or approximately, the
consensus folding landscape problem. Therefore, given two lists of SLOpt structures for
two RNA sequences, we should attempt to find an approximate consensus between them.
The most thorough way to achieve this goal is to perform a pairwise comparison of every
possible pair consisting of one structure from each landscape. Much thought was given to
the problem of how to compare two structures. The most naïve method would be to perform
an alignment using the Needleman-Wunsch algorithm. However, this method has many
limitations. Consider the following two structures in dot-bracket notation:
...((((.....))))...(((....(((...))).....))).....
...((((.....(((....(((...)))........))).....))))
Particularly, if we examine the first 12 characters in each structure, we can see that an
alignment based on the edit distance of these two sequences would treat both sets of
characters as a perfect alignment. However, in the first structure, characters 8 through 12
are part of a hairpin loop, whereas the same characters in the second structure are actually
part of an internal loop. While, in DNA sequence alignment, each nucleotide is an atomic
unit of the sequence whose identity does not depend on the surrounding nucleotides, each
character in a dot-bracket string represents something different depending not only on the
identity of the character, but the context of the surrounding characters. Thus, performing
structural alignment based solely on the edit distance of these structural representations
eliminates contextual information that is critical to understanding the information the
35
sequence is meant to convey. Clearly, we must devise some way to overcome the
limitations of such a primitive alphabet.
In order to best understand the current methodology, it may help to explain some of the
ideas that were discarded along the way, and why they were abandoned. When discussing
the best way to compare two sets of SLOpt structures, many options were debated. One
option was to compute the pairing probability matrix for each landscape. For a sequence
of length N, the pairing probability matrix is a matrix of length and width N. Each cell (i,j)
stores the probability that nucleotides i and j form a base pair. This probability is calculated
from the frequency of such occurrences in the actual landscape. We calculated pairing
probability matrices for two landscapes, scaled them so that the values in cells ranged from
0 to 255, and used MATLAB to create a greyscale visualization of each matrix, where each
cell was assigned in color in the spectrum from black to white, depending on its numerical
value. While this method provided interesting insight to the most common stack structures
shared between both landscapes, aligning these matrices in a way that yielded valuable
information proved to be difficult. Additionally, because this methodology was a direct
translation of dot-bracket notation, it suffered from the same drawbacks described earlier.
Particularly, critical information about the non-stack structures such as internal loops is not
conveyed in a pairing probability matrix, which makes it poor method of comparing
landscapes.
Because the shortcomings of dot-bracket notation were now apparent, we next attempted
to create a structural encoding for each structure that would convey a greater amount of
information. Initially, we decided to use five possible characters: m (multiloop), u
36
(unpaired), s (stack), h (hairpin), i (internal loop), which inspired the acronym MUSHI.
The example below shows how a structure would be encoded using this scheme:
...(((...(((...(((...)))...)))...(((...)))...)))...
uuummmuuusssiiissshhhsssiiisssuuussshhhsssuuummmuuu
Using this encoding, we then attempted to extract additional information from the
landscape. We defined a condensed structural encoding as a shortened version of a
structural encoding that retained transitions between different characters, but removed
repeat characters. The example below shows the transformation from a structural encoding
to a condensed structural encoding:
uuummmuuusssiiissshhhsssiiisssuuussshhhsssuuummmuuu
umusishsisushsumu
The rationale behind using the condensed representation was to preclude length from being
a factor in the alignments. It is possible for two structures to strongly resemble each other
even though they may differ in length. Additionally, we attempted to gather data on
frequency at which each nucleotide in the sequence was involved in a particular SSE. For
each sequence of length N, this required creating a 5xN matrix, where each of the five rows
recorded the frequency that its associated nucleotide was involved in that SSE. This process
is similar to assigning a color to each nucleotide in a consensus structure, such as that
computed by LocARNA, where the color denotes the sequence conservation of the
nucleotide. We then attempted to visualize these in MATLAB by mapping three of the five
rows to one of the three standard colors in the RGB additive color model, again scaling the
range [0,1] to [0, 255]. While this resulted in some interesting patterns and identified a few
important analogous nucleotides from each landscape, no meaningful consensus could be
extracted from this information, so we again abandoned it.
37
3.2 Brand nEw Alphabet for RNA (BEAR)
After some research, we came across an article in Nucleic Acids Research published in
March 2014 by a group of researchers from the University of Rome [25]. In this article,
Mattei et al. describe a structural alphabet similar to our earlier version, but far more
sophisticated. First, they divide the class of possible structures into four groups: loops
(L), internal loops (I), stems/stacks (S), and bulges (B). Each group contains a series of
characters denoting an SSE of different length. For example, S = {S1, S2, S3, …}, where
S1 denotes that the nucleotide is a member of a stem of length 1, and so on. Additionally,
the alphabets S, I, and B are divided into characters denoting branching and nonbranching structures (S = Sn ∪ Sb). A non-branching structure is defined by the maximal
boundary [i, j] such that if a hairpin loop exists within the boundary, it is the only such
loop. In other words, there does not exist a multiloop structure in [i, j]. This distinction
was necessary because they observed different transition rates between SSEs when
building the substitution matrix discussed in the next section. Furthermore, the alphabets
I and B are divided again into left and right internal loops and bulges. Thus, I = ILn ∪ ILb
∪ IRn ∪ IRb. In order to determine the upper limit on the size of each alphabet, Mattei et
al. use a set of carefully selected RNA structures and use the 95th percentile of the length
distribution as the limit for each SSE. Then, the final BEAR alphabet β = L ∪ I ∪ S ∪ B.
One additional character, ‘:’, is added to the alphabet to denote unpaired nucleotides not
belonging to any other SSE. The full alphabet used in this methodology is detailed in the
following breakdown, where the first character in each set represents an SSE of length 1,
the second represents an SSE of length 2, and so on:
38
Hairpin Loop =
{j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,^}
Stack =
{a,b,c,d,e,f,g,h,i,=}
Stack (Branching) =
{A,B,C,D,E,F,G,H,I,J}
Left Internal Loop =
{?,!,”,#,$,%,&,’,(,),+}
Left Internal Loop (Branching) =
{?,K,L,M,N,O,P,Q,R,S,T,U,V,W}
Right Internal Loop =
{?,2,3,4,5,6,7,8,9,0,>}
Right Internal Loop (Branching) =
{?,Y,Z,~,?,_,|,/,\,@}
Left Bulge =
{[}
Left Bulge (Branching) =
{{}
Right Bulge =
{]}
Right Bulge (Branching) =
{}}
Unpaired =
{:}
As an example of this new encoding, consider the following sequence, followed by the
traditional dot-bracket representation of its structure, followed by the BEAR
representation:
AAAGCGCAAAGCGCAAACGCGAAAGCGCAAACGCGAAACGCGAAA
...((((...((((...))))...((((...))))...))))...
:::dddd:::ddddllldddd:::ddddllldddd:::dddd:::
3.3 Substitution Matrices and MBR
In addition to creating a new expressive alphabet, Mattei et al. also construct a structural
substitution matrix which can be used for a variety of purposes, such as structural
alignment. Substitution matrices were originally introduced by Margaret Dayhoff in 1978
[26]. Their purpose is to measure rates of evolutionary mutations between two polypeptides
by expressing the relative probability that one amino acid in a multiple sequence alignment
will mutate into another. Such matrices have dimensions 20x20, and the value S(i, j)
39
corresponds to the likelihood that, over a certain period of time, residue i in the sequence
will mutate into residue j. The two most common substitution matrices in use are PAM and
BLOSUM. PAM (Point Accepted Mutations or Percent Accepted Mutations, depending on
the literature) is the matrix introduced by Dayhoff, and is based on global alignments of
very closely related amino acid sequences (i.e. <1% divergence). The matrix PAM1 derives
the effect on the sequence after 1% of the amino acids have changed. The commonly used
PAM250 is equivalent to PAM1250. BLOSUM (Blocks Substitution Matrix) was
introduced by Henikoff in 1992 [27]. This matrix is based on local alignments that are more
distantly related.
A substitution matrix is calculated from a multiple sequence alignment by counting the
number of occurrences of each residue, as well as the number of times a specific residue is
aligned with another specific residue, including an identical one. The result is then
expressed as what is known as a log-odds score:
𝑆(𝑖, 𝑗) = 𝑙𝑜𝑔(
𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
)
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
(12)
The base of this logarithm is negligible. A score less than zero indicates that residues i and
j were aligned less than what would be expected simply by chance, while a score greater
than zero indicates the converse. In this way, we can use substitution matrices to predict
how an ancestral sequence might mutate over millions of years, which can help us establish
genetic phylogenies.
Mattei et al. use the concept of a substitution matrix in a completely new context.
Beginning with a set of highly structured RNA families in the work by Meyer et al [28],
they searched through the database Rfam to find additional highly structured RNA families.
40
Each structure in the RNA families was folded, its sequence was converted to the BEAR
alphabet, and each character in its BEAR representation was mapped to its corresponding
nucleotide in the multiple sequence alignment. Then, using the method created by Dayhoff
described above, they derived an 83x83 matrix where each value (i, j) denotes the relative
probability that a nucleotide involved in a substructure i may, in another structure, be
involved in a substructure j. In essence, this creates a completely new kind of substitution
matrix that can be used to analyze different structures for many purposes. Mattei et al.
explain such uses, as well as validating this matrix, in their paper [25]. This matrix is called
the Matrix of BEAR-encoded RNA secondary structures, or MBR.
3.4 Datasets
Because we are interested in comparing this methodology to that of RNAConSLOpt, we
selected the same four nucleotides used in the benchmarking process in Li’s paper [10].
Specifically, we examined the following riboswitches: (1) the adenine riboswitch from the
ydhL gene of Bacillus subtilis, the lysine riboswitch from the lysC gene of Bacillus subtilis,
the thiamine pyrophosphate (TPP) riboswitch from the thiamin gene of Bacillus subtilis,
and the flavin mononucleotide (FMN) riboswitch from the ribD gene of Bacillus subtilis.
For each riboswitch, we obtained five RNA sequences in the family, as well as the
canonical ‘on’ and ‘off’ structures and sequences. These were used as input to the software
pipeline described in the next section.
3.5 Software Pipeline
We began by creating four files – one for each of the riboswitches. Each file contained five
RNA sequences belonging to that family. Each file was given to RNASLOpt as input to
41
find the stable local optimal structures of each sequence. RNASLOpt also requires
parameter specifying the acceptable range of structures it should return. Specifically, the
user should select a Δp, which specifies the boundary for which no structure having a free
energy value greater than p percent away from the MFE structure will be returned, and a
ΔB, which specifies the boundary for which no structure having stability less than B
kcal/mol will be returned. The full software pipeline is detailed in Figure 3. The parameters
used for each sequence, as well as the number of structures generated by RNASLOpt, are
detailed in Table 1.
Figure 3 - MUSHI Pipeline
42
RNASLOpt can process multiple sequences at once, and will combine all the results into a
single file. Our output was one large file for each riboswitch, with each file containing the
local optimal structures for each sequence ranked by free energy, as well as the stable local
optimal structures (a subset of the LOpt structures) ranked by stability. At the top of each
of these files, we prepended the canonical sequence for each riboswitch, as well as the
native on and off structures.
Table 1 - RNASLOpt Parameters and Landscape Sizes
Landscape
Δp (%)
ΔB (kcal/mol)
Adenine 0
Adenine 1
Adenine 2
Adenine 3
Adenine 4
Lysine 0
Lysine 1
Lysine 2
Lysine 3
Lysine 4
TPP 0
TPP 1
TPP 2
TPP 3
TPP 4
FMN 0
FMN 1
FMN 2
FMN 3
FMN 4
55
55
55
55
55
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
Number of SLOpt
structures
4
3
6
3
5
346
728
222
523
96
91
135
18
14
36
316
523
666
216
718
MUSHI is designed to accept a series of landscapes from within a single riboswitch family.
It parses the output of a single run of RNASLOpt on a set of sequences, and stores the
native sequence, on, and off structures of the riboswitch, each sequence analyzed by
RNASLOpt, and every SLOpt structure as well as their energy and stability rankings. After
43
loading all of these landscapes, MUSHI then translates the native structures into BEAR,
and performs the structural alignment algorithm described below on each possible pair
(a,b), where a is either the on or off native structure, and b is a structure in one of the
landscapes given in the file. For each landscape, MUSHI returns the two (possibly
identical) structures most closely resembling the on and off native structures, as well as
both alignments and alignment scores. The two structures returned for each landscape are
referred to as the target structures.
Next, the user can specify which two landscapes on which to perform structural alignment.
Once this information has been specified, MUSHI writes each structure in the landscape
into a file in FASTA format. It then sends an instruction to the command prompt to run a
program called the BEAR encoder, created by Mattei et al., to translate each structure into
the BEAR alphabet. The BEAR encoder performs this translation, and outputs the result
into another file in FASTA format. MUSHI sleeps for a sufficient amount of time to allow
the BEAR encoder to finish the process, and then accesses the new file containing the
BEAR sequences. Then, for each possible pair (a, b) where a ∈ Landscape 1 and b ∈
Landscape 2, it runs the structural alignment algorithm defined by the following recursive
function:
𝑆(𝑖 − 1, 𝑗 − 1) + 𝑀𝐵𝑅(𝑖, 𝑗) + 𝐵𝑂𝑁𝑈𝑆(𝑁𝑖 , 𝑁𝑗 )
𝑆(𝑖, 𝑗) = 𝑚𝑎𝑥 {
𝑆(𝑖 − 1, 𝑗) + 𝑔𝑎𝑝
𝑆(𝑖, 𝑗 − 1) + 𝑔𝑎𝑝
𝐵𝑂𝑁𝑈𝑆(𝑁𝑖 , 𝑁𝑗 ) = {
𝑏𝑜𝑛𝑢𝑠 𝑖𝑓 𝑁𝑖 = 𝑁𝑗
0
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(13)
44
This very simple algorithm is just a slight modification of Needleman-Wunsch to include
the MBR as a method for scoring matches and mismatches. Additionally, it incorporates
information about the sequences themselves by adding a bonus score when the nucleotide
associated with BEAR character i is identical to the nucleotide associated with BEAR
character j. Mattei et al. use the exact same algorithm in their paper, but for the purpose of
creating multiple sequence alignments. Conversely, we are not necessarily interested in the
structural alignment itself, but rather the score of the optimal alignment as an estimation
for structural distance.
Because this O(n2) algorithm is run NxM times, the running time of the overall procedure
of comparing two landscapes is O(n4). It may be possible to optimize this in the future, but
for now we are interested in using the most thorough possible method to prove that our
concept is valid. After all the structural alignments have been performed, the scores and
alignments themselves are saved in memory. Then, using the stability ranking, energy
ranking, and alignment ranking, every possible alignment is ranked according to the
following formula:
𝑀(𝑖, 𝑗) = 𝛼(𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑟𝑎𝑛𝑘) + 𝛽(𝑒𝑛𝑒𝑟𝑔𝑦 𝑟𝑎𝑛𝑘) + 𝛾(𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡 𝑠𝑐𝑜𝑟𝑒)
(14)
where
𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑟𝑎𝑛𝑘(𝑖, 𝑗) =
𝑠𝑖𝑧𝑒(𝐿1) − 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑖) + 𝑠𝑖𝑧𝑒(𝐿2) − 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑗)
𝑠𝑖𝑧𝑒(𝐿1) + 𝑠𝑖𝑧𝑒(𝐿2)
(15)
𝑒𝑛𝑒𝑟𝑔𝑦 𝑟𝑎𝑛𝑘(𝑖, 𝑗) =
𝑠𝑖𝑧𝑒(𝐿1) − 𝑒𝑛𝑒𝑟𝑔𝑦(𝑖) + 𝑠𝑖𝑧𝑒(𝐿2) − 𝑒𝑛𝑒𝑟𝑔𝑦(𝑗)
𝑠𝑖𝑧𝑒(𝐿1) + 𝑠𝑖𝑧𝑒(𝐿2)
(16)
45
and
(∑𝐿𝑘=0 {
𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡 𝑠𝑐𝑜𝑟𝑒(𝑖, 𝑗) =
𝑔𝑎𝑝 𝑖𝑓 𝑖(𝑘) =′ −′ 𝑜𝑟 𝑗(𝑘) =′ − ′
) − min 𝑠𝑐𝑜𝑟𝑒
𝑀𝐵𝑅(𝑖(𝑘), 𝑗(𝑘))
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
max 𝑠𝑐𝑜𝑟𝑒 − min 𝑠𝑐𝑜𝑟𝑒
(17)
where size(L1) is the number of SLOpt structures in landscape 1, stability(i) is the stability
ranking of structure i in landscape 1, energy(i) is the energy ranking of structure i in
landscape 1, L is the length of the alignment between the two structures, i(k) is the k-th
character in the first level of the alignment, and α, β, and γ are parameters dictating the
contribution of each addend of the weighted sum. Each series of structural alignments
resulted in a list of NxM alignments, where M and N are the sizes of landscapes 1 and 2,
respectively. If the size of the landscapes created by RNASLOpt are too large, the user can
specify the percentage of the structures that should be used when comparing two
landscapes. These alignments were than ranked by their match score, based on the formula
given above.
Next, MUSHI locates and reports the ranking of the match between the two target on
structures and the two target off structures between both landscapes, called the target
matches. It also returns the top k alignments, where k is a number specified by the user.
Finally, MUSHI returns the consensus landscape, defined as shortened versions of the two
original landscapes such that every structure in the shortened list of the first landscape has
a strong match in the shortened list of the second landscape. Here, ‘strong’ is a relative
term which can be defined by the user in multiple ways: the unique structures involved in
the top x alignments, the unique structures involved in the top x% of the alignments, or the
46
unique structures involved in the alignments within x% of the maximum possible match
score. In this thesis, we use the second method.
47
CHAPTER FOUR: RESULTS AND DISCUSSION
4.1 Ranks of Target Matches in Complete Pairwise Alignment
As explained in Chapter 3, one of the first steps in the methodology is to establish which
structures from both landscapes most closely resemble the canonical on and off riboswitch
structures by aligning their BEAR representations with both landscapes. The results of
these alignments are detailed in Table 2. For example, in the landscape Adenine 0,
RNASLOpt returned 4 structures. The structure most closely resembling the on structure
had stability rank 1 and energy rank 2 in its landscape (denoted (1, 2)), and the raw score
of its best alignment with the on structure, excluding the sequence bonus, was 62.39.
Table 2 - Results of aligning each landscape with native ‘On’ and ‘Off’ structures
Landscape # of SLOpt
Structures
Adenine 0
Adenine 1
Adenine 2
Adenine 3
Adenine 4
Lysine 0
Lysine 1
Lysine 2
Lysine 3
Lysine 4
TPP 0
TPP 1
TPP 2
TPP 3
TPP 4
FMN 0
FMN 1
FMN 2
FMN 3
FMN 4
4
3
6
3
5
346
728
222
523
96
91
135
18
14
36
316
523
666
216
718
Rank of
Best ‘On’
Structure
(s, e)
Raw
Alignment
Score (w/
Native ‘On’)
Rank of
Best ‘Off’
Structure
(s, e)
Raw
Alignment
Score (w/
Native ‘Off’)
(1, 2)
(1, 1)
(4, 5)
(1, 2)
(0, 0)
(21, 30)
(96, 71)
(159, 14)
(96, 205)
(89, 36)
(2, 4)
(99, 14)
(15, 5)
(5, 5)
(12, 30)
(3, 294)
(485, 516)
(285, 241)
(72, 118)
(208, 121)
62.39
38.74
38.04
13.91
45.85
148.84
99.46
69.05
75.22
70.63
82.92
48.08
15.54
53.13
42.87
43.95
42.45
28.10
27.97
17.52
(0, 0)
(0, 0)
(0, 0)
(0, 0)
(2, 1)
(149, 29)
(0, 0)
(158, 47)
(31, 33)
(24, 20)
(0, 0)
(0, 0)
(3, 1)
(0, 0)
(8, 1)
(29, 138)
(333, 108)
(386, 233)
(185, 37)
(362, 397)
65.30
51.15
38.01
37.32
9.5
160.35
115.22
67.07
101.59
77.61
101.38
86.32
69.31
66.70
22.12
86.49
72.09
19.09
33.12
15.89
48
The purpose of comparing each structure in the first landscape to each structure in the
second is to establish which pairs both share a strong structural resemblance and are also
highly stable and energetically favorable. This procedure results in NxM comparisons,
where N and M are the sizes of the first and second landscape respectively. Here, we make
three key assumptions: (1) RNASLOpt has correctly predicted the on and off structures
and they appear in both landscapes, (2) the stability of a molecule and the probability that
it is the correct native structure share a positive correlation, and (3) the energy level of a
molecule and the probability that it is the correct native structure share a negative
correlation. These are the assumptions underpinning the use of the weighted sum, in which
two structures will have a greater score if they strongly resemble each other, and their
stability and energy ranks are both high. Ideally, if we then let the target on structures in
landscape 1 and landscape 2 be labeled n and m respectively, we should expect that the
match score between n and m will appear near the top of the rankings, indicating that these
two structures are good candidates for the native structure. The following tables show the
ranks of the target matches for each riboswitch with α = 1, β = 1, and γ = 2 in the weighted
sum, gap = -0.75, and bonus = 0.2. The method for determining the optimal values of these
parameters is explained at the end of section 4.1.
49
Table 3 – Complete pairwise structural alignment for adenine ‘On’
Comparison
(L1, L2)
Number of
Alignments
Rank of
Target ‘On’
Match
Percentile
Alignment
Score
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
12
24
12
20
18
9
15
18
30
15
3
7
4
1
10
5
1
8
3
1
0.7500
0.7083
0.6667
0.9500
0.4444
0.4444
0.9333
0.5556
0.9000
0.9333
42.89
55.78
44.30
54.85
45.59
33.80
48.45
50.27
47.59
55.46
Table 4 – Complete pairwise structural alignment for adenine ‘Off’
Comparison
(L1, L2)
Number of
Alignments
Rank of
Target ‘Off’
Match
Percentile
Alignment
Score
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
12
24
12
20
18
9
15
18
30
15
1
1
1
3
1
1
5
1
18
3
0.9167
0.9583
0.9167
0.8500
0.9444
0.8889
0.6667
0.9444
0.4000
0.8000
52.06
51.56
83.96
23.68
77.00
60.57
17.25
74.84
3.48
15.59
In the Adenine riboswitch (Tables 3 and 4), most of the matches involving both targets
appear in the top 5 possible alignments. This is not surprising, especially since the off
targets already appeared at the top of the landscapes given by RNASLOpt alone. However,
the usefulness of this riboswitch lies in its relatively short sequence length, which allows
computations to be carried out thoroughly and rapidly. As shown in Table 2, the fact that
the sequences are much shorter resulted in landscapes of dramatically reduced sized
50
compared to the other riboswitches used. In order to evaluate the methodology, it was
necessary to use riboswitches with increased sample size.
Table 5 – Complete pairwise structural alignment for lysine ‘On’
Comparison
(L1, L2)
Number of
Alignments
Rank of
Target ‘On’
Match
Percentile
Alignment
Score
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
251888
76812
180958
33216
161616
380744
69888
116106
21312
50208
4
77
415
4
24
116
706
3290
665
32
0.99998
0.99900
0.99771
0.99988
0.99985
0.99970
0.98990
0.97166
0.96880
0.99936
153.57
105.12
98.71
103.12
107.28
121.46
79.81
99.20
80.04
132.10
Table 6 – Complete pairwise structural alignment for lysine ‘Off’
Comparison
(L1, L2)
Number of
Alignments
Rank of
Target ‘Off’
Match
Percentile
Alignment
Score
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
251888
76812
180958
33216
161616
380744
69888
116106
21312
50208
12
1244
3
5
23
1
9
767
151
1
0.99995
0.98380
0.99998
0.99985
0.99986
0.99999
0.99987
0.99339
0.99291
0.99998
142.63
96.81
146.43
108.31
93.08
153.60
97.41
95.00
95.02
127.37
The landscapes comprising the data set for the Lysine riboswitch (Tables 5 and 6) were
much larger, yielding more significant results. Most striking is the fact that, in the majority
51
of comparisons, the target matches were ranked in the 99.9th percentile, even when the
original target structures did not appear at the top of the RNASLOpt rankings.
Table 7 – Complete pairwise structural alignment for TPP ‘On’
Comparison
(L1, L2)
Number of
Alignments
Rank of
Target ‘On’
Match
Percentile
Alignment
Score
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
12285
1638
1274
3276
2430
1890
4860
252
648
504
7
264
9
12
1450
153
295
94
432
20
0.9994
0.8388
0.9929
0.9963
0.4033
0.9190
0.9393
0.6270
0.3333
0.9603
132.95
18.05
70.67
83.18
20.14
74.04
61.06
44.22
14.94
82.41
Table 8 – Complete pairwise structural alignment for TPP ‘Off’
Comparison
(L1, L2)
Number of
Alignments
Rank of
Target ‘Off’
Match
Percentile
Alignment
Score
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
12285
1638
1274
3276
2430
1890
4860
252
648
504
4
3
4
2
1
2
8
1
23
2
0.9997
0.9982
0.9969
0.9994
0.9996
0.9989
0.9984
0.9960
0.9645
0.9960
94.67
75.58
79.51
77.97
106.18
90.64
50.87
82.56
24.83
64.48
The results for the TPP riboswitch were similar (Tables 7 and 8), with a majority of the
target matches appearing in the 99th percentile of the rankings. However, most of the target
matches that did not rank as well were associated with the on native structure. One
52
explanation for this may be that the alignment scores of the best on structures in each
landscape with the native on structure were significantly lower than that of their off
counterparts. This is due to a fundamental limitation on MUSHI’s performance: a close
match with the native structure must appear somewhere in the landscape for MUSHI to
identify it.
Table 9 – Complete pairwise structural alignment for FMN ‘On’
Comparison
(L1, L2)
Number of
Alignments
Rank of
Target ‘On’
Match
Percentile
Alignment
Score
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
165268
210456
68256
226888
348318
112968
375514
143856
478188
155088
162611
52695
7288
39867
243343
62496
184017
21563
421
2834
0.0161
0.7496
0.8932
0.8243
0.3014
0.4468
0.5100
0.8501
0.9991
0.9817
52.99
34.34
80.84
26.76
22.93
94.19
33.54
63.23
121.53
76.37
Table 10 – Complete pairwise structural alignment for FMN ‘Off’
Comparison
(L1, L2)
Number of
Alignments
Rank of
Target ‘Off’
Match
Percentile
Alignment
Score
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
165268
210456
68256
226888
348318
112968
375514
143856
478188
155088
10653
31626
2884
71106
30954
5615
22014
42382
105621
16247
0.9355
0.8497
0.9577
0.6866
0.9111
0.9503
0.9414
0.7054
0.7791
0.8952
137.43
45.63
82.26
38.83
59.17
96.63
83.16
54.81
68.62
90.93
53
To our dismay, the results obtained from the FMN riboswitch were significantly worse
(Tables 9 and 10). While 14 of the 20 target matches scored above the 74th percentile, this
is still nowhere near the significance necessary to effectively identify significant structures
in either landscape. The best explanation for this is the fact that the alignment scores of the
best on and off structures with the native structures are very poor in comparison to all other
landscapes. The FMN dataset contained the longest RNA sequences, and so if the
RNASLOpt results for FMN were on par with the rest of the data, we would expect the
alignment scores to be the highest in the set. Instead, the alignment scores from FMN are
similar or worse than TPP, indicating that RNASLOpt did not perform as well on the final
riboswitch.
Tables 11 and 12 show the rankings of all target matches for varying values of α, β, and γ.
These values were chosen to determine the significance of introducing a distance measure
into the ranking formula. They use the following factors: stability alone; energy alone;
stability and energy; stability, energy and similarity; and stability, energy, and similarity
with doubled weight. The value 0.2 was chosen for the bonus parameter per the
recommendations of Mattei et al [25]. The value -0.75 was chosen for gap by experiment,
to strike a balance between setting the value too low and therefore having the alignment
avoid all gaps, or setting it too high and having the alignment overuse gaps.
Based on these tables, the optimal parameters appear to be α = 1, β = 1, γ = 2, gap = -0.75,
and bonus = 0.2. In most cases, this resulted in the highest rankings, and improvements
over measures such as “stability only” grew significantly with the introduction of the
similarity measure, and moderately by double-weighting it. In most of the cases when the
optimal parameters were not 1,1,2,-0.75,0.2, the rank assigned by these parameters did not
54
differ significantly from the highest rank for that target match, specifically in the off
structures of TPP. This suggests that MUSHI is able to preserve good output from
RNASLOpt, while improving moderate output. MUSHI’s poor results on FMN suggest
that improvement on poor RNASLOpt results is negligible.
55
Table 11 – Complete pairwise alignment rankings of ‘On’ target matches for different values of α, β, γ,
gap, and bonus respectively
Comparison
(L1, L2)
Adenine (0, 1)
Adenine (0, 2)
Adenine (0, 3)
Adenine (0, 4)
Adenine (1, 2)
Adenine (1, 3)
Adenine (1, 4)
Adenine (2, 3)
Adenine (2, 4)
Adenine (3, 4)
Lysine (0, 1)
Lysine (0, 2)
Lysine (0, 3)
Lysine (0, 4)
Lysine (1, 2)
Lysine (1, 3)
Lysine (1, 4)
Lysine (2, 3)
Lysine (2, 4)
Lysine (3, 4)
TPP (0, 1)
TPP (0, 2)
TPP (0, 3)
TPP (0, 4)
TPP (1, 2)
TPP (1, 3)
TPP (1, 4)
TPP (2, 3)
TPP (2, 4)
TPP (3, 4)
FMN (0, 1)
FMN (0, 2)
FMN (0, 3)
FMN (0, 4)
FMN (1, 2)
FMN (1, 3)
FMN (1, 4)
FMN (2, 3)
FMN (2, 4)
FMN (3, 4)
Number of
Alignments
12
24
12
20
18
9
15
18
30
15
251888
76812
180958
33216
161616
380744
69888
116106
21312
50208
12285
1638
1274
3276
2430
1890
4860
252
648
504
165268
210456
68256
226888
348318
112968
375514
143856
478188
155088
1,0,0,
-0.75,0.2
5
16
5
3
14
5
3
14
15
3
6925
16312
6925
6007
32142
18625
13207
32239
18904
13207
5099
156
31
108
1902
1374
3390
195
349
153
104442
41620
2854
22370
261128
96641
226422
54036
122057
37333
0,1,0,
-0.75,0.2
7
22
10
5
16
7
3
18
19
5
5160
999
27746
2220
3678
38282
5735
24249
1320
18593
174
47
47
598
205
187
981
65
493
405
164863
119294
61117
81374
255619
107605
197134
54407
65872
28477
56
1,1,0,
-0.75,0.2
6
19
7
3
15
6
2
15
16
3
1445
2793
6975
1991
6467
10620
4563
19570
9572
11904
1003
54
20
190
949
692
2042
126
465
286
159471
67399
27465
31084
283422
110331
218154
43916
57172
19621
1,1,1,
-0.75,0.2
4
14
7
1
12
6
1
13
3
1
15
198
937
20
86
327
1128
6960
2857
422
46
135
9
21
1220
300
753
97
450
61
162340
58112
14048
34528
265604
93551
200339
29960
3620
6077
1,1,2,
-0.75,0.2
3
7
4
1
10
5
1
8
3
1
4
77
415
4
24
116
706
3290
665
32
7
264
9
12
1450
153
295
94
432
20
162611
52695
7288
39867
243343
62496
184017
21563
421
2834
Table 12 – Complete pairwise alignment rankings of ‘Off’ target matches for different values of α, β, γ,
gap, and bonus respectively
Comparison
(L1, L2)
Adenine (0, 1)
Adenine (0, 2)
Adenine (0, 3)
Adenine (0, 4)
Adenine (1, 2)
Adenine (1, 3)
Adenine (1, 4)
Adenine (2, 3)
Adenine (2, 4)
Adenine (3, 4)
Lysine (0, 1)
Lysine (0, 2)
Lysine (0, 3)
Lysine (0, 4)
Lysine (1, 2)
Lysine (1, 3)
Lysine (1, 4)
Lysine (2, 3)
Lysine (2, 4)
Lysine (3, 4)
TPP (0, 1)
TPP (0, 2)
TPP (0, 3)
TPP (0, 4)
TPP (1, 2)
TPP (1, 3)
TPP (1, 4)
TPP (2, 3)
TPP (2, 4)
TPP (3, 4)
FMN (0, 1)
FMN (0, 2)
FMN (0, 3)
FMN (0, 4)
FMN (1, 2)
FMN (1, 3)
FMN (1, 4)
FMN (2, 3)
FMN (2, 4)
FMN (3, 4)
Number of
Alignments
12
24
12
20
18
9
15
18
30
15
251888
76812
180958
33216
161616
380744
69888
116106
21312
50208
12285
1638
1274
3276
2430
1890
4860
252
648
504
165268
210456
68256
226888
348318
112968
375514
143856
478188
155088
1,0,0,
-0.75,0.2
1
1
1
4
1
1
4
1
4
4
11325
43687
16440
12120
12562
497
301
18114
12984
1572
1
7
1
37
7
1
37
10
70
37
64652
81400
23035
73816
238383
88699
227316
100147
276614
95118
0,1,0,
-0.75,0.2
1
1
1
2
1
1
2
1
2
2
458
2976
1994
1262
1129
562
211
3308
2333
1443
1
2
1
2
2
1
2
3
5
2
30408
67496
15421
119320
58527
10682
128089
35235
199137
70710
57
1,1,0,
-0.75,0.2
1
1
1
2
1
1
3
1
2
2
834
13960
2239
3715
1491
72
49
3903
6136
432
1
2
1
4
3
1
3
2
8
4
33008
59909
14913
86066
127831
43086
170369
63892
234749
84027
1,1,1,
-0.75,0.2
1
1
1
3
1
1
5
1
10
4
17
3529
6
46
22
1
5
900
753
2
1
1
2
1
1
1
2
1
10
2
16757
41872
5889
77322
62766
16154
66602
51349
156409
40944
1,1,2,
-0.75,0.2
1
1
1
3
1
1
5
1
18
3
12
1244
3
5
23
1
9
767
151
1
4
3
4
2
1
2
8
1
23
2
10653
31626
2884
71106
30954
5615
22014
42382
105621
16247
4.2 Comparison of Ranks to RNA SLOpt Prediction
Because MUSHI is a post-processing tool used in conjunction with RNASLOpt, it has a
fundamental limitation: RNASLOpt should correctly enumerate at least one structure that
is reasonably similar to the native structure. In other words, if the correct answer does not
exist in the input, it cannot be found. In this section, we quantify the phrase “reasonably
similar.” In addition, we seek to make some general statement about when MUSHI is
helpful, and establish a positive correlation between RNASLOpt’s performance and
MUSHI’s performance.
It was necessary to establish some metric by which we could estimate how close
RNASLOpt had come to finding the correct structure. We decided to compare the
alignment score of the structure most similar to the native structure (without any
sequencing bonus) to the maximum and minimum possible alignment scores. First, we
consider all possible structures in the landscape, as well as their alignment scores with the
native on structure. Without loss of generality, we can consider the on and off structures
separately. Clearly, we can then sort all structures by their alignment scores, and there will
exist one structure with a maximum score and one structure with a minimum score. If we
can determine these two scores, we can establish a range, and determine the placement of
the SLOpt structure most closely matching the native structure in that range to estimate
RNASLOpt’s ability to enumerate the correct structure.
The problem of finding the most dissimilar structure in the landscape can be formalized as
follows. Given an alphabet Σ, a target string x ∈ 𝛴 𝑁 , an RNA sequence m of length M, a
|𝛴| ∗ |𝛴| scoring matrix indicating d(a, b) for each a, b ∈ Σ, and bonus and gap parameters,
determine the string y ∈ 𝛴 𝑀 meeting the following two conditions: (1) y ∈ 𝐿(𝑚) (2) ∄𝑧 ∈
58
𝐿(𝑚) 𝑠. 𝑡. 𝑀𝑁𝑊(𝑥, 𝑧) < 𝑀𝑁𝑊(𝑥, 𝑦), where d(a, b) denotes the log-odds score between
characters a and b, L(m) denotes the landscape induced by RNA sequence m, and MNW(x,
y) denotes the score of the best alignment given by the Mattei variant of the NeedlemanWunsch algorithm. This problem statement can be easily altered to search for the most
similar string in the landscape. Luckily, we are not interested in the string itself, but the
value MNW(x, y). Therefore, in the absence of any algorithm that solves this problem
directly, estimating the upper and lower bounds of this range should still allow us to make
meaningful statements about RNASLOpt’s performance.
In the case where the length of the native structure used matches the length of the sequence,
it is obvious that the string of BEAR characters that most closely matches the native
structure is the BEAR representation of the native structure itself. However, not all of our
sequences are the same length. Luckily, because all of sequences are relatively similar in
length, we can still estimate the best possible alignment score by aligning the native
structure with itself.
Estimating the lower bound of this range is more difficult. A little logic can tell us that the
structure whose best possible alignment with the native structure is minimal cannot be less
than the cost of deleting the entire first sequence and inserting the entire second sequence.
This is because insertions and deletions are always an option in the alignment, so any global
alignment algorithm will either choose this alignment or choose a better one. This gives us
an absolute minimum bound on the similarity score of any structure. Of course, depending
on the gap parameter and the MBR itself, it may be the case that no structures will have
this arrangement as their best alignment (if, for example, the gap penalty is extremely
large), and so we would be overestimating the magnitude of this boundary. However, in
59
the absence of any algorithm that directly solves this problem, this is currently the best
estimate that we can use.
In order to find a correspondence between RNASLOpt’s performance and MUSHI’s
performance, we used the aforementioned method to estimate the range of the alignment
scores of the structures in each landscape to the native structure. We then divided the
difference between the alignment score of the most similar structure and the lower bound
by the entire range to normalize the score. Finally, for each possible pairing of two
landscapes, we averaged this percentile for both the on and off structures and compared it
to the percentile rank of the target match in the complete pairwise alignment. What we
expected to find was that, as long as the average RNASLOpt percentile was above a certain
threshold, the target match should appear in a high percentile in MUSHI. Around the
threshold, performance should start to decrease, and below the threshold, MUSHI should
hold no predictive power, so we should not see any clear trends. Figure 4 shows what was
actually measured.
In the figure, the line of best fit for the Adenine data set is positive, but the standard
deviation from this line is high. This is likely due to the small size of this data set, in which
even lowering the rank of the target match by one can have a large effect on its percentile.
If we focus on the TPP dataset, we can see that it follows the general trend that we expected
to see, with poor performance below the threshold, and high performance above it.
Similarly, the structures from the Lysine data set were all roughly at or above the threshold,
and so in MUSHI the target matches all appeared at or above the 95th percentile. The FMN
dataset is a bit more perplexing. Most of its structures seem to be at or below the threshold,
and the trend seen in the previous datasets disappears as expected. However, the points in
60
each dataset overlap more than expected. While the FMN data points well above the
threshold score in the 90th percentile in MUSHI, the points only slightly above the threshold
did not score nearly as well as the points from other datasets in a similar position along the
x axis. This overlap may be due to the fact that the estimation of the lower bound of the
range of alignment scores may not be highly accurate, as discussed previously. The
expected trend still exists, but its boundaries are somewhat fuzzy.
Figure 4 - Performance of RNASLOpt compared to performance of MUSHI
61
Removing Adenine from the graph highlights the trend:
Figure 5 - Performance of RNASLOpt compared to performance of MUSHI (without adenine)
The graph in Figure 5 seems to suggest that the performance of MUSHI on any two
landscapes containing structures at least 75% similar to the native structure should be
adequate.
4.3 Analysis of Consensus Landscapes
MUSHI should improve the output of RNASLOpt in two ways: (1) it should reduce the
number of structures in each landscape, leaving only those that can reasonably be called
part of a “consensus”, and (2) it should reorder the remaining structures such that the
structures that most closely match the native structures appear closer to the top of the list.
These are the two primary metrics by which we will evaluate MUSHI’s performance. The
way in which the results from the complete pairwise alignment can be translated into these
62
two objectives is simple. We simply scan through the NxM set of ranked pairings until we
reach an arbitrary threshold set by the user. For each match we encounter, add both
structures to the consensus if they are not a part of it already. This will guarantee that each
structure included from landscape A will have at least one strong match from landscape B.
In this case, we determine the threshold by only examining the top x% of matches. The
percentage for each landscape comparison is different. In the tables throughout this section,
the Consensus Threshold column represents the tightest possible lower bound necessary to
include all four target structures in the consensus. The bar graphs in this section each
correspond to the tables above them, and convey the same information. For each
comparison (L1, L2), the blue bar shows the percentile ranking of the target structure in L1
after only applying RNASLOpt, and the orange bar shows the new rank of the target
structure in the landscape of L1 after comparing L1 with L2.
Table 13 - Effect of MUSHI on adenine landscape size
Comparison
(L1, L2)
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
Consensus
Threshold (%)
L1 Old
Size
L1 New
Size
L2 Old
Size
L2 New
Size
25.000
29.167
33.333
15.000
55.556
55.556
33.333
44.444
60.000
20.000
4
4
4
4
3
3
3
6
6
3
2
4
3
2
3
3
3
4
6
2
3
6
3
5
6
3
5
3
5
5
3
4
3
2
5
3
2
3
5
2
63
Table 14 – Changes in target ‘On’ structures in adenine (tabulated)
Comparison
Consensus
(L1, L2)
Threshold (%)
25.000
(0, 1)
(0, 2)
29.167
(0, 3)
33.333
(0, 4)
15.000
(1, 2)
55.556
(1, 3)
55.556
(1, 4)
33.333
(2, 3)
44.444
(2, 4)
60.000
(3, 4)
20.000
L1 ‘On’
Old Rank
L1 ‘On’
New Rank
L2 ‘On’
Old Rank
L2 ‘On’
New Rank
1
1
1
1
1
1
1
4
4
1
1
2
2
0
1
2
0
3
1
0
1
4
1
0
4
1
0
1
0
0
2
3
2
0
3
2
0
2
0
0
1
0.9
Ranking (Percentile)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
(0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3)
Comparison
RNASLOpt
RNASLOpt + MUSHI
Figure 6 - Changes in target 'On' structures in adenine (graphed)
64
Table 15 – Changes in target ‘Off’ structures in adenine (tabulated)
Comparison
Consensus
(L1, L2)
Threshold (%)
25.000
(0, 1)
(0, 2)
29.167
(0, 3)
33.333
(0, 4)
15.000
(1, 2)
55.556
(1, 3)
55.556
(1, 4)
33.333
(2, 3)
44.444
(2, 4)
60.000
(3, 4)
20.000
L1 ‘Off’
Old Rank
L1 ‘Off’
New Rank
L2 ‘Off’
Old Rank
L2 ‘Off’
New Rank
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
2
0
3
1
0
0
0
2
0
0
2
0
2
2
0
0
0
1
0
0
1
0
1
1
1
0.9
Ranking (Percentile)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
(0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3)
Comparison
RNASLOpt
RNASLOpt + MUSHI
Figure 7 - Changes in target 'Off' structures in adenine (graphed)
As expected, the results from the Adenine group (Tables 13, 14, and 15, Figures 6 and 7)
are unimpressive due to the small landscape sizes. Even a match with a high alignment
score may rank within the 50th percentile of all matches, so MUSHI was only able to
remove a couple structures from the list, if any. Furthermore, because the target structures
are already so close to the top, their upward mobility is limited. In other words, MUSHI is
65
unable to narrow down the landscapes any further because it was designed for larger
landscapes.
Table 16 - Effect of MUSHI on lysine landscape size
Comparison
(L1, L2)
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
Consensus
Threshold (%)
L1 Old
Size
L1 New
Size
L2 Old
Size
L2 New
Size
0.010
1.620
0.229
0.015
0.015
0.030
1.010
2.834
3.120
0.064
346
346
346
346
728
728
728
222
222
523
16
108
92
5
9
36
77
204
80
11
728
222
523
96
222
523
96
523
96
96
12
146
81
3
15
35
82
198
66
11
Table 17 – Changes in target ‘On’ structures in lysine (tabulated)
Comparison
Consensus
(L1, L2)
Threshold (%)
0.010
(0, 1)
(0, 2)
1.620
(0, 3)
0.229
(0, 4)
0.015
(1, 2)
0.015
(1, 3)
0.030
(1, 4)
1.010
(2, 3)
2.834
(2, 4)
3.120
(3, 4)
0.064
L1 ‘On’
Old Rank
L1 ‘On’
New Rank
L2 ‘On’
Old Rank
L2 ‘On’
New Rank
21
21
21
21
96
96
96
159
159
96
1
0
9
3
5
31
55
75
37
10
96
159
96
89
159
96
89
96
89
89
2
27
80
1
14
28
38
78
3
4
66
1
0.9
Ranking (Percentile)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
(0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3)
Comparison
RNASLOpt
RNASLOpt + MUSHI
Figure 8 - Changes in target 'On' structures in lysine (graphed)
Table 18 – Changes in target ‘Off’ structures in lysine (tabulated)
Comparison
Consensus
(L1, L2)
Threshold (%)
0.010
(0, 1)
(0, 2)
1.620
(0, 3)
0.229
(0, 4)
0.015
(1, 2)
0.015
(1, 3)
0.030
(1, 4)
1.010
(2, 3)
2.834
(2, 4)
3.120
(3, 4)
0.064
L1 ‘Off’
Old Rank
L1 ‘Off’
New Rank
L2 ‘Off’
Old Rank
L2 ‘Off’
New Rank
149
149
149
149
0
0
0
158
158
31
6
2
2
4
1
0
4
134
30
0
0
158
31
24
158
31
24
31
24
24
9
75
2
2
13
0
5
3
0
0
67
1
0.9
Ranking (Percentile)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
(0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3)
Comparison
RNASLOpt
RNASLOpt + MUSHI
Figure 9 - Changes in target 'Off' structures in lysine (graphed)
The results from the Lysine riboswitch were much more promising (Tables 16, 17, and 18,
Figures 8 and 9). In all cases, the landscape size was reduced significantly because the
target matches were ranked so highly in the NxM alignments that it was possible to draw
a tighter consensus. However, because these reductions depend first on the ranks of each
target match, they should be considered secondary to how much the rank of each target
structure improved, which is the ultimate metric by which these methods should be
evaluated. In Table 17, the most apparent result is the fact that, in landscape 0, the best on
structure was previously ranked 21, whereas after finding the consensus with all four other
landscapes, the rank of the structure was significantly increased. Further, its best off
structure was previously ranked 149, whereas in all cases, its new ranking was 6 or better.
Critically, it should be noted that the best off structure in landscape 1 was already ranked
0, and while in some cases MUSHI decreased the ranking of this structure, the magnitudes
of the aforementioned increases greatly exceeded those of the decreases, suggesting that
68
MUSHI is able to preserve good results from RNASLOpt while simultaneously increasing
the ranking of previously obscure but biologically significant structures.
As another more visual example, consider the Lysine native off structure in Figure 10:
Figure 10 - Lysine native 'Off' structure
After RNASLOpt processes the sequence Lysine 3, the most stable (top-ranking) structure
in the landscape it returns is shown in Figure 11:
69
Figure 11 - Top structure returned by RNASLOpt for sequence lysine 3
However, after comparing the landscapes for Lysine 1 and Lysine 3, and reordering their
structures according to the methodology in Chapter 3, the top-ranking structure (previously
ranked 31) in the new consensus landscape is shown in Figure 12:
Figure 12 - Top structure returned by RNASLOpt+MUSHI for lysine 3 after comparison with lysine 1
70
A visual inspection of all three diagrams confirms how much more similar the new top
structure looks to the native structure in comparison with the old top structure. However,
it is important to note that it is unusual for the target structure to appear at the very top of
the rankings in the consensus landscape. Even when results are good, most target structures
merely appear near the top, so this example only serves as a visual aid, and is not
necessarily representative of the final results.
Table 19 - Effect of MUSHI on TPP landscape size
Comparison
(L1, L2)
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
Consensus
Threshold (%)
L1 Old
Size
L1 New
Size
L2 Old
Size
L2 New
Size
0.057
16.117
0.706
0.366
59.671
8.095
6.070
37.302
66.667
3.968
91
91
91
91
135
135
135
18
18
14
7
34
7
7
113
48
51
17
18
7
135
18
14
36
18
14
36
14
36
36
8
18
4
5
18
14
35
14
34
10
Table 20 – Changes in target ‘On’ structures in TPP (tabulated)
Comparison
Consensus
(L1, L2)
Threshold (%)
0.057
(0, 1)
(0, 2)
16.117
(0, 3)
0.706
(0, 4)
0.366
(1, 2)
59.671
(1, 3)
8.095
(1, 4)
6.070
(2, 3)
37.302
(2, 4)
66.667
(3, 4)
3.968
L1 ‘On’
Old Rank
L1 ‘On’
New Rank
L2 ‘On’
Old Rank
L2 ‘On’
New Rank
2
2
2
2
99
99
99
15
15
5
2
9
4
0
61
47
30
7
12
2
99
15
5
12
15
5
12
5
12
12
6
7
3
4
11
2
1
5
26
9
71
1
Ranking (Percentile)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
(0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3)
Comparison
RNASLOpt
RNASLOpt + MUSHI
Figure 13 - Changes in target 'On' structures in TPP (graphed)
Table 21 – Changes in target ‘Off’ structures in TPP (tabulated)
Comparison
Consensus
(L1, L2)
Threshold (%)
0.057
(0, 1)
(0, 2)
16.117
(0, 3)
0.706
(0, 4)
0.366
(1, 2)
59.671
(1, 3)
8.095
(1, 4)
6.070
(2, 3)
37.302
(2, 4)
66.667
(3, 4)
3.968
L1 ‘Off’
Old Rank
L1 ‘Off’
New Rank
L2 ‘Off’
Old Rank
L2 ‘Off’
New Rank
0
0
0
0
0
0
0
3
3
0
3
1
1
1
0
0
5
0
2
0
0
3
0
8
3
0
8
0
8
8
3
2
1
1
0
0
6
0
1
1
72
1
0.9
Ranking (Percentile)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
(0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3)
Comparison
RNASLOpt
RNASLOpt + MUSHI
Figure 14 - Changes in target 'Off' structures in TPP (graphed)
The results from the TPP riboswitch (Tables 19, 20, and 21, Figures 13 and 14) were similar
to those from the Lysine riboswitch. In general, target structures that were previously
highly ranked stayed highly ranked (with one notable outlier), while lower-ranking
structures tended to rise to the top. As seen with the Adenine riboswitch, MUSHI’s ability
to reduce the landscape size is reduced with decreasing landscape size. Specifically, the
landscapes of sizes 14 and 18 did not see much reduction.
73
Table 22 - Effect of MUSHI on FMN landscape size
Comparison
(L1, L2)
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
Consensus
Threshold (%)
L1 Old
Size
L1 New
Size
L2 Old
Size
L2 New
Size
98.392
25.038
10.677
31.340
69.862
55.322
49.004
29.461
22.088
10.476
316
316
316
316
523
523
523
666
666
216
316
316
248
316
523
518
523
579
653
216
523
666
216
718
666
216
718
216
718
718
523
596
205
633
666
216
715
216
702
431
Table 23 – Changes in target ‘On’ structures in FMN (tabulated)
Comparison
Consensus
(L1, L2)
Threshold (%)
98.392
(0, 1)
(0, 2)
25.038
(0, 3)
10.677
(0, 4)
31.340
(1, 2)
69.862
(1, 3)
55.322
(1, 4)
49.004
(2, 3)
29.461
(2, 4)
22.088
(3, 4)
10.476
L1 ‘On’
Old Rank
L1 ‘On’
New Rank
L2 ‘On’
Old Rank
L2 ‘On’
New Rank
3
3
3
3
485
485
485
285
285
72
306
253
162
166
449
497
515
343
112
113
485
285
72
208
285
72
208
72
208
208
388
437
147
104
373
119
106
51
34
132
74
1
0.9
Ranking (Percentile)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
(0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3)
Comparison
RNASLOpt
RNASLOpt + MUSHI
Figure 15 - Changes in target 'On' structures in FMN (graphed)
Table 24 – Changes in target ‘Off’ structures in FMN (tabulated)
Comparison
Consensus
(L1, L2)
Threshold (%)
98.392
(0, 1)
(0, 2)
25.038
(0, 3)
10.677
(0, 4)
31.340
(1, 2)
69.862
(1, 3)
55.322
(1, 4)
49.004
(2, 3)
29.461
(2, 4)
22.088
(3, 4)
10.476
L1 ‘Off’
Old Rank
L1 ‘Off’
New Rank
L2 ‘Off’
Old Rank
L2 ‘Off’
New Rank
29
29
29
29
333
333
333
386
386
185
88
121
46
59
238
200
88
287
79
61
333
386
185
362
386
185
362
185
362
362
153
334
91
372
259
30
439
38
151
175
75
1
Ranking (Percentile)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
(0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3)
Comparison
RNASLOpt
RNASLOpt + MUSHI
Figure 16 - Changes in target 'Off' structures in FMN (graphed)
Not surprisingly, the FMN riboswitch fared the worst (Tables 22, 23, and 24, Figures 15
and 16). The landscapes saw almost no reduction in size, and the ranking of target
structures often decreased dramatically in the consensus, especially the on structure in
landscape 0. As discussed in the previous section, this suggests that MUSHI’s ability to
extract significant results is limited by the ability of RNASLOpt to correctly predict native
structures.
Finally, Figure 17 shows the average effect of MUSHI on the sizes of landscapes except
those from Adenine. The reason for this is the small size of the Adenine landscapes. The
blue bars indicate the old size, the orange bars indicate the average new size, and the
markings on the orange bars indicate the range of values included in each average. Once
again, we see that, even using an arbitrary value for the consensus threshold, MUSHI is
able to reduce the size of the landscapes significantly.
76
800
Number of Structures
700
600
500
400
300
200
100
0
Lysine Lysine Lysine Lysine Lysine TPP 0 TPP 1 TPP 2 TPP 3 TPP 4 FMN 0FMN 1FMN 2FMN 3FMN 4
0
1
2
3
4
Landscapes
Old Size
New Size
Figure 17 - Average effect of MUSHI on landscape size
4.4 Benchmarking
As discussed in Chapter 2, RNAConSLOpt also seeks to find the consensus landscape
between multiple structures. In this section, we compare the performance of MUSHI to
that of RNAConSLOpt. We used the same datasets in both cases, preparing a multiple
sequence alignment between all the sequences belonging to each riboswitch, and using the
same values for Δp and ΔB given in Table 1. RNAConSLOpt takes the multiple sequence
alignment and user-defined parameters and outputs a list of structures sorted by stability.
Using the same methodology used in MUSHI, we aligned each structure with the native
structures to determine the closest matches. Table 25 details the results.
77
Table 25 - Performance of RNAConSLOpt on riboswitches from B. Subtilis
# of ConSLOpt
Riboswitch
Structures
Adenine
Lysine
TPP
FMN
2
3
5
49
Rank of
Target
‘On’
Alignment
Score of
Target ‘On’
with Native
Rank of
Target
‘Off’
Alignment
Score of
Target ‘Off’
with Native
1
1
4
8
35.9
74.31
45.02
28.42
0
0
0
13
38.94
110.72
35.01
68.4
As seen in the table, RNAConSLOpt has the ability to return a much smaller set of
structures than that returned by RNASLOpt, and the structures most closely resembling the
native structures appear near the top of each list. However, RNAConSLOpt differs from
MUSHI in the breadth of solutions provided. The structures most closely resembling the
native structures in the landscapes returned by RNAConSLOpt have an alignment score
which is near the average of the analogous structures returned by RNASLOpt. Essentially,
RNAConSLOpt will return an “average” of the structures allowed by the multiple sequence
alignment, while the results of MUSHI have much more variance. Depending on which
two landscapes are combined, MUSHI can produce very good or very bad results, whereas
RNAConSLOpt’s results are more temperate. The figures below show this comparison.
Each circular node in the graph indicates a target structure in one landscape returned by
RNASLOpt, while the triangular node indicates the score of the target structure returned
by RNAConSLOpt. As shown in Figures 18 and 19, each of the target structures from
RNAConSLOpt are roughly in the middle of the range of scores of their analogous
structures in RNASLOpt, thus indicating the difference in breadth of solutions given by
the two programs.
78
Adenine
Lysine
TPP
FMN
Figure 18 - Structural similarity of RNASLOpt and RNAConSLOpt 'On' target
structures to native 'On' structure
Figure 19 - Structural similarity of RNASLOpt and RNAConSLOpt 'Off' target
structures to native 'Off' structure
79
These graphs combined with the tables in 4.3 seem to indicate that MUSHI can sometimes
obtain results better than RNAConSLOpt, but only if we know in advance which two
landscapes to combine. Combining the wrong landscapes may yield worse results, and on
average MUSHI yields comparable results. This may come as no surprise, as the
methodology of MUSHI is essentially the inverse of RNAConSLOpt: MUSHI folds the
sequences and then aligns them, whereas RNAConSLOpt aligns the sequences and then
folds them.
80
CHAPTER FIVE: CONCLUSIONS AND FUTURE WORK
5.1 Conclusions, Advantages, Disadvantages, and Limitations
The data collected during this study suggest that when the output of RNASLOpt is of
moderate to good quality, MUSHI can significantly increase the rankings of the correct
target structures.
One of the advantages of using MUSHI to find the consensus landscape is that it is
thorough. All possible pairs of structures are considered, so all strong matches will
naturally gravitate toward the top. The results are most dramatic when a low-ranking target
structure from one landscape matches a high-ranking structure from another landscape.
Although the match will incur a penalty due to the low ranking of the first structure, the
fact that structural similarity counts double in the weighted sum limits the effect of such a
penalty. An additional advantage is that MUSHI uses BEAR+MBR to estimate structural
distance, which Mattei et al. just introduced in 2014. To the best of our knowledge, this is
one of the only major uses of this methodology, giving MUSHI an advantage over other
systems such as RNAdistance [29]. RNAdistance performs global alignments between
sequences encoded with dot-bracket notation, which we have argued is less accurate than
BEAR. RNAdistance is capable of translating each dot-bracket string into a more “coarsegrained” representation using an alphabet of six characters: H (hairpin loop), I (internal
loop), B (bulge), M (multiloop), S (stack), and E (external/unpaired). This is very similar
to the alphabet we originally envisioned before discovering BEAR, and we have already
discussed the advantages BEAR has over this primitive alphabet. Additionally, the use of
the MBR substitution matrix is an invaluable tool in these alignments, as it allows similar
but non-matching structures to be aligned based on empirical data describing their mutation
81
rates, whereas traditional alignment methods tend to use a binary score which only rewards
exact matches. Furthermore, RNAdistance uses a metric called “base pair distance”, which
is calculated by determining the number of base pairs which are in one structure but not in
the other. This metric only makes sense when both structures are the same length, as two
base pairs can then be considered identical if their starting and ending indices match.
Because we must work with sequences of different length, however, this metric is not
useful to us, and indeed Hofacker et al. recommend against using RNAdistance for this
purpose.
Natually, there are many disadvantages to using MUSHI, the most obvious of which is the
trade-off between thoroughness and speed. Because MUSHI must perform global
alignment on all possible pairs of structures, it is very slow, running in O(N*M*A*B) time,
where N is the size of landscape one, M is the size of landscape two, A is the length of
RNA sequence one, and B is the length of RNA sequence two. For example, the
comparison between the FMN 2 and FMN 4 landscapes takes over an hour on a computer
with an Intel Core i5-4200U CPU @ 1.60 GHz and 6 GB of RAM. In order to improve on
this run time, it would be necessary to find a faster way to compare two strings. While
some fast alignment-free heuristics exist for string comparison, it is not immediately
apparent how to modify them so that they make use of the substitution matrix, so we chose
the most straightforward method to prove the concept.
A second and perhaps fundamental disadvantage is that, in its current form, what MUSHI
returns is arguably different from the actual consensus landscape. Because the problem we
set out to solve was to improve the prediction of the native structures of riboswitches, we
defined the consensus as those structures included in a match whose total match score was
82
above some arbitrary percentile. While this method produced some moderately good
results, we realized that, if two perfect target structures were at the bottom of their
respective landscapes, they would appear in the 50th percentile in the complete pairwise
structural alignment phase, and therefore would likely not appear in the consensus if we
set any meaningful discriminatory threshold. This could be solved by altering the
methodology to rank the matches using only structural similarity information in the
complete pairwise structural alignment, determine the consensus by setting a threshold,
and then using stability and energy information to rank the structures in that threshold. In
other words, stability and energy should not be factors for entry into the consensus. We are
currently working on adapting MUSHI to this new methodology, at which point it can be
said the MUSHI is able to find the actual consensus landscape. Luckily, because the results
we obtained from the original methodology were good enough to be significant, we stuck
with them for the purposes of this thesis. We expect slight improvements in the results after
the new methodology is implemented.
Another disadvantage of MUSHI is its accuracy in terms of scoring structures based on
their stability and energy. Because RNASLOpt outputs a stability and energy ranking for
each structure, MUSHI uses the rankings to determine the percentile to which that structure
belongs in the weighted sum. Essentially, we assume that energy increases linearly as we
move farther down in the rankings. This is not the case, however. When using RNASLOpt,
as we increase Δp at a constant rate, we find that the number of new structures returned at
each time step continues to increase. This means that, in a landscape of size 100, the energy
difference between structure 0 and structure 49 will be much greater than the difference
between structure 49 and structure 99. The percentile-based ranking, however, assumes
83
these two intervals are equal. This creates a bias against low-ranking target structures
because it assumes middle-ranking structures are more viable candidates than they really
are, and scores them significantly higher. This can be fixed by using the absolute free
energy level and stability level of each structure. While the free energy for each structure
can be extracted from RNASLOpt’s output, RNASLOpt currently provides no information
on the exact stability of each structure aside from ranking them. If we could modify
RNASLOpt to provide us with this information, we could modify MUSHI to make use of
it, and improve the accuracy of our results on datasets such as FMN.
While fixing these disadvantages could improve the results of MUSHI, there are a few
fundamental limitations that may not be able to be fixed without trying an alternative
approach entirely. One of these limiting factors is that stability and energy may not be
categorically good. While simple intuition can tell us that the correct native structures
should be stable and energetically favorable, it may be the case that these factors are only
important up to a certain threshold, and beyond this threshold, they hold little predictive
power. In other words, as long as the biological structures are stable enough, there might
not be an additional need for stability, so further ranking SLOpt structures based on these
factors might not help us, at which point we would need to determine some additional
predictive metrics.
A second limiting factor is that gauging the significance of structural similarity is difficult.
While it is certainly true that, if the correct native structure appears in both landscapes, they
will be structurally similar, there may be other structures that resemble each other simply
as a consequence of sequence identity. A fundamental question still remains: at which level
of sequence identity do folding landscapes diverge? A quantitative answer to this question
84
would be extremely difficult to derive, due to the immense size of the landscape and the
inherent difficulty in visualizing it. However, if there is a threshold sequence identity above
which two landscapes still more or less resemble one another, the consensus landscape will
be large, and structural similarity is likely to bear less significance for predicting native
structures. Once again, the predictive power of our data would be limited.
Another limitation is that MUSHI is unable to predict the native structures exactly, instead
coopting the problem of finding a consensus landscape to aid in the prediction. While
expanding MUSHI to find a consensus between more than two landscapes (discussed in
section 5.2) would be able to narrow the list of candidates even further, like RNASLOpt,
the methodology is simply not designed to choose a single pair of structures and propose
them as predictions.
A final fundamental limitation of MUSHI is that, in practice, we cannot reliably predict
where to place the consensus threshold. In Chapter 4, we chose the tightest possible
boundary that would include the target matches in the consensus, but we had prior
knowledge of where the target matches ranked. Without this knowledge, we cannot be
confident that the target matches will be in the consensus. Then again, they should only
appear in the consensus if they resemble the actual native structures. The nature of this
problem requires us to work in a gradient, which makes perfect discretization of the
problem almost impossible.
5.2 Future Work and Alternative Approaches
In order to improve MUSHI’s performance, we first plan on correcting the drawbacks and
disadvantages discussed in the previous section. Specifically, we aim to speed up MUSHI’s
performance by utilizing linear string comparison algorithms, alter the methodology to
85
remove stability and energy as factors for entering the consensus, and utilize actual values
for free energy and stability and eliminate the percentile-based ranking system.
Another of MUSHI’s drawbacks is that the performance can vary depending on which
landscapes are compared. In order to get around this, we can adapt MUSHI to find a
consensus between more than two landscapes. This can be achieved for three landscapes
A, B, and C by performing all possible comparisons AB, AC, BC, and deriving two
abridged versions of each landscape. These can then be merged via some arbitrary process,
such as only returning the structures appearing in both landscapes, as these are likely to be
the most significant. In reality, it would be necessary to use a more sophisticated process
of determine which structures are in the final consensus, as this process would still be
subject to the limitations discussed earlier, such as working in a gradient.
Alternatively, one might try approaching the problem from a different perspective, such as
decomposing the landscape into smaller elements, similar to GraphClust [22]. One such
decomposition could be a network of stack structures. Each unique stack appearing in a
sequence’s landscape is defined by its starting and ending points, and its length. Each stack
could be represented by a node in the graph, and directed edges indicate that one stack
encloses another stack in at least one structure. Edge labels could specify exactly which
structures contain such an enclosure. This network would represent how stacks relate to
one another, and could also be unambiguously converted back into the original landscape.
Gathering statistics about this network could lead to important insights, but even more
interesting is the prospect of performing balanced global network alignment on two of these
graphs. In Chapter 1, we discussed how this would be infeasible for entire energy folding
landscapes due to their immense size, but these substructural networks would only consist
86
of a few hundred or a few thousand nodes, and so the algorithms discussed by Zaslavskiy
et al. [11] would now be applicable. An alignment between two substructural graphs could
indicate a more significant correlation that what could be uncovered by MUSHI, because
comparisons in MUSHI necessarily involve entire structures. In an abstract sense, this new
methodology would allow one to increase the resolution at which we can make meaningful
comparisons, allowing us to pick out interesting details instead of having to always look at
the bigger picture.
Finding the consensus landscape is still a novel problem in structural bioinformatics, and
this thesis merely scratches its surface. It is our hope that more researchers will take interest
in the problem in the coming years, and use it to leverage critical information that will
change our understanding of RNA secondary structure.
87
APPENDIX: RNA SEQUENCES AND STRUCTURES USED
88
The following sequences and structures are taken from the benchmarking set used in Li et
al.’s paper on RNAConSLOpt [10]. They can be accessed at the following link:
http://genome.ucf.edu/RNAConSLOpt/Benchmarks.txt
Each RNA sequence is labeled with the reference used in this thesis (e.g. Adenine 0),
followed by its accession number.
Adenine 0:
>D88802.1
AUUAUCACU-UGUAUAACCUCAAUAAUAUGGUUUGAGGGUGUCUACCAGGAACCGUAAAAUCCUGAUUACA
AAAUUUGUUUAUG-ACAUUUUUUGUAAUCAGGAUUUU
Adenine 1:
>AAXV01000018.1
AUUUGAAC—UGUAUAACCUCAAUAAUAUGGAUUGAGGGUCUCUACCAGGAACCAUAAAAUCCUGACUACAA
AA----CUUUGU-UUCAUUUUUGUAGUCAGGAUUUU
Adenine 2:
>AAEK01000052.1
-UGAGAAUCAUGUAUAACUCCAAGAAUAUGGCUUGGGGGUCUCUACCAGGAACCAAUAACUCCUGACUACA
AAAU--GCGUAUU-AUAGCGUUUGUAGUCAGGAGUUU
Adenine 3:
>BA000016.3
AUUUUGCUU-CGUAUAACUCUAAUGAUAUGGAUUAGAGGUCUCUACCAAGAACCGAGAAUUCUUGAUUACG
AAGAAAGCUUAUUUGCUUUCUUCGUAAUCAAGAAUU-
Adenine 4:
>CP000851.1
-UUAACACUUCGUAUAAUCUCAAUGAUAUGGUUUGAGAGUUUCUACCAAGAGCCCUAAACUCUUGAUUAUG
AAGACUUUACUUU-AUGUAAUGCUAAUUUAACAAGUU
Native functional structures from the adenine riboswitch of the ydhL gene of
Bacillus subtilis:
AUUAUCACU-UGUAUAACCUCAAUAAUAUGGUUUGAGGGUGUCUACCAGGAACCGUAAAAUCCUGAUUACA
AAAUUUGUUUAUG-ACAUUUUUUGUAAUCAGGAUUUU
........(-((((...((((((.........))))))........(((((.........)))))..))))
)....((((...)-)))....................
.........-.......((((((.........))))))..................(((((((((((((((
(((..((((...)-)))..))))))))))))))))))
Lysine 0:
>J03294.1
GAAGAUAGAGGU-GCGAACUUCAAGAGUAUGCCUUUGGAGAAAGAUGGAUUCUG-UGAAAAAGGCUGAAAG
GGGAGCGUCGCCGAAGCAAAUAAAACCCCAUCGGUAUUAUUUGCUGGCCGUGCAUUGAAUAAAUGUAAGGC
UGUCAAGAAAUCAUUUUCUUGGAGGGCUAUCUCGUUGUUCAUAAUCAUUUAUGAUGAUUAAUUGAU—AAGC
AAUGAGAGUAUUCCUCUCAUUGC
89
Lysine 1:
>CP000002.3
GAAGAUAGAGGUGCGAACUUCAAGAGUAGGCUUGAUGAGGAAGAUGGAUUCCGAUGAAGAAAGCCGAAAGGGGAGCGUCGCCG
AAGCGGGGAAAAAUCCACUCGUUUUUCCUGCUGGCUUUACAUUGAAUAAAUGUGAGGCUGUCAAGAAAUCA
UUU-CUUGGAGAGCUAUCUCGUUGUUUAAGAUCAUCGGCAUU—UUUGUUGGUUAAAGCAAUGAGGGAAUUC
-UCUCGUUGC
Lysine 2:
>AAOX01000015.1
GAAGGUAGAGGU-GCAAACUUCAUCAGUAAAAGCUUGGAGAAAGAUGAGUUUCCGUGAAAAGCUUUGAAAG
GGAAUGUUUGCCGAAGAAAAGGAAGUCUCAUUU-CUUUCUUUUCUGGUCCUGUAUUGAAUAAAUACUGGAU
UGUCAAGACAGCGCCGUCUUGGAGAGCUAUCUCACUGUGUGGGCAUAUUU-UAUAUGUAUUUAAAACACAG
CAAUGGGAUGGUUAUUCUCAUUG-
Lysine 3:
>M93419.1
GAAGAUAGAGGU-GCGAACUUCAUCAGUAAAAGCUUGGAGAAGAAUGAGCUUCAAUGAAAAGCUUUGAAAG
GGAACGUUCGCCGAAGUGAAGAAAAACUCAUUU-UUUUCUUUGCUGGUCCUGCAUUUAAGAGAUGCCGGAU
UGUCAAGGCGGUGCCGCCUUGGAGAGCUAUCUCACUGUGUCUGCGUAUUU-UAC---UACGUUAUCCACAG
CAAUGAGGUAGCU-UUCUCAUUGC
Lysine 4:
>CP001186.1
AAAGGUAGAGGCCGCGAUAGGAAAGAGUAAGCUAUGGGAGAUUUAAUGGAAUCUGUGAUCAUAGGUUGAAA
GGGACUAUUGCCGAAAUAUAAGAAUAACCAUCUUAUUCAUAUAUUGGGACUACAUUGAAUAAAUGUAGUAC
UGUCAUAAGAUUUAUUUUAUGGAGAGCUAUUUGGAGAUGUUGAUGCGGUUUCUUA—UUUUGAGGAGAUAAC
AACUCGUUUAUU-UUUUCAAUAU
Native functional structures from the lysine riboswitch of the lysC gene of Bacillus
subtilis:
GAAGAUAGAGGU-GCGAACUUCAAGAGUAUGCCUUUGGAGAAAGAUGGAUUCUG-UGAAAAAGGCUGAAAG
GGGAGCGUCGCCGAAGCAAAUAAAACCCCAUCGGUAUUAUUUGCUGGCCGUGCAUUGAAUAAAUGUAAGGC
UGUCAAGAAAUCAUUUUCUUGGAGGGCUAUCUCGUUGUUCAUAAUCAUUUAUGAUGAUUAAUUGAU—AAGC
AAUGAGAGUAUUCCUCUCAUUGC
..((((((..(.-((((.((((........((((((...(((.......)))..-....))))))......
.))))..)))))..(((((((((.(((.....))).)))))))))((((.((((((.....)))))).)))
).((((((((....))))))))....))))))......((((((((((.....)))))))..))).--..(
(((((((((.....))))))))))
.....(((..(.-((((.((((........((((((...(((.......)))..-....))))))......
.))))..)))))..(((((((((.(((.....))).)))))))))((((.((((((.....)))))).)))
).((((((((....))))))))....)))(((((((((((((((((((.....)))))))..))..--.))
))))))))................
TPP 0:
>AL009126.3
GCAGAACAAUUCAAUA-UGUAUUCGUUUAACCACUAGGGGUGUCCUUCAUAAGGGCUGAGAUAAAAGUGUG
--ACUUUUAGACCCUCAUAACUUGAACAGGUUCAGACCUGCGUAGGGAAGUGGAG-CGGUAUU—UGUGUUA
UUUUACUAUG---CCAAUUCCAAACCACUUUUCCUUGCGGGAAAGUGGUUU
TPP 1:
90
>CP000002.3
AUAAACUGAAUGAACA-AGAAAUGUUUU—CCACUAGGGGAGUCCUUGAUAAGGGCUGAGAUAAAAGUUUG—
ACUUUUAGACCCUCAUAACCUGAACAGGUUCAAACCUGCGUAGGGAAGUGGCA-CGGUAUU--UGAGU-AU
GUAUAUAUG---CAAAUUCCAAACCACUUU-CCUUGCGGGAAAGUGGUUU
TPP 2:
>CP000764.1
-ACUAUCAAAACUAUAUAGUUCUCAUCUAUCCACUAGGGGUGCCGAU-AUU—GGCUGAGAUUAAAGUUUA-UCUUUGAGACCCUUAGUACCUGAUCUGGUUCGUACCAGCGUAGGGAAGUGGAAAUGACAA---AAUAU
GAUAUUUAUAUA---UCUAGGCCACUUUCUUUAC-CUACUAAGGAAGUGGCUU
TPP 3:
>AAEK01000033.1
GAAUA-CAAUACGAAA-AUUAAAUAUUUAUCCACUAGGGGGGCCUAUUAUA--GGCUGAGAUCAAA-UGGG
—AAUUUGAGACUCUUAGUACCUGAUCUGGUUAAUGCCAGCGUAGGGAAGUGGAAAAGACAUUGCUAUUUCA
UGUAUAAAUACUGUCAAUUUCACUUUCUUUACGCCUGUAAAGAAA------
TPP 4:
>ABCF01000004.1
-AAUGCAUAUAAAUAAAUAGCCGAGCAAAACCACUGGGGGAGCCUUUUAAA—GGCUGAGAUUAAAGUGUAC
UACUUUAAGACCCUUUGAACCUGAUCUAGUUCAUACUAGCGGAGGGAAGUGUAGUCUGAAUGAUUGAAU-A
UUUCAAAAAAUCACUCAAGCCGCCUUCCCGU-GCAAUCAGGGAGGCG----
Native functional structures from the TPP riboswitch of the thiamin gene of Bacillus
subtilis:
GCAGAACAAUUCAAUA-UGUAUUCGUUUAACCACUAGGGGUGUCCUUCAUAAGGGCUGAGAUAAAAGUGUG
--ACUUUUAGACCCUCAUAACUUGAACAGGUUCAGACCUGCGUAGGGAAGUGGAG-CGGUAUU—UGUGUUA
UUUUACUAUG---CCAAUUCCAAACCACUUUUCCUUGCGGGAAAGUGGUUU
...(((((((.(....-.).))).))))..(((((.(((((((((((...)))))).....((((((....
--.)))))).))))).....((((..(((((....)))))..))))..)))))..-.(((((.--.(((..
....))).)))---))......(((((((((((((....)))))))))))))
................-...........(((((((.(((((((((((...)))))).....((((((....
--.)))))).))))).......((((....))))..((((((..(..((((((..-.((.(((--.((((.
........)))---).))).))...))))))..)..))))))..))))))).
FMN 0:
>X51510.1
UAUCCUUCGGGGCAGGGUGGAAAUCCCGACCGGCGGUAGUAAAGCACAUUUGC—UUUAGAGCCCGUGACCC
GUGUGC----AUAAGCACGCGGUGGAUUCAGUUUAA-GCUGAAGCCGACAGUGAAAGUCUGGAUGGGAGAA
GGAUG---AUGAGCCGCUAUGCAAAAUGUU-UAAAAAUGCAUAGUGUUAUUUCCUAUUGCGUAAAAUACCU
AAAGCCCCGAAUUUUUUAUAAAUUCGGGGCUUU
FMN 1:
>X95955.1
UAUCCUUCGGGGCUGGGUGAAAAUCCCGACCGGCGGUAAUAAGGCGCUCCUGCGCUUUACAGCCCGUGACC
CGUAUGC----AUCUGUAUACGGUGGAUUCAGUGAAAAGCUGAAGCCGACAGUGAAAGUCUGGAUGGGAGA
AGGAUG---A-GAGAAGCUAUGCAAAAAAUAAUCAUACUGUAUAGUCUUAUUUCCUAUGGAUUAAAACUGG
UAAAGCCCCGAAUGUGUAA-ACAUUCGGGGCUUU
FMN 2:
91
>BA000004.3
UAUCCUUCGGGGCUGGGUGGAAAUCCCGACCGGCGGUGAUGAAGCGAA--UGC—UUCUUAGUCCGUGACCC
GGUUGCUGAUAUCAGUAAGCGGUGGACCUGGUGAAAAUCCGGGACCGACAGUGAAAGUCUGGAUGGGAGAA
GGAAACGUACGGUUCAAUUUGGAAAAAUGUGCAUGAUUGCACAUCUUCUUUCUCGUGGGCAAAAAACCUAC
GUAUACACAAGGGAGAAGUCUGUCCAAAU----
FMN 3:
>CP000903.1
CAUCCUUCGGGGUCGGGUGAAAUUCCCAACCGGCGGUGAUGAAGCGAU--AGC—UUCUAAGUCCGUGACCC
GUUUUC---AACGCGAAAACGGUGGAUCUAGUGAAACUCUAGGGCCGACAGU-AUAGUCUGGAUGGGAGAA
GGAUA---------UGUUUUCUAGUAAUUUUAUAUAGCGAAUACACUUUUAUUUCAGUAUGCAUA-UUUUU
AAAGUUUCAUUUUGAAUCUUUAUAUAUGUUUUA
FMN 4:
>L47648.1
CAAUCUUCGGGGCAGGGUGAAAUUCCCUACCGGCGGUGAUGAGCCAAU--GGC---UCUAAGCCCGCGAGC
UGUCUUU---------ACAGCA--GGAUUCGGUGAGAUUCCGGAGCCGACAGU-ACAGUCUGGAUGGGAGA
AG-AUGGAGGUUCAUAAGCGUUUUGAAAUUGAAUUUUUCAAACGUUUCUUUGCCU-----AGCCUAAUUUU
CGAAACCCCGCUUUUAUAUAUGAAGCGGUUUUUU
Native functional structures from the FMN riboswitch of the ribD gene of Bacillus
subtilis:
UAUCCUUCGGGGCAGGGUGGAAAUCCCGACCGGCGGUAGUAAAGCACAUUUGC—UUUAGAGCCCGUGACCC
GUGUGC----AUAAGCACGCGGUGGAUUCAGUUUAA-GCUGAAGCCGACAGUGAAAGUCUGGAUGGGAGAA
GGAUG---AUGAGCCGCUAUGCAAAAUGUU-UAAAAAUGCAUAGUGUUAUUUCCUAUUGCGUAAAAUACC
UAAAGCCCCGAAUUUUUUAUAAAUUCGGGGCUUU
((((((((......(((.......))).....((((...((((((......))--))))....))))...(
(((((((----....))))))))....((((((....-)))))).....(((.......))).......))
))))))---......................-.......................................
.((((((((((((((.....))))))))))))))
.....((((((((.(((.......))).....((((...((((((......))--))))....))))...(
(((((((----....))))))))....((((((....-)))))).....(((.......))).........
......---......................-.......................................
....)))))))).(((....)))...........
92
LIST OF REFERENCES
[1]
S. B. Needleman, and C. D. Wunsch, “A general method applicable to the search
for similarities in the amino acid sequence of two proteins,” Journal of Molecular
Biology, vol. 48, no. 3, pp. 443-453, 1970.
[2]
M. Zuker, and P. Stiegler, “Optimal computer folding of large RNA sequences
using thermodynamics and auxiliary information,” Nucleic acids research, vol. 9,
no. 1, pp. 133-148, 1981.
[3]
M. Zuker, and D. Sankoff, “RNA secondary structures and their prediction,”
Bulletin of Mathematical Biology, vol. 46, no. 4, pp. 591-621, 1984.
[4]
P. Stein, and M. Waterman, “On some new sequences generalizing the Catalan
and Motzkin numbers,” Discrete Mathematics, vol. 26, no. 3, pp. 261-272, 1979.
[5]
R. Nussinov, and A. B. Jacobson, “Fast algorithm for predicting the secondary
structure of single-stranded RNA,” Proceedings of the National Academy of
Sciences, vol. 77, no. 11, pp. 6309-6313, 1980.
[6]
D. Sankoff, “Simultaneous solution of the RNA folding, alignment and
protosequence problems,” SIAM Journal on Applied Mathematics, vol. 45, no. 5,
pp. 810-825, 1985.
[7]
Y. Li, and S. Zhang, “Finding stable local optimal RNA secondary structures,”
Bioinformatics, vol. 27, no. 21, pp. 1-1-1-8, 11//, 2011.
[8]
S. Wuchty, W. Fontana, I. L. Hofacker, and P. Schuster, “Complete suboptimal
folding of RNA and the stability of secondary structures,” Biopolymers, vol. 49,
no. 2, pp. 145-165, 1999.
93
[9]
C. Flamm, I. L. Hofacker, P. F. Stadler, and M. T. Wolfinger, “Barrier trees of
degenerate landscapes,” Zeitschrift für Physikalische Chemie International
journal of research in physical chemistry and chemical physics, vol. 216, no.
2/2002, pp. 155, 2002.
[10]
Y. Li, and S. Zhang, “Finding consensus stable local optimal structures for
aligned RNA sequences,” 2012 IEEE 2nd International Conference on
Computational Advances in Bio & Medical Sciences (ICCABS), pp. 1, 01//1/
1/2012, 2012.
[11]
M. Zaslavskiy, F. Bach, and J.-P. Vert, “Global alignment of protein–protein
interaction networks by graph matching methods,” Bioinformatics, vol. 25, no. 12,
pp. i259-1267, 2009.
[12]
S. Umeyama, “An eigendecomposition approach to weighted graph matching
problems,” Pattern Analysis and Machine Intelligence, IEEE Transactions on,
vol. 10, no. 5, pp. 695-703, 1988.
[13]
R. Singh, J. Xu, and B. Berger, “Global alignment of multiple protein interaction
networks with application to functional orthology detection,” Proceedings of the
National Academy of Sciences, vol. 105, no. 35, pp. 12763-12768, 2008.
[14]
M. Zaslavskiy, F. Bach, and J.-P. Vert, "A path following algorithm for graph
matching," Image and Signal Processing, pp. 329-337: Springer, 2008.
[15]
M. Mandal, and R. R. Breaker, “Gene regulation by riboswitches,” Nature
Reviews Molecular Cell Biology, vol. 5, no. 6, pp. 451-463, 2004.
94
[16]
V. Bafna, H. Tang, and S. Zhang, “Consensus folding of unaligned RNA
sequences revisited,” Journal of computational biology, vol. 13, no. 2, pp. 283295, 2006.
[17]
D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner, “Expanded sequence
dependence of thermodynamic parameters improves prediction of RNA secondary
structure,” Journal of molecular biology, vol. 288, no. 5, pp. 911-940, 1999.
[18]
I. L. Hofacker, M. Fekete, and P. F. Stadler, “Secondary structure prediction for
aligned RNA sequences,” Journal of molecular biology, vol. 319, no. 5, pp. 10591066, 2002.
[19]
C. Smith, S. Heyne, A. S. Richter, S. Will, and R. Backofen, “Freiburg RNA
Tools: a web server integrating INTARNA, EXPARNA and LOCARNA,”
Nucleic acids research, vol. 38, no. suppl 2, pp. W373-W377, 2010.
[20]
S. Will, T. Joshi, I. L. Hofacker, P. F. Stadler, and R. Backofen, “LocARNA-P:
Accurate boundary prediction and improved detection of structural RNAs,” RNA,
vol. 18, no. 5, pp. 900-914, 2012.
[21]
S. Will, K. Reiche, I. L. Hofacker, P. F. Stadler, and R. Backofen, “Inferring
noncoding RNA families and classes by means of genome-scale structure-based
clustering,” PLoS computational biology, vol. 3, no. 4, pp. e65, 2007.
[22]
S. Heyne, F. Costa, D. Rose, and R. Backofen, “GraphClust: alignment-free
structural clustering of local RNA secondary structures,” Bioinformatics (Oxford),
vol. 28, no. 12, pp. I224-I232, 2012.
[23]
A. Hüttenhofer, P. Schattner, and N. Polacek, “Non-coding RNAs: hope or
hype?,” TRENDS in Genetics, vol. 21, no. 5, pp. 289-297, 2005.
95
[24]
F. Costa, and K. De Grave, "Fast neighborhood subgraph pairwise distance
kernel." pp. 255-262.
[25]
E. Mattei, G. Ausiello, F. Ferre, and M. Helmer-Citterich, “A novel approach to
represent and compare RNA secondary structures,” Nucleic Acids Research, vol.
42, no. 10, pp. 6146-6157, 2014.
[26]
M. O. Dayhoff, and R. M. Schwartz, "A model of evolutionary change in
proteins."
[27]
S. Henikoff, and J. G. Henikoff, “Amino acid substitution matrices from protein
blocks,” Proceedings of the National Academy of Sciences, vol. 89, no. 22, pp.
10915-10919, 1992.
[28]
F. Meyer, S. Kurtz, R. Backofen, S. Will, and M. Beckstette, “Structator: fast
index-based search for RNA sequence-structure patterns,” BMC bioinformatics,
vol. 12, no. 1, pp. 214, 2011.
[29]
R. Lorenz, S. H. Bernhart, C. H. Zu Siederdissen, H. Tafer, C. Flamm, P. F.
Stadler, and I. L. Hofacker, “ViennaRNA Package 2.0,” Algorithms for Molecular
Biology, vol. 6, no. 1, pp. 26, 2011.
[30]
P. P. Gardner, A. Wilm, and S. Washietl, “A benchmark of multiple sequence
alignment programs upon structural RNAs,” Nucleic acids research, vol. 33, no.
8, pp. 2433-2439, 2005.
[31]
A. D. Garst, A. L. Edwards, and R. T. Batey, “Riboswitches: structures and
mechanisms,” Cold Spring Harbor perspectives in biology, vol. 3, no. 6, pp.
a003533, 2011.
96
[32]
A. R. Gruber, S. H. Bernhart, I. L. Hofacker, and S. Washietl, “Strategies for
measuring evolutionary conservation of RNA secondary structures,” BMC
bioinformatics, vol. 9, no. 1, pp. 122, 2008.
[33]
I. HOFACKER, and P. F. STADLER, "RNAz 2.0: improved noncoding RNA
detection." pp. 69-79.
[34]
I. L. Hofacker, P. Schuster, and P. F. Stadler, “Combinatorics of RNA secondary
structures,” Discrete Applied Mathematics, vol. 88, no. 1, pp. 207-237, 1998.
[35]
M. Kucharik, I. L. Hofacker, P. F. Stadler, and J. Qin, “Basin Hopping Graph: a
computational framework to characterize RNA folding landscapes,”
Bioinformatics (Oxford), vol. 30, no. 14, pp. 2009-2017, 2014.
[36]
Y. Li, Computational methods for analyzing RNA folding landscapes and its
applications. [electronic resource]: Orlando, Fla. : University of Central Florida,
2012., 2012.
[37]
J. S. McCaskill, “The equilibrium partition function and base pair binding
probabilities for RNA secondary structure,” Biopolymers, vol. 29, no. 6‐7, pp.
1105-1119, 1990.
[38]
J. Reeder, and R. Giegerich, “Consensus shapes: an alternative to the Sankoff
algorithm for RNA consensus structure prediction,” Bioinformatics, vol. 21, no.
17, pp. 3516-3523, 2005.
[39]
P. Schuster, and P. F. Stadler, “Landscapes: Complex optimization problems and
biopolymer structures,” Computers & chemistry, vol. 18, no. 3, pp. 295-324,
1994.
97
[40]
P. Steffen, B. Voß, M. Rehmsmeier, J. Reeder, and R. Giegerich, “RNAshapes: an
integrated RNA analysis package based on abstract shapes,” Bioinformatics, vol.
22, no. 4, pp. 500-503, 2006.
[41]
Y. Wan, K. Qu, Q. C. Zhang, R. A. Flynn, O. Manor, Z. Ouyang, J. Zhang, R. C.
Spitale, M. P. Snyder, and E. Segal, “Landscape and variation of RNA secondary
structure across the human transcriptome,” Nature, vol. 505, no. 7485, pp. 706709, 2014.
[42]
S. Washietl, I. L. Hofacker, and P. F. Stadler, “Fast and reliable prediction of
noncoding RNAs,” Proceedings of the National Academy of Sciences of the
United States of America, vol. 102, no. 7, pp. 2454-2459, 2005.
[43]
L. Yuan, Z. Cuncong, and Z. Shaojie, “Finding consensus stable local optimal
structures for aligned RNA sequences and its application to discovering
riboswitch elements,” International Journal of Bioinformatics Research and
Applications, no. 4/5, 2014.
98