University of Central Florida Electronic Theses and Dissertations Masters Thesis (Open Access) Finding Consensus Energy Folding Landscapes Between RNA Sequences 2015 Joshua Burbridge University of Central Florida Find similar works at: http://stars.library.ucf.edu/etd University of Central Florida Libraries http://library.ucf.edu Part of the Computer Engineering Commons STARS Citation Burbridge, Joshua, "Finding Consensus Energy Folding Landscapes Between RNA Sequences" (2015). Electronic Theses and Dissertations. Paper 5032. This Masters Thesis (Open Access) is brought to you for free and open access by STARS. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of STARS. For more information, please contact [email protected]. FINDING CONSENSUS ENERGY FOLDING LANDSCAPES BETWEEN RNA SEQUENCES by JOSHUA BURBRIDGE B.S. University of Central Florida 2013 A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the Department of Electrical Engineering and Computer Science in the College of Engineering and Computer Science at the University of Central Florida Orlando, Florida Summer Term 2015 Major Professor: Shaojie Zhang © 2015 Joshua Burbridge ii ABSTRACT In molecular biology, the secondary structure of a ribonucleic acid (RNA) molecule is closely related to its biological function. One problem in structural bioinformatics is to determine the two- and three-dimensional structure of RNA using only sequencing information, which can be obtained at low cost. This entails designing sophisticated algorithms to simulate the process of RNA folding using detailed sets of thermodynamic parameters. The set of all chemically feasible structures an RNA molecule can assume, as well as the energy associated with each structure, is called its energy folding landscape. This research focuses on defining and solving the problem of finding the consensus landscape between multiple RNA molecules. Specifically, we discuss how this problem is equivalent to the problem of Balanced Global Network Alignment, and what effect a solution to this problem would have on our understanding of RNA. Because this problem is known to be NP-hard, we instead define an approximate consensus on a landscape of reduced size, which dramatically reduces the searching space associated with the problem. We use the program RNASLOpt to enumerate all stable local optimal secondary structures in multiple landscapes within a certain energy and stability range of the minimum free energy (MFE) structure. We then encode these using an extended structural alphabet and perform sequence alignment using a structural substitution matrix to find and rank the best matches between the sets based on stability, energy, and structural distance. We apply this method to twenty landscapes from four sets of riboswitches from Bacillus subtillis in order to predict their native “on” and “off” structures. We find that this iii method significantly reduces the size of the list of candidate structures, as well as increasing the ranking of previously obscure secondary structures, resulting in more accurate predictions overall. Advances in the field of structural bioinformatics can help elucidate the underlying mechanisms of many genetic diseases. iv ACKNOWLEDGMENTS The author would like to thank his adviser and committee chair, Dr. Shaojie Zhang, for his invaluable intellectual and moral support during the research process. v TABLE OF CONTENTS LIST OF FIGURES ............................................................................................................................. vii LIST OF TABLES .............................................................................................................................. viii LIST OF EQUATIONS ........................................................................................................................ ix CHAPTER ONE: INTRODUCTION ....................................................................................................... 1 1.1 General Background............................................................................................................... 1 1.1.1 Dynamic Programming.................................................................................................... 1 1.1.2 RNA ................................................................................................................................. 4 1.1.3 RNA Secondary Structure................................................................................................ 8 1.1.4 RNA Folding ................................................................................................................... 11 1.1.5 Folding Landscapes ....................................................................................................... 15 1.1.6 Folding Pathways .......................................................................................................... 18 1.2 Consensus Folding Landscapes ............................................................................................ 20 1.3 Riboswitches ........................................................................................................................ 24 1.4 Predicting Native Structures of Riboswitches ...................................................................... 26 CHAPTER TWO: PREVIOUS WORKS ................................................................................................ 28 2.1 RNASLOpt ............................................................................................................................. 28 2.2 RNAConSLOpt....................................................................................................................... 29 2.3 GraphClust ........................................................................................................................... 31 CHAPTER THREE: METHODOLOGY ................................................................................................. 34 3.1 RNASLOpt ............................................................................................................................. 34 3.2 Brand nEw Alphabet for RNA (BEAR) ................................................................................... 38 3.3 Substitution Matrices and MBR ........................................................................................... 39 3.4 Datasets ............................................................................................................................... 41 3.5 Software Pipeline ................................................................................................................. 41 CHAPTER FOUR: RESULTS AND DISCUSSION ................................................................................. 48 4.1 Ranks of Target Matches in Complete Pairwise Alignment ................................................. 48 4.2 Comparison of Ranks to RNA SLOpt Prediction ................................................................... 58 4.3 Analysis of Consensus Landscapes....................................................................................... 62 4.4 Benchmarking ...................................................................................................................... 77 CHAPTER FIVE: CONCLUSIONS AND FUTURE WORK ..................................................................... 81 5.1 Conclusions, Advantages, Disadvantages, and Limitations ................................................. 81 5.2 Future Work and Alternative Approaches ........................................................................... 85 APPENDIX: RNA SEQUENCES AND STRUCTURES USED .................................................................. 88 LIST OF REFERENCES ...................................................................................................................... 93 vi LIST OF FIGURES Figure 1 - Example of the Secondary Structure of an RNA Molecule .............................................. 9 Figure 2 - Examples of the Six Different Sub-Structural Elements................................................. 11 Figure 3 - MUSHI Pipeline .............................................................................................................. 42 Figure 4 - Performance of RNASLOpt compared to performance of MUSHI................................. 61 Figure 5 - Performance of RNASLOpt compared to performance of MUSHI (without adenine)... 62 Figure 6 - Changes in target 'On' structures in adenine (graphed) ................................................ 64 Figure 7 - Changes in target 'Off' structures in adenine (graphed) ............................................... 65 Figure 8 - Changes in target 'On' structures in lysine (graphed).................................................... 67 Figure 9 - Changes in target 'Off' structures in lysine (graphed) ................................................... 68 Figure 10 - Lysine native 'Off' structure ......................................................................................... 69 Figure 11 - Top structure returned by RNASLOpt for sequence lysine 3 ....................................... 70 Figure 12 - Top structure returned by RNASLOpt+MUSHI for lysine 3 after comparison with lysine 1 ..................................................................................................................................................... 70 Figure 13 - Changes in target 'On' structures in TPP (graphed) ..................................................... 72 Figure 14 - Changes in target 'Off' structures in TPP (graphed) .................................................... 73 Figure 15 - Changes in target 'On' structures in FMN (graphed) ................................................... 75 Figure 16 - Changes in target 'Off' structures in FMN (graphed) ................................................... 76 Figure 17 - Average effect of MUSHI on landscape size ................................................................ 77 Figure 18 - Structural similarity of RNASLOpt and RNAConSLOpt 'On' target structures to native 'On' structure ................................................................................................................................. 79 Figure 19 - Structural similarity of RNASLOpt and RNAConSLOpt 'Off' target structures to native 'Off' structure ................................................................................................................................. 79 vii LIST OF TABLES Table 1 - RNASLOpt Parameters and Landscape Sizes ................................................................... 43 Table 2 - Results of aligning each landscape with native ‘On’ and ‘Off’ structures....................... 48 Table 3 – Complete pairwise structural alignment for adenine ‘On’............................................. 50 Table 4 – Complete pairwise structural alignment for adenine ‘Off’ ............................................ 50 Table 5 – Complete pairwise structural alignment for lysine ‘On’ ................................................ 51 Table 6 – Complete pairwise structural alignment for lysine ‘Off’ ................................................ 51 Table 7 – Complete pairwise structural alignment for TPP ‘On’.................................................... 52 Table 8 – Complete pairwise structural alignment for TPP ‘Off’ ................................................... 52 Table 9 – Complete pairwise structural alignment for FMN ‘On’ .................................................. 53 Table 10 – Complete pairwise structural alignment for FMN ‘Off’ ............................................... 53 Table 11 – Complete pairwise alignment rankings of ‘On’ target matches for different values of α, β, γ, gap, and bonus respectively .............................................................................................. 56 Table 12 – Complete pairwise alignment rankings of ‘Off’ target matches for different values of α, β, γ, gap, and bonus respectively .............................................................................................. 57 Table 13 - Effect of MUSHI on adenine landscape size .................................................................. 63 Table 14 – Changes in target ‘On’ structures in adenine (tabulated) ............................................ 64 Table 15 – Changes in target ‘Off’ structures in adenine (tabulated) ........................................... 65 Table 16 - Effect of MUSHI on lysine landscape size ..................................................................... 66 Table 17 – Changes in target ‘On’ structures in lysine (tabulated)................................................ 66 Table 18 – Changes in target ‘Off’ structures in lysine (tabulated) ............................................... 67 Table 19 - Effect of MUSHI on TPP landscape size ......................................................................... 71 Table 20 – Changes in target ‘On’ structures in TPP (tabulated) ................................................... 71 Table 21 – Changes in target ‘Off’ structures in TPP (tabulated) .................................................. 72 Table 22 - Effect of MUSHI on FMN landscape size ....................................................................... 74 Table 23 – Changes in target ‘On’ structures in FMN (tabulated) ................................................. 74 Table 24 – Changes in target ‘Off’ structures in FMN (tabulated)................................................. 75 Table 25 - Performance of RNAConSLOpt on riboswitches from B. Subtilis .................................. 78 viii LIST OF EQUATIONS Equation 1 – Needleman-Wunsch recursive function……………………………………………………….………….2 Equation 2 – Recursive function approximating landscape size………………………………………..…………11 Equation 3 – Asymptotic formula approximating landscape size…………………………………………………11 Equation 4 – Recursive function for Nussinov’s algorithm…………………………………………….……………12 Equation 5 – Recursive function for Zuker-Sankoff’s algorithm…………………………………….…………….13 Equation 6 – Conserved topological interactions in global network alignment……………..…………….22 Equation 7 – Total similarity score in balanced global network alignment…………………………………..22 Equation 8 – Weighted sum in balanced global network alignment………………………….……………….22 Equation 9 – Covariance and conservation score from RNAalifold……………………………………………..30 Equation 10 – Bonus given to columns with compensatory mutations in RNAalifold…………………..30 Equation 11 – Penalty given to columns not following the consensus structure in RNAalifold…..30 Equation 12 – Log-odds score in substitution matrices……………………………………………………………….40 Equation 13 – Mattei-variant of the Needleman-Wunsch Algorithm………………………………………….44 Equation 14 – Weighted sum ranking pairs of structures in MUSHI…………………………………………….45 Equation 15 – Stability ranking in MUSHI……………………………………………………………………………………45 Equation 16 – Energy ranking in MUSHI……………………………………………………………………………………..45 Equation 17 – Structural similarity ranking in MUSHI………………………………………………………………….46 ix CHAPTER ONE: INTRODUCTION 1.1 General Background Bioinformatics is one of the fastest-growing academic fields. Since the year 2000, rapid advances in processing power coupled with dramatic decreases in the cost of DNA sequencing technology have opened the floodgates for research institutions to carry out various studies of the human genome. Very often, a new research study will provide insight into one particular question while inevitably spawning multiple additional questions. Thus, we expect the exponential explosion of novel problems in bioinformatics to continue for some time. In this thesis, we formulate and suggest solutions to such a novel problem, one which has thus far remained largely untouched by other computer scientists: finding the consensus energy folding landscape between multiple RNA sequences. Before presenting the problem statement, it will be necessary to briefly review the computational strategies used, as well as provide the appropriate background information necessary for understanding the problem. Due to the amount of background knowledge necessary to fully understand the problem, and in order to make this document accessible to readers from both Computer Science and Biology backgrounds, Chapter 1 is somewhat extensive. The reader is encouraged to skip over sections containing information with which they are already well-acquainted. 1.1.1 Dynamic Programming Dynamic programming is a classic computational strategy central to many different problems in bioinformatics, including the problem of finding a consensus folding landscape. First formally proposed by Richard Bellman in 1953, “dynamic programming” 1 is an ambiguous name for a very clever strategy. The characteristic approach for a dynamic programming solution is to identify the different subproblems within a larger problem, solve these smaller subproblems, and then combine the smaller solutions into a solution to the overall problem. A classic example of the power of dynamic programming is its use in generating Fibonacci numbers. The recursive definition of the Fibonacci sequence is F(n) = F(n-1) + F(n-2). This definition splits each instance of the problem into two smaller subproblems, and so a naïve recursive function to compute Fibonacci numbers would result in O(2N) time complexity. The power of dynamic programming lies in recognizing that there are only N distinct subproblems (F1, F2, …, FN), and so we can reduce the exponential solution to a linear solution simply by memoizing the result of each distinct subproblem. One of the earliest and most well-known uses of dynamic programming in bioinformatics is the Needleman-Wunsch algorithm for global sequence alignment[1]. Biologists are often interested in computing the “sequence identity” of two polymers, usually DNA, RNA, or protein. Informally, the sequence identity specifies the similarity between two given strings. An alignment algorithm seeks the best possible mapping from characters in the first sequence to characters in the second, while also allowing insertion of gaps in either sequence. The algorithm operates by recognizing that the optimal solution for a subsequence of either string is also included in the global solution. We can proceed by first examining the last characters of both sequences. There are then three possibilities: they can be aligned, a gap can be inserted in the first sequence, or a gap can be inserted in the second sequence. The recursive function for the Needleman-Wunsch algorithm can be written as 𝐷(𝑖 − 1, 𝑗 − 1) + 𝑠(𝑥𝑖 , 𝑦𝑗 ) 𝐷(𝑖, 𝑗) = max { 𝐷(𝑖 − 1, 𝑗) + 𝑔 𝐷(𝑖, 𝑗 − 1) + 𝑔 2 𝑥𝑚𝑎𝑡𝑐ℎ 𝑆(𝑥𝑖 , 𝑦𝑗 ) = { 𝑥𝑚𝑖𝑠𝑚𝑎𝑐ℎ 𝑖𝑓 𝑥𝑖 = 𝑦𝑗 𝑖𝑓 𝑥𝑖 ≠ 𝑦𝑗 (1) where g is the penalty for inserting a gap, 𝑥𝑚𝑎𝑡𝑐ℎ is the bonus for aligning two identical characters, and 𝑥𝑚𝑖𝑠𝑚𝑎𝑡𝑐ℎ is the penalty for aligning two different characters. The value of these parameters can be adjusted to suit the needs of the specific application. However, implementing the recursive function alone will not solve the problem. After transforming D into an iterative procedure, we will end up with an (M+1) by (N+1) matrix, in which D(m, n) will represent the best possible score achievable by an optimal alignment of the first m characters of the first sequence, and the first n characters of the second. To recover the alignment that generated this score, we perform a traceback procedure. While creating matrix D, we simultaneously create a traceback matrix T, where each cell T(i, j) corresponds to cell D(i, j). Each cell in T contains three Booleans, which we call left, up, and diagonal. When the value for D(i, j) is calculated, it is defined as the maximum of the three values D(i-1, j), D(i, j-1), and D(i-1, j-1). If the final value in D(i, j) is equal to any of these three values, to corresponding Boolean is switched to “true”, indicating where the value D(i, j) was derived from, and ultimately marking the trail through the matrix that will show us the optimal solution. When matrix D is filled, we begin the traceback procedure in cell T(i+1, j+1), and check each Boolean in the cell. If we find that T(i+1, j+1).left is true, this means the last character in the second sequence was matched with a gap in the first sequence, so we print this as the first column of the alignment. If T(i+1, j+1).diagonal is true, this means we matched the corresponding characters in both sequences with each other, so we add the next character of both sequences to the alignment. Finally, if T(i+1, j+1).up is true, we match the last character in the first sequence with an inserted gap in the 3 second sequence. Proceeding in this fashion from the lower right corner of T to the upper left corner of T will generate an optimal alignment between the two sequences. It is important to note, however, that there may be more than one optimal alignment. In order to recover all of them, we need to write a procedure that can enumerate all distinct paths from T(i+1, j+1) to T(0, 0). We sometimes expand upon the penalty for adding a gap into the alignment with an affine gap function. This is called the affine gap penalty, and it recognizes the fact that, from a biological standpoint, it may be more difficult for a sequence to mutate and “open” a gap (that is, to increase the length of the sequence) than to keep the characters aligned but mismatched, or to extend a gap that is already open. Clearly, the Needleman-Wunsch algorithm can be modified in many useful ways. This is important to note, as we will present a modified version of this algorithm in chapter 3. 1.1.2 RNA The primary focus of this thesis is to apply strategies like dynamic programming to a particular subfield within bioinformatics called structural bioinformatics. In structural bioinformatics, we are interesting in characterizing the three-dimensional shape of various biological molecules. The biological molecule of interest in this study is called ribonucleic acid, also known as RNA. RNA plays a pivotal role in the process known to molecular biologists as “central dogma,” which we will briefly explain. It is now nearly universal knowledge that the physical and behavioral traits organisms pass down to their offspring are packaged in the form of genes. These genes are nothing more than complex sequences of the four nucleotides adenine, guanine, cytosine, and thymine, 4 chemically bonded to create a long, double-stranded polymer known as deoxyribonucleic acid, or DNA. Sequences of exactly three nucleotides in DNA called codons can be translated into one amino acid, of which eukaryotes use twenty-one. Long sequences of amino acids called polypeptides fold into complex shapes and becomes proteins, which perform innumerable biological functions within the cell, many of which are still unknown. However, DNA is not directly translated into protein. It relies on RNA as an intermediary. In order for a gene to be expressed in the cell, the two strands of its corresponding sequence in DNA must be pulled apart by an enzyme called RNA Polymerase. This enzyme attaches to one strand of the DNA and moves down the sequence, pausing at each nucleotide in the DNA to add the corresponding nucleotide to a growing chain matching the DNA sequence to be copied. Once finished, RNA Polymerase disconnects itself from the DNA, and releases the short molecule containing the copy of the sequence. The process of copying the sequence is called transcription, and the molecule which is a direct copy of the sequence is called RNA. RNA differs from DNA in three major ways. First, it uses the nucleotide uracil in place of thymine. Therefore, if a DNA sequence were “GCGCATA”, the corresponding RNA sequence would be “GCGCAUA”. Second, RNA has a hydroxyl group attached to the 2’ position of its pentose ring, whereas DNA does not (hence the prefix “deoxy-”). This makes RNA less stable than DNA. Third and most importantly for this thesis, RNA is a single-stranded molecule, whereas DNA is double-stranded. This structural difference makes RNA more flexible and able to fold into complex shapes, which is often the key determinant of its function. After transcription, the RNA molecule must be translated into an amino acid sequence to complete the gene expression process. However, before this can happen, some important 5 pre-processing steps must occur. The most important of these steps is called RNA splicing. Contrary to what was once popular belief, not all DNA directly codes for an amino acid sequence. In fact, in the human genome, only about 1% of our DNA will eventually be translated into protein. Approximately 25% of the remaining DNA has been associated with regulatory elements and other non-coding portions of genes, but the function of much of our genome is still unknown. What is clear, however, is that after a gene is transcribed, molecules called small nuclear ribonucleoproteins (snRNPs) cut the RNA molecule into fragments that can be classified as either exons or introns. Exons are stretches of RNA that will be translated into amino acid sequences, whereas introns may perform a variety of other functions. Once the process of RNA splicing is complete, the exons are, in essence, “stitched” back together, so that only the coding portion remains in the RNA molecule. After undergoing some additional processing, the strings of exons now known as messenger RNA (mRNA) exits the nucleus, and is picked up by a ribosome. The purpose of the ribosome is to scan through the RNA molecule and translate each codon into a single amino acid, thus building a chain of amino acids in the same way that RNA Polymerase builds a chain of nucleotides. This process is called translation. Central dogma refers to the entire process in which DNA is transcribed into RNA, which is translated into proteins. While this is surely an oversimplification of the process (indeed, some viruses known as reverse transcriptase actually attack the cell by reversing this process and inserting foreign sequences into the DNA), this basic explanation is sufficient for understanding the role that RNA plays in the cell, and why we are interested in predicting its shape. 6 In general, RNA can be broadly classified into two categories, coding RNA and non-coding RNA (ncRNA). ncRNA is defined as any RNA sequence that is not directly translated into a polypeptide. There are many different classes of RNA molecules. Messenger RNA (mRNA) is the concatenation of exons in a gene that undergo translation. This is coding RNA. ncRNA contains many smaller groups. We have already discussed small nuclear ribonucleoproteins (snRNPs), which cleave RNA into exons and introns. As the name suggests, snRNPs are complexes that consist of RNA and proteins. The RNA portion of this complex is referred to as small nuclear RNA (snRNA). These molecules are typically about 150 nucleotides (nt) in length. Similarly, ribosomes are composed of RNA and protein molecules. The RNA in ribosomes is called ribosomal RNA (rRNA). In order for the ribosome to match a codon with a particular amino acid, it bonds with transfer RNA (tRNA). tRNAs are a class of RNA molecules with a highly conserved cloverleaf shape. There exists a specific tRNA for each possible pairing of one codon with one amino acid. The tRNA’s structure has a region that allows it to bond to a codon in an mRNA sequence, as well as a region that bonds to a specific amino acid. Thus, each tRNA acts as a sort of grammatical rule by adding its particular amino acid to the growing polypeptide in the ribosome when it recognizes the correct codon in the mRNA. Some RNA molecules can silence the expression of genes by destroying the corresponding mRNA molecule in a process called RNA interference (RNAi). The two most important types of these molecules are called micro RNA (miRNA) and small interfering RNA (siRNA), both of which share roughly the same function but usually have dramatically different structures, thus affecting the specific mechanism by which they act. Small nucleolar RNAs (snoRNAs) function primarily in processing other RNA molecules, such as rRNA. 7 More classes of RNA exist than have been described here, and still more have yet to be discovered. Ultimately, there are two important facts that are central to this thesis. First, it is clear that RNA plays many complex roles in the cell beyond the simple storage of protein coding information. Second, and most importantly, the various functions of RNA molecules are differentiated primarily by the structure of the molecule itself. Thus, the more we understand about how RNA molecules acquire their three-dimensional shape, the more we can infer about their roles in gene regulation and expression. 1.1.3 RNA Secondary Structure The structures of RNA and other biological molecules are typically classified in terms of primary, secondary, tertiary, and sometimes quaternary structure. An RNA molecule’s primary structure refers to the specific sequence of the nucleotides it contains, and is simply written as a string: GCGCAUA. The chemical structure of RNA allows it to be very flexible such that, if it is energetically favorable, the molecule will fold back onto itself, allowing some nucleotides to form hydrogen bonds with other nucleotides farther downstream. We refer to these bonds as base pairs. In general, the following base pairs are energetically favorable, in order from most stable to least stable: G-C, A-U, G-U. The varying stability between these three possible base pairs is determined by the number of hydrogen bonds the nucleotides can form. When guanine bonds with cytosine, both components are held together by three hydrogen bonds, therefore conferring greater stability to the molecule as a whole than an A-U base pair, which is joined by two hydrogen bonds. Occasionally, it is possible for three nucleotides to form a more complex configuration in which all three are bonded to each other, resulting in a base triplet. However, this is an uncommon occurrence, 8 and it is usually safe to assume that if any two nucleotides form a base pair together, those nucleotides can no longer form base pairs with any other nucleotides in the sequence. When describing the secondary structure of an RNA molecule, we typically number each nucleotide from 1 to N, where N is the length of the sequence. The notation for referring to a base pair is (i, j), where i and j refer to the i-th and j-th nucleotides in the sequence, respectively. Graphically, we may represent an RNA sequence and its corresponding structure as follows: AAAGCGAAAGCGAAACGCAAAGCGAAACGCAAACGCAAA ...(((...(((...)))...(((...)))...)))... This string representation corresponds to the following two-dimensional structure: Figure 1 - Example of the Secondary Structure of an RNA Molecule This is called dot-bracket notation. A dot in the i-th location in the structure indicates that, in this structure, the i-th nucleotide in the sequence is not part of any base pair, while an open or close parenthesis indicates that the associated nucleotide is paired with the nucleotide associated with the corresponding complementing parenthesis. Thus, a 9 secondary structure can be fully described by a sequence length N and a list of K base pairs {(i1, j1), …, (ik, jk)}. In addition to ignoring base triplets, we will also ignore pseudoknots. A pseudoknot is defined as a set of base pairs (i, j) and (i’, j’) such that i < i’ < j < j’. While such formations do appear in some RNA molecules, pseudoknots are considered uncommon if not rare [2], and excluding them from our model simplifies the problem of predicting secondary structure dramatically. Thus, the language describing the set of all valid RNA structures can be described as L = ∑*, where ∑ = { ., (, ) }, such that all parentheses are balanced. This is a slight variation of the well-known Dyck language, which is context-free. Longer, more complicated secondary structures have recurring motifs, which can be classified into six categories: hairpin loops, stacks, internal loops, bulges, multiloops, and dangling or unpaired bases. The diagram below shows an example of a structure with each kind of motif. Because “motif” is a word with special connotations in sequence analysis, we henceforth refer to these as secondary sub-structural elements (SSEs). 10 Figure 2 - Examples of the Six Different Sub-Structural Elements 1.1.4 RNA Folding It is important to note that one sequence may have many different possible structures. Based on the definition of admissible structures above, the size of the subset of words of length N+1 in the language L satisfies the recurrence 𝑁−2 𝑇(𝑁 + 1) = 𝑇(𝑁) + ∑ 𝑇(𝑘) ∗ 𝑇(𝑁 − 𝑘 − 1) 𝑘=0 (2) where T(N) is the number of possible structures of a sequence of length N [3]. The first term on the right hand side of the equation accounts for all possible structures where a new nucleotide N+1 remains unpaired, and the summation accounts for all possible structures when nucleotide N+1 forms a base pair with nucleotide k+1. We can also specify the base cases as T(0) = T(1) = T(2) = 1. This recurrence can be approximated by the formula [4] 11 𝑁 3 15 + 7√5 3 + √5 𝑇(𝑁) = √( ) ∗ 𝑁 −2 ∗ ( ) 8𝜋 2 (3) which, for a sequence length of 150, equates to approximately 5.75 * 1029, a colossal number of structures for a relatively short sequence. However, because we normally place further restrictions on the class of chemically feasible structures, this formula represents a vastly overestimated upper bound. Examples of these restrictions includes setting a minimum for the number of nucleotides that must occur between any two base pairs. Clearly, two adjacent nucleotides are already bonded, and cannot be considered in the same base pair. Additionally, hairpin loops typically require at least three nucleotides between the closing members of the base pair because electromagnetic forces between nucleotides make it very difficult for a molecule to bend sharply, only including one or two nucleotides in the loop. Thus, three rules define the class of permissible RNA structures: (i) all parentheses must be balanced, (ii) we do not consider pseudoknots, and (iii) for any base pair (i, j), j – i ≥ 4. Once we begin to consider the actual sequence information, further restrictions are placed on the set of permissible structures. Most importantly, we only consider the following pairs: G-C, A-U, and G-U. This property can alter the size of the set of permissible structures, taking it from only one structure (if, for example, the entire sequence is composed of a single nucleotide), up to a maximum that depends on the sequence itself. In general, the constraints imposed by the sequence data reduce the size of the landscape (a term formally defined in section 1.1.6) dramatically. 12 Simple thermodynamics requires that complex molecules conform into energetically stable structures. Thus, the first attempts at predicting RNA secondary structure focused on selecting the structure with minimum free energy (MFE) from the set of permissible structures. The earliest algorithm to achieve a reasonable solution to this problem was given by Nussinov et al. in 1980 [5]. Because base pairs contribute to the stability of the overall structure, this dynamic programming solution focuses on finding the structure in which the number of base pairs are maximized. The recursive function for the algorithm can be written 𝑀(𝑖, 𝑗 − 1) 𝑀(𝑖, 𝑗) = 𝑚𝑎𝑥 { 𝑀(𝑖, 𝑘 − 1) + 𝑀(𝑘 + 1, 𝑗 − 1) + 𝛿(𝑘, 𝑗) 𝑓𝑜𝑟 𝑖 ≤ 𝑘 < 𝑗 (4) where δ(k, j) = 1 if the respective nucleotides can form a base pair. Note that this scoring function does not differentiate base pairs by stability, but it can easily be modified to do so. Filling this matrix requires O(N3) time. However, base pairs are not the only elements within a structure that affect its overall stability. SSEs in various forms such as hairpin loops or internal loops destabilize the overall structure, but some loops destabilize it more than others. Additionally, there may be multiple structures with the maximum number of base pairs, and this solution would be unable to differentiate between them. Therefore, in order to determine the true MFE structure, a good folding algorithm must account for a larger set of parameters than Nussinov’s algorithm. A more sophisticated algorithm was developed by Zuker and Steigler in 1981, refined by Zuker and Sankoff in 1984, and again by Sankoff in 1985 [2, 3, 6]. This algorithm takes into account the destabilizing energies of SSEs such as hairpin loops and internal loops. Its recursive function can be written 13 𝑉(𝑖, 𝑗) 𝑊(𝑖 + 1, 𝑗) 𝑊(𝑖, 𝑗) = 𝑚𝑖𝑛 𝑊(𝑖, 𝑗 − 1) 𝑚𝑖𝑛 {𝑊(𝑖, 𝑘) + 𝑊(𝑘 + 1, 𝑗)} { 𝑖≤𝑘≤𝑗−1 𝑒ℎ(𝑖, 𝑗) 𝑉(𝑖 + 1, 𝑗 − 1) + 𝑒𝑠(𝑟𝑖 , 𝑟𝑗 , 𝑟𝑖+1 , 𝑟𝑗−1 ) 𝑉(𝑖, 𝑗) = 𝑚𝑖𝑛 𝑚𝑖𝑛𝑖<𝑖 ′ <𝑗′ <𝑗 && 2<𝑖 ′ −𝑖+𝑗−𝑗′ {𝑉(𝑖 ′ , 𝑗 ′ ) + 𝑒𝑏𝑖(𝑖, 𝑗, 𝑖 ′ , 𝑗 ′ )} { 𝑚𝑖𝑛𝑖+1≤𝑘≤𝑗−2 {𝑊(𝑖 + 1, 𝑘) + 𝑊(𝑘 + 1, 𝑗 − 1)} + 𝑎 (5) where W(i, j) is the score of the best possible structure on subsequence [i…j], and V(i, j) is the score of the best possible structure on subsequence [i…j] with the added assumption that i and j form a base pair in that particular structure. Additionally, eh(i, j) is the energy penalty of a hairpin loop closed by base pair (i, j), es(ri, rj, ri+1, rj-1) is the energy bonus given by “stacking” base pairs (i, j) and (i+1, j-1) together, a function that depends specifically on the nucleotides involved in the stacking, ebi(i, j, i’, j’) is the energy penalty given by the internal loop or bulge closed by base pairs (i, j) and (i’, j’), and a is the energy penalty given by opening a new multiloop structure. In simple terms, the matrix W accounts for 4 possibilities: i and j form a base pair, i unpaired, j is unpaired, or both i and j are in a base pair, but not with each other. Note that the possibility that both i and j are unpaired can be accounted for by choosing the second and third options consecutively. The matrix V assumes i and j form a base pair, and then determines the SSE of minimum energy that could be closed by (i, j). The advantage of this algorithm is that the functions eh, es, ebi, and a can be continuously updated to reflect parameters given by new experiments. Michael Zuker’s MFOLD software package uses a version of this algorithm which runs in O(N4) time. 14 However, while this algorithm has been used to successfully compute the MFE structure of many different sequences, it has been shown that the MFE structure is not often the native structure assumed by an RNA molecule [7]. This is because the native structure depends on the dynamic folding process of the molecule, which can depend on how the molecule interacts with external forces, and is thus extremely difficult to predict. Very often, the RNA molecule will end up in a structure that is both energetically favorable and stable, but is not necessarily the structure with the minimum possible energy given its sequence. Thus, algorithms that find the MFE structure are limited, and we must continue to search for new and creative ways to visualize and solve the problem. 1.1.5 Folding Landscapes We discussed earlier the notion of a folding landscape, which is composed of all possible structures an RNA molecule can take. The landscape can naturally be represented as a graph in which any two structures that differ only by the addition or subtraction of a base pair are connected by an edge. These edges can also be directed if it is desired that an edge represents a transition from a structure with higher free energy to a structure with lower free energy. The edges can also be weighted if it is desired that an edge should include the difference in free energy between two structures. However, these are just alternate representations of the same information. For the purposes of this research, we will represent the landscape as an unweighted, undirected graph. Each node is encoded with the free energy of its associated structure. As previously discussed, the size of the folding landscape is enormous, but finite. RNAsubopt, introduced by Wuchty et al. in 1999 [8] is capable of enumerating all suboptimal structures in the energy range from the MFE to some arbitrary upper limit. 15 Additionally, BARRIERS, introduced by Flamm et al. in 2002 [9], is capable of constructing the exact energy landscape by establishing that the topology of the landscape always forms a hierarchy, and so the landscape can be represented in a form called a barrier tree. Therefore, it is possible to fully realize an actual folding landscape. However, the size of a landscape for any RNA sequence of sufficient length is overwhelmingly large, so we are still in essence restricted to working with the portion of the landscape within a certain percentage of the MFE, rather than the full landscape. To solve the problem of finding a consensus folding landscape, we must first start with a rigorous mathematical definition of a landscape as well as its interesting features. Flamm et al. have already provided such definitions [9], and the relevant ones will be explained here. Let a configuration space be a graph G(V, E), where each v ∈ V represents a unique secondary structure, and each pair of vertices v1, v2 ∈ V whose associated structures in dotbracket notation differ only by the addition or deletion of a single base pair is connected by an unweighted, undirected edge e ∈ E. Then a landscape L(G, f) is defined by a configuration space and a function f: V ⟶ ℝ. In the case of our RNA landscape, this function f is precisely the function that calculates the free energy of a structure. There are a couple of intuitive assumptions we can make about identifying the properties of native structures from a folding landscape. The first and most obvious is that a native structure will be locally optimal, or equivalently a local minimum in the landscape. This makes sense because any structure that is not a local minimum can spontaneously transform into a structure that is a local minimum by adding another base pair, much like a ball rolling down a hill. Thus, a folding RNA molecule will not come to rest in a conformation that is 16 not a local minimum. The second assumption is that that native structure should be stable. Specifically, stability of a structure x refers to the difference in energy between x and height of the lowest saddle point between x and any other local minimum y ∈ V. This is an important feature of the landscape because for RNA molecules to function reliably, they should be strongly resistant to external forces that may act on the structure, breaking some base pairs and possibly pushing the molecule into a new energy basin and structure. Because local optimality and stability are two independent properties, it is important to note that for two local minima A and B, A may have greater total energy that B, but it may also be more stable. The third assumption deals with the notion of accessibility of a local minimum. That is, it may be possible to have a local minimum with low total energy which is also highly stable, but for which the probability that a random structure in V lies in its associated basin is far less than the probability of the same event for a different basin. If we visualize an energy basin as generally assuming the shape of a bowl, the accessibility property refers to the width of the bowl. Flamm et al. have given us explicit ways to measure these three properties, as we will now see. The neighbors of a vertex v ∈ V can be defined by the set ∂v = ∂{v} = {y ∈ V | {v, y} ∈ E}. This definition extends to the neighboring set of a connected component: ∂A = {y ∈ V \ A | ∃x ∈ A: {x, y} ∈ E}. ∂A is called the boundary of A. The neighborhood of A is the union of set A and its boundary. That is, N(A) = A ∪ ∂A. A vertex x is a local minimum if f(x) ≤ f(y) for all y ∈ ∂x. If the inequality is changed to f(x) < f(y), x is called a strict local minimum. M is the set of all local minima in the landscape. As mentioned, it is assumed the vertex representing the native structure of an RNA molecule belongs to M. A landscape is non-degenerate or invertible if f(x) = f(y) → x = y ∀x,y ∈ V. Some degenerate RNA 17 folding landscapes exist. A walk of length k in the landscape is a sequence of vertices p = {x1, x2, …, xk} such that each xi ∈ V and {xi, xi+1} ∈ E. The set of all possible walks between vertices x and y is denoted by Pxy. X and y are mutually accessible at height n if there exists a walk p ∈ Pxy such that f(z) ≤ n for all z ∈ p. Finally, the saddle height 𝑓̂(x, y) between two vertices x and y is the minimum height at which they are mutually accessible. That is, 𝑓̂(x, y) = 𝑚𝑖𝑛𝑝∈𝑷𝑥𝑦 𝑚𝑎𝑥𝑧∈𝑝 𝑓(𝑧). With these definitions, we can measure the three characteristics discussed previously. A structure is locally optimal if its associated vertex is a local minimum in the landscape. The stability of a structure x can be represented as the minimal saddle height between x and any other structure. When discussing the saddle height between two adjacent local minima, the number represents the energy required to transform one structure into the other (for the purposes of this thesis, “adjacent” means that the shortest walk between two local minima can be divided into one ascending phase followed by one descending phase). 1.1.6 Folding Pathways Assuming that the dynamic folding process occurs by the discrete addition or subtraction of individual base pairs, it can be modeled as a path through the landscape. Here, discrete means that no two base pairs are altered at precisely the same time, so a path denotes the series of transformations an RNA molecule assumes during the folding process. Naturally, we are primarily interested in the end point of this path, which we assume lies in a stable energy basin. However, this can depend on the molecule’s starting position. If we assume in vitro folding, the molecule begins in a “flat” conformation, containing no base pairs. Then, small fluctuations in thermal noise allow the molecule to begin folding, travelling a 18 path in the landscape that is almost entirely determined by sequence information. However, this is not necessarily the case in vivo. In the cell, there may be many external forces that can interfere with the folding process. More importantly, when RNA molecules are assembled in the transcription process, the free-floating portion of the molecule may begin to fold before the rest of the molecule is assembled. Because each nucleotide added to the sequence adds a new set of structures to the folding landscape, the RNA molecule is constantly jumping between different landscapes until it arrives at the final one. The structure of the molecule at this point is then considered its starting point in the final landscape, and in the absence of any external stabilizing forces can drastically alter the end point of the path, or the final structure. This is akin to dropping a marble directly above different locations on a complex topological surface. Assuming the marble can only roll downhill (that is, there are no external forces acting upon the system), choosing an arbitrary starting location for the marble necessarily excludes a set of final locations. Thus, even if we could perfectly determine the step an RNA molecule may take at each point in the landscape, we still cannot predict the final vertex without knowing the starting vertex. The accessibility property discussed in the previous section has important implications for folding pathways. If we assume that the folding pathway for an RNA molecule ends in a local minimum that is highly inaccessible, there is a greater probability that some external force may act on the molecule before it reaches the basin, and because the basin is so inaccessible, the molecule may never recover from this error and will end up misfolded. Thus, it seems clear that pressure from natural selection would also drive the creation of nucleotide sequences for which the target structure in the folding landscape lies in a highly accessible basin, therefore minimizing the occurrences of a misfold. 19 1.2 Consensus Folding Landscapes At last, we are ready to move to the primary focus of this thesis: defining and solving the problem of finding the consensus folding landscape between two or more RNA sequences. To the best of our knowledge, Li et al. [10] are the only group who have researched this subject, but even in that paper, the problem is not formally defined. This thesis will give a formal definition of the problem, explain the efficacy of some solutions on constrained versions of the problem, and discuss what implications solving this problem could have for the field of structural bioinformatics. Because we have already defined a folding landscape as a graph, the intuitive notion for a consensus folding landscape is that it should be an alignment of two graphs that conserves a maximal amount of correspondence between them. If two landscapes can be said to have a consensus in some region, then both of the associated RNA molecules should be governed by the same set of folding dynamics when they are in those regions. How can we formulate this problem? Clearly, one aspect should involve comparing the nodes of each graph, and finding the best correspondence between the two sets. The best correspondence could be defined in terms of structural similarity, total energy level, stability, or some combination of these factors. In fact, this is exactly the approach we take in Chapter 3. However, finding such a correspondence is not enough. Because a folding landscape defines how a molecule behaves during the folding process, a consensus landscape should also define how both molecules behave when they are in the consensus zone. Therefore, a consensus landscape should maximize the size of a correspondence between the vertices of two networks while also conserving topological information. 20 Although no one has yet defined the problem of finding a consensus folding landscape, it is somewhat obvious from the previous discussion that this problem is identical to the Global Network Alignment (GNA) problem. This problem is defined rigorously in relation to protein-protein interaction networks by Zaslavskiy et al. [11].The definition we are about to give is more or less a transcription of their definition. Assume we are trying to find some global alignment between graphs G and H. We begin by assuming G and H have the same number of vertices. Even though this is rarely the case, we can “simulate” this property by adding some number of “dead” vertices to each graph so that they have the same order. If any particular alignment maps a “live” vertex in G to a dead vertex in H, this means that the associated structure in G does not map to any real structure in H. Now that the vertex sets are the same size, we are looking for a bijection between the two sets that maximizes the number of conserved topological interactions. Such a mapping is given by a permutation π of {1, 2, …, N}, which are the vertices of G. In each permutation, the i-th vertex of G is mapped to the π(i)-th vertex of H. Each permutation π can also be represented by a permutation matrix P, where Pij = 1 if and only if π(i) = j. Then, the set of all possible permutations is defined by P = {P ∈ {0,1}NxN | P1N = 1N, PT1N = 1N}, where 1N is the column vector with N entries all equal to 1. For clarity, the identity matrix INxN represents the permutation where the first vertex of G is mapped to the first vertex of H, the second vertex of the G is mapped to the second vertex of H, and so on. The two conditions on the right side of the definition of P mean that each distinct row and each distinct column in a permutation matrix P must sum to exactly one. This is because each vertex in G must map to exactly one vertex in H, and the relationship is bidirectional. 21 We can score each permutation by recording the number of interactions it conserves. That is, we are interested in the number of vertex pairs (i, j) that are connected in G where π(i) and π(j) are also connected in H. We denote this number by J(P), and it is clear that we seek to maximize this quantity. If we apply the permutation encoded by P to the graph H, we can obtain a new graph isomorphic to H, denoted by P(H). This is because the permutation simply shuffles the vertex labels. Note that this is not equivalent to multiplying PAH. Now, the adjacency matrix for the permuted graph AP(H) can be obtained from multiplying PAHPT [12]. Because adjacency matrices are symmetric, J(P) is then equivalent to half the number of entries in both AG and AP(H) that are simultaneously equal to 1. Formally, 𝑁 1 𝐽(𝑃) = ∑ [𝐴𝐺 ]𝑖𝑗 [𝐴𝑃(𝐻) ]𝑖𝑗 2 𝑖,𝑗=1 (6) Zaslavskiy et al. [11] define two formulations of this problem. The first is referred to as the Constrained Global Network Alignment problem, and can be applied in the case where we have a list of candidate matchings between vertices, and we simply want to disambiguate the set of matchings by disallowing some correspondences, which allows us to rule out large numbers of permutations simultaneously. For the purposes of finding a consensus folding landscape, however, we can make no such obvious constraints on which structures can match. Therefore, we will focus on the second formulation of the problem given by Zaslavskiy et al. [11], which is called the Balanced Global Network Alignment problem. This formulation takes into account the degrees of similarity between all vertices in the network, and allows for the trade-off or balancing between vertex similarity and topological information. Assuming we have an N x N similarity matrix C in which C ij 22 denotes the similarity score between vertex I ∈ G and j ∈ H, the total similarity for any permutation P can be denoted 𝑁 𝑆(𝑃) = ∑ 𝐶𝑖,𝜋(𝑖) 𝑖=1 (7) Finally, we can state the BGNA problem as the optimization of the following weighted sum 𝑚𝑎𝑥𝑝∈𝑷 𝜆𝐽(𝑃) + (1 − 𝜆)𝑆(𝑃) (8) where λ is a weighing factor determining the relative importance of topological information and vertex similarity, respectively. If the Balanced GNA problem could be solved exactly, it would illuminate many aspects of RNA folding landscapes. Currently, little is known about quantifying the change enacted on a landscape by a change in the RNA sequence. Two identical RNA sequences also have identical folding landscapes, but if we change a single nucleotide in the second sequence, how dramatically does its landscape change? We could quantify this using the change in the best alignment given by a solution to BGNA. We could then measure the point at which two landscapes “diverge” from each other due to differences in the primary sequence. This could establish a sequence identity threshold between two RNA molecules, below which any consensus folding landscape of sufficient size could be said to be statistically significant. In clearer terms, we mean that the consensus landscape between two sequences that differ only be a single nucleotide is more likely to be a result of high sequence identity between the two molecules, rather than having any biological significance. However, if two molecules with lower sequence identity share a relatively large consensus, this 23 information is likely to clue us in on significant biological structures, reducing the list of candidate structures, and improving native structure prediction significantly. Unfortunately, we can currently only speak of these advantages in general terms, because the intractability of BGNA prohibits us from quantifying exactly how much a solution would help us. The Balanced GNA problem is known to be NP-hard. The number of possible permutations is N!, where N is even greater than the estimated size of the landscape discussed in section 1.1.4 due to the additional dead vertices required by the problem formulation. A number of approximate solutions to the problem have been proposed [11, 13, 14], but even these are incapable of processing RNA folding landscapes, which are enormous in size. Thus, we will also propose an approximate solution specific to RNA folding landscapes, in which we perform a similar matching on a landscape whose size has been dramatically reduced by predicting and only considering the stable local optimal structures. The method for predicting these “points of interest” on the landscape is discussed in section 2.1, and methods for redefining and finding a consensus are discussed in chapter 3. 1.3 Riboswitches The transcription and translation of mRNA is largely regulated by ncRNA. However, mRNA molecules are also capable of self-regulation. Riboswitches are cis-regulatory elements usually found in the 5’ UTR of mRNA molecules. Most known riboswitches exist in the RNA of bacteria, though the existence of at least one riboswitch has been verified in eukaryotes [15]. Riboswitch sequences are relatively short (usually around 150 nt long) and consist of two primary domains. The aptamer domain is responsible for binding a very specific ligand, whose presence in the cellular solution triggers changes in mRNA 24 processing. The second domain is the expression platform, which is the portion of the sequence that can act as an interface between mRNA and ribosomes or other cellular machinery. The aptamer domain and expression platform overlap in the sequence. The overlapping portion is called the switching sequence, and it undergoes structural changes in response to the binding of a ligand to the aptamer domain. The result is that the riboswitch can assume two native structures, commonly referred to as the “on” and “off” positions. Thus, while most ncRNA molecules assume one stable local optimal structure in the folding landscape, riboswitches assume two native structures, and the binding of the ligand to the aptamer domain provides a mechanism for the riboswitch to follow a path from one energy basin to another and back. The exact mechanism by which this structural change occurs depends on the riboswitch, and requires detailed knowledge of the atomic structure of the molecule. There are three primary mechanisms by which riboswitches can regulate RNA. The first is transcription termination. This occurs after the riboswitch has been transcribed from a DNA sequence but before the rest of the gene has been transcribed by RNA polymerase. In the presence of the appropriate ligand, the switching sequence may form what is known as a rho-independent hairpin loop, which causes the transcription process to stall. The hairpin loop is then followed by a polyuracil chain, which destabilizes the transcription complex, letting it detach from the DNA strand and ending the process prematurely. Because this step occurs before the alternative splicing of the gene, the riboswitch also disallows the introns from being transcribed, and therefore it also regulates ncRNA. The second mechanism is translation inhibition. In this case, the expression platform coincides with the ribosome-bonding site on the mRNA, and the switching sequence partially alters 25 its structure. This makes the mRNA unable to bond with a ribosome, prohibiting translation from occurring. The third mechanism is ribozyme activation. In this case, the riboswitch acts as a ribozyme, and when activated by a ligand, automatically cleaves itself, effectively destroying the mRNA in the process. Riboswitches perform regulatory duties in many other ways, but these are the most common. 1.4 Predicting Native Structures of Riboswitches In this thesis, we are interested in using an approximate solution to the global network alignment problem to predict the native structures of different riboswitches. The rationale behind this is that different RNA sequences may regulate the same genes and bind the same ligands, and are therefore considered the same riboswitch, even though the primary structure of these sequences may differ somewhat. Because each riboswitch must bind a single particular ligand, the structure of the aptamer domain must be very strongly conserved among all sequences comprising the same riboswitch. The riboswitch would not function correctly if it were to accidentally bind the wrong ligand. Thus, though the different manifestations of the same riboswitch may differ in sequence and therefore folding landscape, we can make a strong assumption that there must be two structures common to both landscapes: the native “on” and “off” structures of the riboswitch. Current methods of secondary structure prediction tend to focus on the candidate structures derived from a single sequence, and the result is that viable theoretical candidates may be returned, but without having any biological significance. However, if we compare the points of interests across multiple landscapes, we may be able to narrow our pool of candidate structures. 26 One possible limitation to this approach is that the sequences for a riboswitch may have high (>90%) sequence identity, which may mean that the landscapes for each sequence are very similar, and therefore finding a consensus among them may not yield much useful information. Essentially, if we are interested in the intersection between two sets, and the sets happen to be very similar, the intersection will be large, which may not help us in our search. However, if the folding landscapes of two sequences diverge rapidly with decreasing sequence identity, the intersection is likely to be very small, which would improve structure prediction. The validity of such speculation remains to be seen. 27 CHAPTER TWO: PREVIOUS WORKS There are a number of previous related works, some of which form a strong basis for this research. However, because of the novelty of the problem, very little work has been performed to directly solve the problem of finding a consensus landscape, except for RNAConSLOpt, discussed in section 2.2. 2.1 RNASLOpt RNASLOpt is a program published in 2011 by Dr. Yuan Li and Dr. Shaojie Zhang at the University of Central Florida [7]. The problem that their research seeks to address is how to deal with the prohibitively large size of the energy folding landscape. RNASLOpt filters out the vast majority of these extraneous structures by computing the Stable, Locally Optimal structures in the landscape, hence the acronym RNASLOpt. The assumptions underlying the methodology are simple: the native structure should not spontaneously change into another structure that is thermodynamically more favorable (it should be locally optimal), and the barrier energy of the native structure should be sufficiently high that the secondary structure cannot be altered by thermal noise or other unpredictable events (it should be stable). RNASLOpt takes an RNA sequence and uses Bafna’s algorithm [16] to compute all possible putative stacks in O(n2) time within user-defined parameters. In order to significantly reduce the time of the computation, Li et al. do not consider stacks with length less than 4, because Bafna showed that the fraction of stacks missed with this cutoff is less than 10%. It is important to note that the native structures of the riboswitches we use in 28 this thesis contain a small number stacks of length 2 and 3, as well as isolated base pairs, so we will have to accept this as merely an approximation. Using this list of putative stacks, RNASLOpt can use two separate algorithms to compute optimal configurations of these stacks, one based on the Nussinov model [5], and the other based on the Turner energy model [17], and also partially based on the Zuker-Sankoff algorithm [3]. A configuration is considered locally optimal when no new stacks can be added to the structure without conflicting with another stack or forming a pseudoknot. RNASLOpt returns the list of locally optimal structures. It then computes the stability of each of these structures, determined by the height of the lowest saddle point adjacent to each energy basin. If this energy barrier is less than ΔB, the structure is discarded. Finally, the remaining structures are ranked according to stability. Li et al. benchmark RNASLOpt using seven riboswitches (Adenine-BS, Adenine-VV, Guanine, SAM, C-di-GMP, Lysine, and TPP) against other RNA secondary structure prediction software, including mfold, RNAShapes, and RNAlocopt. In all cases except Lysine, RNASLOpt ranked the correct native structure higher than its competitors, indicating that its unique way of estimating the “points of interest” in the landscape is sound. Therefore, the methodology of this thesis will begin with results returned from RNASLOpt and attempt to optimize them, although the data sets will come from the paper summarized in the next section. 2.2 RNAConSLOpt Li et al. expanded on RNASLOpt by creating RNAConSLOpt [10], which, based on a thorough review of the literature, is currently the only software package that addresses the 29 problem of finding a consensus folding landscape. The input to the program is a multiple sequence alignment. RNAConSLOpt then analyzes the alignment using the covariance and conservation score introduced in RNAalifold by Hofacker et al. [18]. Specifically, if we visualize the multiple sequence alignment as a matrix in which each row contains a sequence and each column contains a set of nucleotides that have been aligned, we can calculate the covariance and conservation score between columns i and j, denoted by γij, using the following formula: 𝛾𝑖𝑗 = 1 (𝐶 − 𝜙1 𝑞𝑖𝑗 ) 𝑛 𝑖𝑗 (9) where ϕ1 is a weighting factor set to 1 by default, n is the number of sequences in the alignment, and Cij is defined by 𝐶𝑖𝑗 = 2 𝑛−1 𝑗 𝑗 𝑗 𝑖 𝑖 {𝑑(𝑎𝑘 , 𝑑𝑙 ) + 𝑑(𝑎𝑘 , 𝑎𝑙 ) 0 1≤𝑘<𝑙≤𝑛 𝑗 𝑖𝑓 (𝑎𝑘𝑖 ∙ 𝑎𝑘 ) 𝑎𝑛𝑑 (𝑎𝑙𝑖 ∙ 𝑎𝑙 ) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ∑ (10) 𝑗 where 𝑎𝑘𝑖 refers to the ith nucleotide in sequence k, (𝑎𝑘𝑖 ∙ 𝑎𝑘 ) means the associated 𝑗 nucleotides can form a base pair, and d(𝑎𝑘𝑖 , 𝑎𝑘 ) = 1 if the nucleotides are equal and 0 if they are not. Finally, qij is defined by 0 𝑞𝑖𝑗 = ∑ {0.25 1≤𝑘≤𝑛 1 𝑗 𝑖𝑓 𝑎𝑘𝑖 ∙ 𝑎𝑘 𝑗 𝑖𝑓 𝑏𝑜𝑡ℎ 𝑎𝑘𝑖 𝑎𝑛𝑑 𝑎𝑘 𝑎𝑟𝑒 𝑔𝑎𝑝𝑠 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (11) 30 In other words, Cij represents a bonus given to columns with compensatory mutations that keep the consensus structure, while qij is a penalty assessed to columns that do not follow the consensus structure. RNASLOpt computes γij for all possible pairs of distinct columns in the alignment, and uses this value in a modified version of the recursive function from [7] to compute locally optimal configurations of consensus stacks. Using the same heuristic from the 2011 paper, RNAConSLOpt calculates the barrier energy of each ConLOpt structure to determine the ConSLOpt structures. Because this methodology takes into account stacks shared between multiple sequences, the number of structures is reduced in comparison to RNASLOpt, essentially cutting out most of the structures that are simply a consequence of the RNA sequence, and leaving the biologically relevant structures. It is important to note that while other tools such as LocARNA [19-21] are already capable of computing the consensus structure of a multiple sequence alignment, these tools focus exclusively on the best possible consensus structure, and are therefore inappropriate to apply to RNA molecules that have alternate functional structures, such as riboswitches. In a sense, MUSHI represents the converse of RNAConSLOpt. While RNAConSLOpt takes an alignment first and computes optimal configurations of stacks (the consensus landscape) second, MUSHI takes input from RNASLOpt, which computes the landscapes first, and performs structural alignment second. In section 4.4, we compare the results of RNASLOpt + MUSHI to RNAConSLOpt, explaining the advantages and disadvantages of each. 2.3 GraphClust In 2012, Heyne et al. [22] published a paper describing GraphClust, a tool for ncRNA annotation. As discussed previously, recent studies have begun to indicate that ncRNA 31 plays a crucial role in gene expression [23], and yet we know relatively little about how such molecules function. One reason for this is that the classes of ncRNA are quite diverse, and ncRNAs sharing a common structure do not necessarily have high sequence identity, so sequencing information alone is not as useful in ncRNA clustering as it is in mRNAs. Thus, ncRNA annotation requires clustering using sequence-structure information, which is usually computationally expensive. GraphClust solves this problem by decomposing the landscape and using fast heuristics, many of which operate in constant time. Because of this speedup, GraphClust can scale to hundreds of thousands of sequences. GraphClust’s relevance to this thesis is due to its methodology, which is similar but not equivalent to finding the consensus landscape between two sequences. First, the sequences are analyzed and many suboptimal structures are enumerated. The structures are then represented by a graph, where the vertices are nucleotides, and an edge exists between vertices if and only if the corresponding nucleotides are adjacent in the sequence, or they form a base pair. Furthermore, an additional node is added in the middle of each pair of stacking base pairs, which induces important features into the graph. Because in practice one often only has a partial transcript of the RNA sequence, the authors further consider subsequences of the original sequence. In this way, each sequence is represented by a set of disconnected graphs. The authors can then compare these representative graphs using a graph kernel, which is a simple way of computing the similarity between two graphs. Specifically, they use the neighborhood subgraph pairwise distance kernel, which is a decomposition kernel introduced by Costa and Grave in 2010 [24]. Further Heyne et al. propose a fast method for testing for graph isomorphism, which reduces two isomorphic graphs to an identical string, which can then be mapped to an integer using an iterative 32 hashing procedure. At this point, determining isomorphism between two graphs simply amounts to checking whether or not these integers are equal. Using these special distance measures with the graph kernel, the authors are able to cluster ncRNAs based on the similarity of their representative structures, and thereby detect new classes of ncRNAs. This methodology is important to us because it shows us one method of comparing elements of a sequence’s landscape to find some sort of consensus which can be used to cluster the sequences. However, our desired outcomes are different. In our case, we already start with a group of ncRNAs that we know are part of the same class, and we want to analyze their landscapes in order to find common structures. GraphClust, by contrast, only seeks to cluster the ncRNAs, showing that they are part of the same class, but not necessarily making any definitive statements on their secondary structure. These three papers will prove to be the most relevant to our discussion of consensus landscapes. In order to analyze various riboswitches and the folding landscapes, we have written a custom piece of software in Java capable of performing the tasks explained in detail in the next two chapters. Originally, we envisioned a structural alphabet similar to the BEAR alphabet before actually learning of BEAR’s existence. The original structural alphabet was rudimentary, dividing each SSE into only five categories: multiloop, unpaired, stack, hairpin, internal loop/bulge. Thus, MUSHI was chosen as an acronym. 33 CHAPTER THREE: METHODOLOGY Clearly, the problem of global network alignment is exceedingly difficult, especially for energy folding networks, which are enormous in size. Current methods cannot solve the problem directly. Therefore, it is necessary to find the best approximate solution by working with an abridged data set. One observation of note is that the folding landscape includes all possible structures on any particular sequence, and necessarily includes structures that are clearly not of any biological significance. If we can devise some method of filtering out these uninteresting data points before any sort of alignment is attempted, it may be possible to find a solution. The primary method of investigation in this thesis combines the approaches of RNASLOpt, a structural alphabet called BEAR, and a structural substation matrix called MBR. In the following chapter, we explain how these approaches are combined to provide insight on our new problem, and in Chapter 4 present the results of applying them to predict the native structures of riboswitches. 3.1 RNASLOpt As discussed in the literature review, RNASLOpt [7] takes an RNA sequence, finds the putative stacks predicted using the methods of Bafna et al. [16], and enumerates all possible locally optimal structures that can be constructed using the stacks. These structures are then filtered by stability, so the output of the program is a list of stable local optimal structures. Because these are the points in the landscape that are most likely to be the native structure, the input to MUSHI will be the output of RNASLOpt. The goal of this methodology is to improve native structure prediction by both reducing the size of the list of candidates output 34 by RNASLOpt and reordering them such that the structures most closely resembling the native structures move toward the top. Our primary thesis is that this can be done by solving, exactly or approximately, the consensus folding landscape problem. Therefore, given two lists of SLOpt structures for two RNA sequences, we should attempt to find an approximate consensus between them. The most thorough way to achieve this goal is to perform a pairwise comparison of every possible pair consisting of one structure from each landscape. Much thought was given to the problem of how to compare two structures. The most naïve method would be to perform an alignment using the Needleman-Wunsch algorithm. However, this method has many limitations. Consider the following two structures in dot-bracket notation: ...((((.....))))...(((....(((...))).....)))..... ...((((.....(((....(((...)))........))).....)))) Particularly, if we examine the first 12 characters in each structure, we can see that an alignment based on the edit distance of these two sequences would treat both sets of characters as a perfect alignment. However, in the first structure, characters 8 through 12 are part of a hairpin loop, whereas the same characters in the second structure are actually part of an internal loop. While, in DNA sequence alignment, each nucleotide is an atomic unit of the sequence whose identity does not depend on the surrounding nucleotides, each character in a dot-bracket string represents something different depending not only on the identity of the character, but the context of the surrounding characters. Thus, performing structural alignment based solely on the edit distance of these structural representations eliminates contextual information that is critical to understanding the information the 35 sequence is meant to convey. Clearly, we must devise some way to overcome the limitations of such a primitive alphabet. In order to best understand the current methodology, it may help to explain some of the ideas that were discarded along the way, and why they were abandoned. When discussing the best way to compare two sets of SLOpt structures, many options were debated. One option was to compute the pairing probability matrix for each landscape. For a sequence of length N, the pairing probability matrix is a matrix of length and width N. Each cell (i,j) stores the probability that nucleotides i and j form a base pair. This probability is calculated from the frequency of such occurrences in the actual landscape. We calculated pairing probability matrices for two landscapes, scaled them so that the values in cells ranged from 0 to 255, and used MATLAB to create a greyscale visualization of each matrix, where each cell was assigned in color in the spectrum from black to white, depending on its numerical value. While this method provided interesting insight to the most common stack structures shared between both landscapes, aligning these matrices in a way that yielded valuable information proved to be difficult. Additionally, because this methodology was a direct translation of dot-bracket notation, it suffered from the same drawbacks described earlier. Particularly, critical information about the non-stack structures such as internal loops is not conveyed in a pairing probability matrix, which makes it poor method of comparing landscapes. Because the shortcomings of dot-bracket notation were now apparent, we next attempted to create a structural encoding for each structure that would convey a greater amount of information. Initially, we decided to use five possible characters: m (multiloop), u 36 (unpaired), s (stack), h (hairpin), i (internal loop), which inspired the acronym MUSHI. The example below shows how a structure would be encoded using this scheme: ...(((...(((...(((...)))...)))...(((...)))...)))... uuummmuuusssiiissshhhsssiiisssuuussshhhsssuuummmuuu Using this encoding, we then attempted to extract additional information from the landscape. We defined a condensed structural encoding as a shortened version of a structural encoding that retained transitions between different characters, but removed repeat characters. The example below shows the transformation from a structural encoding to a condensed structural encoding: uuummmuuusssiiissshhhsssiiisssuuussshhhsssuuummmuuu umusishsisushsumu The rationale behind using the condensed representation was to preclude length from being a factor in the alignments. It is possible for two structures to strongly resemble each other even though they may differ in length. Additionally, we attempted to gather data on frequency at which each nucleotide in the sequence was involved in a particular SSE. For each sequence of length N, this required creating a 5xN matrix, where each of the five rows recorded the frequency that its associated nucleotide was involved in that SSE. This process is similar to assigning a color to each nucleotide in a consensus structure, such as that computed by LocARNA, where the color denotes the sequence conservation of the nucleotide. We then attempted to visualize these in MATLAB by mapping three of the five rows to one of the three standard colors in the RGB additive color model, again scaling the range [0,1] to [0, 255]. While this resulted in some interesting patterns and identified a few important analogous nucleotides from each landscape, no meaningful consensus could be extracted from this information, so we again abandoned it. 37 3.2 Brand nEw Alphabet for RNA (BEAR) After some research, we came across an article in Nucleic Acids Research published in March 2014 by a group of researchers from the University of Rome [25]. In this article, Mattei et al. describe a structural alphabet similar to our earlier version, but far more sophisticated. First, they divide the class of possible structures into four groups: loops (L), internal loops (I), stems/stacks (S), and bulges (B). Each group contains a series of characters denoting an SSE of different length. For example, S = {S1, S2, S3, …}, where S1 denotes that the nucleotide is a member of a stem of length 1, and so on. Additionally, the alphabets S, I, and B are divided into characters denoting branching and nonbranching structures (S = Sn ∪ Sb). A non-branching structure is defined by the maximal boundary [i, j] such that if a hairpin loop exists within the boundary, it is the only such loop. In other words, there does not exist a multiloop structure in [i, j]. This distinction was necessary because they observed different transition rates between SSEs when building the substitution matrix discussed in the next section. Furthermore, the alphabets I and B are divided again into left and right internal loops and bulges. Thus, I = ILn ∪ ILb ∪ IRn ∪ IRb. In order to determine the upper limit on the size of each alphabet, Mattei et al. use a set of carefully selected RNA structures and use the 95th percentile of the length distribution as the limit for each SSE. Then, the final BEAR alphabet β = L ∪ I ∪ S ∪ B. One additional character, ‘:’, is added to the alphabet to denote unpaired nucleotides not belonging to any other SSE. The full alphabet used in this methodology is detailed in the following breakdown, where the first character in each set represents an SSE of length 1, the second represents an SSE of length 2, and so on: 38 Hairpin Loop = {j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,^} Stack = {a,b,c,d,e,f,g,h,i,=} Stack (Branching) = {A,B,C,D,E,F,G,H,I,J} Left Internal Loop = {?,!,”,#,$,%,&,’,(,),+} Left Internal Loop (Branching) = {?,K,L,M,N,O,P,Q,R,S,T,U,V,W} Right Internal Loop = {?,2,3,4,5,6,7,8,9,0,>} Right Internal Loop (Branching) = {?,Y,Z,~,?,_,|,/,\,@} Left Bulge = {[} Left Bulge (Branching) = {{} Right Bulge = {]} Right Bulge (Branching) = {}} Unpaired = {:} As an example of this new encoding, consider the following sequence, followed by the traditional dot-bracket representation of its structure, followed by the BEAR representation: AAAGCGCAAAGCGCAAACGCGAAAGCGCAAACGCGAAACGCGAAA ...((((...((((...))))...((((...))))...))))... :::dddd:::ddddllldddd:::ddddllldddd:::dddd::: 3.3 Substitution Matrices and MBR In addition to creating a new expressive alphabet, Mattei et al. also construct a structural substitution matrix which can be used for a variety of purposes, such as structural alignment. Substitution matrices were originally introduced by Margaret Dayhoff in 1978 [26]. Their purpose is to measure rates of evolutionary mutations between two polypeptides by expressing the relative probability that one amino acid in a multiple sequence alignment will mutate into another. Such matrices have dimensions 20x20, and the value S(i, j) 39 corresponds to the likelihood that, over a certain period of time, residue i in the sequence will mutate into residue j. The two most common substitution matrices in use are PAM and BLOSUM. PAM (Point Accepted Mutations or Percent Accepted Mutations, depending on the literature) is the matrix introduced by Dayhoff, and is based on global alignments of very closely related amino acid sequences (i.e. <1% divergence). The matrix PAM1 derives the effect on the sequence after 1% of the amino acids have changed. The commonly used PAM250 is equivalent to PAM1250. BLOSUM (Blocks Substitution Matrix) was introduced by Henikoff in 1992 [27]. This matrix is based on local alignments that are more distantly related. A substitution matrix is calculated from a multiple sequence alignment by counting the number of occurrences of each residue, as well as the number of times a specific residue is aligned with another specific residue, including an identical one. The result is then expressed as what is known as a log-odds score: 𝑆(𝑖, 𝑗) = 𝑙𝑜𝑔( 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 ) 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (12) The base of this logarithm is negligible. A score less than zero indicates that residues i and j were aligned less than what would be expected simply by chance, while a score greater than zero indicates the converse. In this way, we can use substitution matrices to predict how an ancestral sequence might mutate over millions of years, which can help us establish genetic phylogenies. Mattei et al. use the concept of a substitution matrix in a completely new context. Beginning with a set of highly structured RNA families in the work by Meyer et al [28], they searched through the database Rfam to find additional highly structured RNA families. 40 Each structure in the RNA families was folded, its sequence was converted to the BEAR alphabet, and each character in its BEAR representation was mapped to its corresponding nucleotide in the multiple sequence alignment. Then, using the method created by Dayhoff described above, they derived an 83x83 matrix where each value (i, j) denotes the relative probability that a nucleotide involved in a substructure i may, in another structure, be involved in a substructure j. In essence, this creates a completely new kind of substitution matrix that can be used to analyze different structures for many purposes. Mattei et al. explain such uses, as well as validating this matrix, in their paper [25]. This matrix is called the Matrix of BEAR-encoded RNA secondary structures, or MBR. 3.4 Datasets Because we are interested in comparing this methodology to that of RNAConSLOpt, we selected the same four nucleotides used in the benchmarking process in Li’s paper [10]. Specifically, we examined the following riboswitches: (1) the adenine riboswitch from the ydhL gene of Bacillus subtilis, the lysine riboswitch from the lysC gene of Bacillus subtilis, the thiamine pyrophosphate (TPP) riboswitch from the thiamin gene of Bacillus subtilis, and the flavin mononucleotide (FMN) riboswitch from the ribD gene of Bacillus subtilis. For each riboswitch, we obtained five RNA sequences in the family, as well as the canonical ‘on’ and ‘off’ structures and sequences. These were used as input to the software pipeline described in the next section. 3.5 Software Pipeline We began by creating four files – one for each of the riboswitches. Each file contained five RNA sequences belonging to that family. Each file was given to RNASLOpt as input to 41 find the stable local optimal structures of each sequence. RNASLOpt also requires parameter specifying the acceptable range of structures it should return. Specifically, the user should select a Δp, which specifies the boundary for which no structure having a free energy value greater than p percent away from the MFE structure will be returned, and a ΔB, which specifies the boundary for which no structure having stability less than B kcal/mol will be returned. The full software pipeline is detailed in Figure 3. The parameters used for each sequence, as well as the number of structures generated by RNASLOpt, are detailed in Table 1. Figure 3 - MUSHI Pipeline 42 RNASLOpt can process multiple sequences at once, and will combine all the results into a single file. Our output was one large file for each riboswitch, with each file containing the local optimal structures for each sequence ranked by free energy, as well as the stable local optimal structures (a subset of the LOpt structures) ranked by stability. At the top of each of these files, we prepended the canonical sequence for each riboswitch, as well as the native on and off structures. Table 1 - RNASLOpt Parameters and Landscape Sizes Landscape Δp (%) ΔB (kcal/mol) Adenine 0 Adenine 1 Adenine 2 Adenine 3 Adenine 4 Lysine 0 Lysine 1 Lysine 2 Lysine 3 Lysine 4 TPP 0 TPP 1 TPP 2 TPP 3 TPP 4 FMN 0 FMN 1 FMN 2 FMN 3 FMN 4 55 55 55 55 55 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Number of SLOpt structures 4 3 6 3 5 346 728 222 523 96 91 135 18 14 36 316 523 666 216 718 MUSHI is designed to accept a series of landscapes from within a single riboswitch family. It parses the output of a single run of RNASLOpt on a set of sequences, and stores the native sequence, on, and off structures of the riboswitch, each sequence analyzed by RNASLOpt, and every SLOpt structure as well as their energy and stability rankings. After 43 loading all of these landscapes, MUSHI then translates the native structures into BEAR, and performs the structural alignment algorithm described below on each possible pair (a,b), where a is either the on or off native structure, and b is a structure in one of the landscapes given in the file. For each landscape, MUSHI returns the two (possibly identical) structures most closely resembling the on and off native structures, as well as both alignments and alignment scores. The two structures returned for each landscape are referred to as the target structures. Next, the user can specify which two landscapes on which to perform structural alignment. Once this information has been specified, MUSHI writes each structure in the landscape into a file in FASTA format. It then sends an instruction to the command prompt to run a program called the BEAR encoder, created by Mattei et al., to translate each structure into the BEAR alphabet. The BEAR encoder performs this translation, and outputs the result into another file in FASTA format. MUSHI sleeps for a sufficient amount of time to allow the BEAR encoder to finish the process, and then accesses the new file containing the BEAR sequences. Then, for each possible pair (a, b) where a ∈ Landscape 1 and b ∈ Landscape 2, it runs the structural alignment algorithm defined by the following recursive function: 𝑆(𝑖 − 1, 𝑗 − 1) + 𝑀𝐵𝑅(𝑖, 𝑗) + 𝐵𝑂𝑁𝑈𝑆(𝑁𝑖 , 𝑁𝑗 ) 𝑆(𝑖, 𝑗) = 𝑚𝑎𝑥 { 𝑆(𝑖 − 1, 𝑗) + 𝑔𝑎𝑝 𝑆(𝑖, 𝑗 − 1) + 𝑔𝑎𝑝 𝐵𝑂𝑁𝑈𝑆(𝑁𝑖 , 𝑁𝑗 ) = { 𝑏𝑜𝑛𝑢𝑠 𝑖𝑓 𝑁𝑖 = 𝑁𝑗 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (13) 44 This very simple algorithm is just a slight modification of Needleman-Wunsch to include the MBR as a method for scoring matches and mismatches. Additionally, it incorporates information about the sequences themselves by adding a bonus score when the nucleotide associated with BEAR character i is identical to the nucleotide associated with BEAR character j. Mattei et al. use the exact same algorithm in their paper, but for the purpose of creating multiple sequence alignments. Conversely, we are not necessarily interested in the structural alignment itself, but rather the score of the optimal alignment as an estimation for structural distance. Because this O(n2) algorithm is run NxM times, the running time of the overall procedure of comparing two landscapes is O(n4). It may be possible to optimize this in the future, but for now we are interested in using the most thorough possible method to prove that our concept is valid. After all the structural alignments have been performed, the scores and alignments themselves are saved in memory. Then, using the stability ranking, energy ranking, and alignment ranking, every possible alignment is ranked according to the following formula: 𝑀(𝑖, 𝑗) = 𝛼(𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑟𝑎𝑛𝑘) + 𝛽(𝑒𝑛𝑒𝑟𝑔𝑦 𝑟𝑎𝑛𝑘) + 𝛾(𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡 𝑠𝑐𝑜𝑟𝑒) (14) where 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑟𝑎𝑛𝑘(𝑖, 𝑗) = 𝑠𝑖𝑧𝑒(𝐿1) − 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑖) + 𝑠𝑖𝑧𝑒(𝐿2) − 𝑠𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑗) 𝑠𝑖𝑧𝑒(𝐿1) + 𝑠𝑖𝑧𝑒(𝐿2) (15) 𝑒𝑛𝑒𝑟𝑔𝑦 𝑟𝑎𝑛𝑘(𝑖, 𝑗) = 𝑠𝑖𝑧𝑒(𝐿1) − 𝑒𝑛𝑒𝑟𝑔𝑦(𝑖) + 𝑠𝑖𝑧𝑒(𝐿2) − 𝑒𝑛𝑒𝑟𝑔𝑦(𝑗) 𝑠𝑖𝑧𝑒(𝐿1) + 𝑠𝑖𝑧𝑒(𝐿2) (16) 45 and (∑𝐿𝑘=0 { 𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡 𝑠𝑐𝑜𝑟𝑒(𝑖, 𝑗) = 𝑔𝑎𝑝 𝑖𝑓 𝑖(𝑘) =′ −′ 𝑜𝑟 𝑗(𝑘) =′ − ′ ) − min 𝑠𝑐𝑜𝑟𝑒 𝑀𝐵𝑅(𝑖(𝑘), 𝑗(𝑘)) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 max 𝑠𝑐𝑜𝑟𝑒 − min 𝑠𝑐𝑜𝑟𝑒 (17) where size(L1) is the number of SLOpt structures in landscape 1, stability(i) is the stability ranking of structure i in landscape 1, energy(i) is the energy ranking of structure i in landscape 1, L is the length of the alignment between the two structures, i(k) is the k-th character in the first level of the alignment, and α, β, and γ are parameters dictating the contribution of each addend of the weighted sum. Each series of structural alignments resulted in a list of NxM alignments, where M and N are the sizes of landscapes 1 and 2, respectively. If the size of the landscapes created by RNASLOpt are too large, the user can specify the percentage of the structures that should be used when comparing two landscapes. These alignments were than ranked by their match score, based on the formula given above. Next, MUSHI locates and reports the ranking of the match between the two target on structures and the two target off structures between both landscapes, called the target matches. It also returns the top k alignments, where k is a number specified by the user. Finally, MUSHI returns the consensus landscape, defined as shortened versions of the two original landscapes such that every structure in the shortened list of the first landscape has a strong match in the shortened list of the second landscape. Here, ‘strong’ is a relative term which can be defined by the user in multiple ways: the unique structures involved in the top x alignments, the unique structures involved in the top x% of the alignments, or the 46 unique structures involved in the alignments within x% of the maximum possible match score. In this thesis, we use the second method. 47 CHAPTER FOUR: RESULTS AND DISCUSSION 4.1 Ranks of Target Matches in Complete Pairwise Alignment As explained in Chapter 3, one of the first steps in the methodology is to establish which structures from both landscapes most closely resemble the canonical on and off riboswitch structures by aligning their BEAR representations with both landscapes. The results of these alignments are detailed in Table 2. For example, in the landscape Adenine 0, RNASLOpt returned 4 structures. The structure most closely resembling the on structure had stability rank 1 and energy rank 2 in its landscape (denoted (1, 2)), and the raw score of its best alignment with the on structure, excluding the sequence bonus, was 62.39. Table 2 - Results of aligning each landscape with native ‘On’ and ‘Off’ structures Landscape # of SLOpt Structures Adenine 0 Adenine 1 Adenine 2 Adenine 3 Adenine 4 Lysine 0 Lysine 1 Lysine 2 Lysine 3 Lysine 4 TPP 0 TPP 1 TPP 2 TPP 3 TPP 4 FMN 0 FMN 1 FMN 2 FMN 3 FMN 4 4 3 6 3 5 346 728 222 523 96 91 135 18 14 36 316 523 666 216 718 Rank of Best ‘On’ Structure (s, e) Raw Alignment Score (w/ Native ‘On’) Rank of Best ‘Off’ Structure (s, e) Raw Alignment Score (w/ Native ‘Off’) (1, 2) (1, 1) (4, 5) (1, 2) (0, 0) (21, 30) (96, 71) (159, 14) (96, 205) (89, 36) (2, 4) (99, 14) (15, 5) (5, 5) (12, 30) (3, 294) (485, 516) (285, 241) (72, 118) (208, 121) 62.39 38.74 38.04 13.91 45.85 148.84 99.46 69.05 75.22 70.63 82.92 48.08 15.54 53.13 42.87 43.95 42.45 28.10 27.97 17.52 (0, 0) (0, 0) (0, 0) (0, 0) (2, 1) (149, 29) (0, 0) (158, 47) (31, 33) (24, 20) (0, 0) (0, 0) (3, 1) (0, 0) (8, 1) (29, 138) (333, 108) (386, 233) (185, 37) (362, 397) 65.30 51.15 38.01 37.32 9.5 160.35 115.22 67.07 101.59 77.61 101.38 86.32 69.31 66.70 22.12 86.49 72.09 19.09 33.12 15.89 48 The purpose of comparing each structure in the first landscape to each structure in the second is to establish which pairs both share a strong structural resemblance and are also highly stable and energetically favorable. This procedure results in NxM comparisons, where N and M are the sizes of the first and second landscape respectively. Here, we make three key assumptions: (1) RNASLOpt has correctly predicted the on and off structures and they appear in both landscapes, (2) the stability of a molecule and the probability that it is the correct native structure share a positive correlation, and (3) the energy level of a molecule and the probability that it is the correct native structure share a negative correlation. These are the assumptions underpinning the use of the weighted sum, in which two structures will have a greater score if they strongly resemble each other, and their stability and energy ranks are both high. Ideally, if we then let the target on structures in landscape 1 and landscape 2 be labeled n and m respectively, we should expect that the match score between n and m will appear near the top of the rankings, indicating that these two structures are good candidates for the native structure. The following tables show the ranks of the target matches for each riboswitch with α = 1, β = 1, and γ = 2 in the weighted sum, gap = -0.75, and bonus = 0.2. The method for determining the optimal values of these parameters is explained at the end of section 4.1. 49 Table 3 – Complete pairwise structural alignment for adenine ‘On’ Comparison (L1, L2) Number of Alignments Rank of Target ‘On’ Match Percentile Alignment Score (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) 12 24 12 20 18 9 15 18 30 15 3 7 4 1 10 5 1 8 3 1 0.7500 0.7083 0.6667 0.9500 0.4444 0.4444 0.9333 0.5556 0.9000 0.9333 42.89 55.78 44.30 54.85 45.59 33.80 48.45 50.27 47.59 55.46 Table 4 – Complete pairwise structural alignment for adenine ‘Off’ Comparison (L1, L2) Number of Alignments Rank of Target ‘Off’ Match Percentile Alignment Score (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) 12 24 12 20 18 9 15 18 30 15 1 1 1 3 1 1 5 1 18 3 0.9167 0.9583 0.9167 0.8500 0.9444 0.8889 0.6667 0.9444 0.4000 0.8000 52.06 51.56 83.96 23.68 77.00 60.57 17.25 74.84 3.48 15.59 In the Adenine riboswitch (Tables 3 and 4), most of the matches involving both targets appear in the top 5 possible alignments. This is not surprising, especially since the off targets already appeared at the top of the landscapes given by RNASLOpt alone. However, the usefulness of this riboswitch lies in its relatively short sequence length, which allows computations to be carried out thoroughly and rapidly. As shown in Table 2, the fact that the sequences are much shorter resulted in landscapes of dramatically reduced sized 50 compared to the other riboswitches used. In order to evaluate the methodology, it was necessary to use riboswitches with increased sample size. Table 5 – Complete pairwise structural alignment for lysine ‘On’ Comparison (L1, L2) Number of Alignments Rank of Target ‘On’ Match Percentile Alignment Score (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) 251888 76812 180958 33216 161616 380744 69888 116106 21312 50208 4 77 415 4 24 116 706 3290 665 32 0.99998 0.99900 0.99771 0.99988 0.99985 0.99970 0.98990 0.97166 0.96880 0.99936 153.57 105.12 98.71 103.12 107.28 121.46 79.81 99.20 80.04 132.10 Table 6 – Complete pairwise structural alignment for lysine ‘Off’ Comparison (L1, L2) Number of Alignments Rank of Target ‘Off’ Match Percentile Alignment Score (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) 251888 76812 180958 33216 161616 380744 69888 116106 21312 50208 12 1244 3 5 23 1 9 767 151 1 0.99995 0.98380 0.99998 0.99985 0.99986 0.99999 0.99987 0.99339 0.99291 0.99998 142.63 96.81 146.43 108.31 93.08 153.60 97.41 95.00 95.02 127.37 The landscapes comprising the data set for the Lysine riboswitch (Tables 5 and 6) were much larger, yielding more significant results. Most striking is the fact that, in the majority 51 of comparisons, the target matches were ranked in the 99.9th percentile, even when the original target structures did not appear at the top of the RNASLOpt rankings. Table 7 – Complete pairwise structural alignment for TPP ‘On’ Comparison (L1, L2) Number of Alignments Rank of Target ‘On’ Match Percentile Alignment Score (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) 12285 1638 1274 3276 2430 1890 4860 252 648 504 7 264 9 12 1450 153 295 94 432 20 0.9994 0.8388 0.9929 0.9963 0.4033 0.9190 0.9393 0.6270 0.3333 0.9603 132.95 18.05 70.67 83.18 20.14 74.04 61.06 44.22 14.94 82.41 Table 8 – Complete pairwise structural alignment for TPP ‘Off’ Comparison (L1, L2) Number of Alignments Rank of Target ‘Off’ Match Percentile Alignment Score (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) 12285 1638 1274 3276 2430 1890 4860 252 648 504 4 3 4 2 1 2 8 1 23 2 0.9997 0.9982 0.9969 0.9994 0.9996 0.9989 0.9984 0.9960 0.9645 0.9960 94.67 75.58 79.51 77.97 106.18 90.64 50.87 82.56 24.83 64.48 The results for the TPP riboswitch were similar (Tables 7 and 8), with a majority of the target matches appearing in the 99th percentile of the rankings. However, most of the target matches that did not rank as well were associated with the on native structure. One 52 explanation for this may be that the alignment scores of the best on structures in each landscape with the native on structure were significantly lower than that of their off counterparts. This is due to a fundamental limitation on MUSHI’s performance: a close match with the native structure must appear somewhere in the landscape for MUSHI to identify it. Table 9 – Complete pairwise structural alignment for FMN ‘On’ Comparison (L1, L2) Number of Alignments Rank of Target ‘On’ Match Percentile Alignment Score (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) 165268 210456 68256 226888 348318 112968 375514 143856 478188 155088 162611 52695 7288 39867 243343 62496 184017 21563 421 2834 0.0161 0.7496 0.8932 0.8243 0.3014 0.4468 0.5100 0.8501 0.9991 0.9817 52.99 34.34 80.84 26.76 22.93 94.19 33.54 63.23 121.53 76.37 Table 10 – Complete pairwise structural alignment for FMN ‘Off’ Comparison (L1, L2) Number of Alignments Rank of Target ‘Off’ Match Percentile Alignment Score (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) 165268 210456 68256 226888 348318 112968 375514 143856 478188 155088 10653 31626 2884 71106 30954 5615 22014 42382 105621 16247 0.9355 0.8497 0.9577 0.6866 0.9111 0.9503 0.9414 0.7054 0.7791 0.8952 137.43 45.63 82.26 38.83 59.17 96.63 83.16 54.81 68.62 90.93 53 To our dismay, the results obtained from the FMN riboswitch were significantly worse (Tables 9 and 10). While 14 of the 20 target matches scored above the 74th percentile, this is still nowhere near the significance necessary to effectively identify significant structures in either landscape. The best explanation for this is the fact that the alignment scores of the best on and off structures with the native structures are very poor in comparison to all other landscapes. The FMN dataset contained the longest RNA sequences, and so if the RNASLOpt results for FMN were on par with the rest of the data, we would expect the alignment scores to be the highest in the set. Instead, the alignment scores from FMN are similar or worse than TPP, indicating that RNASLOpt did not perform as well on the final riboswitch. Tables 11 and 12 show the rankings of all target matches for varying values of α, β, and γ. These values were chosen to determine the significance of introducing a distance measure into the ranking formula. They use the following factors: stability alone; energy alone; stability and energy; stability, energy and similarity; and stability, energy, and similarity with doubled weight. The value 0.2 was chosen for the bonus parameter per the recommendations of Mattei et al [25]. The value -0.75 was chosen for gap by experiment, to strike a balance between setting the value too low and therefore having the alignment avoid all gaps, or setting it too high and having the alignment overuse gaps. Based on these tables, the optimal parameters appear to be α = 1, β = 1, γ = 2, gap = -0.75, and bonus = 0.2. In most cases, this resulted in the highest rankings, and improvements over measures such as “stability only” grew significantly with the introduction of the similarity measure, and moderately by double-weighting it. In most of the cases when the optimal parameters were not 1,1,2,-0.75,0.2, the rank assigned by these parameters did not 54 differ significantly from the highest rank for that target match, specifically in the off structures of TPP. This suggests that MUSHI is able to preserve good output from RNASLOpt, while improving moderate output. MUSHI’s poor results on FMN suggest that improvement on poor RNASLOpt results is negligible. 55 Table 11 – Complete pairwise alignment rankings of ‘On’ target matches for different values of α, β, γ, gap, and bonus respectively Comparison (L1, L2) Adenine (0, 1) Adenine (0, 2) Adenine (0, 3) Adenine (0, 4) Adenine (1, 2) Adenine (1, 3) Adenine (1, 4) Adenine (2, 3) Adenine (2, 4) Adenine (3, 4) Lysine (0, 1) Lysine (0, 2) Lysine (0, 3) Lysine (0, 4) Lysine (1, 2) Lysine (1, 3) Lysine (1, 4) Lysine (2, 3) Lysine (2, 4) Lysine (3, 4) TPP (0, 1) TPP (0, 2) TPP (0, 3) TPP (0, 4) TPP (1, 2) TPP (1, 3) TPP (1, 4) TPP (2, 3) TPP (2, 4) TPP (3, 4) FMN (0, 1) FMN (0, 2) FMN (0, 3) FMN (0, 4) FMN (1, 2) FMN (1, 3) FMN (1, 4) FMN (2, 3) FMN (2, 4) FMN (3, 4) Number of Alignments 12 24 12 20 18 9 15 18 30 15 251888 76812 180958 33216 161616 380744 69888 116106 21312 50208 12285 1638 1274 3276 2430 1890 4860 252 648 504 165268 210456 68256 226888 348318 112968 375514 143856 478188 155088 1,0,0, -0.75,0.2 5 16 5 3 14 5 3 14 15 3 6925 16312 6925 6007 32142 18625 13207 32239 18904 13207 5099 156 31 108 1902 1374 3390 195 349 153 104442 41620 2854 22370 261128 96641 226422 54036 122057 37333 0,1,0, -0.75,0.2 7 22 10 5 16 7 3 18 19 5 5160 999 27746 2220 3678 38282 5735 24249 1320 18593 174 47 47 598 205 187 981 65 493 405 164863 119294 61117 81374 255619 107605 197134 54407 65872 28477 56 1,1,0, -0.75,0.2 6 19 7 3 15 6 2 15 16 3 1445 2793 6975 1991 6467 10620 4563 19570 9572 11904 1003 54 20 190 949 692 2042 126 465 286 159471 67399 27465 31084 283422 110331 218154 43916 57172 19621 1,1,1, -0.75,0.2 4 14 7 1 12 6 1 13 3 1 15 198 937 20 86 327 1128 6960 2857 422 46 135 9 21 1220 300 753 97 450 61 162340 58112 14048 34528 265604 93551 200339 29960 3620 6077 1,1,2, -0.75,0.2 3 7 4 1 10 5 1 8 3 1 4 77 415 4 24 116 706 3290 665 32 7 264 9 12 1450 153 295 94 432 20 162611 52695 7288 39867 243343 62496 184017 21563 421 2834 Table 12 – Complete pairwise alignment rankings of ‘Off’ target matches for different values of α, β, γ, gap, and bonus respectively Comparison (L1, L2) Adenine (0, 1) Adenine (0, 2) Adenine (0, 3) Adenine (0, 4) Adenine (1, 2) Adenine (1, 3) Adenine (1, 4) Adenine (2, 3) Adenine (2, 4) Adenine (3, 4) Lysine (0, 1) Lysine (0, 2) Lysine (0, 3) Lysine (0, 4) Lysine (1, 2) Lysine (1, 3) Lysine (1, 4) Lysine (2, 3) Lysine (2, 4) Lysine (3, 4) TPP (0, 1) TPP (0, 2) TPP (0, 3) TPP (0, 4) TPP (1, 2) TPP (1, 3) TPP (1, 4) TPP (2, 3) TPP (2, 4) TPP (3, 4) FMN (0, 1) FMN (0, 2) FMN (0, 3) FMN (0, 4) FMN (1, 2) FMN (1, 3) FMN (1, 4) FMN (2, 3) FMN (2, 4) FMN (3, 4) Number of Alignments 12 24 12 20 18 9 15 18 30 15 251888 76812 180958 33216 161616 380744 69888 116106 21312 50208 12285 1638 1274 3276 2430 1890 4860 252 648 504 165268 210456 68256 226888 348318 112968 375514 143856 478188 155088 1,0,0, -0.75,0.2 1 1 1 4 1 1 4 1 4 4 11325 43687 16440 12120 12562 497 301 18114 12984 1572 1 7 1 37 7 1 37 10 70 37 64652 81400 23035 73816 238383 88699 227316 100147 276614 95118 0,1,0, -0.75,0.2 1 1 1 2 1 1 2 1 2 2 458 2976 1994 1262 1129 562 211 3308 2333 1443 1 2 1 2 2 1 2 3 5 2 30408 67496 15421 119320 58527 10682 128089 35235 199137 70710 57 1,1,0, -0.75,0.2 1 1 1 2 1 1 3 1 2 2 834 13960 2239 3715 1491 72 49 3903 6136 432 1 2 1 4 3 1 3 2 8 4 33008 59909 14913 86066 127831 43086 170369 63892 234749 84027 1,1,1, -0.75,0.2 1 1 1 3 1 1 5 1 10 4 17 3529 6 46 22 1 5 900 753 2 1 1 2 1 1 1 2 1 10 2 16757 41872 5889 77322 62766 16154 66602 51349 156409 40944 1,1,2, -0.75,0.2 1 1 1 3 1 1 5 1 18 3 12 1244 3 5 23 1 9 767 151 1 4 3 4 2 1 2 8 1 23 2 10653 31626 2884 71106 30954 5615 22014 42382 105621 16247 4.2 Comparison of Ranks to RNA SLOpt Prediction Because MUSHI is a post-processing tool used in conjunction with RNASLOpt, it has a fundamental limitation: RNASLOpt should correctly enumerate at least one structure that is reasonably similar to the native structure. In other words, if the correct answer does not exist in the input, it cannot be found. In this section, we quantify the phrase “reasonably similar.” In addition, we seek to make some general statement about when MUSHI is helpful, and establish a positive correlation between RNASLOpt’s performance and MUSHI’s performance. It was necessary to establish some metric by which we could estimate how close RNASLOpt had come to finding the correct structure. We decided to compare the alignment score of the structure most similar to the native structure (without any sequencing bonus) to the maximum and minimum possible alignment scores. First, we consider all possible structures in the landscape, as well as their alignment scores with the native on structure. Without loss of generality, we can consider the on and off structures separately. Clearly, we can then sort all structures by their alignment scores, and there will exist one structure with a maximum score and one structure with a minimum score. If we can determine these two scores, we can establish a range, and determine the placement of the SLOpt structure most closely matching the native structure in that range to estimate RNASLOpt’s ability to enumerate the correct structure. The problem of finding the most dissimilar structure in the landscape can be formalized as follows. Given an alphabet Σ, a target string x ∈ 𝛴 𝑁 , an RNA sequence m of length M, a |𝛴| ∗ |𝛴| scoring matrix indicating d(a, b) for each a, b ∈ Σ, and bonus and gap parameters, determine the string y ∈ 𝛴 𝑀 meeting the following two conditions: (1) y ∈ 𝐿(𝑚) (2) ∄𝑧 ∈ 58 𝐿(𝑚) 𝑠. 𝑡. 𝑀𝑁𝑊(𝑥, 𝑧) < 𝑀𝑁𝑊(𝑥, 𝑦), where d(a, b) denotes the log-odds score between characters a and b, L(m) denotes the landscape induced by RNA sequence m, and MNW(x, y) denotes the score of the best alignment given by the Mattei variant of the NeedlemanWunsch algorithm. This problem statement can be easily altered to search for the most similar string in the landscape. Luckily, we are not interested in the string itself, but the value MNW(x, y). Therefore, in the absence of any algorithm that solves this problem directly, estimating the upper and lower bounds of this range should still allow us to make meaningful statements about RNASLOpt’s performance. In the case where the length of the native structure used matches the length of the sequence, it is obvious that the string of BEAR characters that most closely matches the native structure is the BEAR representation of the native structure itself. However, not all of our sequences are the same length. Luckily, because all of sequences are relatively similar in length, we can still estimate the best possible alignment score by aligning the native structure with itself. Estimating the lower bound of this range is more difficult. A little logic can tell us that the structure whose best possible alignment with the native structure is minimal cannot be less than the cost of deleting the entire first sequence and inserting the entire second sequence. This is because insertions and deletions are always an option in the alignment, so any global alignment algorithm will either choose this alignment or choose a better one. This gives us an absolute minimum bound on the similarity score of any structure. Of course, depending on the gap parameter and the MBR itself, it may be the case that no structures will have this arrangement as their best alignment (if, for example, the gap penalty is extremely large), and so we would be overestimating the magnitude of this boundary. However, in 59 the absence of any algorithm that directly solves this problem, this is currently the best estimate that we can use. In order to find a correspondence between RNASLOpt’s performance and MUSHI’s performance, we used the aforementioned method to estimate the range of the alignment scores of the structures in each landscape to the native structure. We then divided the difference between the alignment score of the most similar structure and the lower bound by the entire range to normalize the score. Finally, for each possible pairing of two landscapes, we averaged this percentile for both the on and off structures and compared it to the percentile rank of the target match in the complete pairwise alignment. What we expected to find was that, as long as the average RNASLOpt percentile was above a certain threshold, the target match should appear in a high percentile in MUSHI. Around the threshold, performance should start to decrease, and below the threshold, MUSHI should hold no predictive power, so we should not see any clear trends. Figure 4 shows what was actually measured. In the figure, the line of best fit for the Adenine data set is positive, but the standard deviation from this line is high. This is likely due to the small size of this data set, in which even lowering the rank of the target match by one can have a large effect on its percentile. If we focus on the TPP dataset, we can see that it follows the general trend that we expected to see, with poor performance below the threshold, and high performance above it. Similarly, the structures from the Lysine data set were all roughly at or above the threshold, and so in MUSHI the target matches all appeared at or above the 95th percentile. The FMN dataset is a bit more perplexing. Most of its structures seem to be at or below the threshold, and the trend seen in the previous datasets disappears as expected. However, the points in 60 each dataset overlap more than expected. While the FMN data points well above the threshold score in the 90th percentile in MUSHI, the points only slightly above the threshold did not score nearly as well as the points from other datasets in a similar position along the x axis. This overlap may be due to the fact that the estimation of the lower bound of the range of alignment scores may not be highly accurate, as discussed previously. The expected trend still exists, but its boundaries are somewhat fuzzy. Figure 4 - Performance of RNASLOpt compared to performance of MUSHI 61 Removing Adenine from the graph highlights the trend: Figure 5 - Performance of RNASLOpt compared to performance of MUSHI (without adenine) The graph in Figure 5 seems to suggest that the performance of MUSHI on any two landscapes containing structures at least 75% similar to the native structure should be adequate. 4.3 Analysis of Consensus Landscapes MUSHI should improve the output of RNASLOpt in two ways: (1) it should reduce the number of structures in each landscape, leaving only those that can reasonably be called part of a “consensus”, and (2) it should reorder the remaining structures such that the structures that most closely match the native structures appear closer to the top of the list. These are the two primary metrics by which we will evaluate MUSHI’s performance. The way in which the results from the complete pairwise alignment can be translated into these 62 two objectives is simple. We simply scan through the NxM set of ranked pairings until we reach an arbitrary threshold set by the user. For each match we encounter, add both structures to the consensus if they are not a part of it already. This will guarantee that each structure included from landscape A will have at least one strong match from landscape B. In this case, we determine the threshold by only examining the top x% of matches. The percentage for each landscape comparison is different. In the tables throughout this section, the Consensus Threshold column represents the tightest possible lower bound necessary to include all four target structures in the consensus. The bar graphs in this section each correspond to the tables above them, and convey the same information. For each comparison (L1, L2), the blue bar shows the percentile ranking of the target structure in L1 after only applying RNASLOpt, and the orange bar shows the new rank of the target structure in the landscape of L1 after comparing L1 with L2. Table 13 - Effect of MUSHI on adenine landscape size Comparison (L1, L2) (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) Consensus Threshold (%) L1 Old Size L1 New Size L2 Old Size L2 New Size 25.000 29.167 33.333 15.000 55.556 55.556 33.333 44.444 60.000 20.000 4 4 4 4 3 3 3 6 6 3 2 4 3 2 3 3 3 4 6 2 3 6 3 5 6 3 5 3 5 5 3 4 3 2 5 3 2 3 5 2 63 Table 14 – Changes in target ‘On’ structures in adenine (tabulated) Comparison Consensus (L1, L2) Threshold (%) 25.000 (0, 1) (0, 2) 29.167 (0, 3) 33.333 (0, 4) 15.000 (1, 2) 55.556 (1, 3) 55.556 (1, 4) 33.333 (2, 3) 44.444 (2, 4) 60.000 (3, 4) 20.000 L1 ‘On’ Old Rank L1 ‘On’ New Rank L2 ‘On’ Old Rank L2 ‘On’ New Rank 1 1 1 1 1 1 1 4 4 1 1 2 2 0 1 2 0 3 1 0 1 4 1 0 4 1 0 1 0 0 2 3 2 0 3 2 0 2 0 0 1 0.9 Ranking (Percentile) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3) Comparison RNASLOpt RNASLOpt + MUSHI Figure 6 - Changes in target 'On' structures in adenine (graphed) 64 Table 15 – Changes in target ‘Off’ structures in adenine (tabulated) Comparison Consensus (L1, L2) Threshold (%) 25.000 (0, 1) (0, 2) 29.167 (0, 3) 33.333 (0, 4) 15.000 (1, 2) 55.556 (1, 3) 55.556 (1, 4) 33.333 (2, 3) 44.444 (2, 4) 60.000 (3, 4) 20.000 L1 ‘Off’ Old Rank L1 ‘Off’ New Rank L2 ‘Off’ Old Rank L2 ‘Off’ New Rank 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 3 1 0 0 0 2 0 0 2 0 2 2 0 0 0 1 0 0 1 0 1 1 1 0.9 Ranking (Percentile) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3) Comparison RNASLOpt RNASLOpt + MUSHI Figure 7 - Changes in target 'Off' structures in adenine (graphed) As expected, the results from the Adenine group (Tables 13, 14, and 15, Figures 6 and 7) are unimpressive due to the small landscape sizes. Even a match with a high alignment score may rank within the 50th percentile of all matches, so MUSHI was only able to remove a couple structures from the list, if any. Furthermore, because the target structures are already so close to the top, their upward mobility is limited. In other words, MUSHI is 65 unable to narrow down the landscapes any further because it was designed for larger landscapes. Table 16 - Effect of MUSHI on lysine landscape size Comparison (L1, L2) (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) Consensus Threshold (%) L1 Old Size L1 New Size L2 Old Size L2 New Size 0.010 1.620 0.229 0.015 0.015 0.030 1.010 2.834 3.120 0.064 346 346 346 346 728 728 728 222 222 523 16 108 92 5 9 36 77 204 80 11 728 222 523 96 222 523 96 523 96 96 12 146 81 3 15 35 82 198 66 11 Table 17 – Changes in target ‘On’ structures in lysine (tabulated) Comparison Consensus (L1, L2) Threshold (%) 0.010 (0, 1) (0, 2) 1.620 (0, 3) 0.229 (0, 4) 0.015 (1, 2) 0.015 (1, 3) 0.030 (1, 4) 1.010 (2, 3) 2.834 (2, 4) 3.120 (3, 4) 0.064 L1 ‘On’ Old Rank L1 ‘On’ New Rank L2 ‘On’ Old Rank L2 ‘On’ New Rank 21 21 21 21 96 96 96 159 159 96 1 0 9 3 5 31 55 75 37 10 96 159 96 89 159 96 89 96 89 89 2 27 80 1 14 28 38 78 3 4 66 1 0.9 Ranking (Percentile) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3) Comparison RNASLOpt RNASLOpt + MUSHI Figure 8 - Changes in target 'On' structures in lysine (graphed) Table 18 – Changes in target ‘Off’ structures in lysine (tabulated) Comparison Consensus (L1, L2) Threshold (%) 0.010 (0, 1) (0, 2) 1.620 (0, 3) 0.229 (0, 4) 0.015 (1, 2) 0.015 (1, 3) 0.030 (1, 4) 1.010 (2, 3) 2.834 (2, 4) 3.120 (3, 4) 0.064 L1 ‘Off’ Old Rank L1 ‘Off’ New Rank L2 ‘Off’ Old Rank L2 ‘Off’ New Rank 149 149 149 149 0 0 0 158 158 31 6 2 2 4 1 0 4 134 30 0 0 158 31 24 158 31 24 31 24 24 9 75 2 2 13 0 5 3 0 0 67 1 0.9 Ranking (Percentile) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3) Comparison RNASLOpt RNASLOpt + MUSHI Figure 9 - Changes in target 'Off' structures in lysine (graphed) The results from the Lysine riboswitch were much more promising (Tables 16, 17, and 18, Figures 8 and 9). In all cases, the landscape size was reduced significantly because the target matches were ranked so highly in the NxM alignments that it was possible to draw a tighter consensus. However, because these reductions depend first on the ranks of each target match, they should be considered secondary to how much the rank of each target structure improved, which is the ultimate metric by which these methods should be evaluated. In Table 17, the most apparent result is the fact that, in landscape 0, the best on structure was previously ranked 21, whereas after finding the consensus with all four other landscapes, the rank of the structure was significantly increased. Further, its best off structure was previously ranked 149, whereas in all cases, its new ranking was 6 or better. Critically, it should be noted that the best off structure in landscape 1 was already ranked 0, and while in some cases MUSHI decreased the ranking of this structure, the magnitudes of the aforementioned increases greatly exceeded those of the decreases, suggesting that 68 MUSHI is able to preserve good results from RNASLOpt while simultaneously increasing the ranking of previously obscure but biologically significant structures. As another more visual example, consider the Lysine native off structure in Figure 10: Figure 10 - Lysine native 'Off' structure After RNASLOpt processes the sequence Lysine 3, the most stable (top-ranking) structure in the landscape it returns is shown in Figure 11: 69 Figure 11 - Top structure returned by RNASLOpt for sequence lysine 3 However, after comparing the landscapes for Lysine 1 and Lysine 3, and reordering their structures according to the methodology in Chapter 3, the top-ranking structure (previously ranked 31) in the new consensus landscape is shown in Figure 12: Figure 12 - Top structure returned by RNASLOpt+MUSHI for lysine 3 after comparison with lysine 1 70 A visual inspection of all three diagrams confirms how much more similar the new top structure looks to the native structure in comparison with the old top structure. However, it is important to note that it is unusual for the target structure to appear at the very top of the rankings in the consensus landscape. Even when results are good, most target structures merely appear near the top, so this example only serves as a visual aid, and is not necessarily representative of the final results. Table 19 - Effect of MUSHI on TPP landscape size Comparison (L1, L2) (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) Consensus Threshold (%) L1 Old Size L1 New Size L2 Old Size L2 New Size 0.057 16.117 0.706 0.366 59.671 8.095 6.070 37.302 66.667 3.968 91 91 91 91 135 135 135 18 18 14 7 34 7 7 113 48 51 17 18 7 135 18 14 36 18 14 36 14 36 36 8 18 4 5 18 14 35 14 34 10 Table 20 – Changes in target ‘On’ structures in TPP (tabulated) Comparison Consensus (L1, L2) Threshold (%) 0.057 (0, 1) (0, 2) 16.117 (0, 3) 0.706 (0, 4) 0.366 (1, 2) 59.671 (1, 3) 8.095 (1, 4) 6.070 (2, 3) 37.302 (2, 4) 66.667 (3, 4) 3.968 L1 ‘On’ Old Rank L1 ‘On’ New Rank L2 ‘On’ Old Rank L2 ‘On’ New Rank 2 2 2 2 99 99 99 15 15 5 2 9 4 0 61 47 30 7 12 2 99 15 5 12 15 5 12 5 12 12 6 7 3 4 11 2 1 5 26 9 71 1 Ranking (Percentile) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3) Comparison RNASLOpt RNASLOpt + MUSHI Figure 13 - Changes in target 'On' structures in TPP (graphed) Table 21 – Changes in target ‘Off’ structures in TPP (tabulated) Comparison Consensus (L1, L2) Threshold (%) 0.057 (0, 1) (0, 2) 16.117 (0, 3) 0.706 (0, 4) 0.366 (1, 2) 59.671 (1, 3) 8.095 (1, 4) 6.070 (2, 3) 37.302 (2, 4) 66.667 (3, 4) 3.968 L1 ‘Off’ Old Rank L1 ‘Off’ New Rank L2 ‘Off’ Old Rank L2 ‘Off’ New Rank 0 0 0 0 0 0 0 3 3 0 3 1 1 1 0 0 5 0 2 0 0 3 0 8 3 0 8 0 8 8 3 2 1 1 0 0 6 0 1 1 72 1 0.9 Ranking (Percentile) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3) Comparison RNASLOpt RNASLOpt + MUSHI Figure 14 - Changes in target 'Off' structures in TPP (graphed) The results from the TPP riboswitch (Tables 19, 20, and 21, Figures 13 and 14) were similar to those from the Lysine riboswitch. In general, target structures that were previously highly ranked stayed highly ranked (with one notable outlier), while lower-ranking structures tended to rise to the top. As seen with the Adenine riboswitch, MUSHI’s ability to reduce the landscape size is reduced with decreasing landscape size. Specifically, the landscapes of sizes 14 and 18 did not see much reduction. 73 Table 22 - Effect of MUSHI on FMN landscape size Comparison (L1, L2) (0, 1) (0, 2) (0, 3) (0, 4) (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4) Consensus Threshold (%) L1 Old Size L1 New Size L2 Old Size L2 New Size 98.392 25.038 10.677 31.340 69.862 55.322 49.004 29.461 22.088 10.476 316 316 316 316 523 523 523 666 666 216 316 316 248 316 523 518 523 579 653 216 523 666 216 718 666 216 718 216 718 718 523 596 205 633 666 216 715 216 702 431 Table 23 – Changes in target ‘On’ structures in FMN (tabulated) Comparison Consensus (L1, L2) Threshold (%) 98.392 (0, 1) (0, 2) 25.038 (0, 3) 10.677 (0, 4) 31.340 (1, 2) 69.862 (1, 3) 55.322 (1, 4) 49.004 (2, 3) 29.461 (2, 4) 22.088 (3, 4) 10.476 L1 ‘On’ Old Rank L1 ‘On’ New Rank L2 ‘On’ Old Rank L2 ‘On’ New Rank 3 3 3 3 485 485 485 285 285 72 306 253 162 166 449 497 515 343 112 113 485 285 72 208 285 72 208 72 208 208 388 437 147 104 373 119 106 51 34 132 74 1 0.9 Ranking (Percentile) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3) Comparison RNASLOpt RNASLOpt + MUSHI Figure 15 - Changes in target 'On' structures in FMN (graphed) Table 24 – Changes in target ‘Off’ structures in FMN (tabulated) Comparison Consensus (L1, L2) Threshold (%) 98.392 (0, 1) (0, 2) 25.038 (0, 3) 10.677 (0, 4) 31.340 (1, 2) 69.862 (1, 3) 55.322 (1, 4) 49.004 (2, 3) 29.461 (2, 4) 22.088 (3, 4) 10.476 L1 ‘Off’ Old Rank L1 ‘Off’ New Rank L2 ‘Off’ Old Rank L2 ‘Off’ New Rank 29 29 29 29 333 333 333 386 386 185 88 121 46 59 238 200 88 287 79 61 333 386 185 362 386 185 362 185 362 362 153 334 91 372 259 30 439 38 151 175 75 1 Ranking (Percentile) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (0, 1)(0, 2)(0, 3)(0, 4)(1, 0)(1, 2)(1, 3)(1, 4)(2, 0)(2, 1)(2, 3)(2, 4)(3, 0)(3, 1)(3, 2)(3, 4)(4, 0)(4, 1)(4, 2)(4, 3) Comparison RNASLOpt RNASLOpt + MUSHI Figure 16 - Changes in target 'Off' structures in FMN (graphed) Not surprisingly, the FMN riboswitch fared the worst (Tables 22, 23, and 24, Figures 15 and 16). The landscapes saw almost no reduction in size, and the ranking of target structures often decreased dramatically in the consensus, especially the on structure in landscape 0. As discussed in the previous section, this suggests that MUSHI’s ability to extract significant results is limited by the ability of RNASLOpt to correctly predict native structures. Finally, Figure 17 shows the average effect of MUSHI on the sizes of landscapes except those from Adenine. The reason for this is the small size of the Adenine landscapes. The blue bars indicate the old size, the orange bars indicate the average new size, and the markings on the orange bars indicate the range of values included in each average. Once again, we see that, even using an arbitrary value for the consensus threshold, MUSHI is able to reduce the size of the landscapes significantly. 76 800 Number of Structures 700 600 500 400 300 200 100 0 Lysine Lysine Lysine Lysine Lysine TPP 0 TPP 1 TPP 2 TPP 3 TPP 4 FMN 0FMN 1FMN 2FMN 3FMN 4 0 1 2 3 4 Landscapes Old Size New Size Figure 17 - Average effect of MUSHI on landscape size 4.4 Benchmarking As discussed in Chapter 2, RNAConSLOpt also seeks to find the consensus landscape between multiple structures. In this section, we compare the performance of MUSHI to that of RNAConSLOpt. We used the same datasets in both cases, preparing a multiple sequence alignment between all the sequences belonging to each riboswitch, and using the same values for Δp and ΔB given in Table 1. RNAConSLOpt takes the multiple sequence alignment and user-defined parameters and outputs a list of structures sorted by stability. Using the same methodology used in MUSHI, we aligned each structure with the native structures to determine the closest matches. Table 25 details the results. 77 Table 25 - Performance of RNAConSLOpt on riboswitches from B. Subtilis # of ConSLOpt Riboswitch Structures Adenine Lysine TPP FMN 2 3 5 49 Rank of Target ‘On’ Alignment Score of Target ‘On’ with Native Rank of Target ‘Off’ Alignment Score of Target ‘Off’ with Native 1 1 4 8 35.9 74.31 45.02 28.42 0 0 0 13 38.94 110.72 35.01 68.4 As seen in the table, RNAConSLOpt has the ability to return a much smaller set of structures than that returned by RNASLOpt, and the structures most closely resembling the native structures appear near the top of each list. However, RNAConSLOpt differs from MUSHI in the breadth of solutions provided. The structures most closely resembling the native structures in the landscapes returned by RNAConSLOpt have an alignment score which is near the average of the analogous structures returned by RNASLOpt. Essentially, RNAConSLOpt will return an “average” of the structures allowed by the multiple sequence alignment, while the results of MUSHI have much more variance. Depending on which two landscapes are combined, MUSHI can produce very good or very bad results, whereas RNAConSLOpt’s results are more temperate. The figures below show this comparison. Each circular node in the graph indicates a target structure in one landscape returned by RNASLOpt, while the triangular node indicates the score of the target structure returned by RNAConSLOpt. As shown in Figures 18 and 19, each of the target structures from RNAConSLOpt are roughly in the middle of the range of scores of their analogous structures in RNASLOpt, thus indicating the difference in breadth of solutions given by the two programs. 78 Adenine Lysine TPP FMN Figure 18 - Structural similarity of RNASLOpt and RNAConSLOpt 'On' target structures to native 'On' structure Figure 19 - Structural similarity of RNASLOpt and RNAConSLOpt 'Off' target structures to native 'Off' structure 79 These graphs combined with the tables in 4.3 seem to indicate that MUSHI can sometimes obtain results better than RNAConSLOpt, but only if we know in advance which two landscapes to combine. Combining the wrong landscapes may yield worse results, and on average MUSHI yields comparable results. This may come as no surprise, as the methodology of MUSHI is essentially the inverse of RNAConSLOpt: MUSHI folds the sequences and then aligns them, whereas RNAConSLOpt aligns the sequences and then folds them. 80 CHAPTER FIVE: CONCLUSIONS AND FUTURE WORK 5.1 Conclusions, Advantages, Disadvantages, and Limitations The data collected during this study suggest that when the output of RNASLOpt is of moderate to good quality, MUSHI can significantly increase the rankings of the correct target structures. One of the advantages of using MUSHI to find the consensus landscape is that it is thorough. All possible pairs of structures are considered, so all strong matches will naturally gravitate toward the top. The results are most dramatic when a low-ranking target structure from one landscape matches a high-ranking structure from another landscape. Although the match will incur a penalty due to the low ranking of the first structure, the fact that structural similarity counts double in the weighted sum limits the effect of such a penalty. An additional advantage is that MUSHI uses BEAR+MBR to estimate structural distance, which Mattei et al. just introduced in 2014. To the best of our knowledge, this is one of the only major uses of this methodology, giving MUSHI an advantage over other systems such as RNAdistance [29]. RNAdistance performs global alignments between sequences encoded with dot-bracket notation, which we have argued is less accurate than BEAR. RNAdistance is capable of translating each dot-bracket string into a more “coarsegrained” representation using an alphabet of six characters: H (hairpin loop), I (internal loop), B (bulge), M (multiloop), S (stack), and E (external/unpaired). This is very similar to the alphabet we originally envisioned before discovering BEAR, and we have already discussed the advantages BEAR has over this primitive alphabet. Additionally, the use of the MBR substitution matrix is an invaluable tool in these alignments, as it allows similar but non-matching structures to be aligned based on empirical data describing their mutation 81 rates, whereas traditional alignment methods tend to use a binary score which only rewards exact matches. Furthermore, RNAdistance uses a metric called “base pair distance”, which is calculated by determining the number of base pairs which are in one structure but not in the other. This metric only makes sense when both structures are the same length, as two base pairs can then be considered identical if their starting and ending indices match. Because we must work with sequences of different length, however, this metric is not useful to us, and indeed Hofacker et al. recommend against using RNAdistance for this purpose. Natually, there are many disadvantages to using MUSHI, the most obvious of which is the trade-off between thoroughness and speed. Because MUSHI must perform global alignment on all possible pairs of structures, it is very slow, running in O(N*M*A*B) time, where N is the size of landscape one, M is the size of landscape two, A is the length of RNA sequence one, and B is the length of RNA sequence two. For example, the comparison between the FMN 2 and FMN 4 landscapes takes over an hour on a computer with an Intel Core i5-4200U CPU @ 1.60 GHz and 6 GB of RAM. In order to improve on this run time, it would be necessary to find a faster way to compare two strings. While some fast alignment-free heuristics exist for string comparison, it is not immediately apparent how to modify them so that they make use of the substitution matrix, so we chose the most straightforward method to prove the concept. A second and perhaps fundamental disadvantage is that, in its current form, what MUSHI returns is arguably different from the actual consensus landscape. Because the problem we set out to solve was to improve the prediction of the native structures of riboswitches, we defined the consensus as those structures included in a match whose total match score was 82 above some arbitrary percentile. While this method produced some moderately good results, we realized that, if two perfect target structures were at the bottom of their respective landscapes, they would appear in the 50th percentile in the complete pairwise structural alignment phase, and therefore would likely not appear in the consensus if we set any meaningful discriminatory threshold. This could be solved by altering the methodology to rank the matches using only structural similarity information in the complete pairwise structural alignment, determine the consensus by setting a threshold, and then using stability and energy information to rank the structures in that threshold. In other words, stability and energy should not be factors for entry into the consensus. We are currently working on adapting MUSHI to this new methodology, at which point it can be said the MUSHI is able to find the actual consensus landscape. Luckily, because the results we obtained from the original methodology were good enough to be significant, we stuck with them for the purposes of this thesis. We expect slight improvements in the results after the new methodology is implemented. Another disadvantage of MUSHI is its accuracy in terms of scoring structures based on their stability and energy. Because RNASLOpt outputs a stability and energy ranking for each structure, MUSHI uses the rankings to determine the percentile to which that structure belongs in the weighted sum. Essentially, we assume that energy increases linearly as we move farther down in the rankings. This is not the case, however. When using RNASLOpt, as we increase Δp at a constant rate, we find that the number of new structures returned at each time step continues to increase. This means that, in a landscape of size 100, the energy difference between structure 0 and structure 49 will be much greater than the difference between structure 49 and structure 99. The percentile-based ranking, however, assumes 83 these two intervals are equal. This creates a bias against low-ranking target structures because it assumes middle-ranking structures are more viable candidates than they really are, and scores them significantly higher. This can be fixed by using the absolute free energy level and stability level of each structure. While the free energy for each structure can be extracted from RNASLOpt’s output, RNASLOpt currently provides no information on the exact stability of each structure aside from ranking them. If we could modify RNASLOpt to provide us with this information, we could modify MUSHI to make use of it, and improve the accuracy of our results on datasets such as FMN. While fixing these disadvantages could improve the results of MUSHI, there are a few fundamental limitations that may not be able to be fixed without trying an alternative approach entirely. One of these limiting factors is that stability and energy may not be categorically good. While simple intuition can tell us that the correct native structures should be stable and energetically favorable, it may be the case that these factors are only important up to a certain threshold, and beyond this threshold, they hold little predictive power. In other words, as long as the biological structures are stable enough, there might not be an additional need for stability, so further ranking SLOpt structures based on these factors might not help us, at which point we would need to determine some additional predictive metrics. A second limiting factor is that gauging the significance of structural similarity is difficult. While it is certainly true that, if the correct native structure appears in both landscapes, they will be structurally similar, there may be other structures that resemble each other simply as a consequence of sequence identity. A fundamental question still remains: at which level of sequence identity do folding landscapes diverge? A quantitative answer to this question 84 would be extremely difficult to derive, due to the immense size of the landscape and the inherent difficulty in visualizing it. However, if there is a threshold sequence identity above which two landscapes still more or less resemble one another, the consensus landscape will be large, and structural similarity is likely to bear less significance for predicting native structures. Once again, the predictive power of our data would be limited. Another limitation is that MUSHI is unable to predict the native structures exactly, instead coopting the problem of finding a consensus landscape to aid in the prediction. While expanding MUSHI to find a consensus between more than two landscapes (discussed in section 5.2) would be able to narrow the list of candidates even further, like RNASLOpt, the methodology is simply not designed to choose a single pair of structures and propose them as predictions. A final fundamental limitation of MUSHI is that, in practice, we cannot reliably predict where to place the consensus threshold. In Chapter 4, we chose the tightest possible boundary that would include the target matches in the consensus, but we had prior knowledge of where the target matches ranked. Without this knowledge, we cannot be confident that the target matches will be in the consensus. Then again, they should only appear in the consensus if they resemble the actual native structures. The nature of this problem requires us to work in a gradient, which makes perfect discretization of the problem almost impossible. 5.2 Future Work and Alternative Approaches In order to improve MUSHI’s performance, we first plan on correcting the drawbacks and disadvantages discussed in the previous section. Specifically, we aim to speed up MUSHI’s performance by utilizing linear string comparison algorithms, alter the methodology to 85 remove stability and energy as factors for entering the consensus, and utilize actual values for free energy and stability and eliminate the percentile-based ranking system. Another of MUSHI’s drawbacks is that the performance can vary depending on which landscapes are compared. In order to get around this, we can adapt MUSHI to find a consensus between more than two landscapes. This can be achieved for three landscapes A, B, and C by performing all possible comparisons AB, AC, BC, and deriving two abridged versions of each landscape. These can then be merged via some arbitrary process, such as only returning the structures appearing in both landscapes, as these are likely to be the most significant. In reality, it would be necessary to use a more sophisticated process of determine which structures are in the final consensus, as this process would still be subject to the limitations discussed earlier, such as working in a gradient. Alternatively, one might try approaching the problem from a different perspective, such as decomposing the landscape into smaller elements, similar to GraphClust [22]. One such decomposition could be a network of stack structures. Each unique stack appearing in a sequence’s landscape is defined by its starting and ending points, and its length. Each stack could be represented by a node in the graph, and directed edges indicate that one stack encloses another stack in at least one structure. Edge labels could specify exactly which structures contain such an enclosure. This network would represent how stacks relate to one another, and could also be unambiguously converted back into the original landscape. Gathering statistics about this network could lead to important insights, but even more interesting is the prospect of performing balanced global network alignment on two of these graphs. In Chapter 1, we discussed how this would be infeasible for entire energy folding landscapes due to their immense size, but these substructural networks would only consist 86 of a few hundred or a few thousand nodes, and so the algorithms discussed by Zaslavskiy et al. [11] would now be applicable. An alignment between two substructural graphs could indicate a more significant correlation that what could be uncovered by MUSHI, because comparisons in MUSHI necessarily involve entire structures. In an abstract sense, this new methodology would allow one to increase the resolution at which we can make meaningful comparisons, allowing us to pick out interesting details instead of having to always look at the bigger picture. Finding the consensus landscape is still a novel problem in structural bioinformatics, and this thesis merely scratches its surface. It is our hope that more researchers will take interest in the problem in the coming years, and use it to leverage critical information that will change our understanding of RNA secondary structure. 87 APPENDIX: RNA SEQUENCES AND STRUCTURES USED 88 The following sequences and structures are taken from the benchmarking set used in Li et al.’s paper on RNAConSLOpt [10]. They can be accessed at the following link: http://genome.ucf.edu/RNAConSLOpt/Benchmarks.txt Each RNA sequence is labeled with the reference used in this thesis (e.g. Adenine 0), followed by its accession number. Adenine 0: >D88802.1 AUUAUCACU-UGUAUAACCUCAAUAAUAUGGUUUGAGGGUGUCUACCAGGAACCGUAAAAUCCUGAUUACA AAAUUUGUUUAUG-ACAUUUUUUGUAAUCAGGAUUUU Adenine 1: >AAXV01000018.1 AUUUGAAC—UGUAUAACCUCAAUAAUAUGGAUUGAGGGUCUCUACCAGGAACCAUAAAAUCCUGACUACAA AA----CUUUGU-UUCAUUUUUGUAGUCAGGAUUUU Adenine 2: >AAEK01000052.1 -UGAGAAUCAUGUAUAACUCCAAGAAUAUGGCUUGGGGGUCUCUACCAGGAACCAAUAACUCCUGACUACA AAAU--GCGUAUU-AUAGCGUUUGUAGUCAGGAGUUU Adenine 3: >BA000016.3 AUUUUGCUU-CGUAUAACUCUAAUGAUAUGGAUUAGAGGUCUCUACCAAGAACCGAGAAUUCUUGAUUACG AAGAAAGCUUAUUUGCUUUCUUCGUAAUCAAGAAUU- Adenine 4: >CP000851.1 -UUAACACUUCGUAUAAUCUCAAUGAUAUGGUUUGAGAGUUUCUACCAAGAGCCCUAAACUCUUGAUUAUG AAGACUUUACUUU-AUGUAAUGCUAAUUUAACAAGUU Native functional structures from the adenine riboswitch of the ydhL gene of Bacillus subtilis: AUUAUCACU-UGUAUAACCUCAAUAAUAUGGUUUGAGGGUGUCUACCAGGAACCGUAAAAUCCUGAUUACA AAAUUUGUUUAUG-ACAUUUUUUGUAAUCAGGAUUUU ........(-((((...((((((.........))))))........(((((.........)))))..)))) )....((((...)-))).................... .........-.......((((((.........))))))..................((((((((((((((( (((..((((...)-)))..)))))))))))))))))) Lysine 0: >J03294.1 GAAGAUAGAGGU-GCGAACUUCAAGAGUAUGCCUUUGGAGAAAGAUGGAUUCUG-UGAAAAAGGCUGAAAG GGGAGCGUCGCCGAAGCAAAUAAAACCCCAUCGGUAUUAUUUGCUGGCCGUGCAUUGAAUAAAUGUAAGGC UGUCAAGAAAUCAUUUUCUUGGAGGGCUAUCUCGUUGUUCAUAAUCAUUUAUGAUGAUUAAUUGAU—AAGC AAUGAGAGUAUUCCUCUCAUUGC 89 Lysine 1: >CP000002.3 GAAGAUAGAGGUGCGAACUUCAAGAGUAGGCUUGAUGAGGAAGAUGGAUUCCGAUGAAGAAAGCCGAAAGGGGAGCGUCGCCG AAGCGGGGAAAAAUCCACUCGUUUUUCCUGCUGGCUUUACAUUGAAUAAAUGUGAGGCUGUCAAGAAAUCA UUU-CUUGGAGAGCUAUCUCGUUGUUUAAGAUCAUCGGCAUU—UUUGUUGGUUAAAGCAAUGAGGGAAUUC -UCUCGUUGC Lysine 2: >AAOX01000015.1 GAAGGUAGAGGU-GCAAACUUCAUCAGUAAAAGCUUGGAGAAAGAUGAGUUUCCGUGAAAAGCUUUGAAAG GGAAUGUUUGCCGAAGAAAAGGAAGUCUCAUUU-CUUUCUUUUCUGGUCCUGUAUUGAAUAAAUACUGGAU UGUCAAGACAGCGCCGUCUUGGAGAGCUAUCUCACUGUGUGGGCAUAUUU-UAUAUGUAUUUAAAACACAG CAAUGGGAUGGUUAUUCUCAUUG- Lysine 3: >M93419.1 GAAGAUAGAGGU-GCGAACUUCAUCAGUAAAAGCUUGGAGAAGAAUGAGCUUCAAUGAAAAGCUUUGAAAG GGAACGUUCGCCGAAGUGAAGAAAAACUCAUUU-UUUUCUUUGCUGGUCCUGCAUUUAAGAGAUGCCGGAU UGUCAAGGCGGUGCCGCCUUGGAGAGCUAUCUCACUGUGUCUGCGUAUUU-UAC---UACGUUAUCCACAG CAAUGAGGUAGCU-UUCUCAUUGC Lysine 4: >CP001186.1 AAAGGUAGAGGCCGCGAUAGGAAAGAGUAAGCUAUGGGAGAUUUAAUGGAAUCUGUGAUCAUAGGUUGAAA GGGACUAUUGCCGAAAUAUAAGAAUAACCAUCUUAUUCAUAUAUUGGGACUACAUUGAAUAAAUGUAGUAC UGUCAUAAGAUUUAUUUUAUGGAGAGCUAUUUGGAGAUGUUGAUGCGGUUUCUUA—UUUUGAGGAGAUAAC AACUCGUUUAUU-UUUUCAAUAU Native functional structures from the lysine riboswitch of the lysC gene of Bacillus subtilis: GAAGAUAGAGGU-GCGAACUUCAAGAGUAUGCCUUUGGAGAAAGAUGGAUUCUG-UGAAAAAGGCUGAAAG GGGAGCGUCGCCGAAGCAAAUAAAACCCCAUCGGUAUUAUUUGCUGGCCGUGCAUUGAAUAAAUGUAAGGC UGUCAAGAAAUCAUUUUCUUGGAGGGCUAUCUCGUUGUUCAUAAUCAUUUAUGAUGAUUAAUUGAU—AAGC AAUGAGAGUAUUCCUCUCAUUGC ..((((((..(.-((((.((((........((((((...(((.......)))..-....))))))...... .))))..)))))..(((((((((.(((.....))).)))))))))((((.((((((.....)))))).))) ).((((((((....))))))))....))))))......((((((((((.....)))))))..))).--..( (((((((((.....)))))))))) .....(((..(.-((((.((((........((((((...(((.......)))..-....))))))...... .))))..)))))..(((((((((.(((.....))).)))))))))((((.((((((.....)))))).))) ).((((((((....))))))))....)))(((((((((((((((((((.....)))))))..))..--.)) ))))))))................ TPP 0: >AL009126.3 GCAGAACAAUUCAAUA-UGUAUUCGUUUAACCACUAGGGGUGUCCUUCAUAAGGGCUGAGAUAAAAGUGUG --ACUUUUAGACCCUCAUAACUUGAACAGGUUCAGACCUGCGUAGGGAAGUGGAG-CGGUAUU—UGUGUUA UUUUACUAUG---CCAAUUCCAAACCACUUUUCCUUGCGGGAAAGUGGUUU TPP 1: 90 >CP000002.3 AUAAACUGAAUGAACA-AGAAAUGUUUU—CCACUAGGGGAGUCCUUGAUAAGGGCUGAGAUAAAAGUUUG— ACUUUUAGACCCUCAUAACCUGAACAGGUUCAAACCUGCGUAGGGAAGUGGCA-CGGUAUU--UGAGU-AU GUAUAUAUG---CAAAUUCCAAACCACUUU-CCUUGCGGGAAAGUGGUUU TPP 2: >CP000764.1 -ACUAUCAAAACUAUAUAGUUCUCAUCUAUCCACUAGGGGUGCCGAU-AUU—GGCUGAGAUUAAAGUUUA-UCUUUGAGACCCUUAGUACCUGAUCUGGUUCGUACCAGCGUAGGGAAGUGGAAAUGACAA---AAUAU GAUAUUUAUAUA---UCUAGGCCACUUUCUUUAC-CUACUAAGGAAGUGGCUU TPP 3: >AAEK01000033.1 GAAUA-CAAUACGAAA-AUUAAAUAUUUAUCCACUAGGGGGGCCUAUUAUA--GGCUGAGAUCAAA-UGGG —AAUUUGAGACUCUUAGUACCUGAUCUGGUUAAUGCCAGCGUAGGGAAGUGGAAAAGACAUUGCUAUUUCA UGUAUAAAUACUGUCAAUUUCACUUUCUUUACGCCUGUAAAGAAA------ TPP 4: >ABCF01000004.1 -AAUGCAUAUAAAUAAAUAGCCGAGCAAAACCACUGGGGGAGCCUUUUAAA—GGCUGAGAUUAAAGUGUAC UACUUUAAGACCCUUUGAACCUGAUCUAGUUCAUACUAGCGGAGGGAAGUGUAGUCUGAAUGAUUGAAU-A UUUCAAAAAAUCACUCAAGCCGCCUUCCCGU-GCAAUCAGGGAGGCG---- Native functional structures from the TPP riboswitch of the thiamin gene of Bacillus subtilis: GCAGAACAAUUCAAUA-UGUAUUCGUUUAACCACUAGGGGUGUCCUUCAUAAGGGCUGAGAUAAAAGUGUG --ACUUUUAGACCCUCAUAACUUGAACAGGUUCAGACCUGCGUAGGGAAGUGGAG-CGGUAUU—UGUGUUA UUUUACUAUG---CCAAUUCCAAACCACUUUUCCUUGCGGGAAAGUGGUUU ...(((((((.(....-.).))).))))..(((((.(((((((((((...)))))).....((((((.... --.)))))).))))).....((((..(((((....)))))..))))..)))))..-.(((((.--.(((.. ....))).)))---))......(((((((((((((....))))))))))))) ................-...........(((((((.(((((((((((...)))))).....((((((.... --.)))))).))))).......((((....))))..((((((..(..((((((..-.((.(((--.((((. ........)))---).))).))...))))))..)..))))))..))))))). FMN 0: >X51510.1 UAUCCUUCGGGGCAGGGUGGAAAUCCCGACCGGCGGUAGUAAAGCACAUUUGC—UUUAGAGCCCGUGACCC GUGUGC----AUAAGCACGCGGUGGAUUCAGUUUAA-GCUGAAGCCGACAGUGAAAGUCUGGAUGGGAGAA GGAUG---AUGAGCCGCUAUGCAAAAUGUU-UAAAAAUGCAUAGUGUUAUUUCCUAUUGCGUAAAAUACCU AAAGCCCCGAAUUUUUUAUAAAUUCGGGGCUUU FMN 1: >X95955.1 UAUCCUUCGGGGCUGGGUGAAAAUCCCGACCGGCGGUAAUAAGGCGCUCCUGCGCUUUACAGCCCGUGACC CGUAUGC----AUCUGUAUACGGUGGAUUCAGUGAAAAGCUGAAGCCGACAGUGAAAGUCUGGAUGGGAGA AGGAUG---A-GAGAAGCUAUGCAAAAAAUAAUCAUACUGUAUAGUCUUAUUUCCUAUGGAUUAAAACUGG UAAAGCCCCGAAUGUGUAA-ACAUUCGGGGCUUU FMN 2: 91 >BA000004.3 UAUCCUUCGGGGCUGGGUGGAAAUCCCGACCGGCGGUGAUGAAGCGAA--UGC—UUCUUAGUCCGUGACCC GGUUGCUGAUAUCAGUAAGCGGUGGACCUGGUGAAAAUCCGGGACCGACAGUGAAAGUCUGGAUGGGAGAA GGAAACGUACGGUUCAAUUUGGAAAAAUGUGCAUGAUUGCACAUCUUCUUUCUCGUGGGCAAAAAACCUAC GUAUACACAAGGGAGAAGUCUGUCCAAAU---- FMN 3: >CP000903.1 CAUCCUUCGGGGUCGGGUGAAAUUCCCAACCGGCGGUGAUGAAGCGAU--AGC—UUCUAAGUCCGUGACCC GUUUUC---AACGCGAAAACGGUGGAUCUAGUGAAACUCUAGGGCCGACAGU-AUAGUCUGGAUGGGAGAA GGAUA---------UGUUUUCUAGUAAUUUUAUAUAGCGAAUACACUUUUAUUUCAGUAUGCAUA-UUUUU AAAGUUUCAUUUUGAAUCUUUAUAUAUGUUUUA FMN 4: >L47648.1 CAAUCUUCGGGGCAGGGUGAAAUUCCCUACCGGCGGUGAUGAGCCAAU--GGC---UCUAAGCCCGCGAGC UGUCUUU---------ACAGCA--GGAUUCGGUGAGAUUCCGGAGCCGACAGU-ACAGUCUGGAUGGGAGA AG-AUGGAGGUUCAUAAGCGUUUUGAAAUUGAAUUUUUCAAACGUUUCUUUGCCU-----AGCCUAAUUUU CGAAACCCCGCUUUUAUAUAUGAAGCGGUUUUUU Native functional structures from the FMN riboswitch of the ribD gene of Bacillus subtilis: UAUCCUUCGGGGCAGGGUGGAAAUCCCGACCGGCGGUAGUAAAGCACAUUUGC—UUUAGAGCCCGUGACCC GUGUGC----AUAAGCACGCGGUGGAUUCAGUUUAA-GCUGAAGCCGACAGUGAAAGUCUGGAUGGGAGAA GGAUG---AUGAGCCGCUAUGCAAAAUGUU-UAAAAAUGCAUAGUGUUAUUUCCUAUUGCGUAAAAUACC UAAAGCCCCGAAUUUUUUAUAAAUUCGGGGCUUU ((((((((......(((.......))).....((((...((((((......))--))))....))))...( (((((((----....))))))))....((((((....-)))))).....(((.......))).......)) ))))))---......................-....................................... .((((((((((((((.....)))))))))))))) .....((((((((.(((.......))).....((((...((((((......))--))))....))))...( (((((((----....))))))))....((((((....-)))))).....(((.......)))......... ......---......................-....................................... ....)))))))).(((....)))........... 92 LIST OF REFERENCES [1] S. B. Needleman, and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443-453, 1970. [2] M. Zuker, and P. Stiegler, “Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information,” Nucleic acids research, vol. 9, no. 1, pp. 133-148, 1981. [3] M. Zuker, and D. Sankoff, “RNA secondary structures and their prediction,” Bulletin of Mathematical Biology, vol. 46, no. 4, pp. 591-621, 1984. [4] P. Stein, and M. Waterman, “On some new sequences generalizing the Catalan and Motzkin numbers,” Discrete Mathematics, vol. 26, no. 3, pp. 261-272, 1979. [5] R. Nussinov, and A. B. Jacobson, “Fast algorithm for predicting the secondary structure of single-stranded RNA,” Proceedings of the National Academy of Sciences, vol. 77, no. 11, pp. 6309-6313, 1980. [6] D. Sankoff, “Simultaneous solution of the RNA folding, alignment and protosequence problems,” SIAM Journal on Applied Mathematics, vol. 45, no. 5, pp. 810-825, 1985. [7] Y. Li, and S. Zhang, “Finding stable local optimal RNA secondary structures,” Bioinformatics, vol. 27, no. 21, pp. 1-1-1-8, 11//, 2011. [8] S. Wuchty, W. Fontana, I. L. Hofacker, and P. Schuster, “Complete suboptimal folding of RNA and the stability of secondary structures,” Biopolymers, vol. 49, no. 2, pp. 145-165, 1999. 93 [9] C. Flamm, I. L. Hofacker, P. F. Stadler, and M. T. Wolfinger, “Barrier trees of degenerate landscapes,” Zeitschrift für Physikalische Chemie International journal of research in physical chemistry and chemical physics, vol. 216, no. 2/2002, pp. 155, 2002. [10] Y. Li, and S. Zhang, “Finding consensus stable local optimal structures for aligned RNA sequences,” 2012 IEEE 2nd International Conference on Computational Advances in Bio & Medical Sciences (ICCABS), pp. 1, 01//1/ 1/2012, 2012. [11] M. Zaslavskiy, F. Bach, and J.-P. Vert, “Global alignment of protein–protein interaction networks by graph matching methods,” Bioinformatics, vol. 25, no. 12, pp. i259-1267, 2009. [12] S. Umeyama, “An eigendecomposition approach to weighted graph matching problems,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 10, no. 5, pp. 695-703, 1988. [13] R. Singh, J. Xu, and B. Berger, “Global alignment of multiple protein interaction networks with application to functional orthology detection,” Proceedings of the National Academy of Sciences, vol. 105, no. 35, pp. 12763-12768, 2008. [14] M. Zaslavskiy, F. Bach, and J.-P. Vert, "A path following algorithm for graph matching," Image and Signal Processing, pp. 329-337: Springer, 2008. [15] M. Mandal, and R. R. Breaker, “Gene regulation by riboswitches,” Nature Reviews Molecular Cell Biology, vol. 5, no. 6, pp. 451-463, 2004. 94 [16] V. Bafna, H. Tang, and S. Zhang, “Consensus folding of unaligned RNA sequences revisited,” Journal of computational biology, vol. 13, no. 2, pp. 283295, 2006. [17] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner, “Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure,” Journal of molecular biology, vol. 288, no. 5, pp. 911-940, 1999. [18] I. L. Hofacker, M. Fekete, and P. F. Stadler, “Secondary structure prediction for aligned RNA sequences,” Journal of molecular biology, vol. 319, no. 5, pp. 10591066, 2002. [19] C. Smith, S. Heyne, A. S. Richter, S. Will, and R. Backofen, “Freiburg RNA Tools: a web server integrating INTARNA, EXPARNA and LOCARNA,” Nucleic acids research, vol. 38, no. suppl 2, pp. W373-W377, 2010. [20] S. Will, T. Joshi, I. L. Hofacker, P. F. Stadler, and R. Backofen, “LocARNA-P: Accurate boundary prediction and improved detection of structural RNAs,” RNA, vol. 18, no. 5, pp. 900-914, 2012. [21] S. Will, K. Reiche, I. L. Hofacker, P. F. Stadler, and R. Backofen, “Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering,” PLoS computational biology, vol. 3, no. 4, pp. e65, 2007. [22] S. Heyne, F. Costa, D. Rose, and R. Backofen, “GraphClust: alignment-free structural clustering of local RNA secondary structures,” Bioinformatics (Oxford), vol. 28, no. 12, pp. I224-I232, 2012. [23] A. Hüttenhofer, P. Schattner, and N. Polacek, “Non-coding RNAs: hope or hype?,” TRENDS in Genetics, vol. 21, no. 5, pp. 289-297, 2005. 95 [24] F. Costa, and K. De Grave, "Fast neighborhood subgraph pairwise distance kernel." pp. 255-262. [25] E. Mattei, G. Ausiello, F. Ferre, and M. Helmer-Citterich, “A novel approach to represent and compare RNA secondary structures,” Nucleic Acids Research, vol. 42, no. 10, pp. 6146-6157, 2014. [26] M. O. Dayhoff, and R. M. Schwartz, "A model of evolutionary change in proteins." [27] S. Henikoff, and J. G. Henikoff, “Amino acid substitution matrices from protein blocks,” Proceedings of the National Academy of Sciences, vol. 89, no. 22, pp. 10915-10919, 1992. [28] F. Meyer, S. Kurtz, R. Backofen, S. Will, and M. Beckstette, “Structator: fast index-based search for RNA sequence-structure patterns,” BMC bioinformatics, vol. 12, no. 1, pp. 214, 2011. [29] R. Lorenz, S. H. Bernhart, C. H. Zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, and I. L. Hofacker, “ViennaRNA Package 2.0,” Algorithms for Molecular Biology, vol. 6, no. 1, pp. 26, 2011. [30] P. P. Gardner, A. Wilm, and S. Washietl, “A benchmark of multiple sequence alignment programs upon structural RNAs,” Nucleic acids research, vol. 33, no. 8, pp. 2433-2439, 2005. [31] A. D. Garst, A. L. Edwards, and R. T. Batey, “Riboswitches: structures and mechanisms,” Cold Spring Harbor perspectives in biology, vol. 3, no. 6, pp. a003533, 2011. 96 [32] A. R. Gruber, S. H. Bernhart, I. L. Hofacker, and S. Washietl, “Strategies for measuring evolutionary conservation of RNA secondary structures,” BMC bioinformatics, vol. 9, no. 1, pp. 122, 2008. [33] I. HOFACKER, and P. F. STADLER, "RNAz 2.0: improved noncoding RNA detection." pp. 69-79. [34] I. L. Hofacker, P. Schuster, and P. F. Stadler, “Combinatorics of RNA secondary structures,” Discrete Applied Mathematics, vol. 88, no. 1, pp. 207-237, 1998. [35] M. Kucharik, I. L. Hofacker, P. F. Stadler, and J. Qin, “Basin Hopping Graph: a computational framework to characterize RNA folding landscapes,” Bioinformatics (Oxford), vol. 30, no. 14, pp. 2009-2017, 2014. [36] Y. Li, Computational methods for analyzing RNA folding landscapes and its applications. [electronic resource]: Orlando, Fla. : University of Central Florida, 2012., 2012. [37] J. S. McCaskill, “The equilibrium partition function and base pair binding probabilities for RNA secondary structure,” Biopolymers, vol. 29, no. 6‐7, pp. 1105-1119, 1990. [38] J. Reeder, and R. Giegerich, “Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction,” Bioinformatics, vol. 21, no. 17, pp. 3516-3523, 2005. [39] P. Schuster, and P. F. Stadler, “Landscapes: Complex optimization problems and biopolymer structures,” Computers & chemistry, vol. 18, no. 3, pp. 295-324, 1994. 97 [40] P. Steffen, B. Voß, M. Rehmsmeier, J. Reeder, and R. Giegerich, “RNAshapes: an integrated RNA analysis package based on abstract shapes,” Bioinformatics, vol. 22, no. 4, pp. 500-503, 2006. [41] Y. Wan, K. Qu, Q. C. Zhang, R. A. Flynn, O. Manor, Z. Ouyang, J. Zhang, R. C. Spitale, M. P. Snyder, and E. Segal, “Landscape and variation of RNA secondary structure across the human transcriptome,” Nature, vol. 505, no. 7485, pp. 706709, 2014. [42] S. Washietl, I. L. Hofacker, and P. F. Stadler, “Fast and reliable prediction of noncoding RNAs,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 7, pp. 2454-2459, 2005. [43] L. Yuan, Z. Cuncong, and Z. Shaojie, “Finding consensus stable local optimal structures for aligned RNA sequences and its application to discovering riboswitch elements,” International Journal of Bioinformatics Research and Applications, no. 4/5, 2014. 98
© Copyright 2026 Paperzz