BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 24 2009, pages 3236–3243 doi:10.1093/bioinformatics/btp580 Sequence analysis CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sum-of-pairs score Michiaki Hamada1,2,∗ , Kengo Sato2,3 , Hisanori Kiryu4 , Toutai Mituyama2 and Kiyoshi Asai2,4 1 Mizuho Information & Research Institute, 2 Computational Biology Research Center, Inc, 2–3 Kanda-Nishikicho, Chiyoda-ku, Tokyo 101–8443, National Institute of Advanced Industrial Science and Technology (AIST), 2–41–6, Aomi, Koto-ku, Tokyo 135–0064, 3 Japan Biological Informatics Consortium, 2–45 Aomi, Koto-ku, Tokyo 135–8073 and 4 Graduate School of Frontier Sciences, University of Tokyo, 5–1–5 Kashiwanoha, Kashiwa 277–8562, Japan Received on July 26, 2009; revised on September 14, 2009; accepted on September 30, 2009 Advance Access publication October 6, 2009 Associate Editor: Limsoon Wong ABSTRACT Motivation: The importance of accurate and fast predictions of multiple alignments for RNA sequences has increased due to recent findings about functional non-coding RNAs. Recent studies suggest that maximizing the expected accuracy of predictions will be useful for many problems in bioinformatics. Results: We designed a novel estimator for multiple alignments of structured RNAs, based on maximizing the expected accuracy of predictions. First, we define the maximum expected accuracy (MEA) estimator for pairwise alignment of RNA sequences. This maximizes the expected sum-of-pairs score (SPS) of a predicted alignment under a probability distribution of alignments given by marginalizing the Sankoff model. Then, by approximating the MEA estimator, we obtain an estimator whose time complexity is O(L3 +c2 dL2 ) where L is the length of input sequences and both c and d are constants independent of L. The proposed estimator can handle uncertainty of secondary structures and alignments that are obstacles in Bioinformatics because it considers all the secondary structures and all the pairwise alignments as input sequences. Moreover, we integrate the probabilistic consistency transformation (PCT) on alignments into the proposed estimator. Computational experiments using six benchmark datasets indicate that the proposed method achieved a favorable SPS and was the fastest of many state-of-theart tools for multiple alignments of structured RNAs. Availability: The software called CentroidAlign, which is an implementation of the algorithm in this article, is freely available on our website: http://www.ncrna.org/software/centroidalign/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION The importance of accurate and fast prediction of multiple alignment for RNA sequences has increased due to recent findings about functional non-coding RNAs. Not only nucleotide sequences but ∗ To whom correspondence should be addressed. 3236 also secondary structures are closely related to the functions of most functional RNAs, so we should consider secondary structures when aligning RNA sequences. In 1985, Sankoff (1985) proposed structural alignment in which we must handle alignments of base pairs in secondary structures. The computational complexities of the Sankoff algorithm are O(L 3n ) for time and O(L 2n ) for space where L is the length of RNA sequences and n is the number of input sequences, and those are too large for practical applications even when we conduct pairwise alignment. Therefore, a number of approximations to the Sankoff algorithm have been proposed (Anwar et al., 2006; Bradley et al., 2008; Dalli et al., 2006; Do et al., 2008; Dowell and Eddy, 2006; Harmanci et al., 2008; Havgaard et al., 2005; Katoh and Toh, 2008; Kiryu et al., 2007a; Mathews, 2005; Mathews and Turner, 2002; Moretti et al., 2008; Tabei et al., 2008; Wilm et al., 2008). Bauer et al. (2007) proposed an integer linear programming (ILP) approach for structural alignments. On the other hand, recent studies have indicated that maximizing the expected accuracy of predictions is a useful approach to design powerful estimators [maximum expected accuracy (MEA) estimators] for a number of problems appearing in Bioinformatics, including predictions of secondary structures of RNA (Do et al., 2006a; Hamada et al., 2009a, b), predictions of common secondary structures of RNA sequences (Hamada et al., 2009a; Kiryu et al., 2007b; Seemann et al., 2008) and alignments (Bradley et al., 2008, 2009; Do et al., 2005). Fortunately, MEA estimators have the possibility to address another obstacle in many Bioinformatics problems: the uncertainty of the solutions. For example, there are huge numbers of candidates for both secondary structures of an RNA sequence (Carvalho and Lawrence, 2008) and for alignments of biological sequences (Carvalho and Lawrence, 2008; Lunter et al., 2008; Webb-Robertson et al., 2008; Wong et al., 2008), and the probability of the optimal solution is very low (uncertainty). Therefore, it is important to design an estimator for multiple alignments of RNA sequences based on maximizing expected accuracy. In this article, we propose a novel estimator for multiple alignment of structured RNA sequences, based on maximizing the expected sum-of-pairs score (SPS) (Thompson et al., 1999), © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3236 3236–3243 CentroidAlign which is a widely used accuracy measure for multiple alignments. First, we design the MEA estimator for pairwise alignment of RNA sequences, based on the Sankoff model (Sankoff, 1985). The MEA estimator maximizes the expected SPS under the probability distribution of alignments given by marginalizing the probability distribution of structural alignments in the Sankoff model; the estimator considers all the suboptimal structural alignments of input RNA sequences. Because the MEA estimator entails a huge computational cost, we introduce an approximation by factorizing a probability space, using an idea similar to that in our previous study (Hamada et al., 2009b). The resulting estimator considers all the suboptimal secondary structures and all the suboptimal pairwise alignments of the input sequences. By using the sparsity of the base paring and alignment match probability matrices, we reduced the computational cost of the estimator to O(L 3 +c2 dL 2 ) where L is the length of the RNA sequence, and c and d are constants which are independent of L. Moreover, we integrated the probabilistic consistency transformation (PCT) of the alignment probability matrix (Do et al., 2005) into the proposed estimator. Finally, the extension to multiple alignment is conducted by a progressive alignment algorithm similar to CONTRAlign (Do et al., 2006b). Computational experiments using six datasets indicate that the proposed method achieved a favorable SPS and is the fastest of the state-of-the-art aligners. 2.1.3 Posterior probabilities (1) The base-pairing probability matrix of x, (s,x) (s,x) {pij }i<j , has entries pij = θ∈S(x) θij p(s) (θ|x). The computational costs for computing a base pairing probability matrix using the inside–outside algorithm are O(|x|3 ) and O(|x|2 ) for time and space, respectively (e.g. see McCaskill, 1990). (s,x) (s,x) (2) {qu } are called the loop probabilities of x where qu = 1− (s,x) (s,x) i:i<u piu − j:u<j puj . (a,x,x ) }u,v is called an alignment match probability matrix of x and x (a,x,x ) where puv = θ∈A(x,x ) θuv p(a) (θ|x,x ). Both time and space complexities (3) {puv for computing the matrix using the forward–backward algorithm are O(|x||x |) (Durbin et al., 1998). (sa,x,x ) (sa,x,x ) (4) {pijkl }i<j,k<l and {puv }u,v are called the alignment match (sa,x,x ) = probability matrices of a structural alignment of x and x , where pijkl p (sa) ) (i.e. the probability that the base pair (x ,x ) θ p (θ|x,x i j θ∈SA(x,x ) ijkl (sa,x,x ) l p(sa) (θ|x,x ) = θ∈SA(x,x ) θuv aligns with the base pair (xk ,xl )) and puv (i.e. the probability that xu aligns with xv in the loop region). The time and space complexities for computing the matrices using a variant of the inside–outside algorithm are O(|x|3 |x |3 ) and O(|x|2 |x |2 ), respectively. 2.2 Designing estimators for pairwise alignments of RNA sequences A number of existing successful methods use the following estimators (e.g. Carvalho and Lawrence, 2008; Ding et al., 2005; Do et al., 2006a; Hamada et al., 2009a, b; Kiryu et al., 2007b): G(θ,y)p(θ|d) (1) ŷ = arg max Eθ|d [G(θ,y)] = arg max y∈Y 2 2.1 METHODS Preliminaries First of all, we summarize the notation used in this article. In the following, let x and x be RNA sequences. The length of RNA sequence x is denoted by |x| and xi for 1 ≤ i ≤ |x| indicates the i-th base in x. 2.1.1 Binary discrete spaces (1) S(x)(⊂ {θij ∈ {0,1}|1 ≤ i < j ≤ |x|}) is the space of secondary structures of x, where θij = 1 (respectively θij = 0) for θ ∈ S(x) means that xi and xj form (respectively do not form) base pairs. In this study, no pseudo-knotted base pairs are allowed in a secondary structure. (2) A(x,x )(⊂ {θuv ∈ {0,1}|1 ≤ u ≤ |x|,1 ≤ v ≤ |x |}) is the space of pairwise alignments of x and x , where θuv = 1 (respectively θuv = 0) for θ ∈ A(x,x ) means that xu aligns (respectively does not align) with the xv . p l ) ∈ {0,1}×{0,1}|1 ≤ i < j ≤ |x|,1 ≤ k < l ≤ |x |, (3) SA(x,x )(⊂ {(θijkl ,θuv 1 ≤ u ≤ |x|,1 ≤ v ≤ |x |}) is a space of structural alignments of x and x , where p p θijkl = 1 (respectively θijkl = 0) means that the base pair (xi ,xj ) in x aligns l =1 (respectively does not align) with the base pair (xk ,xl ) in x , and θuv l = 0) means that the base x in a loop region of x aligns (respectively θuv u (respectively does not align) with the base xv in a loop region of x . Note that θ ∈ SA(x,x ) satisfies a number of constraints for a consistent structural alignment. 2.1.2 Probability distributions (1) p(s) (·|x) is a probability distribution on S(x), which is given by the McCaskill model (McCaskill, 1990), the CONTRAfold model (Do et al., 2006a) or the SCFG model (Dowell and Eddy, 2004). (2) p(a) (·|x,x ) is a probability distribution on A(x,x ), which is given by the ProbCons model (Do et al., 2005), the Miyazawa model (Miyazawa, 1995), the Probalign model (Roshan and Livesay, 2006) or the CONTRAlign model (Do et al., 2006b). (3) p(sa) (·|x,x ) is a probability distribution on SA(x,x ), which is given by the pair SCFG model (Holmes, 2005) or the Sankoff model (Sankoff, 1985). y∈Y θ∈ where Y is a space from which we would like to obtain a prediction, referred to as a predictive space, is the parameter space and is potentially different from the predictive space, p(θ|d) is a probability distribution on the parameter space given a dataset d and G(θ,y) is a gain function relating and Y according to a measure of the accuracy of predictions. We can design various estimators by defining the gain function and the parameter space. For example, the γ-centroid estimator (for secondary structure prediction) proposed in Hamada et al. (2009a) is represented by (1) as follows: d = x where x is an RNA sequence; Y = = S(x); G(θ,y) = α1 TP + α2 TN − α3 FN − α4 FN (αn > 0, n = 1,2,3,4), where TP (respectively TN, FP, FN) is the number of true positive (respectively true negative, false positive, false negative) base pairs when we consider y as the predicted secondary structure and θ as a reference secondary structure; and p(θ|d) = p(s) (θ|x). Hamada et al. (2009a) proved theoretically that the γ-centroid estimator includes the centroid estimator (Carvalho and Lawrence, 2008) used in Sfold (Ding et al., 2006) as a special case and is superior to the MEA estimator used in CONTRAfold (Do et al., 2006a). Note that Hamada et al. (2009b) recently extended this estimator to secondary structure prediction using homologous sequences. In this study, our strategy for designing an estimator for pairwise alignment of RNA sequences is to use A(x,x ) for the predictive space Y , that is, a predicted alignment is not a structural alignment but a sequential alignment. However, we implicitly consider structural alignments by using structural alignment in the parameter space . In the following sections, we introduce two estimators: the MEA estimator that maximizes the expected SPS, and an approximation to the MEA estimator, based on an idea used in Hamada et al. (2009b). 2.2.1 MEA estimator (maximizing expected SPS) In order to consider all the suboptimal structural alignments between x and x , the parameter space is defined by (mea) = SA(x,x ) and the probability distribution on the parameter space is defined by p(mea) (θ|x,x ) = p(sa) (θ|x,x ) where p(sa) is given by the Sankoff model (Sankoff, 1985). For θ = (θ p , θ l ) ∈ SA(x,x ) and positions u in x and v in x , the indicator p function Ruv (θ) := j:u<j,l:v<l θujvl is equal to 1 only when there exists bases 3237 [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3237 3236–3243 M.Hamada et al. and Mu,v is an optimal score of the secondary structure for the sub sequence (a,x,x ) (mea) Fig. 1. Illustration of the approximations of the gain function in the proposed method. (A) xu and xv are aligned in base pairs and (B) xu and xv are aligned in loop region. xj and xl where the base pair (xu ,x j ) aligns with the base pair (xv ,xl ) (the p left-side of Fig. 1A), and Luv (θ) := i:i<u,k:k<v θiukv is equal to 1 only when there exists bases xi and xk where the base pair (xi ,xu ) aligns with the base l is equal to 1 only when x aligns with x pair (xk ,xv ). On the other hand, θuv u v as a part of a loop region (the left-side of Fig. 1B). Then, the gain function is defined by (mea) G(mea) (θ,y) = 1≤u≤|x| 1≤v≤|x | Guv (θ,y) (mea) l Guv (θ,y) = γyuv Ruv (θ)+L uv (θ)+θuv + l (1−yuv ) 1−Ruv (θ)−Luv (θ)−θuv for a prediction y= {yuv } ∈ A(x,x ), where γ > 0 is a parameter which adjusts the balance of sensitivity (SEN) and positive predictive value (PPV) of a l takes a value predicted pairwise alignment. Note that Ruv (θ)+Luv (θ)+θuv in {0,1}, and xu and xv are aligned with each other in base pairs or a loop region if and only if the value is equal to 1. Finally, we obtain an estimator which maximizes the expectation of the gain function G(mea) (θ,y) on the probability distribution p(mea) (θ|x,x ): ŷ = arg max G(mea) (θ,y)p(mea) (θ|x,x ). (2) y∈A(x,x ) θ∈(mea) It is clear that this estimator is equivalent to the γ-centroid estimator (Hamada et al., 2009a) on A(x,x ) when the probability distribution is defined by the marginalized distribution p(sa) (θ |x,x ) (3) p(θ|x,x ) = θ ∈−1 (θ) where is the projection map from a structural alignment θ ∈ SA(x,x ) to an alignment θ ∈ A(x,x ) (See Section A.2 in the Supplementary Material). In other words, p(θ|x,x ) is a probability distribution on A(x,x) obtained by marginalizing p(sa) into the space A(x,x ). Therefore, the MEA estimator considers all the suboptimal structural alignments of the RNA sequences, and has the following useful property. Property 1. When γ → ∞, the estimator (2) maximizes the expectation of the SPS for the pairwise alignment of x and x under the probability distribution (3). The predicted alignment of the MEA estimator can be computed by a Needleman–Wunsch (Needleman and Wunsch, 1970) or a Holmes (Holmes and Durbin, 1998) type dynamic programming (DP) algorithm whose recursive equation is ⎧ (mea) ⎨ Mu−1,v−1 +(γ +1)puv −1 Mu,v = max Mu−1,v (4) ⎩ Mu,v−1 where = p(mea) uv j:u<j,l:v<l (sa,x,x ) pujvl + i:i<u,k:k<v (sa,x,x ) piukv (sa,x,x ) +puv to puv in Equation (4), the algorithm is xu ...xv . If we change puv equivalent to γ-centroid alignment (Hamada et al., 2009a); if we change (mea) (a,x,x ) in Equation (4), the result is equivalent to the (γ +1)puv −1 to puv estimator proposed by Holmes and Durbin (1998). (mea) Given {puv }u,v , this DP algorithm requires O(L 2 ) time for calculating the optimal alignment, where |x|,|x | ≈ L. The estimator (2) is similar to the alignment metric accuracy (AMA) estimator for the structural alignment of RNA sequences (Bradley et al., 2008), which maximizes the expected AMA score (Schwartz et al., 2005) under the probability distribution (3). The relation between the AMA estimator and the estimator in this section is shown in Section A.3 in the Supplementary Material. Although the MEA estimator has the theoretically hopeful properties described above, it comes with a huge computational cost of O(L 6 ) for calculating alignments because we must compute the alignment probability (sa,x,x ) (sa,x,x ) } and {puv }. This computational cost is too large. So, matrices {pijkl in the following sections, we obtain our proposed method by approximating the estimator. 2.2.2 Proposed estimator Taking into account the space SA(x,x ) and the probability distribution p(sa) on the space entails a huge computational cost, as noted in the previous section. Therefore, we factorize the parameter space (mea) and the probability distribution p(mea) into (prop) = A(x,x )×S(x)×S(x ) and p(prop) (θ|x,x ) = p(a) (θ (a,x,x ) |x,x )p(s) (θ (s,x) |x)p(s) (θ (s,x ) |x ), respectively, where θ = (θ (a,x,x ) ,θ (s,x) ,θ (s,x ) ) ∈ (prop) A(x,x ), θ (s,x) ∈ S(x) and θ (s,x ) ∈ S(x ). In the following, we use the notation (s,x) (s,x ) (a,x,x ) Ruv (θ) := θuj θvl θjl with θ (a,x,x ) ∈ j:u<j,l:v<l and L uv (θ) := (s,x) (s,x ) (a,x,x ) θik . θiu θkv i:i<u,k:k<v Either of these is equal to 1 when xu forms a base pair with xj , xv forms a base pair with xl and xj aligns with xl in θ (the right-side configuration of Fig. 1A). (x) (s,x) (s,x) We also define ηu := j:u<j (1−θuj ) j:j<u (1−θju ), which is equal to 1 when xu is part of a loop region in θ (s,x) ∈ S(x) (the right configuration of the sequence x of Fig. 1B). We consider the following approximation to the elements in the gain function G(mea) : (a,x,x ) +w2 Ruv (θ)+L uv (θ) and [1] Ruv (θ)+Luv (θ) ≈ w21 θuv (a,x,x ) l ≈ w1 θ [2] θuv 2 uv (x) (x ) +w3 ηu ηv where w1 , w2 and w3 are positive weights which satisfy w1 +w2 +w3 = 1. See Figure 1 for an illustration of the approximations. Then, the new gain function G(prop) is given by (prop) G(prop) (θ,y) = 1≤u≤|x| 1≤v≤|x | Guv (θ,y) (prop) Guv (θ,y) = γyuv δuv +(1−yuv )(1−δuv ) where (a,x,x ) δuv = w1 θuv (x) (x ) +w2 Ruv (θ)+L uv (θ) +w3 ηu ηv . Finally, we introduce the estimator that maximizes the expectation of G(prop) (θ,y) under the probability distribution p(prop) (θ|x,x ). By definition, the proposed estimator uses all the suboptimal secondary structures of x and x and all the pairwise alignments between x and x . 3238 [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3238 3236–3243 CentroidAlign We can obtain the secondary structure of the proposed estimator by (mea) replacing puv with ) (s,x) (s,x ) (a,x,x ) p(prop) = w1 p(a,x,x +w2 puj pvl pjl + uv uv j:u<j,l:v<l (s,x) (s,x ) piu pkv (a,x,x ) pik (s,x ) +w3 qu(s,x) qv i:i<u,k:k<v (prop) in Equation (4). It is easily seen that puv ∈ [0,1]. Note that MAFFT (Katoh and Toh, 2008) used a similar scoring scheme for the optimizing function of iterative alignments (Katoh and Toh, 2008). The total computational cost with respect to time for calculating the proposed estimators is O(L 4 ) where |x|,|x | ≈ L because it is necessary to (prop) compute the alternative alignment probability matrix {puv }u,v . However, by using the sparsity of aligned-base probability matrix and base-pairing probability matrix, we can greatly reduce the computational cost. As described in Do et al. (2008), we can assume O(c) and O(d) bounds on the number of candidate base pairs and alignment pairs per sequence position, respectively, if we use a threshold δ for the probability. In other words, there are O(c) [respectively O(d)] base pairs (respectively aligned pairs) whose probability is more than δ per sequence position. Under these assumptions, it is easily seen that the time complexity for calculating the (prop) matrix {puv } is O(c2 dL 2 ). Since we need O(L 3 ) time for calculating a base-pairing probability matrix, the total computational cost of our algorithm is O(L 3 +c2 dL 2 ), whereas the computational cost ofthe RNA alignment and folding (RAF) algorithm is O L 3 +min(c,d)cd 2 L 2 , as shown in Do et al. (2008). 2.3 Integrating PCT into the proposed estimator We can easily integrate the PCT (Do et al., 2005) for alignment problems into the proposed estimator when we have a set of sequences which are homologous to x and x . For a set of (homologous) sequences S and x,x ∈ S, we redefine the parameter space (prop) as (prop) = A(x,x )×S(x)×S(x ) × z∈S\{x,x } A(x,z)×A(z,x ) and the probability distribution on the parameter space as p(prop) (θ|x,x ) = (s,x ) (a,z,x ) (a,x,x ) for θ = (θ ,θ (s,x) ,θ ,{θ (a,x,z) ,θ }z∈S\{x,x } ) where θ ∈ A(x,x ), θ (s,x) ∈ S(x) and θ (s,x ) ∈ S(x ). Furthermore, we redefine the gain (a,x,x ) function G(prop) by replacing θik with the pseudo-count ⎛ ⎞ (a,x,z) (a,z,x ) 1 ⎝ (a,x,x ) (a,x,x ,S) ⎠ θ = + θiu θuk θik |S|−1 ik z∈S\{x,x } 1≤u≤|z| (a,x,x ,S) where |S| indicates the number of sequences in S. Note that θik ∈ [0,1]. Then, we obtain a new estimator that maximizes the expected gain under (a,x,x ) the probability distribution. In practice, we only replace pik with the pseudo-probability ⎞ ⎛ (a,x,z) (a,z,x ) 1 ⎝ (a,x,x ) (a,x,x ,S) ⎠ pik = + piu puk pik |S|−1 z∈S\{x,x } 1≤u≤|z| (prop) (a,x,x ) probability pik Extension to multiple alignments Multiple alignment is conducted using the progressive alignment algorithm used in ProbconsRNA (Do et al., 2005) and CONTRAlign (Do et al., 2006b). We used the proposed estimator with sufficiently large γ and the PCT for pairwise alignment of two RNA sequences. Note that integrating PCT does not influence the total computational time achieved using the sparsity of the base-pairing probability and alignment match probability matrices. 3 EXPERIMENTS We conducted all experiments in this section on our Linux cluster machines, each of which has a 2 GHz AMD Opteron(tm) Processor 246 and 4 GB of memory. 3.1 Datasets We used the following six datasets for benchmarking. (i) Murlet dataset1 (Kiryu et al., 2007a), which contains 85 multiple alignments and reference common secondary structures extracted from the Rfam database. The number of families is 17 and there are 5 subalignments of 10 sequences for each family. (ii) Murlet dataset2 (Kiryu et al., 2007a), which contains 188 multiple alignments and reference common secondary structures of four sequences taken from the Hammerhead_3 ribozyme family in the Rfam database. (iii) MXSCARNA_dataset (Tabei et al., 2008), which contains 1693 multiple alignments and their common secondary structures. (iv) MASTR_dataset (Lindgreen et al., 2007), which contains five families and the total number of alignments is 52. (v) Bralibase2 (Gardner et al., 2005), which contains 599 multiple alignments. (vi) Bralibase2.1 (Wilm et al., 2006): the total number of alignments is 18 990, each of which contains 2, 3, 5, 7, 10 or 15 RNA sequences. Note that the Bralibase2.1 dataset does not contain reference common secondary structures. 3.2 z∈S\{x,x } in computation of puv 2.4 p(a) (θ (a,x,x ) |x,x )p(s) (θ (s,x) |x)p(s) (θ (s,x ) |x ) × p(a,x,z) (θ (a,x,z) |x,z)p(a,z,x ) (θ (a,z,x ) |z,x ) (a,x,x ) In a similar way, we are able to integrate the PCT of the base-paring probability (Kiryu et al., 2007a) into the proposed estimator. (a,x,x ,S) . Note that pik (Do et al., 2005). is called the PCT of the Compared methods and our model We compared CentroidAlign with the following eight state-ofthe-art aligners. (i) CONTRAlign v2.01 (Do et al., 2006b), (ii) ProbconsRNA (Do et al., 2005) (neither of these aligners consider secondary structures), (iii) RAF v1.00 (Do et al., 2008), (iv) MXSCARNA v2.1 (Tabei et al., 2008), (v) Murlet (Kiryu et al., 2007a), (vi) MAFFT-xinsi 6.626 with SCARNA pair (Katoh and Toh, 2008), (vii) R-coffee v7.81 (Moretti et al., 2008; Wilm et al., 2008), (viii) Stemloc-AMA (Bradley et al., 2008), (ix) STRAL v0.5.4 (Dalli et al., 2006) and (x) (t)LARA v1.3.2 (Bauer et al., 2007). Due to limitations of computational cost, Stemloc-AMA was only applied to the Murlet_dataset2, the MASTR_dataset and the Bralibase2 dataset. In the following sections, the proposed method (‘centroid_align’) means the proposed estimator (Section 2.2.2) with PCT for the alignment (Section 2.3). We used the McCaskill model (p(s) ) with the same parameter in ViennaRNA-1.6.3 and the CONTRAlign model (p(a) ) with the same parameter in CONTRAlign v2.0.1. Moreover, we set δ = 0.01, w1 = 0.45, w2 = 0.5 and w3 = 0.05. Those parameters were determined by testing on a small dataset. 3239 [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3239 3236–3243 M.Hamada et al. Table 1. Murlet dataset1 Table 2. Murlet dataset2 Aligner n SPS SCI SEN PPV MCC TIME Aligner n SPS SCI SEN PPV MCC TIME contralign probcons centroid_align lara mafft-xinsi mlocarna murlet mxscarna raf rcoffee stral 85 85 85 85 85 85 85 85 85 85 85 0.77 0.76 0.78 0.75 0.79 0.72 0.77 0.75 0.75 0.76 0.67 0.41 0.38 0.48 0.51 0.53 0.60 0.43 0.44 0.46 0.42 0.37 0.58 0.57 0.63 0.62 0.67 0.66 0.63 0.64 0.69 0.59 0.48 0.77 0.79 0.75 0.73 0.75 0.73 0.76 0.75 0.72 0.75 0.72 0.65 0.64 0.68 0.66 0.70 0.68 0.68 0.67 0.70 0.65 0.56 149 71 196 12901 510 38645 59506 201 6167 487 145 contralign probcons centroid_align lara mafft-xinsi mlocarna murlet mxscarna raf rcoffee stemloc-ama stral 188 188 188 188 188 188 188 188 188 188 188 188 0.84 0.81 0.88 0.81 0.89 0.84 0.84 0.86 0.86 0.84 0.85 0.77 0.63 0.60 0.82 0.85 0.90 0.94 0.76 0.83 0.88 0.73 0.75 0.55 0.76 0.79 0.89 0.89 0.95 0.93 0.88 0.93 0.95 0.84 0.88 0.64 0.84 0.84 0.89 0.84 0.88 0.85 0.86 0.88 0.89 0.87 0.87 0.71 0.78 0.80 0.88 0.86 0.91 0.89 0.86 0.90 0.91 0.85 0.87 0.65 6 6 11 486 80 100 838 10 68 328 37557 13 The aligners above dashed line cannot consider secondary structures when aligning RNA sequences, whereas the ones below dashed line can consider secondary structures. n means the number of successfully predicted alignment and TIME means the total computational time in seconds. The bold values indicates the best score in each evaluation measure and the fastest times in aligners above and below the dashed line. We used Centroid(Ali)Fold (Hamada et al., 2009a; Sato et al., 2009) to compute common secondary structure from an alignment for calculating SEN, PPV and MCC. See the Supplementary Material for the results using RNAalifold (Bernhart et al., 2008). 3.3 Evaluation measures In order to evaluate a predicted alignment from each aligner, we used the following evaluation measures used in previous research. (i) SPS (Thompson et al., 1999); we used compalignp, which is available from the Bralibase2.1 web site (http://www.biophys.uniduesseldorf.de/bralibase), for computing SPS of each alignment. (ii) Stem candidate index (SCI) (Washietl et al., 2005) defined by SCI = EA /E where EA is the consensus minimum free energy (MFE) computed by RNAalifold and E is the average MFE over all RNA sequences in a given alignment; we used scif, which is also available from the Bralibase2.1 database, for calculating SCI. (iii) SEN, PPV and Matthew’s correlation coefficient (MCC) for a predicted common secondary structure which are defined as follows: SEN = TP/(TP + FN), PPV = TP/(TP + FP) and MCC = TP · TN −FP · FN See the footnote in Table 1. Table 3. MXSCARNA dataset Aligner n SPS SCI SEN PPV MCC TIME contralign probcons centroid_align lara mafft-xinsi mlocarna murlet mxscarna raf rcoffee stral 1693 1693 1693 1693 1693 1693 1693 1693 1693 1693 1693 0.79 0.78 0.80 0.77 0.80 0.77 0.79 0.78 0.79 0.78 0.74 0.58 0.54 0.67 0.71 0.71 0.80 0.63 0.69 0.72 0.61 0.57 0.64 0.63 0.70 0.70 0.72 0.75 0.71 0.73 0.75 0.67 0.61 0.67 0.66 0.69 0.68 0.69 0.68 0.69 0.70 0.70 0.68 0.63 0.63 0.63 0.68 0.68 0.69 0.70 0.69 0.70 0.71 0.66 0.60 707 487 2000 61694 4316 468792 500469 2540 41078 4822 2021 See the footnote in Table 1. Table 4. MASTR dataset Aligner n SPS SCI SEN PPV MCC TIME where TP is the number of correctly predicted base pairs (true positives), TN is the number of base pairs which were correctly predicted as non-matching (true negatives), FN is the number of base pairs in the correct structure which were not predicted (false negatives) and FP is the number of incorrectly predicted base-pairs (false positives). The reference secondary structure of each sequence in alignments is given by a common secondary structure mapped to the sequence. We used RNAalifold (Bernhart et al., 2008) and Centroid(Ali)Fold (Hamada et al., 2009a; Sato et al., 2009) for common secondary structure prediction from a multiple alignment. contralign probcons centroid_align lara mafft-xinsi mlocarna murlet mxscarna raf rcoffee stemloc-ama stral 52 52 52 52 52 52 52 52 52 52 51 52 0.87 0.87 0.88 0.86 0.88 0.85 0.87 0.86 0.86 0.87 0.86 0.81 0.72 0.72 0.75 0.77 0.78 0.80 0.74 0.73 0.74 0.74 0.72 0.70 0.64 0.64 0.65 0.69 0.68 0.68 0.67 0.66 0.70 0.66 0.63 0.61 0.77 0.78 0.77 0.80 0.78 0.75 0.78 0.77 0.77 0.78 0.77 0.75 0.69 0.69 0.70 0.73 0.71 0.71 0.71 0.70 0.73 0.70 0.68 0.65 31 18 47 3620 133 2117 4149 50 272 223 353453 32 3.4 See the footnote in Table 1. (TP + FP)(TP + FN)(TN + FP)(TN + FN) Results of computational experiments We show the results of benchmarking on Murlet dataset1, Murlet dataset2, MXSCARNA dataset, MASTR dataset, Bralibase2 dataset and Bralibase2.1 dataset in Tables 1–6, respectively. These results show that CentroidAlign achieved a great balance of speed and accuracy compared with existing approaches. More precisely, (i) CentroidAlign achieved first or second best SPS out of all the aligners in all the benchmark datasets. These are desirable results because we designed our estimator in order to maximize expected SPS score. Moreover, CentroidAlign was one of the fastest aligners out of all the aligners that consider secondary structures (i.e. the aligners below dashed line in each table), which confirm the effectiveness of our approximations in the MEA estimator. However, CentroidAlign sometimes gives worse SCI, SEN, PPV 3240 [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3240 3236–3243 CentroidAlign and MCC (i.e. evaluation measures related to common secondary structures) than some tools such as mafft-xinsi, raf, lara and locarna, which shows CentroidAlign is not the optimal estimator for those evaluation measures. (ii) On most evaluation measures, CentroidAlign is better than CONTRAlign and ProbconsRNA, both of which cannot consider secondary structures of input sequences at all. However, CentroidAlign is 2–5 times slower than those tools. (iii) CentroidAlign has a better SPS and is much faster than Stemloc-AMA (Bradley et al., 2008) (which has several similarities with CentroidAlign; See Section A in the Supplementary Material). This is because Stemloc-AMA uses the marginalized probability distribution given by Sankoff model while we used approximations to the distribution. We also tried other reasonable approximations to the MEA estimator, as described in Section B in the Supplementary Material, and their performances were consistently worse than those of the estimator proposed in the main text. 4 expected accuracy. We obtained the proposed estimator by approximating the MEA estimator that maximizes the expected SPS under the marginalized distribution of the Sankoff model. Our estimator considers all the suboptimal secondary structures and all the suboptimal pairwise alignments of input sequences. Stemloc-AMA (Bradley et al., 2008) also adopts an MEA estimator that is different from the one in this article. The differences between Stemloc-AMA and CentroidAlign can be summarized as follows: (i) Stemloc-AMA uses the AMA estimator (Schwartz et al., 2005) as the gain function, whereas CentroidAlign uses the different gain function described in Section 2.2.2. The relation between the gain functions of Stemloc-AMA and CentroidAlign is described in the Supplementary Materials. (ii) The probability distribution in Stemloc-AMA for pairwise alignments of two RNA sequences is the marginalized probability distribution obtained by the Sankoff model, that is, Equation (3), whereas CentroidAlign uses an approximation to the distribution. Therefore, StemlocAMA has a huge computational cost, which was demonstrated in our experiments. The idea used in this article will be applied to finding approximations of the AMA estimator used in StemlocAMA, although the approximations will be more complicated than the ones in this article. LARA (Bauer et al., 2007) is able to take into account base-paring probabilities and subsequently uses an approach similar to Probcons and CentroidAlign to compute the multiple alignment. The main difference between LARA and CentroidAlign is that LARA uses an ILP approach rather than a DP used in CentroidAlign. STRAL (Dalli et al., 2006) also conducted sequential alignment whose score considers the potential secondary structures by using base-paring probabilities. However, the score of STRAL is less elaborate than the one of CentroidAlign (e.g. STRAL does not use alignment match probabilities), and our computational experiments indicate that STRAL is less accurate than CentroidAlign although it is as fast as CentroidAlign. Do et al. (2008) proposed the excellent approximations of the Sankoff algorithm, which were implemented in RAF. However, the predictive space used in RAF remains the space of structural alignments of RNA sequences. So, RAF is theoretically slower than the proposed method—a fact confirmed in our experiments. DISCUSSION AND FUTURE WORK In this article, we have proposed a theoretically sound estimator for multiple alignment of RNA sequences, based on maximizing Table 5. BRAlibase2 dataset Aligner n SPS SCI SEN PPV MCC contralign probcons centroid_align lara mafft-xinsi mlocarna murlet mxscarna raf rcoffee stemloc-ama stral 599 599 599 599 599 599 599 599 599 599 597 599 0.83 0.83 0.84 0.85 0.84 0.85 0.83 0.83 0.84 0.82 0.77 0.81 0.79 0.77 0.83 0.90 0.87 0.94 0.82 0.86 0.88 0.78 0.72 0.80 0.74 0.73 0.77 0.80 0.79 0.82 0.78 0.80 0.83 0.73 0.67 0.73 0.74 0.74 0.75 0.76 0.75 0.76 0.75 0.76 0.76 0.74 0.75 0.73 0.74 0.73 0.76 0.77 0.77 0.79 0.76 0.77 0.79 0.73 0.69 0.73 TIME 147 108 416 13 710 870 43 697 33 237 512 2459 1451 1 699 068 421 Stemloc-AMA failed due to the small size of the dataset. See the footnote in Table 1. Table 6. BRAlibase2.1 dataset Aligner contralign probcons centroid_align lara mafft-xinsi mlocarna murlet mxscarna raf rcoffee stral k2 k3 k5 k7 k10 k15 SPS SCI SPS SCI SPS SCI SPS SCI SPS SCI SPS SCI n 0.85 0.84 0.86 0.85 0.86 0.87 0.85 0.85 0.86 0.83 0.82 0.84 0.83 0.89 0.95 0.91 0.96 0.88 0.91 0.95 0.82 0.84 0.86 0.86 0.87 0.87 0.88 0.89 0.87 0.87 0.88 0.85 0.84 0.79 0.77 0.84 0.89 0.87 0.93 0.83 0.86 0.90 0.79 0.79 0.88 0.88 0.89 0.90 0.90 0.90 0.89 0.88 0.89 0.88 0.85 0.78 0.76 0.83 0.88 0.87 0.92 0.81 0.83 0.87 0.78 0.77 0.89 0.89 0.90 0.91 0.91 0.90 0.90 0.89 0.90 0.89 0.85 0.77 0.75 0.81 0.87 0.86 0.90 0.80 0.81 0.83 0.77 0.75 0.90 0.90 0.91 0.92 0.92 0.91 0.91 0.90 0.91 0.90 0.85 0.76 0.73 0.80 0.85 0.85 0.88 0.78 0.78 0.79 0.76 0.73 0.91 0.91 0.92 0.93 0.93 0.92 0.92 0.91 0.91 0.91 0.85 0.75 0.72 0.79 0.86 0.85 0.88 0.76 0.77 0.76 0.75 0.72 18990 18990 18990 18990 18990 18990 18980 18990 18990 18990 18990 TIME 4110 2839 8499 377864 20798 1240871 1371504 9778 67363 38847 7963 This dataset does not contain reference secondary structures, so only SPS and SCI are shown in the table. Murlet failed due to the small size of the dataset. The bold values indicates the best score in each evaluation measure and the fastest times in aligners above and below the dashed line. 3241 [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3241 3236–3243 M.Hamada et al. Note that RAF can predict common secondary structures as well as alignments, whereas CentroidAlign cannot. An advantage of the proposed method is that it can be easily extended to local alignment problems. In that case, we should use probabilities p(a) obtained using local alignment models such as ProDA (Phuong et al., 2006) and LAST (http://last.cbrc.jp/), and conduct Smith–Waterman-style DP (Smith and Waterman, 1981). Instead of Equation (4) Smith–Waterman DP uses ⎧ 0 ⎪ ⎪ ⎨ (prop) Mu,v = max Mu−1,v−1 +(γ +1)puv −1 . ⎪ M ⎪ ⎩ u−1,v Mu,v−1 While we used a large γ to maximize the expected SPS in this article, in Smith–Waterman DP the γ parameter will be important for adjusting the sensitivity and specificity of aligned bases. We should employ Rfold (Kiryu et al., 2008) or RNAplfold (Bernhart et al., 2006) for calculating the banded base-pairing probability matrix. (prop) We determined the parameters w1 , w2 and w3 of puv in the proposed estimator by an ad hoc procedure in this studies. Therefore, the values used might not be the optimal ones. However, it is possible to learn these parameters using the max-margin model (Do et al., 2008). ACKNOWLEDGMENTS The authors thank L. E. Carvalho and C. E. Lawrence for valuable comments. The authors are also grateful to our colleagues at the Computational Biology Research Center (CBRC) for fruitful discussions. Also, the constructive comments of the three anonymous reviewers are greatly appreciated. Funding: ‘Functional RNA Project’ funded by the New Energy and Industrial Technology Development Organization (NEDO) of Japan; Grant-in-Aid for Scientific Research on Priority Areas ‘Comparative Genomics’ from the Ministry of Education, Culture, Sports, Science and Technology of Japan. Conflict of Interest: none declared. REFERENCES Anwar,M. et al. (2006) Identification of consensus RNA secondary structures using suffix arrays. BMC Bioinformatics, 7, 244. Bauer,M. et al. (2007) Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization. BMC Bioinformatics, 8, 271. Bernhart,S. et al. (2008) RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics, 9, 474. Bernhart,S.H. et al. (2006) Local RNA base pairing probabilities in large sequences. Bioinformatics, 22, 614–615. Bradley,R.K. et al. (2008) Specific alignment of structured RNA: stochastic grammars and sequence annealing. Bioinformatics, 24, 2677–2683. Bradley,R.K. et al. (2009) Fast statistical alignment. PLoS Comput. Biol., 5, e1000392. Carvalho,L. and Lawrence,C. (2008) Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl Acad. Sci. USA, 105, 3209–3214. Dalli,D. et al. (2006) STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatics, 22, 1593–1599. Ding,Y. et al. (2005) RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA, 11, 1157–1166. Ding,Y. et al. (2006) Clustering of RNA secondary structures with application to messenger RNAs. J. Mol. Biol., 359, 554–571. Do,C. et al. (2005) ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res., 15, 330–340. Do,C. et al. (2006a) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22, e90–e98. Do,C.B. et al. (2006b) Contralign: discriminative training for protein sequence alignment. In Apostolico,A. et al., eds, RECOMB, vol. 3909 of Lecture Notes in Computer Science, Springer, Berlin/Heidelberg, pp. 160–174. Do,C. et al. (2008) A max-margin model for efficient simultaneous alignment and folding of RNA sequences. Bioinformatics, 24, i68–i76. Dowell,R. and Eddy,S. (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 5, 71. Dowell,R. and Eddy,S. (2006) Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 7, 400. Durbin,R. et al. (1998) Biological Sequence Analysis. Cambridge University press, Cambridge. Gardner,P.P. et al. (2005) A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res., 33, 2433–2439. Hamada,M. et al. (2009a) Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics, 25, 465–473. Hamada,M. et al. (2009b) Predictions of RNA secondary structure by combining homologous sequence information. Bioinformatics, 25, i330–i338. Harmanci,A. et al. (2008) PARTS: probabilistic alignment for RNA joinT secondary structure prediction. Nucleic Acids Res., 36, 2406–2417. Havgaard,J. et al. (2005) Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics, 21, 1815–1824. Holmes,I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics, 6, 73. Holmes,I. and Durbin,R. (1998) Dynamic programming alignment accuracy. J. Comput. Biol., 5, 493–504. Katoh,K. and Toh,H. (2008) Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework. BMC Bioinformatics, 9, 212. Kiryu,H. et al. (2007a) Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics, 23, 1588–1598. Kiryu,H. et al. (2007b) Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics, 23, 434–441. Kiryu,H. et al. (2008) Rfold: an exact algorithm for computing local base pairing probabilities. Bioinformatics, 24, 367–373. Lindgreen,S. et al. (2007) MASTR: multiple alignment and structure prediction of noncoding RNAs using simulated annealing. Bioinformatics, 23, 3304–3311. Lunter,G. et al. (2008) Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res., 18, 298–309. Mathews,D. (2005) Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics, 21, 2246–2253. Mathews,D. and Turner,D. (2002) Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol., 317, 191–203. McCaskill,J.S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119. Miyazawa,S. (1995) A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng., 8, 999–1009. Moretti,S. et al. (2008) R-Coffee: a web server for accurately aligning noncoding RNA sequences. Nucleic Acids Res., 36, W10–W13. Needleman,S. and Wunsch,C. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453. Phuong,T. et al. (2006) Multiple alignment of protein sequences with repeats and rearrangements. Nucleic Acids Res., 34, 5932–5942. Roshan,U. and Livesay,D. (2006) Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics, 22, 2715–2721. Sankoff,D. (1985) Simultaneous solution of the RNA folding alignment and protosequence problems. SIAM J. Appl. Math, pp. 810–825. Sato,K. et al. (2009) CENTROIDFOLD: a web server for RNA secondary structure prediction. Nucleic Acids Res., 37, W277–W280. Schwartz,A.S. et al. (2005) Alignment metric accuracy. Available at http://arxiv.org/abs/q-bio.QM/0510052. Seemann,S. et al. (2008) Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res., 36, 6355–6362. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197. Tabei,Y. et al. (2008) A fast structural multiple alignment method for long RNA sequences. BMC Bioinformatics, 9, 33. 3242 [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3242 3236–3243 CentroidAlign Thompson,J.D. et al. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res., 27, 2682–2690. Washietl,S. et al. (2005) Fast and reliable prediction of noncoding RNAs. Proc. Natl Acad. Sci. USA, 102, 2454–2459. Webb-Robertson,B.J. et al. (2008) Measuring global credibility with application to local sequence alignment. PLoS Comput. Biol., 4, e1000077. Wilm,A. et al. (2006) An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol. Biol., 1, 19. Wilm,A. et al. (2008) R-Coffee: a method for multiple alignment of non-coding RNA. Nucleic Acids Res., 36, e52. Wong,K.M. et al. (2008) Alignment uncertainty and genomic analysis. Science, 319, 473–476. 3243 [14:24 9/11/2009 Bioinformatics-btp580.tex] Page: 3243 3236–3243
© Copyright 2026 Paperzz