Bioinformatics - Oxford Academic

BIOINFORMATICS
ORIGINAL PAPER
Vol. 25 no. 24 2009, pages 3236–3243
doi:10.1093/bioinformatics/btp580
Sequence analysis
CentroidAlign: fast and accurate aligner for structured RNAs by
maximizing expected sum-of-pairs score
Michiaki Hamada1,2,∗ , Kengo Sato2,3 , Hisanori Kiryu4 , Toutai Mituyama2 and
Kiyoshi Asai2,4
1 Mizuho Information & Research Institute,
2 Computational Biology Research Center,
Inc, 2–3 Kanda-Nishikicho, Chiyoda-ku, Tokyo 101–8443,
National Institute of Advanced Industrial Science and Technology (AIST),
2–41–6, Aomi, Koto-ku, Tokyo 135–0064, 3 Japan Biological Informatics Consortium, 2–45 Aomi, Koto-ku,
Tokyo 135–8073 and 4 Graduate School of Frontier Sciences, University of Tokyo, 5–1–5 Kashiwanoha,
Kashiwa 277–8562, Japan
Received on July 26, 2009; revised on September 14, 2009; accepted on September 30, 2009
Advance Access publication October 6, 2009
Associate Editor: Limsoon Wong
ABSTRACT
Motivation: The importance of accurate and fast predictions of
multiple alignments for RNA sequences has increased due to recent
ﬁndings about functional non-coding RNAs. Recent studies suggest
that maximizing the expected accuracy of predictions will be useful
for many problems in bioinformatics.
Results: We designed a novel estimator for multiple alignments of
structured RNAs, based on maximizing the expected accuracy of
predictions. First, we deﬁne the maximum expected accuracy (MEA)
estimator for pairwise alignment of RNA sequences. This maximizes
the expected sum-of-pairs score (SPS) of a predicted alignment
under a probability distribution of alignments given by marginalizing
the Sankoff model. Then, by approximating the MEA estimator, we
obtain an estimator whose time complexity is O(L3 +c2 dL2 ) where
L is the length of input sequences and both c and d are constants
independent of L. The proposed estimator can handle uncertainty
of secondary structures and alignments that are obstacles in
Bioinformatics because it considers all the secondary structures
and all the pairwise alignments as input sequences. Moreover,
we integrate the probabilistic consistency transformation (PCT) on
alignments into the proposed estimator. Computational experiments
using six benchmark datasets indicate that the proposed method
achieved a favorable SPS and was the fastest of many state-of-theart tools for multiple alignments of structured RNAs.
Availability: The software called CentroidAlign, which is an
implementation of the algorithm in this article, is freely available on
our website: http://www.ncrna.org/software/centroidalign/.
Contact: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
The importance of accurate and fast prediction of multiple alignment
for RNA sequences has increased due to recent findings about
functional non-coding RNAs. Not only nucleotide sequences but
∗ To
whom correspondence should be addressed.
3236
also secondary structures are closely related to the functions of
most functional RNAs, so we should consider secondary structures
when aligning RNA sequences. In 1985, Sankoff (1985) proposed
structural alignment in which we must handle alignments of base
pairs in secondary structures. The computational complexities of
the Sankoff algorithm are O(L 3n ) for time and O(L 2n ) for space
where L is the length of RNA sequences and n is the number of
input sequences, and those are too large for practical applications
even when we conduct pairwise alignment. Therefore, a number
of approximations to the Sankoff algorithm have been proposed
(Anwar et al., 2006; Bradley et al., 2008; Dalli et al., 2006; Do et al.,
2008; Dowell and Eddy, 2006; Harmanci et al., 2008; Havgaard
et al., 2005; Katoh and Toh, 2008; Kiryu et al., 2007a; Mathews,
2005; Mathews and Turner, 2002; Moretti et al., 2008; Tabei et al.,
2008; Wilm et al., 2008). Bauer et al. (2007) proposed an integer
linear programming (ILP) approach for structural alignments.
On the other hand, recent studies have indicated that maximizing
the expected accuracy of predictions is a useful approach to
design powerful estimators [maximum expected accuracy (MEA)
estimators] for a number of problems appearing in Bioinformatics,
including predictions of secondary structures of RNA (Do et al.,
2006a; Hamada et al., 2009a, b), predictions of common secondary
structures of RNA sequences (Hamada et al., 2009a; Kiryu et al.,
2007b; Seemann et al., 2008) and alignments (Bradley et al.,
2008, 2009; Do et al., 2005). Fortunately, MEA estimators have
the possibility to address another obstacle in many Bioinformatics
problems: the uncertainty of the solutions. For example, there are
huge numbers of candidates for both secondary structures of an
RNA sequence (Carvalho and Lawrence, 2008) and for alignments
of biological sequences (Carvalho and Lawrence, 2008; Lunter
et al., 2008; Webb-Robertson et al., 2008; Wong et al., 2008), and
the probability of the optimal solution is very low (uncertainty).
Therefore, it is important to design an estimator for multiple
alignments of RNA sequences based on maximizing expected
accuracy.
In this article, we propose a novel estimator for multiple
alignment of structured RNA sequences, based on maximizing
the expected sum-of-pairs score (SPS) (Thompson et al., 1999),
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
[14:24 9/11/2009 Bioinformatics-btp580.tex]
Page: 3236
3236–3243
CentroidAlign
which is a widely used accuracy measure for multiple alignments.
First, we design the MEA estimator for pairwise alignment of
RNA sequences, based on the Sankoff model (Sankoff, 1985). The
MEA estimator maximizes the expected SPS under the probability
distribution of alignments given by marginalizing the probability
distribution of structural alignments in the Sankoff model; the
estimator considers all the suboptimal structural alignments of
input RNA sequences. Because the MEA estimator entails a huge
computational cost, we introduce an approximation by factorizing
a probability space, using an idea similar to that in our previous
study (Hamada et al., 2009b). The resulting estimator considers
all the suboptimal secondary structures and all the suboptimal
pairwise alignments of the input sequences. By using the sparsity
of the base paring and alignment match probability matrices, we
reduced the computational cost of the estimator to O(L 3 +c2 dL 2 )
where L is the length of the RNA sequence, and c and d are
constants which are independent of L. Moreover, we integrated
the probabilistic consistency transformation (PCT) of the alignment
probability matrix (Do et al., 2005) into the proposed estimator.
Finally, the extension to multiple alignment is conducted by a
progressive alignment algorithm similar to CONTRAlign (Do et al.,
2006b). Computational experiments using six datasets indicate that
the proposed method achieved a favorable SPS and is the fastest of
the state-of-the-art aligners.
2.1.3 Posterior probabilities (1) The base-pairing probability matrix of x,
(s,x)
(s,x) {pij }i<j , has entries pij = θ∈S(x) θij p(s) (θ|x). The computational costs
for computing a base pairing probability matrix using the inside–outside
algorithm are O(|x|3 ) and O(|x|2 ) for time and space, respectively (e.g. see
McCaskill, 1990).
(s,x)
(s,x)
(2) {qu } are called the loop probabilities of x where qu = 1−
(s,x) (s,x)
i:i<u piu −
j:u<j puj .
(a,x,x )
}u,v is called an alignment match probability matrix of x and x (a,x,x ) where puv
= θ∈A(x,x ) θuv p(a) (θ|x,x ). Both time and space complexities
(3) {puv
for computing the matrix using the forward–backward algorithm are
O(|x||x |) (Durbin et al., 1998).
(sa,x,x )
(sa,x,x )
(4) {pijkl
}i<j,k<l and {puv
}u,v are called the alignment match
(sa,x,x )
=
probability matrices of a structural alignment of x and x , where pijkl
p (sa)
) (i.e. the probability that the base pair (x ,x )
θ
p
(θ|x,x
i
j
θ∈SA(x,x ) ijkl
(sa,x,x ) l p(sa) (θ|x,x )
= θ∈SA(x,x ) θuv
aligns with the base pair (xk ,xl )) and puv
(i.e. the probability that xu aligns with xv in the loop region). The time
and space complexities for computing the matrices using a variant of the
inside–outside algorithm are O(|x|3 |x |3 ) and O(|x|2 |x |2 ), respectively.
2.2
Designing estimators for pairwise alignments of
RNA sequences
A number of existing successful methods use the following estimators (e.g.
Carvalho and Lawrence, 2008; Ding et al., 2005; Do et al., 2006a; Hamada
et al., 2009a, b; Kiryu et al., 2007b):
G(θ,y)p(θ|d)
(1)
ŷ = arg max Eθ|d [G(θ,y)] = arg max
y∈Y
2
2.1
METHODS
Preliminaries
First of all, we summarize the notation used in this article. In the following,
let x and x be RNA sequences. The length of RNA sequence x is denoted
by |x| and xi for 1 ≤ i ≤ |x| indicates the i-th base in x.
2.1.1 Binary discrete spaces (1) S(x)(⊂ {θij ∈ {0,1}|1 ≤ i < j ≤ |x|}) is the
space of secondary structures of x, where θij = 1 (respectively θij = 0) for
θ ∈ S(x) means that xi and xj form (respectively do not form) base pairs. In
this study, no pseudo-knotted base pairs are allowed in a secondary structure.
(2) A(x,x )(⊂ {θuv ∈ {0,1}|1 ≤ u ≤ |x|,1 ≤ v ≤ |x |}) is the space of pairwise
alignments of x and x , where θuv = 1 (respectively θuv = 0) for θ ∈ A(x,x )
means that xu aligns (respectively does not align) with the xv .
p
l ) ∈ {0,1}×{0,1}|1 ≤ i < j ≤ |x|,1 ≤ k < l ≤ |x |,
(3) SA(x,x )(⊂ {(θijkl ,θuv
1 ≤ u ≤ |x|,1 ≤ v ≤ |x |}) is a space of structural alignments of x and x , where
p
p
θijkl = 1 (respectively θijkl = 0) means that the base pair (xi ,xj ) in x aligns
l =1
(respectively does not align) with the base pair (xk ,xl ) in x , and θuv
l = 0) means that the base x in a loop region of x aligns
(respectively θuv
u
(respectively does not align) with the base xv in a loop region of x . Note
that θ ∈ SA(x,x ) satisfies a number of constraints for a consistent structural
alignment.
2.1.2 Probability distributions (1) p(s) (·|x) is a probability distribution
on S(x), which is given by the McCaskill model (McCaskill, 1990), the
CONTRAfold model (Do et al., 2006a) or the SCFG model (Dowell and
Eddy, 2004).
(2) p(a) (·|x,x ) is a probability distribution on A(x,x ), which is given
by the ProbCons model (Do et al., 2005), the Miyazawa model (Miyazawa,
1995), the Probalign model (Roshan and Livesay, 2006) or the CONTRAlign
model (Do et al., 2006b).
(3) p(sa) (·|x,x ) is a probability distribution on SA(x,x ), which is given
by the pair SCFG model (Holmes, 2005) or the Sankoff model (Sankoff,
1985).
y∈Y
θ∈
where Y is a space from which we would like to obtain a prediction, referred
to as a predictive space, is the parameter space and is potentially different
from the predictive space, p(θ|d) is a probability distribution on the parameter
space given a dataset d and G(θ,y) is a gain function relating and
Y according to a measure of the accuracy of predictions. We can design
various estimators by defining the gain function and the parameter space.
For example, the γ-centroid estimator (for secondary structure prediction)
proposed in Hamada et al. (2009a) is represented by (1) as follows: d =
x where x is an RNA sequence; Y = = S(x); G(θ,y) = α1 TP + α2 TN −
α3 FN − α4 FN (αn > 0, n = 1,2,3,4), where TP (respectively TN, FP, FN) is
the number of true positive (respectively true negative, false positive, false
negative) base pairs when we consider y as the predicted secondary structure
and θ as a reference secondary structure; and p(θ|d) = p(s) (θ|x). Hamada
et al. (2009a) proved theoretically that the γ-centroid estimator includes
the centroid estimator (Carvalho and Lawrence, 2008) used in Sfold (Ding
et al., 2006) as a special case and is superior to the MEA estimator used in
CONTRAfold (Do et al., 2006a). Note that Hamada et al. (2009b) recently
extended this estimator to secondary structure prediction using homologous
sequences.
In this study, our strategy for designing an estimator for pairwise alignment
of RNA sequences is to use A(x,x ) for the predictive space Y , that is, a
predicted alignment is not a structural alignment but a sequential alignment.
However, we implicitly consider structural alignments by using structural
alignment in the parameter space . In the following sections, we introduce
two estimators: the MEA estimator that maximizes the expected SPS, and
an approximation to the MEA estimator, based on an idea used in Hamada
et al. (2009b).
2.2.1 MEA estimator (maximizing expected SPS) In order to consider
all the suboptimal structural alignments between x and x , the parameter
space is defined by (mea) = SA(x,x ) and the probability distribution on
the parameter space is defined by p(mea) (θ|x,x ) = p(sa) (θ|x,x ) where p(sa) is
given by the Sankoff model (Sankoff, 1985).
For θ = (θ p , θ l ) ∈ SA(x,x ) and positions u in x and v in x , the indicator
p
function Ruv (θ) := j:u<j,l:v<l θujvl is equal to 1 only when there exists bases
3237
[14:24 9/11/2009 Bioinformatics-btp580.tex]
Page: 3237
3236–3243
M.Hamada et al.
and Mu,v is an optimal score of the secondary structure for the sub sequence
(a,x,x )
(mea)
Fig. 1. Illustration of the approximations of the gain function in the proposed
method. (A) xu and xv are aligned in base pairs and (B) xu and xv are aligned
in loop region.
xj and xl where the base pair (xu ,x
j ) aligns with the base pair (xv ,xl ) (the
p
left-side of Fig. 1A), and Luv (θ) := i:i<u,k:k<v θiukv is equal to 1 only when
there exists bases xi and xk where the base pair (xi ,xu ) aligns with the base
l is equal to 1 only when x aligns with x pair (xk ,xv ). On the other hand, θuv
u
v
as a part of a loop region (the left-side of Fig. 1B). Then, the gain function
is defined by
(mea)
G(mea) (θ,y) = 1≤u≤|x| 1≤v≤|x | Guv (θ,y)
(mea)
l
Guv (θ,y) = γyuv Ruv (θ)+L
uv (θ)+θuv +
l
(1−yuv ) 1−Ruv (θ)−Luv (θ)−θuv
for a prediction y= {yuv } ∈ A(x,x ), where γ > 0 is a parameter which adjusts
the balance of sensitivity (SEN) and positive predictive value (PPV) of a
l takes a value
predicted pairwise alignment. Note that Ruv (θ)+Luv (θ)+θuv
in {0,1}, and xu and xv are aligned with each other in base pairs or a loop
region if and only if the value is equal to 1.
Finally, we obtain an estimator which maximizes the expectation of the
gain function G(mea) (θ,y) on the probability distribution p(mea) (θ|x,x ):
ŷ = arg max
G(mea) (θ,y)p(mea) (θ|x,x ).
(2)
y∈A(x,x )
θ∈(mea)
It is clear that this estimator is equivalent to the γ-centroid estimator
(Hamada et al., 2009a) on A(x,x ) when the probability distribution is
defined by the marginalized distribution
p(sa) (θ |x,x )
(3)
p(θ|x,x ) =
θ ∈−1 (θ)
where is the projection map from a structural alignment θ ∈ SA(x,x ) to
an alignment θ ∈ A(x,x ) (See Section A.2 in the Supplementary Material).
In other words, p(θ|x,x ) is a probability distribution on A(x,x) obtained by
marginalizing p(sa) into the space A(x,x ). Therefore, the MEA estimator
considers all the suboptimal structural alignments of the RNA sequences,
and has the following useful property.
Property 1. When γ → ∞, the estimator (2) maximizes the expectation
of the SPS for the pairwise alignment of x and x under the probability
distribution (3).
The predicted alignment of the MEA estimator can be computed by a
Needleman–Wunsch (Needleman and Wunsch, 1970) or a Holmes (Holmes
and Durbin, 1998) type dynamic programming (DP) algorithm whose
recursive equation is
⎧
(mea)
⎨ Mu−1,v−1 +(γ +1)puv −1
Mu,v = max Mu−1,v
(4)
⎩
Mu,v−1
where
=
p(mea)
uv
j:u<j,l:v<l
(sa,x,x )
pujvl
+
i:i<u,k:k<v
(sa,x,x )
piukv
(sa,x,x )
+puv
to puv
in Equation (4), the algorithm is
xu ...xv . If we change puv
equivalent to γ-centroid alignment (Hamada et al., 2009a); if we change
(mea)
(a,x,x )
in Equation (4), the result is equivalent to the
(γ +1)puv −1 to puv
estimator proposed by Holmes and Durbin (1998).
(mea)
Given {puv }u,v , this DP algorithm requires O(L 2 ) time for calculating
the optimal alignment, where |x|,|x | ≈ L. The estimator (2) is similar to
the alignment metric accuracy (AMA) estimator for the structural alignment
of RNA sequences (Bradley et al., 2008), which maximizes the expected
AMA score (Schwartz et al., 2005) under the probability distribution (3).
The relation between the AMA estimator and the estimator in this section is
shown in Section A.3 in the Supplementary Material.
Although the MEA estimator has the theoretically hopeful properties
described above, it comes with a huge computational cost of O(L 6 ) for
calculating alignments because we must compute the alignment probability
(sa,x,x )
(sa,x,x )
} and {puv
}. This computational cost is too large. So,
matrices {pijkl
in the following sections, we obtain our proposed method by approximating
the estimator.
2.2.2 Proposed estimator Taking into account the space SA(x,x ) and the
probability distribution p(sa) on the space entails a huge computational cost,
as noted in the previous section. Therefore, we factorize the parameter space
(mea) and the probability distribution p(mea) into
(prop) = A(x,x )×S(x)×S(x )
and
p(prop) (θ|x,x ) = p(a) (θ (a,x,x ) |x,x )p(s) (θ (s,x) |x)p(s) (θ (s,x ) |x ),
respectively, where θ = (θ (a,x,x ) ,θ (s,x) ,θ (s,x ) ) ∈ (prop)
A(x,x ), θ (s,x) ∈ S(x) and θ (s,x ) ∈ S(x ).
In the following, we use the notation
(s,x) (s,x ) (a,x,x )
Ruv (θ) :=
θuj θvl θjl
with
θ (a,x,x ) ∈
j:u<j,l:v<l
and
L uv (θ) :=
(s,x) (s,x ) (a,x,x )
θik
.
θiu θkv
i:i<u,k:k<v
Either of these is equal to 1 when xu forms a base pair with xj , xv forms a base
pair with xl and xj aligns with xl in θ (the right-side configuration of Fig. 1A).
(x)
(s,x) (s,x)
We also define ηu := j:u<j (1−θuj ) j:j<u (1−θju ), which is equal to 1
when xu is part of a loop region in θ (s,x) ∈ S(x) (the right configuration of
the sequence x of Fig. 1B). We consider the following approximation to the
elements in the gain function G(mea) :
(a,x,x )
+w2 Ruv (θ)+L uv (θ) and
[1] Ruv (θ)+Luv (θ) ≈ w21 θuv
(a,x,x )
l ≈ w1 θ
[2] θuv
2 uv
(x) (x )
+w3 ηu ηv
where w1 , w2 and w3 are positive weights which satisfy w1 +w2 +w3 = 1.
See Figure 1 for an illustration of the approximations. Then, the new gain
function G(prop) is given by
(prop)
G(prop) (θ,y) = 1≤u≤|x| 1≤v≤|x | Guv (θ,y)
(prop)
Guv (θ,y) = γyuv δuv +(1−yuv )(1−δuv )
where
(a,x,x )
δuv = w1 θuv
(x) (x )
+w2 Ruv (θ)+L uv (θ) +w3 ηu ηv .
Finally, we introduce the estimator that maximizes the expectation of
G(prop) (θ,y) under the probability distribution p(prop) (θ|x,x ). By definition,
the proposed estimator uses all the suboptimal secondary structures of x and
x and all the pairwise alignments between x and x .
3238
[14:24 9/11/2009 Bioinformatics-btp580.tex]
Page: 3238
3236–3243
CentroidAlign
We can obtain the secondary structure of the proposed estimator by
(mea)
replacing puv with
)
(s,x) (s,x ) (a,x,x )
p(prop)
= w1 p(a,x,x
+w2
puj pvl pjl
+
uv
uv
j:u<j,l:v<l
(s,x)
(s,x )
piu pkv
(a,x,x )
pik
(s,x )
+w3 qu(s,x) qv
i:i<u,k:k<v
(prop)
in Equation (4). It is easily seen that puv ∈ [0,1]. Note that MAFFT (Katoh
and Toh, 2008) used a similar scoring scheme for the optimizing function of
iterative alignments (Katoh and Toh, 2008).
The total computational cost with respect to time for calculating the
proposed estimators is O(L 4 ) where |x|,|x | ≈ L because it is necessary to
(prop)
compute the alternative alignment probability matrix {puv }u,v . However,
by using the sparsity of aligned-base probability matrix and base-pairing
probability matrix, we can greatly reduce the computational cost. As
described in Do et al. (2008), we can assume O(c) and O(d) bounds
on the number of candidate base pairs and alignment pairs per sequence
position, respectively, if we use a threshold δ for the probability. In other
words, there are O(c) [respectively O(d)] base pairs (respectively aligned
pairs) whose probability is more than δ per sequence position. Under these
assumptions, it is easily seen that the time complexity for calculating the
(prop)
matrix {puv } is O(c2 dL 2 ). Since we need O(L 3 ) time for calculating a
base-pairing probability matrix, the total computational cost of our algorithm
is O(L 3 +c2 dL 2 ), whereas the computational cost ofthe RNA alignment and
folding (RAF) algorithm is O L 3 +min(c,d)cd 2 L 2 , as shown in Do et al.
(2008).
2.3
Integrating PCT into the proposed estimator
We can easily integrate the PCT (Do et al., 2005) for alignment problems
into the proposed estimator when we have a set of sequences which are
homologous to x and x . For a set of (homologous) sequences S and x,x ∈ S,
we redefine the parameter space (prop) as
(prop) = A(x,x )×S(x)×S(x
)
× z∈S\{x,x } A(x,z)×A(z,x )
and the probability distribution on the parameter space as
p(prop) (θ|x,x ) =
(s,x )
(a,z,x )
(a,x,x )
for θ = (θ
,θ (s,x) ,θ
,{θ (a,x,z) ,θ
}z∈S\{x,x } ) where θ
∈
A(x,x ), θ (s,x) ∈ S(x) and θ (s,x ) ∈ S(x ). Furthermore, we redefine the gain
(a,x,x )
function G(prop) by replacing θik
with the pseudo-count
⎛
⎞
(a,x,z) (a,z,x )
1 ⎝ (a,x,x )
(a,x,x ,S)
⎠
θ
=
+
θiu
θuk
θik
|S|−1 ik
z∈S\{x,x } 1≤u≤|z|
(a,x,x ,S)
where |S| indicates the number of sequences in S. Note that θik
∈ [0,1].
Then, we obtain a new estimator that maximizes the expected gain under
(a,x,x )
the probability distribution. In practice, we only replace pik
with the
pseudo-probability
⎞
⎛
(a,x,z) (a,z,x )
1 ⎝ (a,x,x )
(a,x,x ,S)
⎠
pik
=
+
piu
puk
pik
|S|−1
z∈S\{x,x } 1≤u≤|z|
(prop)
(a,x,x )
probability pik
Extension to multiple alignments
Multiple alignment is conducted using the progressive alignment algorithm
used in ProbconsRNA (Do et al., 2005) and CONTRAlign (Do et al., 2006b).
We used the proposed estimator with sufficiently large γ and the PCT for
pairwise alignment of two RNA sequences. Note that integrating PCT does
not influence the total computational time achieved using the sparsity of the
base-pairing probability and alignment match probability matrices.
3
EXPERIMENTS
We conducted all experiments in this section on our Linux cluster
machines, each of which has a 2 GHz AMD Opteron(tm) Processor
246 and 4 GB of memory.
3.1
Datasets
We used the following six datasets for benchmarking. (i) Murlet
dataset1 (Kiryu et al., 2007a), which contains 85 multiple alignments
and reference common secondary structures extracted from the
Rfam database. The number of families is 17 and there are 5
subalignments of 10 sequences for each family. (ii) Murlet dataset2
(Kiryu et al., 2007a), which contains 188 multiple alignments and
reference common secondary structures of four sequences taken
from the Hammerhead_3 ribozyme family in the Rfam database.
(iii) MXSCARNA_dataset (Tabei et al., 2008), which contains
1693 multiple alignments and their common secondary structures.
(iv) MASTR_dataset (Lindgreen et al., 2007), which contains five
families and the total number of alignments is 52. (v) Bralibase2
(Gardner et al., 2005), which contains 599 multiple alignments. (vi)
Bralibase2.1 (Wilm et al., 2006): the total number of alignments
is 18 990, each of which contains 2, 3, 5, 7, 10 or 15 RNA
sequences. Note that the Bralibase2.1 dataset does not contain
reference common secondary structures.
3.2
z∈S\{x,x }
in computation of puv
2.4
p(a) (θ (a,x,x ) |x,x )p(s) (θ (s,x) |x)p(s) (θ (s,x ) |x )
×
p(a,x,z) (θ (a,x,z) |x,z)p(a,z,x ) (θ (a,z,x ) |z,x )
(a,x,x )
In a similar way, we are able to integrate the PCT of the base-paring
probability (Kiryu et al., 2007a) into the proposed estimator.
(a,x,x ,S)
. Note that pik
(Do et al., 2005).
is called the PCT of the
Compared methods and our model
We compared CentroidAlign with the following eight state-ofthe-art aligners. (i) CONTRAlign v2.01 (Do et al., 2006b), (ii)
ProbconsRNA (Do et al., 2005) (neither of these aligners consider
secondary structures), (iii) RAF v1.00 (Do et al., 2008), (iv)
MXSCARNA v2.1 (Tabei et al., 2008), (v) Murlet (Kiryu et al.,
2007a), (vi) MAFFT-xinsi 6.626 with SCARNA pair (Katoh and
Toh, 2008), (vii) R-coffee v7.81 (Moretti et al., 2008; Wilm et al.,
2008), (viii) Stemloc-AMA (Bradley et al., 2008), (ix) STRAL
v0.5.4 (Dalli et al., 2006) and (x) (t)LARA v1.3.2 (Bauer et al.,
2007). Due to limitations of computational cost, Stemloc-AMA was
only applied to the Murlet_dataset2, the MASTR_dataset and the
Bralibase2 dataset.
In the following sections, the proposed method (‘centroid_align’)
means the proposed estimator (Section 2.2.2) with PCT for the
alignment (Section 2.3). We used the McCaskill model (p(s) ) with
the same parameter in ViennaRNA-1.6.3 and the CONTRAlign
model (p(a) ) with the same parameter in CONTRAlign v2.0.1.
Moreover, we set δ = 0.01, w1 = 0.45, w2 = 0.5 and w3 = 0.05. Those
parameters were determined by testing on a small dataset.
3239
[14:24 9/11/2009 Bioinformatics-btp580.tex]
Page: 3239
3236–3243
M.Hamada et al.
Table 1. Murlet dataset1
Table 2. Murlet dataset2
Aligner
n
SPS
SCI
SEN
PPV
MCC
TIME
Aligner
n
SPS
SCI
SEN
PPV
MCC
TIME
contralign
probcons
centroid_align
lara
mafft-xinsi
mlocarna
murlet
mxscarna
raf
rcoffee
stral
85
85
85
85
85
85
85
85
85
85
85
0.77
0.76
0.78
0.75
0.79
0.72
0.77
0.75
0.75
0.76
0.67
0.41
0.38
0.48
0.51
0.53
0.60
0.43
0.44
0.46
0.42
0.37
0.58
0.57
0.63
0.62
0.67
0.66
0.63
0.64
0.69
0.59
0.48
0.77
0.79
0.75
0.73
0.75
0.73
0.76
0.75
0.72
0.75
0.72
0.65
0.64
0.68
0.66
0.70
0.68
0.68
0.67
0.70
0.65
0.56
149
71
196
12901
510
38645
59506
201
6167
487
145
contralign
probcons
centroid_align
lara
mafft-xinsi
mlocarna
murlet
mxscarna
raf
rcoffee
stemloc-ama
stral
188
188
188
188
188
188
188
188
188
188
188
188
0.84
0.81
0.88
0.81
0.89
0.84
0.84
0.86
0.86
0.84
0.85
0.77
0.63
0.60
0.82
0.85
0.90
0.94
0.76
0.83
0.88
0.73
0.75
0.55
0.76
0.79
0.89
0.89
0.95
0.93
0.88
0.93
0.95
0.84
0.88
0.64
0.84
0.84
0.89
0.84
0.88
0.85
0.86
0.88
0.89
0.87
0.87
0.71
0.78
0.80
0.88
0.86
0.91
0.89
0.86
0.90
0.91
0.85
0.87
0.65
6
6
11
486
80
100
838
10
68
328
37557
13
The aligners above dashed line cannot consider secondary structures when aligning
RNA sequences, whereas the ones below dashed line can consider secondary structures.
n means the number of successfully predicted alignment and TIME means the total
computational time in seconds. The bold values indicates the best score in each
evaluation measure and the fastest times in aligners above and below the dashed line. We
used Centroid(Ali)Fold (Hamada et al., 2009a; Sato et al., 2009) to compute common
secondary structure from an alignment for calculating SEN, PPV and MCC. See the
Supplementary Material for the results using RNAalifold (Bernhart et al., 2008).
3.3
Evaluation measures
In order to evaluate a predicted alignment from each aligner, we
used the following evaluation measures used in previous research.
(i) SPS (Thompson et al., 1999); we used compalignp, which is
available from the Bralibase2.1 web site (http://www.biophys.uniduesseldorf.de/bralibase), for computing SPS of each alignment.
(ii) Stem candidate index (SCI) (Washietl et al., 2005) defined
by SCI = EA /E where EA is the consensus minimum free energy
(MFE) computed by RNAalifold and E is the average MFE over
all RNA sequences in a given alignment; we used scif, which is
also available from the Bralibase2.1 database, for calculating SCI.
(iii) SEN, PPV and Matthew’s correlation coefficient (MCC) for a
predicted common secondary structure which are defined as follows:
SEN = TP/(TP + FN), PPV = TP/(TP + FP) and
MCC = TP · TN −FP · FN
See the footnote in Table 1.
Table 3. MXSCARNA dataset
Aligner
n
SPS
SCI
SEN
PPV
MCC
TIME
contralign
probcons
centroid_align
lara
mafft-xinsi
mlocarna
murlet
mxscarna
raf
rcoffee
stral
1693
1693
1693
1693
1693
1693
1693
1693
1693
1693
1693
0.79
0.78
0.80
0.77
0.80
0.77
0.79
0.78
0.79
0.78
0.74
0.58
0.54
0.67
0.71
0.71
0.80
0.63
0.69
0.72
0.61
0.57
0.64
0.63
0.70
0.70
0.72
0.75
0.71
0.73
0.75
0.67
0.61
0.67
0.66
0.69
0.68
0.69
0.68
0.69
0.70
0.70
0.68
0.63
0.63
0.63
0.68
0.68
0.69
0.70
0.69
0.70
0.71
0.66
0.60
707
487
2000
61694
4316
468792
500469
2540
41078
4822
2021
See the footnote in Table 1.
Table 4. MASTR dataset
Aligner
n
SPS
SCI
SEN
PPV
MCC
TIME
where TP is the number of correctly predicted base pairs (true
positives), TN is the number of base pairs which were correctly
predicted as non-matching (true negatives), FN is the number of
base pairs in the correct structure which were not predicted (false
negatives) and FP is the number of incorrectly predicted base-pairs
(false positives). The reference secondary structure of each sequence
in alignments is given by a common secondary structure mapped
to the sequence. We used RNAalifold (Bernhart et al., 2008) and
Centroid(Ali)Fold (Hamada et al., 2009a; Sato et al., 2009) for
common secondary structure prediction from a multiple alignment.
contralign
probcons
centroid_align
lara
mafft-xinsi
mlocarna
murlet
mxscarna
raf
rcoffee
stemloc-ama
stral
52
52
52
52
52
52
52
52
52
52
51
52
0.87
0.87
0.88
0.86
0.88
0.85
0.87
0.86
0.86
0.87
0.86
0.81
0.72
0.72
0.75
0.77
0.78
0.80
0.74
0.73
0.74
0.74
0.72
0.70
0.64
0.64
0.65
0.69
0.68
0.68
0.67
0.66
0.70
0.66
0.63
0.61
0.77
0.78
0.77
0.80
0.78
0.75
0.78
0.77
0.77
0.78
0.77
0.75
0.69
0.69
0.70
0.73
0.71
0.71
0.71
0.70
0.73
0.70
0.68
0.65
31
18
47
3620
133
2117
4149
50
272
223
353453
32
3.4
See the footnote in Table 1.
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
Results of computational experiments
We show the results of benchmarking on Murlet dataset1, Murlet
dataset2, MXSCARNA dataset, MASTR dataset, Bralibase2 dataset
and Bralibase2.1 dataset in Tables 1–6, respectively. These results
show that CentroidAlign achieved a great balance of speed and
accuracy compared with existing approaches. More precisely,
(i) CentroidAlign achieved first or second best SPS out of all
the aligners in all the benchmark datasets. These are desirable
results because we designed our estimator in order to maximize
expected SPS score. Moreover, CentroidAlign was one of the fastest
aligners out of all the aligners that consider secondary structures
(i.e. the aligners below dashed line in each table), which confirm
the effectiveness of our approximations in the MEA estimator.
However, CentroidAlign sometimes gives worse SCI, SEN, PPV
3240
[14:24 9/11/2009 Bioinformatics-btp580.tex]
Page: 3240
3236–3243
CentroidAlign
and MCC (i.e. evaluation measures related to common secondary
structures) than some tools such as mafft-xinsi, raf, lara and
locarna, which shows CentroidAlign is not the optimal estimator
for those evaluation measures. (ii) On most evaluation measures,
CentroidAlign is better than CONTRAlign and ProbconsRNA, both
of which cannot consider secondary structures of input sequences
at all. However, CentroidAlign is 2–5 times slower than those
tools. (iii) CentroidAlign has a better SPS and is much faster than
Stemloc-AMA (Bradley et al., 2008) (which has several similarities
with CentroidAlign; See Section A in the Supplementary Material).
This is because Stemloc-AMA uses the marginalized probability
distribution given by Sankoff model while we used approximations
to the distribution.
We also tried other reasonable approximations to the MEA
estimator, as described in Section B in the Supplementary Material,
and their performances were consistently worse than those of the
estimator proposed in the main text.
4
expected accuracy. We obtained the proposed estimator by
approximating the MEA estimator that maximizes the expected
SPS under the marginalized distribution of the Sankoff model. Our
estimator considers all the suboptimal secondary structures and all
the suboptimal pairwise alignments of input sequences.
Stemloc-AMA (Bradley et al., 2008) also adopts an MEA
estimator that is different from the one in this article. The differences
between Stemloc-AMA and CentroidAlign can be summarized
as follows: (i) Stemloc-AMA uses the AMA estimator (Schwartz
et al., 2005) as the gain function, whereas CentroidAlign uses
the different gain function described in Section 2.2.2. The relation
between the gain functions of Stemloc-AMA and CentroidAlign
is described in the Supplementary Materials. (ii) The probability
distribution in Stemloc-AMA for pairwise alignments of two RNA
sequences is the marginalized probability distribution obtained by
the Sankoff model, that is, Equation (3), whereas CentroidAlign
uses an approximation to the distribution. Therefore, StemlocAMA has a huge computational cost, which was demonstrated in
our experiments. The idea used in this article will be applied to
finding approximations of the AMA estimator used in StemlocAMA, although the approximations will be more complicated than
the ones in this article.
LARA (Bauer et al., 2007) is able to take into account base-paring
probabilities and subsequently uses an approach similar to Probcons
and CentroidAlign to compute the multiple alignment. The main
difference between LARA and CentroidAlign is that LARA uses
an ILP approach rather than a DP used in CentroidAlign. STRAL
(Dalli et al., 2006) also conducted sequential alignment whose score
considers the potential secondary structures by using base-paring
probabilities. However, the score of STRAL is less elaborate than
the one of CentroidAlign (e.g. STRAL does not use alignment
match probabilities), and our computational experiments indicate
that STRAL is less accurate than CentroidAlign although it is as
fast as CentroidAlign.
Do et al. (2008) proposed the excellent approximations of the
Sankoff algorithm, which were implemented in RAF. However,
the predictive space used in RAF remains the space of structural
alignments of RNA sequences. So, RAF is theoretically slower
than the proposed method—a fact confirmed in our experiments.
DISCUSSION AND FUTURE WORK
In this article, we have proposed a theoretically sound estimator
for multiple alignment of RNA sequences, based on maximizing
Table 5. BRAlibase2 dataset
Aligner
n
SPS
SCI
SEN
PPV
MCC
contralign
probcons
centroid_align
lara
mafft-xinsi
mlocarna
murlet
mxscarna
raf
rcoffee
stemloc-ama
stral
599
599
599
599
599
599
599
599
599
599
597
599
0.83
0.83
0.84
0.85
0.84
0.85
0.83
0.83
0.84
0.82
0.77
0.81
0.79
0.77
0.83
0.90
0.87
0.94
0.82
0.86
0.88
0.78
0.72
0.80
0.74
0.73
0.77
0.80
0.79
0.82
0.78
0.80
0.83
0.73
0.67
0.73
0.74
0.74
0.75
0.76
0.75
0.76
0.75
0.76
0.76
0.74
0.75
0.73
0.74
0.73
0.76
0.77
0.77
0.79
0.76
0.77
0.79
0.73
0.69
0.73
TIME
147
108
416
13 710
870
43 697
33 237
512
2459
1451
1 699 068
421
Stemloc-AMA failed due to the small size of the dataset. See the footnote in Table 1.
Table 6. BRAlibase2.1 dataset
Aligner
contralign
probcons
centroid_align
lara
mafft-xinsi
mlocarna
murlet
mxscarna
raf
rcoffee
stral
k2
k3
k5
k7
k10
k15
SPS
SCI
SPS
SCI
SPS
SCI
SPS
SCI
SPS
SCI
SPS
SCI
n
0.85
0.84
0.86
0.85
0.86
0.87
0.85
0.85
0.86
0.83
0.82
0.84
0.83
0.89
0.95
0.91
0.96
0.88
0.91
0.95
0.82
0.84
0.86
0.86
0.87
0.87
0.88
0.89
0.87
0.87
0.88
0.85
0.84
0.79
0.77
0.84
0.89
0.87
0.93
0.83
0.86
0.90
0.79
0.79
0.88
0.88
0.89
0.90
0.90
0.90
0.89
0.88
0.89
0.88
0.85
0.78
0.76
0.83
0.88
0.87
0.92
0.81
0.83
0.87
0.78
0.77
0.89
0.89
0.90
0.91
0.91
0.90
0.90
0.89
0.90
0.89
0.85
0.77
0.75
0.81
0.87
0.86
0.90
0.80
0.81
0.83
0.77
0.75
0.90
0.90
0.91
0.92
0.92
0.91
0.91
0.90
0.91
0.90
0.85
0.76
0.73
0.80
0.85
0.85
0.88
0.78
0.78
0.79
0.76
0.73
0.91
0.91
0.92
0.93
0.93
0.92
0.92
0.91
0.91
0.91
0.85
0.75
0.72
0.79
0.86
0.85
0.88
0.76
0.77
0.76
0.75
0.72
18990
18990
18990
18990
18990
18990
18980
18990
18990
18990
18990
TIME
4110
2839
8499
377864
20798
1240871
1371504
9778
67363
38847
7963
This dataset does not contain reference secondary structures, so only SPS and SCI are shown in the table. Murlet failed due to the small size of the dataset. The bold values indicates
the best score in each evaluation measure and the fastest times in aligners above and below the dashed line.
3241
[14:24 9/11/2009 Bioinformatics-btp580.tex]
Page: 3241
3236–3243
M.Hamada et al.
Note that RAF can predict common secondary structures as well as
alignments, whereas CentroidAlign cannot.
An advantage of the proposed method is that it can be easily
extended to local alignment problems. In that case, we should use
probabilities p(a) obtained using local alignment models such as
ProDA (Phuong et al., 2006) and LAST (http://last.cbrc.jp/), and
conduct Smith–Waterman-style DP (Smith and Waterman, 1981).
Instead of Equation (4) Smith–Waterman DP uses
⎧
0
⎪
⎪
⎨
(prop)
Mu,v = max Mu−1,v−1 +(γ +1)puv −1 .
⎪
M
⎪
⎩ u−1,v
Mu,v−1
While we used a large γ to maximize the expected SPS in this
article, in Smith–Waterman DP the γ parameter will be important for
adjusting the sensitivity and specificity of aligned bases. We should
employ Rfold (Kiryu et al., 2008) or RNAplfold (Bernhart et al.,
2006) for calculating the banded base-pairing probability matrix.
(prop)
We determined the parameters w1 , w2 and w3 of puv
in the
proposed estimator by an ad hoc procedure in this studies. Therefore,
the values used might not be the optimal ones. However, it is possible
to learn these parameters using the max-margin model (Do et al.,
2008).
ACKNOWLEDGMENTS
The authors thank L. E. Carvalho and C. E. Lawrence for
valuable comments. The authors are also grateful to our colleagues
at the Computational Biology Research Center (CBRC) for
fruitful discussions. Also, the constructive comments of the three
anonymous reviewers are greatly appreciated.
Funding: ‘Functional RNA Project’ funded by the New Energy and
Industrial Technology Development Organization (NEDO) of Japan;
Grant-in-Aid for Scientific Research on Priority Areas ‘Comparative
Genomics’ from the Ministry of Education, Culture, Sports, Science
and Technology of Japan.
Conflict of Interest: none declared.
REFERENCES
Anwar,M. et al. (2006) Identification of consensus RNA secondary structures using
suffix arrays. BMC Bioinformatics, 7, 244.
Bauer,M. et al. (2007) Accurate multiple sequence-structure alignment of RNA
sequences using combinatorial optimization. BMC Bioinformatics, 8, 271.
Bernhart,S. et al. (2008) RNAalifold: improved consensus structure prediction for RNA
alignments. BMC Bioinformatics, 9, 474.
Bernhart,S.H. et al. (2006) Local RNA base pairing probabilities in large sequences.
Bioinformatics, 22, 614–615.
Bradley,R.K. et al. (2008) Specific alignment of structured RNA: stochastic grammars
and sequence annealing. Bioinformatics, 24, 2677–2683.
Bradley,R.K. et al. (2009) Fast statistical alignment. PLoS Comput. Biol., 5,
e1000392.
Carvalho,L. and Lawrence,C. (2008) Centroid estimation in discrete high-dimensional
spaces with applications in biology. Proc. Natl Acad. Sci. USA, 105,
3209–3214.
Dalli,D. et al. (2006) STRAL: progressive alignment of non-coding RNA using base
pairing probability vectors in quadratic time. Bioinformatics, 22, 1593–1599.
Ding,Y. et al. (2005) RNA secondary structure prediction by centroids in a Boltzmann
weighted ensemble. RNA, 11, 1157–1166.
Ding,Y. et al. (2006) Clustering of RNA secondary structures with application to
messenger RNAs. J. Mol. Biol., 359, 554–571.
Do,C. et al. (2005) ProbCons: Probabilistic consistency-based multiple sequence
alignment. Genome Res., 15, 330–340.
Do,C. et al. (2006a) CONTRAfold: RNA secondary structure prediction without
physics-based models. Bioinformatics, 22, e90–e98.
Do,C.B. et al. (2006b) Contralign: discriminative training for protein sequence
alignment. In Apostolico,A. et al., eds, RECOMB, vol. 3909 of Lecture Notes in
Computer Science, Springer, Berlin/Heidelberg, pp. 160–174.
Do,C. et al. (2008) A max-margin model for efficient simultaneous alignment and
folding of RNA sequences. Bioinformatics, 24, i68–i76.
Dowell,R. and Eddy,S. (2004) Evaluation of several lightweight stochastic context-free
grammars for RNA secondary structure prediction. BMC Bioinformatics, 5, 71.
Dowell,R. and Eddy,S. (2006) Efficient pairwise RNA structure prediction and
alignment using sequence alignment constraints. BMC Bioinformatics, 7, 400.
Durbin,R. et al. (1998) Biological Sequence Analysis. Cambridge University press,
Cambridge.
Gardner,P.P. et al. (2005) A benchmark of multiple sequence alignment programs upon
structural RNAs. Nucleic Acids Res., 33, 2433–2439.
Hamada,M. et al. (2009a) Prediction of RNA secondary structure using generalized
centroid estimators. Bioinformatics, 25, 465–473.
Hamada,M. et al. (2009b) Predictions of RNA secondary structure by combining
homologous sequence information. Bioinformatics, 25, i330–i338.
Harmanci,A. et al. (2008) PARTS: probabilistic alignment for RNA joinT secondary
structure prediction. Nucleic Acids Res., 36, 2406–2417.
Havgaard,J. et al. (2005) Pairwise local structural alignment of RNA sequences with
sequence similarity less than 40%. Bioinformatics, 21, 1815–1824.
Holmes,I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC
Bioinformatics, 6, 73.
Holmes,I. and Durbin,R. (1998) Dynamic programming alignment accuracy. J. Comput.
Biol., 5, 493–504.
Katoh,K. and Toh,H. (2008) Improved accuracy of multiple ncRNA alignment
by incorporating structural information into a MAFFT-based framework. BMC
Bioinformatics, 9, 212.
Kiryu,H. et al. (2007a) Murlet: a practical multiple alignment tool for structural RNA
sequences. Bioinformatics, 23, 1588–1598.
Kiryu,H. et al. (2007b) Robust prediction of consensus secondary structures using
averaged base pairing probability matrices. Bioinformatics, 23, 434–441.
Kiryu,H. et al. (2008) Rfold: an exact algorithm for computing local base pairing
probabilities. Bioinformatics, 24, 367–373.
Lindgreen,S. et al. (2007) MASTR: multiple alignment and structure prediction of noncoding RNAs using simulated annealing. Bioinformatics, 23, 3304–3311.
Lunter,G. et al. (2008) Uncertainty in homology inferences: assessing and improving
genomic sequence alignment. Genome Res., 18, 298–309.
Mathews,D. (2005) Predicting a set of minimal free energy RNA secondary structures
common to two sequences. Bioinformatics, 21, 2246–2253.
Mathews,D. and Turner,D. (2002) Dynalign: an algorithm for finding the secondary
structure common to two RNA sequences. J. Mol. Biol., 317, 191–203.
McCaskill,J.S. (1990) The equilibrium partition function and base pair binding
probabilities for RNA secondary structure. Biopolymers, 29, 1105–1119.
Miyazawa,S. (1995) A reliable sequence alignment method based on probabilities of
residue correspondences. Protein Eng., 8, 999–1009.
Moretti,S. et al. (2008) R-Coffee: a web server for accurately aligning noncoding RNA
sequences. Nucleic Acids Res., 36, W10–W13.
Needleman,S. and Wunsch,C. (1970) A general method applicable to the search
for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48,
443–453.
Phuong,T. et al. (2006) Multiple alignment of protein sequences with repeats and
rearrangements. Nucleic Acids Res., 34, 5932–5942.
Roshan,U. and Livesay,D. (2006) Probalign: multiple sequence alignment using
partition function posterior probabilities. Bioinformatics, 22, 2715–2721.
Sankoff,D. (1985) Simultaneous solution of the RNA folding alignment and
protosequence problems. SIAM J. Appl. Math, pp. 810–825.
Sato,K. et al. (2009) CENTROIDFOLD: a web server for RNA secondary structure
prediction. Nucleic Acids Res., 37, W277–W280.
Schwartz,A.S. et al. (2005) Alignment metric accuracy. Available at
http://arxiv.org/abs/q-bio.QM/0510052.
Seemann,S. et al. (2008) Unifying evolutionary and thermodynamic information for
RNA folding of multiple alignments. Nucleic Acids Res., 36, 6355–6362.
Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular
subsequences. J. Mol. Biol., 147, 195–197.
Tabei,Y. et al. (2008) A fast structural multiple alignment method for long RNA
sequences. BMC Bioinformatics, 9, 33.
3242
[14:24 9/11/2009 Bioinformatics-btp580.tex]
Page: 3242
3236–3243
CentroidAlign
Thompson,J.D. et al. (1999) A comprehensive comparison of multiple sequence
alignment programs. Nucleic Acids Res., 27, 2682–2690.
Washietl,S. et al. (2005) Fast and reliable prediction of noncoding RNAs. Proc. Natl
Acad. Sci. USA, 102, 2454–2459.
Webb-Robertson,B.J. et al. (2008) Measuring global credibility with application to local
sequence alignment. PLoS Comput. Biol., 4, e1000077.
Wilm,A. et al. (2006) An enhanced RNA alignment benchmark for sequence alignment
programs. Algorithms Mol. Biol., 1, 19.
Wilm,A. et al. (2008) R-Coffee: a method for multiple alignment of non-coding RNA.
Nucleic Acids Res., 36, e52.
Wong,K.M. et al. (2008) Alignment uncertainty and genomic analysis. Science, 319,
473–476.
3243
[14:24 9/11/2009 Bioinformatics-btp580.tex]
Page: 3243
3236–3243

Download Report

Bioinformatics - Oxford Academic

Paperzz.com

Your Paperzz