Simple is beautiful: a straightforward approach to improve the

BIOINFORMATICS
ORIGINAL PAPER
Vol. 24 no. 11 2008, pages 1339–1343
doi:10.1093/bioinformatics/btn130
Sequence analysis
Simple is beautiful: a straightforward approach to improve the
delineation of true and false positives in PSI-BLAST searches
Marianne M. Lee1, Michael K. Chan1,2,* and Ralf Bundschuh1,3,*
1
The Ohio State Biophysics Program, 2Departments of Biochemistry and Chemistry, Ohio State University,
484 W 12th Av. and 3Department of Physics, Ohio State University, 191 W Woodruff Av., Columbus OH
43210-1117, USA
Received on January 5, 2008; revised on February 28, 2008; accepted on April 7, 2008
Advance Access publication April 10, 2008
Associate Editor: Thomas Lengauer
ABSTRACT
Motivation: The deluge of biological information from different
genomic initiatives and the rapid advancement in biotechnologies
have made bioinformatics tools an integral part of modern biology.
Among the widely used sequence alignment tools, BLAST and PSIBLAST are arguably the most popular. PSI-BLAST, which uses an
iterative profile position specific score matrix (PSSM)-based search
strategy, is more sensitive than BLAST in detecting weak homologies, thus making it suitable for remote homolog detection.
Many refinements have been made to improve PSI-BLAST, and its
computational efficiency and high specificity have been much
touted. Nevertheless, corruption of its profile via the incorporation
of false positive sequences remains a major challenge.
Results: We have developed a simple and elegant approach to
resolve the problem of model corruption in PSI-BLAST searches. We
hypothesized that combining results from the first (least-corrupted)
profile with results from later (most sensitive) iterations of PSI-BLAST
provides a better discriminator for true and false hits. Accordingly,
we have derived a formula that utilizes the E-values from these two
PSI-BLAST iterations to obtain a figure of merit for rank-ordering the
hits. Our verification results based on a ‘gold-standard’ test set
indicate that this figure of merit does indeed delineate true positives
from false positives better than PSI-BLAST E-values. Perhaps what
is most notable about this strategy is that it is simple and
straightforward to implement.
Contact: [email protected]
1
INTRODUCTION
Sequence alignment is one of the most widely used techniques
in computational biology, particularly for functional annotation of non-characterized sequences. The power of this
approach is directly dependent on the ability of sequence
alignment tools to detect weak homologies.
BLAST (Altschul et al., 1990) is arguably the most frequently
used sequence alignment tool. PSI-BLAST (Altschul et al.,
1997)—an extended version of BLAST—is similarly popular.
It is more effective and sensitive in detecting distant homologs
*To whom correspondence should be addressed.
due to its iterative profile-based search strategy (Gotoh, 1996;
Thompson et al., 1999). In its first iteration, PSI-BLAST
identifies related sequences that meet a specific inclusion
threshold and utilizes these sequences to generate a profile or
position specific score matrix (PSSM) to be used for the next
iterative search. The PSSM is then used to align and score
sequences in the database to search for new statistically
significant hits. This process is performed iteratively until the
model converges or no new statistically significant sequences are
found. Previous studies have shown that iterative approaches,
such as that used in PSI-BLAST, are more effective in finding
additional homologs. However, a major potential problem of
this approach is model corruption. With each subsequent
iteration, there exists an increased probability of including a
non-homologous sequence that meets the threshold for PSSM
inclusion, and whose presence can, in turn, lead to the
incorporation of other false positives. Such model corruption
is, particularly, treacherous if the non-homologous sequence
belongs to a large family. A naive approach to overcome this
problem is to set a stringent inclusion threshold, but the trade-off
is a loss of sensitivity, especially for the more remote homologs,
thus diminishing the strength and utility of PSI-BLAST. Hence,
the general interest in developing novel strategies to distinguish
and eliminate early false positive sequences from true hits in
homology detection searches.
Many different approaches have been studied and proposed to
improve the discrimination of true and false positives. These
methods typically incorporate extrinsic information into the
search, such as, (predicted) structural (Bowie et al., 1991, 1996;
Gough et al., 2001; Kann et al., 2005; Lee et al., 2007) or
literature data (Becker et al., 2003; Kaplan et al., 2003; Shatkay
et al., 2007), or use completely different techniques such as SAM
(Eddy, 1996; Gribskov et al., 1987; Grundy, 1998; Hughey and
Krogh, 1996; Karplus et al., 1998), or a combination of such
other techniques (Alam et al., 2004). Despite their better performance in remote homology detection, these approaches are
computationally more expensive than PSI-BLAST itself, thus
hindering their wide acceptance by the user community.
Herein, we describe a new methodology to improve the
statistical assessment of hits returned from a PSI-BLAST
search that yields better delineation of true and false positives.
The appeal of our approach is its simplicity: it can be integrated
ß The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1339
M.M.Lee et al.
easily without incurring significant computational costs to the
PSI-BLAST algorithm, while at the same time improving the
accuracy of its searches.
The remainder of this article is organized as follows. In
Section 2 we first outline the key idea of our approach. Then,
we review the basics of BLAST significance assessment, and use
these basics to heuristically derive a method to combine outputs
from different rounds of PSI-BLAST into one aggregate result.
In Section 3, we verify that this method indeed improves search
performance by applying it to the ‘gold standard’ database used
to test PSI-BLAST by its own authors (Schaffer et al., 2001).
We summarize our findings in Section 4.
2
METHODS
2.1
Key idea
The goal of our approach is to improve on the high sensitivity of
PSI-BLAST in terms of remote homolog detection without sacrificing
computational efficiency—one of the main reasons for PSI-BLAST’s
popularity. Our approach is based on a well-known property of
PSI-BLAST searches: in its early iterations PSI-BLAST is not very
sensitive, but very specific. Upon performing more and more iterations,
sensitivity increases while specificity suffers from what is called ‘model
corruption’. This behavior can be understood as follows: As an iterative
tool, PSI-BLAST attempts to build better and better models of the
homologs to the original query sequence. The first round of PSI-BLAST,
which is essentially BLAST, finds only relatively close homologs. These
close homologs are then used to generate a profile for the next iteration
that will be able to find more remote homologs. Iterating this process in
theory yields progressively better models and thus finds more and more
weakly related homologs to the original query. However, in practice,
non-homologous false positives are frequently incorporated into the
model at some point during this iterative process. Once this happens, the
model is ‘corrupted’ and completely loses its specificity for the true
homologs of the original query, resulting in the false identification of
large numbers of non-homologous sequences as putative true homologs.
For this reason, it is recommended to run at most five to six iterations
of PSI-BLAST instead of running PSI-BLAST until no more new
sequences are identified (Schaffer et al., 2001). However, even after five
or six iterations, it is often common to find that some non-homologous
sequences are already incorporated. Thus, there is a trade-off between
an increased sensitivity and an increased possibility for model corruption
as more iterations are performed.
The key idea of our approach is that by combining the results from
different PSI-BLAST iterations, the benefits of low corruption from the
early rounds, and the benefits of high sensitivity from the later rounds
can both be realized. More precisely, we expect that a true homolog
identified in a later iteration should have already revealed its identity as
a homolog in the second round (the one using the first profile generated
by PSI-BLAST and least likely corrupted). This distant homolog may
not have had the level of homology that would have been statistically
significant in the second round, but it should at least exhibit some
homology. One the other hand, a non-homolog reported in a later
round due to model corruption should show a much lower degree of
homology in the second round than a true homolog. We thus
hypothesize that ‘benchmarking’ hits from later iterations against hits
from iteration two where the model is least corrupted should improve
the accuracy of a PSI-BLAST search.
The beauty of our strategy is that it requires almost no additional
computational effort—if PSI-BLAST is iterated to, say, the fifth round,
the results for the second round will have been calculated anyway. All
that is required is to store these results and combine them with the
1340
results of the last round. The obvious question that has to be resolved,
though, is precisely how to combine the results from different rounds of
PSI-BLAST. The remainder of this section will be devoted to providing
an explicit formula for combining the E-values reported by PSI-BLAST
in the two rounds into a combined figure of merit. The derivation of
this formula will be somewhat heuristic in nature. However, an
evaluation of this method on a ‘gold standard’ database presented in
the next section will show that our method is indeed very successful in
combining the benefits from the different rounds of PSI-BLAST in spite
of the somewhat heuristic nature of its derivation.
2.2
Review of BLAST statistics
In order to derive a way to combine the E-values reported by PSIBLAST for the different rounds into a single figure of merit, it is useful
to review how these E-values are actually calculated. If a query
sequence with L-amino acids is aligned against a random subject
sequence of M-amino acids using the Smith–Waterman (Smith and
Waterman, 1981) local alignment algorithm,1 the local alignment score
P
can be considered a random variable. In the absence of gaps, this
random variable has been proven (Karlin and Altschul, 1990, 1993;
Karlin and Dembo, 1992) to follow an extreme value distribution
nX
o
Pr
5S ¼ exp KLMeS
ð1Þ
if the query and subject sequence are sufficiently long.1 Here, and K
are two parameters that depend in a known way on the scoring system,
the query sequence and the random ensemble used to choose the
random subject sequences.
Even in the presence of gaps, heuristic arguments and numerical
studies (Altschul and Gish, 1996; Collins et al., 1988; Mott, 1992;
Schaffer et al., 2001; Smith et al., 1985; Vingron and Waterman, 1994;
Waterman and Vingron, 1994) confirm that the local alignment score is
still distributed according to the extreme value distribution Equation
(1), albeit with modified values of the two parameters and K. Thus,
the P-value of an alignment score, or the probability of obtaining an
alignment score of magnitude S or better in one alignment against a
random subject sequence, is given by
ð2Þ
P ¼ 1 exp KLMeS
BLAST and PSI-BLAST report E-values, or the expected number of
random hits of score S or higher, instead of P-values. These two
quantities are different since the P-value refers to the probability of
having at least one hit at the prescribed significance level, while the
E-value quantifies the expected number of these hits. The general
relationship between these two quantities for random variables with
exponential tails is the Poisson formula
P ¼ 1 eE
ð3Þ
By comparing with Equation (2) we find
Eseq ¼ logð1 PÞ ¼ KLMeS
ð4Þ
It is important to note that this is the E-value for one single sequence
comparison, which we indicate by the subscript ‘seq’. The convenient
property of E-values is that the Bonferroni correction for repeating the
alignment for every sequence in a database of N amino acids, and thus
roughly N/M sequences, just manifests itself in a multiplication of the
expected number of hits for a single sequence by the number of
sequences, i.e.
N
ð5Þ
Edb ¼ Eseq ¼ KLNeS
M
1
BLAST approximates the Smith–Waterman algorithm with some
heuristics but the statistics of large scores is not affected by this
heuristics.
Delineation of true and false positives in PSI-BLAST searches
It is this latter E-value that BLAST and PSI-BLAST report as the
statistical significance of their hits.
2.3
Combining two E-values
In order to combine results from different rounds, we make the
assumption that the different rounds of PSI-BLAST are statistically
independent. At first glance, making this assumption seems heuristic
since the models used in all PSI-BLAST iterations describe the same
query sequence and thus are related. It is worth pointing out, however,
that for precisely in the scenario we are most worried about, namely
model corruption, the assumption of statistical independence between
early and late rounds of PSI-BLAST is actually a valid one. As
mentioned above, the main justification for this assumption is that it
leads to good retrieval performance as shown in the next section using
real biological data.
As a means to eliminate false positives resulting from model
corruption, we have to accept a hit only if it is significant in the final
round of PSI-BLAST as well as in the second round of PSI-BLAST.
Then, a false hit occurs only if the same subject sequence is a false hit in
the second round and in the final round. Under the assumption of
statistical independence, this means that the total probability for a false
hit with score S2 or higher in the second round and Sf or higher in the
final round is
Ptot ¼ P2 Pf
ð6Þ
where P2 and Pf are the P-values calculated according to Equation (2)
for the two score thresholds S2 and Sf. Since PSI-BLAST does not
report these P-values we have to use Equation (3) and the first equality
in Equation (5) to determine them. This yields
M
M
Ptot ¼ 1 exp E2
1 exp Ef
ð7Þ
N
N
where E2 and Ef are the E-values reported by PSI-BLAST in the second
and final round, respectively. In order to transform this P-value into a
quantity that resembles the E-values reported by PSI-BLAST, we use as
the final figure of merit
N
ð8Þ
Etot ¼ logð1 Ptot Þ
M
We refer to this quantity as a figure of merit instead of an E-value since
this quantity is strictly speaking only an E-value if the assumption of
statistical independence of the two rounds in question is fulfilled.
In general, we expect that this number is an underestimate of the true
E-value due to correlations between the rounds. However, as we will
show in the next section, using this number as a figure of merit yields a
superior discrimination between true and false positives.
2.4
Verification protocol
One major consideration in performance evaluation is the accuracy of
the assignment of true and false positives to the hits returned from a
search. An ideal test set would be one with known (expert-curated) true
and false positive relationships of the query sequences to those in the
test database, making the assignment unambiguous and straightforward. In light of this consideration and given that our proposed method
is an elaboration on PSI-BLAST, it therefore makes sense to use the
same dataset used by the PSI-BLAST development team to evaluate
their search refinement strategies. Consequently, we have verified our
proposed methodology using the ‘Aravind’ dataset (Aravind and
Koonin, 1999) employed by Schaeffer et al. (2001) in their study of
different methods to improve PSI-BLAST performance (Schaffer et al.,
2001). This Aravind dataset is comprised of 103 queries of single
protein domain sequences of the yeast Saccharomyces cerevisiae and a
sequence database containing 6406 yeast proteins. We rationalized that
using the same dataset should give the most direct and unbiased
performance comparison between the two approaches. More importantly, the Aravind dataset contains the true positives to each query
sequence and have been expertly curated by Schaffer’s coworkers at the
NCBI—by conducting PSI-BLAST searches against the yeast database,
and analyzing the resultant alignments on a case-by-case basis via
expert examination. Sequences in the yeast database that were not
annotated as true positives were considered false by default.
For our study, a five-round PSI-BLAST search was conducted for
each of the 103 queries against the complete non-redundant (NR)
protein database with its 1 986 685 sequences (frozen as of 8/23/04). The
choice of five rounds is based on previous empirical analysis by the
PSI-BLAST developers that suggested most hits that will be found with
an E-value 0.005 (the default threshold that PSI-BLAST uses to
include sequences for PSSM construction) are usually found by round
five or six (Altschul et al., 1997; Schaffer et al., 2001). The PSSM
constructed for iteration two and iteration five were saved as
‘checkpoints’. The two checkpoint files per query were then used to
search the annotated yeast database to produce a list of yeast hits (up to
E-value of 1000) for iteration two and iteration five, respectively. Each
hit in iteration two and five was assigned a ‘yes’ (for true positive) or
‘no’ (for false positive) according to the true positives for the Aravind
test set described previously. To generate a list of common hits to be
used for our proposed approach, hits that were found in one iteration
but missing in the other were arbitrarily assigned an E-value of 10 000.
This list of hits and their corresponding E-values at iteration two and
iteration five were combined to obtain the figure of merit (Etot) using
the formula Equation (8) in Section 2.3. An awk script that we used to
perform the above procedure is available online at http://bioserv.mps.
ohio-state.edu/SimpleIsBeautiful.
3
3.1
RESULTS
Evaluation and comparison to PSI-BLAST
To compare the performance of our proposed approach with
PSI-BLAST’s iterations two and five, all the hits generated from
the 103 queries were pooled together and rank ordered by their
E-values, or in our case, figures of merit (Etot). Coverage versus
error analysis was conducted to examine the number of true
positives identified by varying the error levels up to a threshold
of 100. The result of this analysis is shown in Figure 1.
As shown clearly by the respective position occupied by each
curve, our proposed methodology outperforms both iterations
two and five of PSI-BLAST. The behavior of the two curves for
PSI-BLAST’s iterations two and five shows that there is indeed
a tradeoff between sensitivity and specificity between different
levels of PSI-BLAST searches: higher specificity for iteration
two (later occurrence of the first false positive), but higher
sensitivity for iteration five (more true positives found). In
contrast, the curve corresponding to our method lies farthest
right in the plot, encompassing more true positives with
minimal increase in false positives.
3.2
Comparison to SAM-T2K
Although widely recognized as the most popular sequence
alignment tool, PSI-BLAST is found to be less effective in
remote homolog detection than the more computationally
costly HMM-based methods (Park et al., 1998). Given the
improved performance of our approach over the current PSIBLAST method, it would be interesting to see how our
proposed improvement to the PSI-BLAST algorithm performs
1341
M.M.Lee et al.
Fig. 1. Performance comparison between rounds two and five of PSIBLAST, and our new method for combining different rounds. The
curves show the number of errors as a function of the number of true
positives for the three methods as evaluated on the Aravind ‘gold
standard’ dataset. The dotted and dashed lines are for iterations two
and five of PSI-BLAST, respectively, while the solid line shows the
performance of our ‘combined E-values’ method. It can be seen that our
method achieves the same sensitivity as round five of PSI-BLAST with
(nearly) the same specificity as round two of PSI-BLAST.
Fig. 2. Performance comparison between PSI-BLAST, SAM-T2K and
our proposed method. The curves show the number of errors as a
function of the number of true positives for the three approaches as
evaluated on the Aravind ‘gold standard’ dataset. The long dashed lines
are for iteration five of PSI-BLAST, the short dashed lines are for
iteration five of SAM-T2K and the solid line represents the
performance of our ‘combined E-values’ method. It can be seen that
both our method and SAM-T2K achieve higher specificity and
sensitivity than PSI-BLAST.
when compared to state-of-the-art approaches. Similar to PSIBLAST in its automated iterative search strategy, the HMMbased SAM-T2K (Hughey and Krogh, 1996; Karplus et al.,
1998) has been shown to be superior to PSI-BLAST in distant
homology detection in general (Park et al., 1998), thus we have
chosen SAM-T2K for additional performance comparison.
We thus applied SAM-T2K to the Aravind test set following
the standard procedure as prescribed in the SAM-T2K suite. In
brief, for each query sequence in the Aravind test set, we used the
script target2k to perform five iterations of the following cycle:
(1) search the NR database for homologs, (2) perform a multiple
alignment of the homologs to construct an HMM-model and
(3) score the sequences in the NR database using this HMMmodel. The multiple alignment generated after the final round
was then used to create an HMM-model using the w0.5 script.
This HMM-model was used to score the 6406 yeast sequences in
the Aravind test set using the script hmmscore with the alignment
parameter set to local alignment. All other parameters used to
conduct the SAM-T2K search were left at their default values.
As shown in Figure 2, whereas our proposed method
performs similarly to SAM-T2K, PSI-BLAST falls short on
both specificity and sensitivity to either method—finding fewer
true homologs for the same number of false positives. The
results here reaffirm the general notion that the HMM-based
SAM-T2K is a better remote homolog detection tool than PSIBLAST. Nevertheless, the more sophisticated and complex
SAM-T2K is computationally more expensive, making it less
popular than PSI-BLAST. The implication of the performance
of our ‘combined E-values’ approach is thus more significant
when put into the context of computational efficiency.
For minimal additional computation cost (since our method
uses outputs already produced by PSI-BLAST), our approach
is able to elevate PSI-BLAST’s performance to match that of
SAM-T2K.
3.3
1342
Receiver operating characteristic analysis
In addition to coverage versus error analysis, another common
technique to quantify the performance of different approaches
is a receiver operating characteristic (ROC) analysis, which
measures the rate of true positives against the rate of false
positives. The quantity ROC is derived by calculating the area
under the curve and adopts a value between 0 (worst) to 1
(best). Although the recommended figure of merit for a
comprehensive sequence database search is ROC50 (Gribskov
and Robinson, 1996), we followed the lead of the PSI-BLAST
developers (Schaffer et al., 2001) in their choice of ROC100 to
compare the performance of PSI-BLAST iterations two and
five, our ‘combined E-values’ method using these two PSIBLAST iterations, and SAM-T2K iteration five. We thus
calculated the characteristic ROC100 as
ROC100 ¼ 100
1 X
ti
100T i¼1
ð9Þ
where T ¼ 1005 is the total number of true positives in the yeast
database and ti is the number of true positives ranked ahead of
the ith false positive. As indicated in Table 1, the ROC100 for
our ‘combined E-values’ method (0.836) is comparable to that
of SAM-T2K. When compared to either iteration two or
iteration five of PSI-BLAST, however, our ROC100 is higher
providing further support that combining results from early
and late PSI-BLAST iterations improves the discrimination of
true and false positives.
4
DISCUSSION
We have established a novel strategy to improve the accuracy
of PSI-BLAST searches by validating the resultant hits from
the final iterations against those obtained in earlier rounds.
The verification results are significant given the improved
Delineation of true and false positives in PSI-BLAST searches
Table 1. ROC100 values characterizing retrieval performance for a ‘gold
standard’ yeast database with 103 query sequences and 1005 true
positives
ROC100
PSI-BLAST
iteration 2
PSI-BLAST
iteration 5
Combined
E-values
SAM-T2K
0.764
0.807
0.836
0.825
Our ‘combined E-values’ method has a higher ROC100, and thus shows better
performance than PSI-BLAST’s iterations two or five alone.
performance shown by our method over PSI-BLAST at minimal
computational cost. They demonstrate that our strategy
enhances the discriminative capability of PSI-BLAST by
maintaining the specificity conferred by the PSSM of iteration
two, while at the same time, capitalizing on the advantage gained
from iteratively improving the model to identify more true
positives in iteration five. One may argue that further enhancement could potentially be achieved by including the E-values
from the interim rounds of a PSI-BLAST search, but such an
approach would further raise concerns about statistical independence. While for corrupted models at least, the assumption of
statistical independence between iterations two and five is
reasonable, it becomes invalid if E-values from immediately
consecutive iterations are combined. Thus, incorporating
E-values from intermediate rounds would eliminate the beauty
of the mathematical framework used in this study, and therefore
was not pursued for the purpose of this manuscript. Nevertheless, it may be worthwhile to explore this strategy in the future
using machine learning techniques to replace the parameter free
combination of E-values put forward in this article.
5
CONCLUSION
The problem of model corruption is a major concern to both
the developers and users of PSI-BLAST. The reliability of PSIBLAST’s assessment on whether two sequences are related
depends heavily on its ability to discriminate true hits from false
positives produced from corrupted queries. Many different
strategies have been and are being explored to overcome this
problem. Our proposed strategy is noteworthy in its simplicity—it utilizes information already output by PSI-BLAST and
thus requires minimal changes to the existing algorithm.
ACKNOWLEDGEMENTS
Funding: This work was partially supported by the National
Science foundation through grant DBI-0317335 to R.B.
Conflict of Interest: none declared.
REFERENCES
Alam,I. et al. (2004) Comparative homology agreement search: an effective
combination of homology-search methods. Proc. Natl Acad. Sci USA, 101,
13814–13819.
Altschul,S.F. and Gish,W. (1996) Local alignment statistics. Methods Enzymol.,
266, 460–480.
Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215,
403–410.
Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res., 25, 3389–3402.
Aravind,L. and Koonin,E.V. (1999) Gleaning non-trivial structural, functional
and evolutionary information about proteins by iterative database searches.
J. Mol. Biol., 287, 1023–1040.
Becker,K.G. et al. (2003) PubMatrix: a tool for multiplex literature mining. BMC
Bioinformatics, 4, 61.
Bowie,J.U. et al. (1991) A method to identify protein sequences that fold into a
known three-dimensional structure. Science, 253, 164–253.
Bowie,J.U. et al. (1996) Three-dimensional profiles for measuring compatibility
of amino acid sequence with three-dimensional structure. Methods Enzymol.,
266, 598–616.
Collins,J.F. et al. (1988) The significance of protein sequence similarities. Comput.
Appl. Biosci., 4, 67–71.
Eddy,S.R. (1996) Hidden Markov models. Curr. Opin. Struct. Biol., 6, 361–365.
Gotoh,O. (1996) Significant improvement in accuracy of multiple protein
sequence alignments by iterative refinement as assessed by reference to
structural alignments. J. Mol. Biol., 264, 823–838.
Gough,J. et al. (2001) Assignment of homology to genome sequences using a
library of hidden Markov models that represent all proteins of known
structure. J. Mol. Biol., 313, 903–919.
Gribskov,M. et al. (1987) Profile analysis: detection of distantly related proteins.
Proc. Natl Acad. Sci. USA, 84, 4355–4358.
Gribskov,M. and Robinson,N.L. (1996) Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem., 20,
25–33.
Grundy,W.N. (1998) Homology detection via family pairwise search. J. Comput.
Biol., 5, 479–491.
Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence
analysis: extension and analysis of the basic method. Comput. Appl. Biosci.,
12, 95–107.
Kann,M.G. et al. (2005) A structure-based method for protein sequence
alignment. Bioinformatics, 21, 1451–1456.
Kaplan,N. et al. (2003) PANDORA: keyword-based analysis of protein sets by
integration of annotation sources. Nucleic Acids Res., 31, 5617–5626.
Karlin,S. and Altschul,S.F. (1990) Methods for assessing the statistical
significance of molecular sequence features by using general scoring schemes.
Proc. Natl Acad. Sci. USA, 87, 2264–2268.
Karlin,S. and Altschul,S.F. (1993) Applications and statistics for multiple highscoring segments in molecular sequences. Proc. Natl Acad. Sci. USA, 90,
5873–5877.
Karlin,S. and Dembo,A. (1992) Limit distributions of maximal segmental score
among Markov-dependent partial sums. Adv. Appl. Prob., 24, 113–140.
Karplus,K. et al. (1998) Hidden Markov models for detecting remote protein
homologies. Bioinformatics, 14, 846–856.
Lee,M.M. et al. (2007) Distant homology detection using a LEngth
and STructure-based sequence Alignment Tool (LESTAT). Proteins, 71,
1409–1419.
Mott,R. (1992) Maximum likelihood estimation of the statistical distribution of
Smith-Waterman local sequence similarity scores. Bull. Math. Biol., 54, 59–75.
Park,J. et al. (1998) Sequence comparisons using multiple sequences detect three
times as many remote homologues as pairwise methods. J. Mol. Biol., 284,
1201–1210.
Schaffer,A.A. et al. (2001) Improving the accuracy of PSI-BLAST protein
database searches with composition-based statistics and other refinements.
Nucleic Acids Res., 29, 2994–3005.
Shatkay,H. et al. (2007) SherLoc: high-accuracy prediction of protein subcellular
localization by integrating text and protein sequence data. Bioinformatics, 23,
1410–1417.
Smith,T.F. and Waterman,M.S. (1981) Comparison of biosequences. Adv. Appl.
Math., 2, 482–489.
Smith,T.F. et al. (1985) The statistical distribution of nucleic acid similarities.
Nucleic Acids Res., 13, 645–656.
Thompson,J.D. et al. (1999) A comprehensive comparison of multiple sequence
alignment programs. Nucleic Acids Res., 27, 2682–2690.
Vingron,M. and Waterman,M.S. (1994) Sequence alignment and penalty choice.
Review of concepts, case studies and implications. J. Mol. Biol., 235, 1–12.
Waterman,M.S. and Vingron,M. (1994) Rapid and accurate estimates of
statistical significance for sequence data base searches. Proc. Natl Acad. Sci.
USA, 91, 4625–4628.
1343