BIOINFORMATICS ORIGINAL PAPER Vol. 24 no. 11 2008, pages 1339–1343 doi:10.1093/bioinformatics/btn130 Sequence analysis Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches Marianne M. Lee1, Michael K. Chan1,2,* and Ralf Bundschuh1,3,* 1 The Ohio State Biophysics Program, 2Departments of Biochemistry and Chemistry, Ohio State University, 484 W 12th Av. and 3Department of Physics, Ohio State University, 191 W Woodruff Av., Columbus OH 43210-1117, USA Received on January 5, 2008; revised on February 28, 2008; accepted on April 7, 2008 Advance Access publication April 10, 2008 Associate Editor: Thomas Lengauer ABSTRACT Motivation: The deluge of biological information from different genomic initiatives and the rapid advancement in biotechnologies have made bioinformatics tools an integral part of modern biology. Among the widely used sequence alignment tools, BLAST and PSIBLAST are arguably the most popular. PSI-BLAST, which uses an iterative profile position specific score matrix (PSSM)-based search strategy, is more sensitive than BLAST in detecting weak homologies, thus making it suitable for remote homolog detection. Many refinements have been made to improve PSI-BLAST, and its computational efficiency and high specificity have been much touted. Nevertheless, corruption of its profile via the incorporation of false positive sequences remains a major challenge. Results: We have developed a simple and elegant approach to resolve the problem of model corruption in PSI-BLAST searches. We hypothesized that combining results from the first (least-corrupted) profile with results from later (most sensitive) iterations of PSI-BLAST provides a better discriminator for true and false hits. Accordingly, we have derived a formula that utilizes the E-values from these two PSI-BLAST iterations to obtain a figure of merit for rank-ordering the hits. Our verification results based on a ‘gold-standard’ test set indicate that this figure of merit does indeed delineate true positives from false positives better than PSI-BLAST E-values. Perhaps what is most notable about this strategy is that it is simple and straightforward to implement. Contact: [email protected] 1 INTRODUCTION Sequence alignment is one of the most widely used techniques in computational biology, particularly for functional annotation of non-characterized sequences. The power of this approach is directly dependent on the ability of sequence alignment tools to detect weak homologies. BLAST (Altschul et al., 1990) is arguably the most frequently used sequence alignment tool. PSI-BLAST (Altschul et al., 1997)—an extended version of BLAST—is similarly popular. It is more effective and sensitive in detecting distant homologs *To whom correspondence should be addressed. due to its iterative profile-based search strategy (Gotoh, 1996; Thompson et al., 1999). In its first iteration, PSI-BLAST identifies related sequences that meet a specific inclusion threshold and utilizes these sequences to generate a profile or position specific score matrix (PSSM) to be used for the next iterative search. The PSSM is then used to align and score sequences in the database to search for new statistically significant hits. This process is performed iteratively until the model converges or no new statistically significant sequences are found. Previous studies have shown that iterative approaches, such as that used in PSI-BLAST, are more effective in finding additional homologs. However, a major potential problem of this approach is model corruption. With each subsequent iteration, there exists an increased probability of including a non-homologous sequence that meets the threshold for PSSM inclusion, and whose presence can, in turn, lead to the incorporation of other false positives. Such model corruption is, particularly, treacherous if the non-homologous sequence belongs to a large family. A naive approach to overcome this problem is to set a stringent inclusion threshold, but the trade-off is a loss of sensitivity, especially for the more remote homologs, thus diminishing the strength and utility of PSI-BLAST. Hence, the general interest in developing novel strategies to distinguish and eliminate early false positive sequences from true hits in homology detection searches. Many different approaches have been studied and proposed to improve the discrimination of true and false positives. These methods typically incorporate extrinsic information into the search, such as, (predicted) structural (Bowie et al., 1991, 1996; Gough et al., 2001; Kann et al., 2005; Lee et al., 2007) or literature data (Becker et al., 2003; Kaplan et al., 2003; Shatkay et al., 2007), or use completely different techniques such as SAM (Eddy, 1996; Gribskov et al., 1987; Grundy, 1998; Hughey and Krogh, 1996; Karplus et al., 1998), or a combination of such other techniques (Alam et al., 2004). Despite their better performance in remote homology detection, these approaches are computationally more expensive than PSI-BLAST itself, thus hindering their wide acceptance by the user community. Herein, we describe a new methodology to improve the statistical assessment of hits returned from a PSI-BLAST search that yields better delineation of true and false positives. The appeal of our approach is its simplicity: it can be integrated ß The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1339 M.M.Lee et al. easily without incurring significant computational costs to the PSI-BLAST algorithm, while at the same time improving the accuracy of its searches. The remainder of this article is organized as follows. In Section 2 we first outline the key idea of our approach. Then, we review the basics of BLAST significance assessment, and use these basics to heuristically derive a method to combine outputs from different rounds of PSI-BLAST into one aggregate result. In Section 3, we verify that this method indeed improves search performance by applying it to the ‘gold standard’ database used to test PSI-BLAST by its own authors (Schaffer et al., 2001). We summarize our findings in Section 4. 2 METHODS 2.1 Key idea The goal of our approach is to improve on the high sensitivity of PSI-BLAST in terms of remote homolog detection without sacrificing computational efficiency—one of the main reasons for PSI-BLAST’s popularity. Our approach is based on a well-known property of PSI-BLAST searches: in its early iterations PSI-BLAST is not very sensitive, but very specific. Upon performing more and more iterations, sensitivity increases while specificity suffers from what is called ‘model corruption’. This behavior can be understood as follows: As an iterative tool, PSI-BLAST attempts to build better and better models of the homologs to the original query sequence. The first round of PSI-BLAST, which is essentially BLAST, finds only relatively close homologs. These close homologs are then used to generate a profile for the next iteration that will be able to find more remote homologs. Iterating this process in theory yields progressively better models and thus finds more and more weakly related homologs to the original query. However, in practice, non-homologous false positives are frequently incorporated into the model at some point during this iterative process. Once this happens, the model is ‘corrupted’ and completely loses its specificity for the true homologs of the original query, resulting in the false identification of large numbers of non-homologous sequences as putative true homologs. For this reason, it is recommended to run at most five to six iterations of PSI-BLAST instead of running PSI-BLAST until no more new sequences are identified (Schaffer et al., 2001). However, even after five or six iterations, it is often common to find that some non-homologous sequences are already incorporated. Thus, there is a trade-off between an increased sensitivity and an increased possibility for model corruption as more iterations are performed. The key idea of our approach is that by combining the results from different PSI-BLAST iterations, the benefits of low corruption from the early rounds, and the benefits of high sensitivity from the later rounds can both be realized. More precisely, we expect that a true homolog identified in a later iteration should have already revealed its identity as a homolog in the second round (the one using the first profile generated by PSI-BLAST and least likely corrupted). This distant homolog may not have had the level of homology that would have been statistically significant in the second round, but it should at least exhibit some homology. One the other hand, a non-homolog reported in a later round due to model corruption should show a much lower degree of homology in the second round than a true homolog. We thus hypothesize that ‘benchmarking’ hits from later iterations against hits from iteration two where the model is least corrupted should improve the accuracy of a PSI-BLAST search. The beauty of our strategy is that it requires almost no additional computational effort—if PSI-BLAST is iterated to, say, the fifth round, the results for the second round will have been calculated anyway. All that is required is to store these results and combine them with the 1340 results of the last round. The obvious question that has to be resolved, though, is precisely how to combine the results from different rounds of PSI-BLAST. The remainder of this section will be devoted to providing an explicit formula for combining the E-values reported by PSI-BLAST in the two rounds into a combined figure of merit. The derivation of this formula will be somewhat heuristic in nature. However, an evaluation of this method on a ‘gold standard’ database presented in the next section will show that our method is indeed very successful in combining the benefits from the different rounds of PSI-BLAST in spite of the somewhat heuristic nature of its derivation. 2.2 Review of BLAST statistics In order to derive a way to combine the E-values reported by PSIBLAST for the different rounds into a single figure of merit, it is useful to review how these E-values are actually calculated. If a query sequence with L-amino acids is aligned against a random subject sequence of M-amino acids using the Smith–Waterman (Smith and Waterman, 1981) local alignment algorithm,1 the local alignment score P can be considered a random variable. In the absence of gaps, this random variable has been proven (Karlin and Altschul, 1990, 1993; Karlin and Dembo, 1992) to follow an extreme value distribution nX o Pr 5S ¼ exp KLMeS ð1Þ if the query and subject sequence are sufficiently long.1 Here, and K are two parameters that depend in a known way on the scoring system, the query sequence and the random ensemble used to choose the random subject sequences. Even in the presence of gaps, heuristic arguments and numerical studies (Altschul and Gish, 1996; Collins et al., 1988; Mott, 1992; Schaffer et al., 2001; Smith et al., 1985; Vingron and Waterman, 1994; Waterman and Vingron, 1994) confirm that the local alignment score is still distributed according to the extreme value distribution Equation (1), albeit with modified values of the two parameters and K. Thus, the P-value of an alignment score, or the probability of obtaining an alignment score of magnitude S or better in one alignment against a random subject sequence, is given by ð2Þ P ¼ 1 exp KLMeS BLAST and PSI-BLAST report E-values, or the expected number of random hits of score S or higher, instead of P-values. These two quantities are different since the P-value refers to the probability of having at least one hit at the prescribed significance level, while the E-value quantifies the expected number of these hits. The general relationship between these two quantities for random variables with exponential tails is the Poisson formula P ¼ 1 eE ð3Þ By comparing with Equation (2) we find Eseq ¼ logð1 PÞ ¼ KLMeS ð4Þ It is important to note that this is the E-value for one single sequence comparison, which we indicate by the subscript ‘seq’. The convenient property of E-values is that the Bonferroni correction for repeating the alignment for every sequence in a database of N amino acids, and thus roughly N/M sequences, just manifests itself in a multiplication of the expected number of hits for a single sequence by the number of sequences, i.e. N ð5Þ Edb ¼ Eseq ¼ KLNeS M 1 BLAST approximates the Smith–Waterman algorithm with some heuristics but the statistics of large scores is not affected by this heuristics. Delineation of true and false positives in PSI-BLAST searches It is this latter E-value that BLAST and PSI-BLAST report as the statistical significance of their hits. 2.3 Combining two E-values In order to combine results from different rounds, we make the assumption that the different rounds of PSI-BLAST are statistically independent. At first glance, making this assumption seems heuristic since the models used in all PSI-BLAST iterations describe the same query sequence and thus are related. It is worth pointing out, however, that for precisely in the scenario we are most worried about, namely model corruption, the assumption of statistical independence between early and late rounds of PSI-BLAST is actually a valid one. As mentioned above, the main justification for this assumption is that it leads to good retrieval performance as shown in the next section using real biological data. As a means to eliminate false positives resulting from model corruption, we have to accept a hit only if it is significant in the final round of PSI-BLAST as well as in the second round of PSI-BLAST. Then, a false hit occurs only if the same subject sequence is a false hit in the second round and in the final round. Under the assumption of statistical independence, this means that the total probability for a false hit with score S2 or higher in the second round and Sf or higher in the final round is Ptot ¼ P2 Pf ð6Þ where P2 and Pf are the P-values calculated according to Equation (2) for the two score thresholds S2 and Sf. Since PSI-BLAST does not report these P-values we have to use Equation (3) and the first equality in Equation (5) to determine them. This yields M M Ptot ¼ 1 exp E2 1 exp Ef ð7Þ N N where E2 and Ef are the E-values reported by PSI-BLAST in the second and final round, respectively. In order to transform this P-value into a quantity that resembles the E-values reported by PSI-BLAST, we use as the final figure of merit N ð8Þ Etot ¼ logð1 Ptot Þ M We refer to this quantity as a figure of merit instead of an E-value since this quantity is strictly speaking only an E-value if the assumption of statistical independence of the two rounds in question is fulfilled. In general, we expect that this number is an underestimate of the true E-value due to correlations between the rounds. However, as we will show in the next section, using this number as a figure of merit yields a superior discrimination between true and false positives. 2.4 Verification protocol One major consideration in performance evaluation is the accuracy of the assignment of true and false positives to the hits returned from a search. An ideal test set would be one with known (expert-curated) true and false positive relationships of the query sequences to those in the test database, making the assignment unambiguous and straightforward. In light of this consideration and given that our proposed method is an elaboration on PSI-BLAST, it therefore makes sense to use the same dataset used by the PSI-BLAST development team to evaluate their search refinement strategies. Consequently, we have verified our proposed methodology using the ‘Aravind’ dataset (Aravind and Koonin, 1999) employed by Schaeffer et al. (2001) in their study of different methods to improve PSI-BLAST performance (Schaffer et al., 2001). This Aravind dataset is comprised of 103 queries of single protein domain sequences of the yeast Saccharomyces cerevisiae and a sequence database containing 6406 yeast proteins. We rationalized that using the same dataset should give the most direct and unbiased performance comparison between the two approaches. More importantly, the Aravind dataset contains the true positives to each query sequence and have been expertly curated by Schaffer’s coworkers at the NCBI—by conducting PSI-BLAST searches against the yeast database, and analyzing the resultant alignments on a case-by-case basis via expert examination. Sequences in the yeast database that were not annotated as true positives were considered false by default. For our study, a five-round PSI-BLAST search was conducted for each of the 103 queries against the complete non-redundant (NR) protein database with its 1 986 685 sequences (frozen as of 8/23/04). The choice of five rounds is based on previous empirical analysis by the PSI-BLAST developers that suggested most hits that will be found with an E-value 0.005 (the default threshold that PSI-BLAST uses to include sequences for PSSM construction) are usually found by round five or six (Altschul et al., 1997; Schaffer et al., 2001). The PSSM constructed for iteration two and iteration five were saved as ‘checkpoints’. The two checkpoint files per query were then used to search the annotated yeast database to produce a list of yeast hits (up to E-value of 1000) for iteration two and iteration five, respectively. Each hit in iteration two and five was assigned a ‘yes’ (for true positive) or ‘no’ (for false positive) according to the true positives for the Aravind test set described previously. To generate a list of common hits to be used for our proposed approach, hits that were found in one iteration but missing in the other were arbitrarily assigned an E-value of 10 000. This list of hits and their corresponding E-values at iteration two and iteration five were combined to obtain the figure of merit (Etot) using the formula Equation (8) in Section 2.3. An awk script that we used to perform the above procedure is available online at http://bioserv.mps. ohio-state.edu/SimpleIsBeautiful. 3 3.1 RESULTS Evaluation and comparison to PSI-BLAST To compare the performance of our proposed approach with PSI-BLAST’s iterations two and five, all the hits generated from the 103 queries were pooled together and rank ordered by their E-values, or in our case, figures of merit (Etot). Coverage versus error analysis was conducted to examine the number of true positives identified by varying the error levels up to a threshold of 100. The result of this analysis is shown in Figure 1. As shown clearly by the respective position occupied by each curve, our proposed methodology outperforms both iterations two and five of PSI-BLAST. The behavior of the two curves for PSI-BLAST’s iterations two and five shows that there is indeed a tradeoff between sensitivity and specificity between different levels of PSI-BLAST searches: higher specificity for iteration two (later occurrence of the first false positive), but higher sensitivity for iteration five (more true positives found). In contrast, the curve corresponding to our method lies farthest right in the plot, encompassing more true positives with minimal increase in false positives. 3.2 Comparison to SAM-T2K Although widely recognized as the most popular sequence alignment tool, PSI-BLAST is found to be less effective in remote homolog detection than the more computationally costly HMM-based methods (Park et al., 1998). Given the improved performance of our approach over the current PSIBLAST method, it would be interesting to see how our proposed improvement to the PSI-BLAST algorithm performs 1341 M.M.Lee et al. Fig. 1. Performance comparison between rounds two and five of PSIBLAST, and our new method for combining different rounds. The curves show the number of errors as a function of the number of true positives for the three methods as evaluated on the Aravind ‘gold standard’ dataset. The dotted and dashed lines are for iterations two and five of PSI-BLAST, respectively, while the solid line shows the performance of our ‘combined E-values’ method. It can be seen that our method achieves the same sensitivity as round five of PSI-BLAST with (nearly) the same specificity as round two of PSI-BLAST. Fig. 2. Performance comparison between PSI-BLAST, SAM-T2K and our proposed method. The curves show the number of errors as a function of the number of true positives for the three approaches as evaluated on the Aravind ‘gold standard’ dataset. The long dashed lines are for iteration five of PSI-BLAST, the short dashed lines are for iteration five of SAM-T2K and the solid line represents the performance of our ‘combined E-values’ method. It can be seen that both our method and SAM-T2K achieve higher specificity and sensitivity than PSI-BLAST. when compared to state-of-the-art approaches. Similar to PSIBLAST in its automated iterative search strategy, the HMMbased SAM-T2K (Hughey and Krogh, 1996; Karplus et al., 1998) has been shown to be superior to PSI-BLAST in distant homology detection in general (Park et al., 1998), thus we have chosen SAM-T2K for additional performance comparison. We thus applied SAM-T2K to the Aravind test set following the standard procedure as prescribed in the SAM-T2K suite. In brief, for each query sequence in the Aravind test set, we used the script target2k to perform five iterations of the following cycle: (1) search the NR database for homologs, (2) perform a multiple alignment of the homologs to construct an HMM-model and (3) score the sequences in the NR database using this HMMmodel. The multiple alignment generated after the final round was then used to create an HMM-model using the w0.5 script. This HMM-model was used to score the 6406 yeast sequences in the Aravind test set using the script hmmscore with the alignment parameter set to local alignment. All other parameters used to conduct the SAM-T2K search were left at their default values. As shown in Figure 2, whereas our proposed method performs similarly to SAM-T2K, PSI-BLAST falls short on both specificity and sensitivity to either method—finding fewer true homologs for the same number of false positives. The results here reaffirm the general notion that the HMM-based SAM-T2K is a better remote homolog detection tool than PSIBLAST. Nevertheless, the more sophisticated and complex SAM-T2K is computationally more expensive, making it less popular than PSI-BLAST. The implication of the performance of our ‘combined E-values’ approach is thus more significant when put into the context of computational efficiency. For minimal additional computation cost (since our method uses outputs already produced by PSI-BLAST), our approach is able to elevate PSI-BLAST’s performance to match that of SAM-T2K. 3.3 1342 Receiver operating characteristic analysis In addition to coverage versus error analysis, another common technique to quantify the performance of different approaches is a receiver operating characteristic (ROC) analysis, which measures the rate of true positives against the rate of false positives. The quantity ROC is derived by calculating the area under the curve and adopts a value between 0 (worst) to 1 (best). Although the recommended figure of merit for a comprehensive sequence database search is ROC50 (Gribskov and Robinson, 1996), we followed the lead of the PSI-BLAST developers (Schaffer et al., 2001) in their choice of ROC100 to compare the performance of PSI-BLAST iterations two and five, our ‘combined E-values’ method using these two PSIBLAST iterations, and SAM-T2K iteration five. We thus calculated the characteristic ROC100 as ROC100 ¼ 100 1 X ti 100T i¼1 ð9Þ where T ¼ 1005 is the total number of true positives in the yeast database and ti is the number of true positives ranked ahead of the ith false positive. As indicated in Table 1, the ROC100 for our ‘combined E-values’ method (0.836) is comparable to that of SAM-T2K. When compared to either iteration two or iteration five of PSI-BLAST, however, our ROC100 is higher providing further support that combining results from early and late PSI-BLAST iterations improves the discrimination of true and false positives. 4 DISCUSSION We have established a novel strategy to improve the accuracy of PSI-BLAST searches by validating the resultant hits from the final iterations against those obtained in earlier rounds. The verification results are significant given the improved Delineation of true and false positives in PSI-BLAST searches Table 1. ROC100 values characterizing retrieval performance for a ‘gold standard’ yeast database with 103 query sequences and 1005 true positives ROC100 PSI-BLAST iteration 2 PSI-BLAST iteration 5 Combined E-values SAM-T2K 0.764 0.807 0.836 0.825 Our ‘combined E-values’ method has a higher ROC100, and thus shows better performance than PSI-BLAST’s iterations two or five alone. performance shown by our method over PSI-BLAST at minimal computational cost. They demonstrate that our strategy enhances the discriminative capability of PSI-BLAST by maintaining the specificity conferred by the PSSM of iteration two, while at the same time, capitalizing on the advantage gained from iteratively improving the model to identify more true positives in iteration five. One may argue that further enhancement could potentially be achieved by including the E-values from the interim rounds of a PSI-BLAST search, but such an approach would further raise concerns about statistical independence. While for corrupted models at least, the assumption of statistical independence between iterations two and five is reasonable, it becomes invalid if E-values from immediately consecutive iterations are combined. Thus, incorporating E-values from intermediate rounds would eliminate the beauty of the mathematical framework used in this study, and therefore was not pursued for the purpose of this manuscript. Nevertheless, it may be worthwhile to explore this strategy in the future using machine learning techniques to replace the parameter free combination of E-values put forward in this article. 5 CONCLUSION The problem of model corruption is a major concern to both the developers and users of PSI-BLAST. The reliability of PSIBLAST’s assessment on whether two sequences are related depends heavily on its ability to discriminate true hits from false positives produced from corrupted queries. Many different strategies have been and are being explored to overcome this problem. Our proposed strategy is noteworthy in its simplicity—it utilizes information already output by PSI-BLAST and thus requires minimal changes to the existing algorithm. ACKNOWLEDGEMENTS Funding: This work was partially supported by the National Science foundation through grant DBI-0317335 to R.B. Conflict of Interest: none declared. REFERENCES Alam,I. et al. (2004) Comparative homology agreement search: an effective combination of homology-search methods. Proc. Natl Acad. Sci USA, 101, 13814–13819. Altschul,S.F. and Gish,W. (1996) Local alignment statistics. Methods Enzymol., 266, 460–480. Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Aravind,L. and Koonin,E.V. (1999) Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J. Mol. Biol., 287, 1023–1040. Becker,K.G. et al. (2003) PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics, 4, 61. Bowie,J.U. et al. (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science, 253, 164–253. Bowie,J.U. et al. (1996) Three-dimensional profiles for measuring compatibility of amino acid sequence with three-dimensional structure. Methods Enzymol., 266, 598–616. Collins,J.F. et al. (1988) The significance of protein sequence similarities. Comput. Appl. Biosci., 4, 67–71. Eddy,S.R. (1996) Hidden Markov models. Curr. Opin. Struct. Biol., 6, 361–365. Gotoh,O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol., 264, 823–838. Gough,J. et al. (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol., 313, 903–919. Gribskov,M. et al. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Gribskov,M. and Robinson,N.L. (1996) Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem., 20, 25–33. Grundy,W.N. (1998) Homology detection via family pairwise search. J. Comput. Biol., 5, 479–491. Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Appl. Biosci., 12, 95–107. Kann,M.G. et al. (2005) A structure-based method for protein sequence alignment. Bioinformatics, 21, 1451–1456. Kaplan,N. et al. (2003) PANDORA: keyword-based analysis of protein sets by integration of annotation sources. Nucleic Acids Res., 31, 5617–5626. Karlin,S. and Altschul,S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA, 87, 2264–2268. Karlin,S. and Altschul,S.F. (1993) Applications and statistics for multiple highscoring segments in molecular sequences. Proc. Natl Acad. Sci. USA, 90, 5873–5877. Karlin,S. and Dembo,A. (1992) Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Prob., 24, 113–140. Karplus,K. et al. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14, 846–856. Lee,M.M. et al. (2007) Distant homology detection using a LEngth and STructure-based sequence Alignment Tool (LESTAT). Proteins, 71, 1409–1419. Mott,R. (1992) Maximum likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull. Math. Biol., 54, 59–75. Park,J. et al. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, 1201–1210. Schaffer,A.A. et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res., 29, 2994–3005. Shatkay,H. et al. (2007) SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics, 23, 1410–1417. Smith,T.F. and Waterman,M.S. (1981) Comparison of biosequences. Adv. Appl. Math., 2, 482–489. Smith,T.F. et al. (1985) The statistical distribution of nucleic acid similarities. Nucleic Acids Res., 13, 645–656. Thompson,J.D. et al. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res., 27, 2682–2690. Vingron,M. and Waterman,M.S. (1994) Sequence alignment and penalty choice. Review of concepts, case studies and implications. J. Mol. Biol., 235, 1–12. Waterman,M.S. and Vingron,M. (1994) Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl Acad. Sci. USA, 91, 4625–4628. 1343
© Copyright 2026 Paperzz