353 Scores for sequence searches and alignments Steven Henikoff Every sequence comparison method requires a set of scores. For aligning protein sequences, substitution scores are based on models of amino acid conservation and properties, and matrices of these scores have substantially improved in recent years. Position-specific scoring matrices provide representations of sequence families that are capable of detecting subtle similarities. Comprehensive evaluations can effectively guide the choice of scores for sequence alignment and searching applications, including those that aid in the prediction of protein structures. Address Howard Hughes Medical Institute, Basic Sciences Division, Fred Hutchinson Cancer Research Center, 1124 Columbia Street, Seattle, WA 98104, USA; e-mail: [email protected] Current Opinion in Structural Biology 1996, 6:353-360 © Current Biology Ltd ISSN 0959-440X Abbreviations BLOSUM blocks substitution matrix HMM hidden Markov model 1D one-dimensional 3D three-dimensional PAM per cent accepted mutation PSSM position-specific scoring matrix Introduction Automated sequence comparisons based on sequence alignment are among the most familiar procedures in biological research. For example, the BLAST Internet servers currently search more than 7000 sequence queries on an average day. In addition to database searches, pairwise and multiple alignment of sequences are often tile starting points for in vitro mutagenesis, homology modeling and other procedures in molecular biology and biochemistry. In nearly all fields of modern biology, displays of sequence alignments, trees and dotplots have become standard features of journal articles. All sequence-comparison methods require an alignment algorithm and a set of scores. Substitution scores quantify the cost of exchanging one residue for another, and gap penalties quantify the cost of exchanging a residue or a string of residues for no residue at all. A pairwise alignment score is the sum of substitution scores and gap penalties over all aligned residues, so that the best alignment is considered to be the one that obtains the highest score. In the case of nucleic acid sequence comparison, only t w o substitution scores are used, one for a match and another for a mismatch. Scores can be represented in the form of a 4 x4 'unita~' matrix consisting of ones (or another positive score) on the diagonal for scoring A-A, C-C, G - G and T - T matches and zeros (or a negative score) for scoring mismatches. Because a C - G mismatch is scored the same as a G-C mismatch, the matrix is symmetrical, and only half of the off-diagonal scores are needed to provide a complete scoring scheme for residue substitutions. Although a unitary matrix suffices for typical nucleic acid alignment tasks, alignments of protein sequences benefit from scores that take into account residue-specific information (Fig. la). A 20x20 unitary matrix is outperformed in alignment tasks by matrices that customize scores for each particular amino acid pair, 210 in all (20 possible match scores and 190 possible mismatch scores). A frequent substitution found in correctly aligned proteins, such as Asp+Glu, should receive a higher score than an infrequent substitution, such as Asp--~Leu. As alignment uncertainty increases, the choice of substitution scores (and gap penalties) becomes increasingly important. The PAM (per cent accepted mutation) 250 mutation data matrix [1] was designed to be effective in the range of typical alignment uncertainty, and dominated protein sequence alignment applications for 15 years. Recently, however, the problem of substitution-matrix choice has received renewed attention, and, as a result, the PAM 250 matrix has been replaced by modern matrices for most sequence-alignment tasks. During the past year, important progress has been made in determining which of these matrices work well for alignment applications that require gap penalties. Multiply aligned sequences provide implicit protein structural information. Such information is rapidly increasing with the expansion of sequence databanks and is now available for most newly determined sequences [2]. By identifying constraints on residues and regions from multiple alignments, applications such as database searching can be improved. The simplest approach is to construct a pattern consisting of invariant and conserved residues, allowing for ambiguities where necessary, and to search for matches to the pattern in sequence databases [3,4]. Unfortunately, these patterns discard potentially useful information and rarely perform as well as modern pairwise searching methods. To use multiple-alignment information effectively, more detailed scores are employed: each column of the alignment is used to calculate the values in a column of a position-specific scoring matrix (PSSM; see Fig. lb) [5]. To search a database, the PSSM is slid along each sequence entry, and, at each alignment, the score for every amino acid is obtained from the aligned PSSM column. During the past two years, improvements in PSSM construction have led to more effective database-searching tools. PSSMs have also been applied to structural prediction tasks. 354 Sequences and topology Figure 1 (a) Pairwise alignment METR: 134 L Q Q G E L D L ~ T S D I L P R S E L H Y S P M F D F E V R L V ~ P D H P L A S K T Q I T P E D L A S E T L L I RBCR: 137 L D S N S V D L ~ M G V P P R N V ~ E A ~ F M D N P L W ~ A P P D H P L A G E ~ I S L ~ L A E E T F ~ I Substitution matrix III I I IIIIII I II II ~ ~:~ ~_~ :T~ i ii :i !i_i i! :! i °~_i ~ s - z - l ~ l - 2 - 1 - 3 2 3 ; o - ~ - ~ l s ~ . 2 - 2 - 2 4 - 2 - 3 - 3 - 3 3 3 ! - 3 - ~ 9 0 1 6 e . 2 - 2 - 2 - 3 - 2 - 3 - 2 3 2 1 2 - 2 - 2 - 1 ~ 1 3 7 w . 2 . 3 - 2 - a - 3 2 4 4 3 - ~ - 3 . 2 - : - 3 2 3 1 2 1 1 (b) Multiple alignment PSSM column RbcR LysR CysB IlvY IrgB GItC OxyR MIeR MetR AmpR TrpI NodD NahR LeuO SyrM CatR CatM AntO Svir 140 150 M~DSNS~VLMGV W~SAQR~LGLTET AVSKGNA~AXATE KVVTGEA~LAXAGK S D g V F E ~ L I XWI E AV~R/~RD~AL LG P Q~DSG~CVILAL LLLNEE~SSLLGS A~QQGE~VMT SD DPAAEG~TZRYG DPRRPG~4LWFA I~RSGD~FLELPD A~QNGT~LAVGLL A C D E F O6 83 11 0 G 0 K 0 x 0 K 0 r. 0 M 0 N 0 P 0 Q 0 O L R Y Q E ~ ' V I SYE R 0 LLEQGE~VVVGQM A~KSGR~ZAFGRI ALKQGK~GFGRL OLSQHK~MIXSDC ELC QTNh[~VX SAR S T V W Y 0 0 0 0 0__ The use of scoring matrices in pairwise and multiple sequence alignment applications. Examples are from the LysR family of bacterial transcriptional regulators [61,62]. (a) The alignment between two amino acid sequences is given, where identities are indicated by vertical lines. The substitution of a given amino acid (shown in single letter code) with any other amino acid is given a score. The matrix provides the scores for all possible substitution combinations. Frequent substitutions found in correctly aligned proteins receive a high positive score. Infrequent substitutions receive a negative score. METR, Salmonella typhimurium MetR; RBCR, Chromatium vinosum RbcR; D, aspartic acid; R, arginine. (b) In the multiple alignment, residues that are identical in >50O/o of the sequences are underlined, and residues with an average score in pair,vise comparisons exceeding a threshold value in >50% of the sequences are indicated in bold [62]. A position-specific scoring matrix (PSSM) derives scores from the multiple alignment, column by column. The PSSM column shows percentage scores derived from the boxed column of the alignment. Amino acid substitution matrices Until recently, the mutation data matrix series dominated sequence-analysis applications. It was based upon two important concepts introduced by Dayhoff and co-workers [1,6]. T h e first was that alignments could be effectively scored using a 'log-odds' strategy. This involves collecting alignments that are presumed to be correct, estimating the frequencies of substituting one residue for another, and making a ratio of observed to expected substitution probabilities, where expected probabilities are based on a model for chance alignments. To score an alignment, the odds ratios for each residue pair in the alignment should be multiplied, or equivalently, their logarithms should be added. A positive log-odds score indicates that the exchange is more likely to occur in a correct alignment than in a chance alignment, and vice versa for a negative score. Altschul [7] has pointed out that all substitution matrices are at least implicitly log-odds matrices, because it is possible to back-calculate a theoretical set of substitution probabilities from any set of substitution scores, using some reasonable expected probabilities. So, a set of scores that is based on some notion of shared amino acid properties [8-12] can be mathematically converted to 'target' substitution probabilities. It seems unlikely that such hypothetical probabilities will be as good a modcl for real alignments as those based on real alignments themselves, and explicit log-odds matrices have contint,ed to dominate. A second important idea was to base substitution frequencies on estimated mutation rates [6]. Each mutational event is assumed to be independent of previous events. For this model to be approximately valid, mutation rates should be estimated from alignments of closely related sequences; otherwise, intermediate events could be missed. A consequence of using closely related sequences to estimate mutation rates is that these rates must be extrapolated to model greater evolutionary distances. For example, to calculate the PAM 250 substitution matrix, the mutation probability matrix that estimates an overall rate of 1 per cent accepted mutation (1 PAM) is squared for 2 PAM, cubed for 3 PAM, and so on until 250 PAM is reached, whereupon the probabilities are used to calculate a log-odds matrix. T h e extrapolation can potentially magnify inaccuracies in the estimations of rates, and comprehensive empirical tests [13] suggest that this is indeed the case, both for Dayhoff's original PAM series and for an updated series based on much more alignment data [14]. A likely source of inaccuracy in scoring distant relationships dominated by selection is that closely related sequences are dominated by amino acid exchanges that require only a single nucleotide change [15,16]. One solution to this problem is to base the extrapolation on mutation rates estimated from distant relationships, as in the matrix described by Gonnet eta/. {17]. An alternative approach to constructing log-odds substitution matrices is to obtain substitution probabilities directly from alignments of distantly related sequences without extrapolation. This has been accomplished in two ways. One method [18] utilized ungapped multiple-sequence Scores for sequence searches and alignments Henikoff penalties. T h e most effective local-alignment methods look for high-scoring alignments of any length, in which case positive values in the substitution matrix extend alignment length and negative values limit it. Theory predicts that for database searching, about 30 bits are necessary to elevate a typical ungapped alignment of a conserved region above background (Fig. 2), which corresponds to an alignment length o f - 3 0 amino acids for a matrix with relative entropy o f - 1 bit per residue pair [7]. In practice, BLOSUM 62 (with a relative entropy of 0.7 bits) performed best overall in comprehensive BLAST and FASTA tests involving challenging queries from 257 different protein families [13,18]. Excellent performance was also obtained using a structure-based matrix (with a relative entropy of 0.92bits), whereas PAM matrices performed relatively poorly. Updated matrices based on the PAM model [14,17] outperformed corresponding matrices from Dayhoff across the board, confirming that Dayhoff's 1978 data set was too sparse to obtain reliable scores for all residue pairs. T h e best performance of PAM matrices with BLAST was for matrices in the 0.7-1.0 bit range, consistent with theory [7]. With only 0.38 bits per residue pair on average, the PAM 250 matrix was a poor performer in these tests. alignments representing groups of related proteins present in the Blocks Database [19]. Substitution probabilities were calculated from counts of amino acid pairs within each column of every block in the database. To reduce the contribution of closely related sequences to pair counts, sequences were clumped within blocks based on a percentage identity, and their contributions were averaged when calculating pair frequencies. For example, BLOSUM 62 (blocks substitution matrix at 62%) is the log-odds matrix derived from pair counts between sequence segments that are less than 62% identical. This procedure resulted in a series of log-odds matrices with average mutual information per residue pair (relative entropy) [7] ranging from 0.2 to 1.5 bits (1 bit of information is the answer to a yes/no question where the answer yes is as likely as the answer no). A second method [20,21] utilized alignments based on selected superpositions of homologous structures. Structure-based log-odds matrices can be derived directly from pair counts in the same ways as for the BLOSUM series, but without the clumping procedure [13,22]. Empirical evaluations performance of substitution 355 matrix T h e evaluation of substitution matrices can give different results depending upon whether the application that uses them seeks a local or a global alignment and whether or not gap penalties are used. Local pairwise alignment algorithms seek high-scoring sub-sequence alignments, whereas global algorithms seek alignments over the full lengths of both sequences. Unlike global alignments, useful local alignments can be obtained without gap Local alignment algorithms that do not employ gap penalties, such as BLAST, might be especially sensitive to substitution-matrix choice because the matrix determines alignment length (Fig. 2). For gapped local alignments, the gap penalties chosen for best performance may balance scores [23"'], and so reduce performance differences between substitution matrices. Global alignments do not Figure 2 (a) HetR: 134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLIYPVQRSRLDVWRHFLQPAG RbcR: 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVMREEGSGTRQAMERFFSERG metR: 134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI RbcR: 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLWIAPPDHPLAGERAISLARLAEETFVM MetR: 169 PDHPLASKTQITPEDLASETLLI RbcR: 172 PDHPLAGERAISLARLAEETFVM l (b) I (~ lil III I i I I IIIPll llllil I11111 I I I II II I I II il II II c, 1996 Current Opinion in Structural Biology The relationship between the local alignment length and the relative entropy (average mutual information per residue pair) of a substitution matrix [7,46]. An aligned segment pair is detected in BLAST searches [63] of the SWlSS-PROT database using members of the BLOSUM series. Salmonella typhimurium MetR (upper sequence in each pair), a member of the LysR family of bacterial transcriptional regulators [61], detects a single region of Chromatium vinosum RbcR (lower sequence in each pair) [62]. (a) Using BLOSUM 50 (0.5 bits per residue pair) the alignment extends for '77 amino acids, and an atignment value of 29 bits is achieved. (b) Using BLOSUM 62 (0.7 bits per residue pair) the alignment extends for 58 amino acids and an alignment value of 30.5 bits is achieved. (¢) Using BLOSUM 100 (1.5 bits per residue pair) the alignment extends for only 23 amino acids and an alignment value of 27.5 bits is achieved. Note that BLOSUM 62 provides the most discrimination (30.5 bits), and that BLOSUM 100 fails to elevate the alignment value (2?.5 bits) above the best chance alignments (30 bits for the first false positive). Multiple alignment using many other diverse members of this family [62] suggests that these alignments are correct, but t g more aligned amino acids in the BLOSUM 50 alignment or 35 fewer in the BLOSUM 100 alignment relative to the BLOSUM 62 alignment reduces searching performance. The offset in alignment is shown at the left of each sequence, and identities are indicated by vertical lines. 356 Sequences and topology limit alignment length and typically do not use negative substitution scores, and may, therefore, be insensitive to relative entropy differences among matrices in a series. Indeed, evaluations of 13 non-negative matrices in global alignment applications on three protein groups showed good performance for several different types of modern log-odds matrices [22]. The relationship bctween substitution scores and gap penalties is clarified in two recent studies. Vogt eta/. [24 "°] evaluated non-negative matrices for alignment accuracy on structurally aligned pairs from 37 different protein families, primarily using a global alignment algorithm. Pearson [23 °° ] evaluated log-odds matrices in database-searching tests for 67 families, primarily using the Smith-Waterman local alignment algorithm. In both studies, a gap penalty consisted of a negative score for opening a gap and a negative extension score proportional to gap length. T h e particular choice of gap penalties was found to be crucial for best matrix performance (Fig. 3), and so both studies tested a wide range for each matrix. In global alignment tests [24"], the Gonnet matrix [17] performed marginally the best out of the 80 tested with a gap penalty of 6 for opening and 0.8 for extension. However, simply dropping the extension penalty to 0.6 dropped the performance of the Gonnet matrix below that of the next five best performers. T h e best matrices performed optimally with gap penalties that ranged widely, from 6-17.5 for opening a gap and from 0.12-0.8 for extension. Except for the fact that all six of the best performers are modern log-odds matrices, there was no common theme: two were based on the PAM model using more distant alignments [16,17], two were from the BLOSUM series [18] and two were structure based [13,22]. Very similar matrices performed comparably when coupled with optimized gap penalties (Fig. 3). Considering this variety, it is doubtful that current substitution matrices can be significantly improved to obtain more accurate pairwise global alignments. What is clearly needed is either a better gap penalty formula or a means of reducing the sensitivity of pairwise alignment to gap penalty choice [25°]. Pearson's evaluation of local alignment searching algorithms [23"] confirmed previous BLAST and FASTA tests [13], but revealed unexpected complexities when gap penalties were used. Several modern matrices performed well, even extrapolated matrices that did not perform well in BLAST tests [13,18]. This might be because the best matrices and gap penalties provide alignments that are somewhat global in character [26,27]. However, simply adding a constant positive value to matrices to force a more global alignment [27] decreased performance. Pearson also discovered that raw alignment scores are not ideal for ordering Smith-Waterman search results, and that substantially improved performance is obtained by the normalization of scores based on the log of the length of the query sequence [23°°]. Optimal gap penalties changed when normalization was used. Figure 3 C I S T P A G N D E ~ H R K M I L V F Y W i i i i i 0 1 0 - 2 - 2 0 - I I 0 1 1 3 4 C 3 0 1 0 0 0 0 - I 0 - I - I 0 1 1 1 1 0 0 1 S C13 3 - 1 - I - I 0 - I - I - I - 2 - I I 0 0 0 0 0 0 1 T S-15 2 - I 0 - I 0 1 1 1 2 0 1 0 2 1 0 0 1 P T 1 2 5 3 0 1 2 1 1 1 1 1 0 0 1 0 1 0 1 A P-4-1-110 I 0 - I - 2 - I - I - 2 - I I 0 0 1 ] I I G A - I I 0 - 1 5 3 0 - 1 - 1 0 - 1 - 1 0 0 - 1 1 1 1 0 N G - 3 0 - 2 2 0 8 3 1 1 1 2 1 1 0 0 1 1 0 0 D N - 2 1 0 - 2 1 0 7 2 0 0 0 0 0 1 0 1 1 1 1 E D 4 0 1 1 2 1 2 8 4 0 1 0 1 1 0 1 1 1 2 Q E-3-I-I-I-I-3026 4-I-I0-2-I-2-I0-2H Q 3 0 1 1 1 2 0 0 2 7 2 0 0 2 1 1 0 1 1 R H 3 1 2 - 2 - 2 - 2 1 - I 0 1 1 0 3-1-1-1-1-101K R - 4 - I - 1 3 2 3 1 2 0 1 0 7 3 0 0 1 2 0 0 H K - 3 0 1 1 1 2 0 1 1 2 0 3 6 l l l l 0 1 1 H 2 2 1 3 1 - 3 - 2 - 4 - 2 0 - 1 - 2 - 2 7 I-I-I-I-]L I 2 3 - i - 3 - 1 4 3 4 4 3 4 4 3 2 5 2 1 0 0 V L - 2 - 3 1 4 2 4 4 4 3 2 3 3 3 3 2 5 I I 3 F V 1 2 0 3 0 - 4 - 3 - 4 - 3 - 3 - 4 - 3 - 3 1 4 1 5 0-2Y F 2 3 - 2 - 4 - 3 - 4 - 4 - 5 - 3 4 1 - 3 - 4 0 @ I - 1 8 IW Y 3 2 - 2 2 ~ - 2 3 2 3 2 1 2 1 2 0 1 1 1 4 8 W-5-4-3 3 3 4 5 3 1 3 3 3 1 3 2 3 1 2 1 5 C S T P A G N D E Q H R K M I L V F Y W Dissimilar matrices can provide comparable performance with optimized gap penalties. The BLOSUM 50 [18] (lower triangle) and difference matrix (upper triangle) obtained by subtracting the matrix of Gonnet et al. [17] position by position are shown. Amino acids are given in single letter code. Although BLOSUM 50 and other BLOSUM matrices substantially outperformed the Gonnet matrix in BLAST (local ungapped) alignment tests [13,18], both matrices were top performers in Smith-Waterman (local) and global alignment tests using optimized gap penalties. For Smith-Waterman, the best BLOSUM 50 performance was obtained with a gap penalty of -12 for opening a gap and -2 for extension (-12 and -1 using log-length normalization), and the best Gonnet performance was obtained with gap penalties of - 1 4 and -1 (-12 and -2 using log-length normalization) [23°°]. For global alignments, the best BLOSUM 50 performance was obtained with an addition of 3.4 to each entry and gap penalties of -9.5 and ~).6, and the best Gonnet performance was obtained with the addition of 2.5 to each entry and gap penalties of ~ and ~ . 8 [24°']. Performance differences with optimized gap penalties were not statistically significant. Taken together, evaluation studies indicate that strictly local alignment applications without gap penalties are very sensitive to matrix choice, whereas the addition of gap penalties reduces performance differences attributable to the matrix. Even so, modern log-odds matrices, including BLOSUM 45-55 [18], structure-based matrices [13,22] and modern PAM matrices [14,17] strongly and consistently outperformed Dayhoff's PAM 250. T h e optimization of gap penalties for each matrix and each application is needed for best performance. A good overall performer for gapped alignment and searching applications is BLOSUM 50 (Fig. 3). Position-specific scoring matrices T h e term 'position-specific scoring matrix' (PSSM, pronounced 'possum'), was coined by Gribskov eta/. [5} to describe their 'profile' method. A PSSM consists of columns of scores for each amino acid derived from corresponding columns of a multiple sequence alignment (Fig. lb). A column may also include gap scores. T h e use of other terms in this particular context, such as 'profile' and 'hidden Markov model' (HMM) often leads to confusion. A profile is a PSSM constructed using the Scores for sequence searches and alignments Henikoff average score method [5] or is a string of structural environments in place of amino acids [28]. An HMM is a PSSM constructed using an iterative alignment procedure that has the attractive feature of determining position-specific gap penalties in the course of finding the alignment ([29",30,31]; Eddy, this issue, pp 361-365). The term 'position-dependent weight matrix' [32°] is synonymous with PSSM. Recent improvements in substitution matrices are also applicable to those PSSMs in which the score for a column is taken from the average of scores obtained from a substitution matrix [33,34°]. A drawback of this method is that when many sequences are represented in an alignment, scores from a substitution matrix reduce specificity. For example, if in a multiple alignment of 50 diverse sequences, alanine is in the same position in all 50 sequences, then substitutions of other amino acids are all but ruled out in correct alignments; however, the average score is the same as for a single sequence with alanine, and so that PSSM position will be very tolerant of non-alanines. Alternatively, observed counts of amino acids in a multiple alignment column can be normalized and used directly; however, zero entries prohibit taking logarithms, and there is no log-odds basis for normalized scores that are added across columns. This situation has led to the employment of 'pseudo-count' methods, in which hypothetical sequences are included in the sample of sequences that comprise the multiple alignment [32°,35-37]. The hypothetical sequences add counts to each column of the PSSM, and their contribution diminishes in direct proportion to the number of real sequences. This procedure can eliminate zero entries, thus allowing log-odds scores to be added across columns. An important question is how to choose a model for pseudo-counts, and several solutions have been considered. In one study [32°1, pseudo-counts based on background frequencies [36], substitution probabilities and Dirichlet mixtures [38] were pitted against the average score method in tests using several PSSMs. More recently, 357 we have found that another important question is how many pseudo-counts to add to a column (Table 1). When the number of pseudo-counts is proportional to the number of different residues in that column, the performance of PSSMs based on substitution probabilities improves dramatically, outperforming all other methods. PSSM performance can also bc improved by differentially weighting sequences to reduce redundancy [33,34*]. In a comprehensive evaluation of sequence weights, improvements were seen using all sequence weighting methods, with the best results for three very different methods [39°]. Good performance was also reported for a sequence-weighting method that maximizes the discrimination between true positives and background [31]. T h e application of scoring matrices for s e q u e n c e a l i g n m e n t to structure prediction In the absence of a reliable strategy for predicting a structure from sequence information alone, prediction methods have largely focused on deducing a structural model for a sequence from its direct comparison to known structures (Bryant, this issue, pp 377-385). For sequences that can be accurately aligned with the sequence of a known structure, the problem is essentially solved, because homologous sequences are known to be closely similar in structure [40]. Currently, 27% of the sequences in the SWlSS-PROT database fall into this category [41]. This percentage should continue to increase, partly because sequence-alignment methods have been improving as described above. In addition, most newly determined protein sequences belong to known families, so that even with large-scale sequencing projects, the number of newly discovered families and domains is leveling off ([42,43]; Bork and Koonin, this issue, pp 366-376; Murzin, this issue, pp 386-394). As families grow larger, distant relatives are better detected in database searches, and the added information can improve multiple sequence alignments. For example, the Clustal W program [44"] first builds a tree, then aligns sequences progressively from the leaves to the root. In general, Table 1 Features and performanceof methodsfor PSSM construction. Method Normalized counts Average score Pseudo-counts Background Substitution Dirichlet Position-basedf Considers substitutions Sensitive to number of sequences Positionspecific Relative performance* No Yes No No No No + No Yes Yes Yes Yes Yes Yes Yes No No Yes Yes + ++ +++ ++++ *The best PSSM performance was obtained when the method for PSSM construction considers substitutions, is sensitive to the number of sequences and is position-specific. The evaluations were based on relative performance of PSSMs that were constructed using the indicated method applied to 1673 alignment blocks representing 459 protein families. Similar results were obtained in previous tests of average score and background, substitution probability and Dirichlet mixture pseudo-count methods [32"]. tThe position-based method uses substitution probabilities, but pseudo-counts are added in proportion to the number of different residues in a column. Adapted from [64]. 358 Sequences and topology the denser the tree, the better the alignment. As the root is approached, increasingly dissimilar sequences are aligned. Gap penalties are modified depending upon the composition and degree of conservation at each position, in common with H M M s [29",31] and other multiple alignment methods [45]. Rather than use a single substitution matrix, Clustal W incorporates a progressive series, such that the matrix chosen is appropriate for the degree of dissimilarity encountered. Multiple-matrix strategies are also available for pairwise alignment applications [13,46]. When no alignment is available between the sequence of interest and a sequence with a known structure, an attempt can be made to align the sequence directly to known structures. This basic approach has successfully identified misfolded candidate models in structural analysis [471. However, finding the right structure in the context of a database search is far more challenging. Furthermore, accurate alignment of the candidate sequence to the sequence representing the structure can be difficult, because homologous structures with dissimilar sequences are not expected to superimpose well in space. Nevertheless, numerous sequence-structure alignment methods aimed at solving this problem have appeared during the past few years (reviewed in [48°]). T h e three dimensional-one dimensional ( 3 D - I D ) profile method employs a conventional alignment algorithm to compare a query sequence to a sequence database. However, either the query sequence or each of the database sequences is converted to a string of environments representing successive residues in a known structure. Scores are obtained from a matrix in which the rows are made up of amino acids and the columns are made up of 3D environments [28[. T h a t is, the only difference from standard sequence alignment is the choice of scoring matrix. Threading is more complex than the 3 D - 1 D profile method because it explicitly takes into account residue contacts (although 'threading' generally refers to all such methods). How well does threading work? One of the difficulties in comparing threading with sequence-only methods is that a realistic candidate for threading must lack similarity to sequences with known structures (otherwise a good structural model could be obtained using homologous sequence alignments). Because sequence similarity is operationally defined by conventional substitution matrices, these will fail to work on any sequence that is deemed suitable for threading. This does not mean that substitution matrices are inherently inadequate for alignment of such dissimilar sequences, just that those designed to detect homology are wholly inappropriate. T h o u g h unlikely, it remains possible that much of what threading methods detect can also be found by an amino acid substitution matrix based on an appropriate data set, such as dissimilar sequences with known structures that are superimposable in space. Scoring matrices are well suited for evaluations because they are simple to change in automated alignment applications. Unfortunately, more complex threading systems are not fully automated and are difficult to evaluate rigorously. This situation fueled a 'blind' structure-prediction competition, in which participants proposed structural models for sequences whose structures had been determined, but not yet released. An entire issue of Proteins: Structure, Function and Genetics is devoted to the results (for a description of the competition and the organization of the issue, see [49]). Although very interesting and informative, a competition such as this is no substitute for rigorous evaluation of the type highlighted in this review. For example, participants were allowed to choose the proteins whose structure they would like to predict. With respect to sequence-structure alignment methods [50], it appears that none of the participants used purely automated methodology. Hopefully, future competitions could be carried out by submitting sequences to participants' automated internet servers, allowing for full objectivity. Automation would also encourage the wider use of successful methods; biologists have become accustomed to the automation of sequence-analysis tasks, a trend that is accelerating with the expanding use of the internet, with the result that methods requiring manual intervention and special expertise are likely to fall by the wayside regardless of merit. T h e competition demonstrated that threading a sequence through a structure is very challenging. Nine teams submitted a total of 44 predictions on 11 proteins that turned out to have known structural homologs. With more than 200 unrelated structures to choose from, the fact that there were 14 correct predictions indicates some degree of success. However, only one of these 14 predictions also correctly aligned all of thc secondary structural elements with the threaded sequence [50]. This is worrisome because reasonably correct alignments are needed for predictions to be useful, such as in site-directed mutagenesis studies. For the 'conventional' threading methods, the evaluators thought that further progress will require an increasingly detailed dcscription of environments, including local and non-local interactions along the sequence [50]. Two teams that followed vcry different strategies achieved comparable success. Both aligned thc query sequence with database homologs and used their resulting multiple sequence alignments to predict secondary structures. T h e predicted sccondary structure strings were then used to search secondary structure strings obtained from the structural database. Hubbard and Park [51 °] added an additional step: they constructed a PSSM frorn the multiple alignment and searched the structural database sequences. This combination of secondary structure prediction with PSSM scoring has potential for full automation. Excellent secondary structure predictions can be obtained automatically from Scores for sequence searches and alignments Henikoff multiple alignments using substitution matrices specific for helices, sheets and coils [52*]. Supplementing scoring matrices with additional information Combining sequence information and secondary structure information is one way that uncertain pairwise alignments can be improved [53]. Structural information represented as 3D-1D environments can also be combined with PSSMs to improve database searches [54]. Although it is assumed that these methods contribute information beyond that that can be obtained from sequences alone, treating sequences as strings of amino acid properties has likewise been successful in detecting homologies when conventional substitution matrices fail [55]. Alternative substitution matrices might also play a role in improving alignments. Two new substitution matrices for the alignment of transmembrane regions should be useful when conventional matrices fail [56,57]. A 400 x 400 matrix of dipeptide substitutions is unlikely to be useful because it is far too 'contrasty': only 54% of possible dipeptide substitutions are represented among 80200 observed dipeptide pairs [58]. However, an approach that scores clusters of residues shows considerable promise [59°]. 6. DayhoffMO, Eck RV (Eds): Atlas of Protein Sequence and Structure, vol 3. Silver Spring: National Biomedical Research Foundation; 1988. 7. Altschul SF: Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 1991, 219:555-565. 8. Grantham R: Amino acid difference formula to help explain protein evolution. Science 1974, 185:862-864. 9. Miyata T, Miyazawa S, Yasunaga T: Two types of amino acid substitutions in protein evolution. J Mol Evo11979, 12:219-236. 10. FengDF, Johnson MS, Doolittle RF: Aligning amino acid sequences: comparison of commonly used methods. J Mol Evol 1985, 21:112-125. 11. Rao JKM: New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int J Pept Protein Res 1987, 29:276-281. 12. MiyazawaS, Jernigan RL: A new substitution matrix for protein sequence searches based on contact frequencies in protein structures. Protein Eng 1993, 6:267-278. 13. Henikoff S, Henikoff JG: Performance evaluation of amino acid substitution matrices. Proteins 1993, 17:49-61. 14. JonesDT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. CABIOS 1992, 8:275-282. 15. Wilbur WJ: On the PAM matrix model of protein evolution. Mol Biol Evol 1985, 2:434-447. 16. Benner SA, Cohen MA, Gonnet GH: Amino acid substitution during functionally constrained divergent evolution of protein sequences. Protein Eng 1994, 7:1323-1332. 1Z Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256:1443-1445. 18. Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Nat/Acad Sci USA 1992, 89:10915-10919. 19. Henikoff S, Henikoff JG: Automated assembly of protein blocks for database searching. Nucleic Acids Res 1991, 19:6565-6572. 20. Risler JL, Delorme MO, Delacroix H, Henaut A: Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J Mol Biol 1988, 204:1019-1029. 21. Overington J, Donnelly D, Johnson MS, Sail A, Blundell TL: Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci 1992, t :216-226. 22. JohnsonMS, Overington JP: A structural basis for sequence comparisons. An evaluation of scoring methodologies. J Mol Bio11993, 233:716-738. Conclusions Recent improvements in substitution matrices and PSSMs have dramatically improved alignment-based procedures. We know this because of comprehensive evaluation studies which also reveal areas in which improvement is possible. General pairwisc substitution scores might be about as good as they need to be, given that gap penalties limit alignment quality. However, multiple alignments provide position-specific information that can overcome this limitation, potentially allowing for the extraction of protein structural features. With the expansion of protein families to include more diverse sequences, such as those from whole bacterial genomes I60], these multiple alignment-based methods will become applicable to more protein families. References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • •• of special interest of outstanding interest 359 23. Pearson WR: Comparison of methods for searching protein ,,sequence databases. Protein Sci 1995, 4:1145-1160. Local alignment searching algorithms were tested using several different substitution matrices and gap penalty combinations. Search sensitivity is significantly improved using modern matrices, such as BLOSUM 45-55, with optimized gap penalties. The best overall performance was obtained with the Smith-Waterman algorithm and log-length normalization of alignment scores. 24. •, Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, vol 5, suppl 3. Edited by Dayhoff M. Washington, DC: National Biomedical Research Foundation; 1978:345-358. Vogt G, Etzold T, Argos P: An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J Mo/Bio/t 995, 249:816-831. Eighty substitution matrices were evaluated using alignment accuracy as a criterion. Several modern log-odds substitution matrices performed well when modified for global alignment and paired with optimized gap penalties. 2. Bork P, Ouzounis C, Sander C: From genome sequences to protein function. Curr Opin Struct Bio11994, 4:393-403. 25. he 3. Brenner S: Phosphotransferase sequence homology. Nature 1987, 329:21. 4. Bairoch A: PROSlTE: a dictionary of sites and patterns in proteins. Nucleic Acids Res 1992, 20:2013-2018. 5. Gribskov M, McLachlan AD, Eisenberg D: Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci USA 1987, 84:4355-4358. Taylor WR: Motif-biased protein sequence alignmenL J Comput Bio11994, 1:297-310. extreme sensitivity of global alignments to gap penalty choice can be reduced by scoring runs of matches higher than scattered matches. 26. Mort R: Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull Math Bio/1992, 54:59-75. 27. Vingron M, Waterman MS: Sequence alignment and penalty choice: review of concepts, case studies and implications. J Mol Bio/1994, 235:1-12. 360 28. Sequences and topology Bowie JU, Luthy R, Eisenberg D: A method to identify protein sequences that fold into a known three-dimensional structure. Science 1991, 253:164-170. 29. • Krogh A, Brown M, Mian IS, Sjolander K, Haussler D: Hidden Markov models in computational biology. J Mol Bio/1994, 235:1501-1531. An iterative method for multiple sequence alignment is described. PSSM parameters, such as gap penalties, are determined in the course of aligning sequences. 30. Baldi P, Chauvin Y, Hunkapiller T, McClure MA: Hidden Markov models of biological primary sequence information. Proc Nat/ Acad Sci USA 1994, 91:1059-1063. 31. Eddy SR, Mitchison G, Durbin R: Maximum discrimination hidden Markov models of sequence consensus. J Comput Bio/ 45. Taylor WR: An investigation of conservation-biased gappenalties for multiple protein sequence alignment. Gene 1995, 165:GC27-GC35. 46. Altschul SF: A protein alignment scoring system sensitive at all evolutionary distances. J Mo/Evo/1993, 36:290-300. 47. LuthyR, Bowie JU, Eisenberg D: Assessment of protein models with three-dimensional profiles. Nature 1992, 356:83-85. 46. BryantSH, Altschul SF: Statistics of sequence-structure • threading. Curr Opin Struct B/o/1995, 5:236-244. A critical review that emphasizes the need for statistical interpretation of threading scores of the type now routinely used for assessing sequence alignments. 49. Moult JT, Pedersen JT, Judson R, Fidelis K: A large-scale experiment to assess protein structure prediction methods. Proteins 1995, 23:ii-iv. 50. LemerCM-R, Rooman MJ, Wodak SJ: Protein structure prediction by threading methods: evaluation of current techniques. Proteins 1995, 23:337-355. 1995, 2:9-24. 32. • Tatusov RL, Altschul SF, Koonin EV: Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Nat/Acad Sci USA 1994, 91:12091-12095. PSSMs representing ungapped alignment blocks were iteratively expanded by searching sequence databases. In this context, different methods for calculating PSSM column weights were evaluated. The pseudo-count method based on mixtures of Dirichlet components performed best. 33. LuthyR, Xenarios I, Bucher P: Improving the sensitivity of the sequence profile method. Protein Sci 1994, 3:139-146. 34. Thompson JD, Higgins DG, Gibson TJ: Improved sensitivity of profile searches through the use of sequence weights and gap excision. CABIOS 1994, 10:19-29. Branch-proportional sequence weights were described and found to improve the performance of several PSSMs. Improvements were also obtained by using BLOSUM 62 rather than PAM 250 and by selectively excluding gaps. • 35. Dodd IB, Egan JB: Systematic method for the detection of potential lambda Cro-like DNA-binding regions in proteins. J Mol Bio/198?, 194:557-564. 36. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262:208-214. 37. Claverie J-M: Some useful statistical properties of positionweight matrices. J Comput Chem 1994, 18:287-293. 38. Brown MP, Hughey R, Krogh A, Mian IS, Sjolander K, Haussler D: Using Dirichlet mixture priors to derive hidden Markov models for protein families. In Proceedings of the First International Conference on Intelligent Biology. Edited by Hunter L, Searls D, Shavlik J. Washington DC: AAAI Press; 1993:47-55. 39. HenikoffS, Henikoff JG: Position-based sequence weights. • J Mol Biol 1994, 243:574-578. A simple method is introduced for weighting sequences that is based on the diversity of each position. Performance was evaluated using ungapped alignment blocks to construct PSSMs representing 698 protein families. Positionbased, Voronoi and branch-proportional weights outperformed other methods. 40. Sander C, Schneider R: Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1991,9:56-68. 41. Schneider R, Sander C: The HSSP database of protein structure-sequence alignments. Nucleic Acids Res 1996, 24:201-205. 42. Green P: Ancient conserved regions in gene sequences. Curt Opin Struct Biol 1994, 4:404-412. 43. Doolittle RF: The multiplicity of domains in proteins. Annu Rev Biochem 1995, 64:287-314. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22:4673-4680. A popular progressive method for muttiple sequence alignment was substantially improved by the implementation of several new features, including sequence weights, multiple substitution matrices and residue-specific gap penalties. 51. • Hubbard TJ, Park J: Fold recognition and ab initio structure predictions using hidden Markov models and beta-strand pair potentials. Proteins 1995, 23:398-402. Structure predictions were based on multiply aligning each unknown sequence with available homologs and using the alignment both to predict secondary structural features and to construct PSSMs. A compatible structure was one with secondary structural features that aligned well with the predicted features and with a sequence that had a high PSSM score. Two out of four structures were correctly identified and the other two were near misses. 52. • Heringa J, Argos P: A simple and fast approach to prediction of protein secondary structure from multiply aligned sequences with accuracy above 70%. Protein Sci 1995, 4:2517-2525. Amino acid substitution matrices specific for helix, sheet and coil regions underlie predictions. The performance on multiply aligned sequences is at least as good as that obtained by complex machine-learning approaches. 53. Gracy J, Chiche L, Sallantin J: Improved alignment of weakly homologous protein sequences using structural information. Protein Eng 1994, 6:821-829. 54. Yi TM, Lander ES: Recognition of related proteins by iterative template refinement (ITR). Protein Sci 1994, 3:1315-1328. 55. Hobohm U, Sander C: A sequence property approach to searching protein databases. J Mol Biol 1995, 251:390-399. 56. Cserzo M, Bernassau J-M, Simon I, Maigret B: New alignment strategy for transmembrane proteins. J Mol Biol 1994, 243:388-396. 57. JonesDT, Taylor WR, Thornton JM: A mutation data matrix for transmembrane proteins. FEBS Lett 1994, 339:269-275. 58. Gonnet GH, Cohen MA, Benner SA: Analysis of amino acid substitution during divergent evolution: the 400 by 400 dipeptide substitution matrix. Biochem Biophys Res Commun 1994, 199:489-496. 59. Han KF, Baker D: Recurring local sequence motifs in proteins. • J Mol Biol 1995, 251:1 76-186. Cluster analysis identified local sequence motifs shared by multiple protein families. This higher-order sequence information can potentially be used to score alignments for the detection of distant relationships. 60. Nowak R: Bacterial 9enome sequence bagged. Science 1995, 269:468-470. 61. HenikoffS, Haughn GW, Calvo JM, Wallace JC: A large family of bacterial activator proteins. Proc Nat/Acad Sci USA 1988, 85:6602-6606. 62. VialeAM, Kobayashi H, Akazawa T, Henikoff S: rcbR, a 9ene coding for a member of the LysR family of transcriptional regulators, is located upstream of the expressed set of ribulose 1,5-bisphosphate carboxylase/oxygenase genes in the photosynthetic bacterium Chromatium vinosum. J Bacterio/ 1991, 173:5224-5229. 63. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410. 64. HenikoffJG, Henikoff S: Using substitution probabilities to improve position-specific scoring matrices. CABIOS 1996, in press. 44. •
© Copyright 2026 Paperzz