Scores for sequence searches and alignments Steven

353
Scores for sequence searches and alignments
Steven Henikoff
Every sequence comparison method requires a set of scores.
For aligning protein sequences, substitution scores are
based on models of amino acid conservation and properties,
and matrices of these scores have substantially improved
in recent years. Position-specific scoring matrices provide
representations of sequence families that are capable of
detecting subtle similarities. Comprehensive evaluations can
effectively guide the choice of scores for sequence alignment
and searching applications, including those that aid in the
prediction of protein structures.
Address
Howard Hughes Medical Institute, Basic Sciences Division, Fred
Hutchinson Cancer Research Center, 1124 Columbia Street, Seattle,
WA 98104, USA; e-mail: [email protected]
Current Opinion in Structural Biology 1996, 6:353-360
© Current Biology Ltd ISSN 0959-440X
Abbreviations
BLOSUM blocks substitution matrix
HMM
hidden Markov model
1D
one-dimensional
3D
three-dimensional
PAM
per cent accepted mutation
PSSM
position-specific scoring matrix
Introduction
Automated sequence comparisons based on sequence
alignment are among the most familiar procedures in
biological research. For example, the BLAST Internet
servers currently search more than 7000 sequence queries
on an average day. In addition to database searches,
pairwise and multiple alignment of sequences are often
tile starting points for in vitro mutagenesis, homology
modeling and other procedures in molecular biology and
biochemistry. In nearly all fields of modern biology,
displays of sequence alignments, trees and dotplots have
become standard features of journal articles.
All sequence-comparison methods require an alignment
algorithm and a set of scores. Substitution scores quantify
the cost of exchanging one residue for another, and gap
penalties quantify the cost of exchanging a residue or a
string of residues for no residue at all. A pairwise alignment
score is the sum of substitution scores and gap penalties
over all aligned residues, so that the best alignment is
considered to be the one that obtains the highest score.
In the case of nucleic acid sequence comparison, only t w o
substitution scores are used, one for a match and another
for a mismatch. Scores can be represented in the form of a
4 x4 'unita~' matrix consisting of ones (or another positive
score) on the diagonal for scoring A-A, C-C, G - G and
T - T matches and zeros (or a negative score) for scoring
mismatches. Because a C - G mismatch is scored the same
as a G-C mismatch, the matrix is symmetrical, and only
half of the off-diagonal scores are needed to provide a
complete scoring scheme for residue substitutions.
Although a unitary matrix suffices for typical nucleic
acid alignment tasks, alignments of protein sequences
benefit from scores that take into account residue-specific
information (Fig. la). A 20x20 unitary matrix is outperformed in alignment tasks by matrices that customize
scores for each particular amino acid pair, 210 in all (20
possible match scores and 190 possible mismatch scores).
A frequent substitution found in correctly aligned proteins,
such as Asp+Glu, should receive a higher score than an
infrequent substitution, such as Asp--~Leu. As alignment
uncertainty increases, the choice of substitution scores
(and gap penalties) becomes increasingly important. The
PAM (per cent accepted mutation) 250 mutation data
matrix [1] was designed to be effective in the range
of typical alignment uncertainty, and dominated protein
sequence alignment applications for 15 years. Recently,
however, the problem of substitution-matrix choice has
received renewed attention, and, as a result, the PAM 250
matrix has been replaced by modern matrices for most
sequence-alignment tasks. During the past year, important
progress has been made in determining which of these
matrices work well for alignment applications that require
gap penalties.
Multiply aligned sequences provide implicit protein structural information. Such information is rapidly increasing
with the expansion of sequence databanks and is now
available for most newly determined sequences [2]. By
identifying constraints on residues and regions from multiple alignments, applications such as database searching
can be improved. The simplest approach is to construct
a pattern consisting of invariant and conserved residues,
allowing for ambiguities where necessary, and to search
for matches to the pattern in sequence databases [3,4].
Unfortunately, these patterns discard potentially useful
information and rarely perform as well as modern pairwise
searching methods. To use multiple-alignment information
effectively, more detailed scores are employed: each
column of the alignment is used to calculate the values
in a column of a position-specific scoring matrix (PSSM;
see Fig. lb) [5]. To search a database, the PSSM is
slid along each sequence entry, and, at each alignment,
the score for every amino acid is obtained from the
aligned PSSM column. During the past two years,
improvements in PSSM construction have led to more
effective database-searching tools. PSSMs have also been
applied to structural prediction tasks.
354
Sequences and topology
Figure 1
(a)
Pairwise alignment
METR:
134 L Q Q G E L D L ~ T S D I L P R S E L H Y S P M F D F E V R L V ~ P D H P L A S K T Q I T P E D L A S E T L L I
RBCR:
137 L D S N S V D L ~ M G V P P R N V ~ E A ~ F M D N P L W ~ A P P D H P L A G E ~ I S L ~ L A E E T F ~
I
Substitution
matrix
III
I
I
IIIIII
I
II II
~ ~:~ ~_~ :T~
i ii :i !i_i i! :! i °~_i ~
s - z - l ~ l - 2 - 1 - 3 2 3 ; o - ~ - ~ l s
~ . 2 - 2 - 2 4 - 2 - 3 - 3 - 3 3 3 ! - 3 - ~ 9 0 1 6
e . 2 - 2 - 2 - 3 - 2 - 3 - 2 3 2 1 2 - 2 - 2 - 1 ~ 1 3 7
w . 2 . 3 - 2 - a - 3 2 4 4 3 - ~ - 3 . 2 - : - 3 2 3 1 2 1 1
(b)
Multiple alignment PSSM column
RbcR
LysR
CysB
IlvY
IrgB
GItC
OxyR
MIeR
MetR
AmpR
TrpI
NodD
NahR
LeuO
SyrM
CatR
CatM
AntO
Svir
140
150
M~DSNS~VLMGV
W~SAQR~LGLTET
AVSKGNA~AXATE
KVVTGEA~LAXAGK
S D g V F E ~ L I XWI E
AV~R/~RD~AL LG P
Q~DSG~CVILAL
LLLNEE~SSLLGS
A~QQGE~VMT SD
DPAAEG~TZRYG
DPRRPG~4LWFA
I~RSGD~FLELPD
A~QNGT~LAVGLL
A
C
D
E
F
O6
83
11
0
G 0
K 0
x 0
K 0
r. 0
M 0
N 0
P 0
Q 0
O L R Y Q E ~ ' V I SYE
R 0
LLEQGE~VVVGQM
A~KSGR~ZAFGRI
ALKQGK~GFGRL
OLSQHK~MIXSDC
ELC QTNh[~VX SAR
S
T
V
W
Y
0
0
0
0
0__
The use of scoring matrices in pairwise and multiple sequence
alignment applications. Examples are from the LysR family of bacterial
transcriptional regulators [61,62]. (a) The alignment between two
amino acid sequences is given, where identities are indicated by
vertical lines. The substitution of a given amino acid (shown in
single letter code) with any other amino acid is given a score. The
matrix provides the scores for all possible substitution combinations.
Frequent substitutions found in correctly aligned proteins receive
a high positive score. Infrequent substitutions receive a negative
score. METR, Salmonella typhimurium MetR; RBCR, Chromatium
vinosum RbcR; D, aspartic acid; R, arginine. (b) In the multiple
alignment, residues that are identical in >50O/o of the sequences
are underlined, and residues with an average score in pair,vise
comparisons exceeding a threshold value in >50% of the sequences
are indicated in bold [62]. A position-specific scoring matrix (PSSM)
derives scores from the multiple alignment, column by column. The
PSSM column shows percentage scores derived from the boxed
column of the alignment.
Amino
acid substitution
matrices
Until recently, the mutation data matrix series dominated
sequence-analysis applications. It was based upon two
important concepts introduced by Dayhoff and co-workers
[1,6]. T h e first was that alignments could be effectively
scored using a 'log-odds' strategy. This involves collecting
alignments that are presumed to be correct, estimating
the frequencies of substituting one residue for another,
and making a ratio of observed to expected substitution
probabilities, where expected probabilities are based on a
model for chance alignments. To score an alignment, the
odds ratios for each residue pair in the alignment should
be multiplied, or equivalently, their logarithms should
be added. A positive log-odds score indicates that the
exchange is more likely to occur in a correct alignment
than in a chance alignment, and vice versa for a negative
score. Altschul [7] has pointed out that all substitution
matrices are at least implicitly log-odds matrices, because
it is possible to back-calculate a theoretical set of
substitution probabilities from any set of substitution
scores, using some reasonable expected probabilities. So, a
set of scores that is based on some notion of shared amino
acid properties [8-12] can be mathematically converted to
'target' substitution probabilities. It seems unlikely that
such hypothetical probabilities will be as good a modcl
for real alignments as those based on real alignments
themselves, and explicit log-odds matrices have contint,ed
to dominate.
A second important idea was to base substitution frequencies on estimated mutation rates [6]. Each mutational
event is assumed to be independent of previous events.
For this model to be approximately valid, mutation
rates should be estimated from alignments of closely related sequences; otherwise, intermediate events could be
missed. A consequence of using closely related sequences
to estimate mutation rates is that these rates must be
extrapolated to model greater evolutionary distances. For
example, to calculate the PAM 250 substitution matrix,
the mutation probability matrix that estimates an overall
rate of 1 per cent accepted mutation (1 PAM) is squared
for 2 PAM, cubed for 3 PAM, and so on until 250
PAM is reached, whereupon the probabilities are used
to calculate a log-odds matrix. T h e extrapolation can
potentially magnify inaccuracies in the estimations of rates,
and comprehensive empirical tests [13] suggest that this
is indeed the case, both for Dayhoff's original PAM series
and for an updated series based on much more alignment
data [14]. A likely source of inaccuracy in scoring distant
relationships dominated by selection is that closely related
sequences are dominated by amino acid exchanges that
require only a single nucleotide change [15,16]. One
solution to this problem is to base the extrapolation on
mutation rates estimated from distant relationships, as in
the matrix described by Gonnet eta/. {17].
An alternative approach to constructing log-odds substitution matrices is to obtain substitution probabilities directly
from alignments of distantly related sequences without
extrapolation. This has been accomplished in two ways.
One method [18] utilized ungapped multiple-sequence
Scores for sequence searches and alignments Henikoff
penalties. T h e most effective local-alignment methods
look for high-scoring alignments of any length, in which
case positive values in the substitution matrix extend
alignment length and negative values limit it. Theory
predicts that for database searching, about 30 bits are
necessary to elevate a typical ungapped alignment of
a conserved region above background (Fig. 2), which
corresponds to an alignment length o f - 3 0 amino acids for
a matrix with relative entropy o f - 1 bit per residue pair
[7]. In practice, BLOSUM 62 (with a relative entropy of
0.7 bits) performed best overall in comprehensive BLAST
and FASTA tests involving challenging queries from 257
different protein families [13,18]. Excellent performance
was also obtained using a structure-based matrix (with
a relative entropy of 0.92bits), whereas PAM matrices
performed relatively poorly. Updated matrices based on
the PAM model [14,17] outperformed corresponding
matrices from Dayhoff across the board, confirming that
Dayhoff's 1978 data set was too sparse to obtain reliable
scores for all residue pairs. T h e best performance of PAM
matrices with BLAST was for matrices in the 0.7-1.0 bit
range, consistent with theory [7]. With only 0.38 bits per
residue pair on average, the PAM 250 matrix was a poor
performer in these tests.
alignments representing groups of related proteins present
in the Blocks Database [19]. Substitution probabilities
were calculated from counts of amino acid pairs within
each column of every block in the database. To reduce
the contribution of closely related sequences to pair
counts, sequences were clumped within blocks based
on a percentage identity, and their contributions were
averaged when calculating pair frequencies. For example,
BLOSUM 62 (blocks substitution matrix at 62%) is
the log-odds matrix derived from pair counts between
sequence segments that are less than 62% identical.
This procedure resulted in a series of log-odds matrices
with average mutual information per residue pair (relative
entropy) [7] ranging from 0.2 to 1.5 bits (1 bit of
information is the answer to a yes/no question where
the answer yes is as likely as the answer no). A second
method [20,21] utilized alignments based on selected
superpositions of homologous structures. Structure-based
log-odds matrices can be derived directly from pair counts
in the same ways as for the BLOSUM series, but without
the clumping procedure [13,22].
Empirical evaluations
performance
of substitution
355
matrix
T h e evaluation of substitution matrices can give different
results depending upon whether the application that uses
them seeks a local or a global alignment and whether
or not gap penalties are used. Local pairwise alignment
algorithms seek high-scoring sub-sequence alignments,
whereas global algorithms seek alignments over the full
lengths of both sequences. Unlike global alignments,
useful local alignments can be obtained without gap
Local alignment algorithms that do not employ gap
penalties, such as BLAST, might be especially sensitive to
substitution-matrix choice because the matrix determines
alignment length (Fig. 2). For gapped local alignments, the
gap penalties chosen for best performance may balance
scores [23"'], and so reduce performance differences
between substitution matrices. Global alignments do not
Figure 2
(a)
HetR:
134
LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLIYPVQRSRLDVWRHFLQPAG
RbcR:
137
LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVMREEGSGTRQAMERFFSERG
metR:
134
LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI
RbcR:
137
LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLWIAPPDHPLAGERAISLARLAEETFVM
MetR:
169
PDHPLASKTQITPEDLASETLLI
RbcR:
172
PDHPLAGERAISLARLAEETFVM
l
(b)
I
(~
lil
III
I
i
I
I
IIIPll
llllil
I11111
I
I
I
II II
I
I
II il
II II
c, 1996 Current Opinion in Structural Biology
The relationship between the local alignment length and the relative entropy (average mutual information per residue pair) of a substitution matrix
[7,46]. An aligned segment pair is detected in BLAST searches [63] of the SWlSS-PROT database using members of the BLOSUM series.
Salmonella typhimurium MetR (upper sequence in each pair), a member of the LysR family of bacterial transcriptional regulators [61], detects a
single region of Chromatium vinosum RbcR (lower sequence in each pair) [62]. (a) Using BLOSUM 50 (0.5 bits per residue pair) the alignment
extends for '77 amino acids, and an atignment value of 29 bits is achieved. (b) Using BLOSUM 62 (0.7 bits per residue pair) the alignment
extends for 58 amino acids and an alignment value of 30.5 bits is achieved. (¢) Using BLOSUM 100 (1.5 bits per residue pair) the alignment
extends for only 23 amino acids and an alignment value of 27.5 bits is achieved. Note that BLOSUM 62 provides the most discrimination (30.5
bits), and that BLOSUM 100 fails to elevate the alignment value (2?.5 bits) above the best chance alignments (30 bits for the first false positive).
Multiple alignment using many other diverse members of this family [62] suggests that these alignments are correct, but t g more aligned amino
acids in the BLOSUM 50 alignment or 35 fewer in the BLOSUM 100 alignment relative to the BLOSUM 62 alignment reduces searching
performance. The offset in alignment is shown at the left of each sequence, and identities are indicated by vertical lines.
356
Sequences and topology
limit alignment length and typically do not use negative
substitution scores, and may, therefore, be insensitive to
relative entropy differences among matrices in a series.
Indeed, evaluations of 13 non-negative matrices in global
alignment applications on three protein groups showed
good performance for several different types of modern
log-odds matrices [22].
The relationship bctween substitution scores and gap
penalties is clarified in two recent studies. Vogt eta/. [24 "°]
evaluated non-negative matrices for alignment accuracy on
structurally aligned pairs from 37 different protein families,
primarily using a global alignment algorithm. Pearson
[23 °° ] evaluated log-odds matrices in database-searching
tests for 67 families, primarily using the Smith-Waterman
local alignment algorithm. In both studies, a gap penalty
consisted of a negative score for opening a gap and a
negative extension score proportional to gap length. T h e
particular choice of gap penalties was found to be crucial
for best matrix performance (Fig. 3), and so both studies
tested a wide range for each matrix. In global alignment
tests [24"], the Gonnet matrix [17] performed marginally
the best out of the 80 tested with a gap penalty of 6 for
opening and 0.8 for extension. However, simply dropping
the extension penalty to 0.6 dropped the performance
of the Gonnet matrix below that of the next five best
performers. T h e best matrices performed optimally with
gap penalties that ranged widely, from 6-17.5 for opening
a gap and from 0.12-0.8 for extension. Except for the fact
that all six of the best performers are modern log-odds
matrices, there was no common theme: two were based
on the PAM model using more distant alignments [16,17],
two were from the BLOSUM series [18] and two were
structure based [13,22]. Very similar matrices performed
comparably when coupled with optimized gap penalties
(Fig. 3). Considering this variety, it is doubtful that current
substitution matrices can be significantly improved to
obtain more accurate pairwise global alignments. What is
clearly needed is either a better gap penalty formula or a
means of reducing the sensitivity of pairwise alignment to
gap penalty choice [25°].
Pearson's evaluation of local alignment searching algorithms [23"] confirmed previous BLAST and FASTA
tests [13], but revealed unexpected complexities when gap
penalties were used. Several modern matrices performed
well, even extrapolated matrices that did not perform
well in BLAST tests [13,18]. This might be because
the best matrices and gap penalties provide alignments
that are somewhat global in character [26,27]. However,
simply adding a constant positive value to matrices to
force a more global alignment [27] decreased performance.
Pearson also discovered that raw alignment scores are not
ideal for ordering Smith-Waterman search results, and that
substantially improved performance is obtained by the
normalization of scores based on the log of the length of
the query sequence [23°°]. Optimal gap penalties changed
when normalization was used.
Figure 3
C
I
S T P A G N D E ~ H R K M I L V F Y W
i i i i i 0 1 0
- 2 - 2 0 - I I 0 1 1 3 4 C
3 0 1 0 0 0 0 - I 0 - I - I 0 1 1 1 1 0 0 1 S
C13
3 - 1 - I - I 0 - I - I - I - 2 - I I 0 0 0 0 0 0 1 T
S-15
2 - I 0 - I 0 1 1 1 2 0 1 0 2 1 0 0 1 P
T 1 2 5
3 0 1 2 1 1 1 1 1 0 0 1 0 1 0 1 A
P-4-1-110
I 0 - I - 2 - I - I - 2 - I I 0 0 1 ] I I G
A - I I 0 - 1 5
3 0 - 1 - 1 0 - 1 - 1 0 0 - 1 1 1 1 0 N
G - 3 0 - 2 2 0 8
3 1 1 1 2 1 1 0 0 1 1 0 0 D
N - 2 1 0 - 2 1 0 7
2 0 0 0 0 0 1 0 1 1 1 1 E
D 4 0 1 1 2 1 2 8
4 0 1 0 1 1 0 1 1 1 2 Q
E-3-I-I-I-I-3026
4-I-I0-2-I-2-I0-2H
Q 3 0 1 1 1 2 0 0 2 7
2 0 0 2 1 1 0 1 1 R
H 3 1 2 - 2 - 2 - 2 1 - I 0 1 1 0
3-1-1-1-1-101K
R - 4 - I - 1 3 2 3 1 2 0 1 0 7
3 0 0 1 2 0 0 H
K - 3 0 1 1 1 2 0 1 1 2 0 3 6
l l l l 0 1 1
H 2 2 1 3 1 - 3 - 2 - 4 - 2 0 - 1 - 2 - 2 7
I-I-I-I-]L
I 2 3 - i - 3 - 1 4 3 4 4 3 4 4 3 2 5
2 1 0 0 V
L - 2 - 3 1 4 2 4 4 4 3 2 3 3 3 3 2 5
I I 3 F
V 1 2 0 3 0 - 4 - 3 - 4 - 3 - 3 - 4 - 3 - 3 1 4 1 5
0-2Y
F 2 3 - 2 - 4 - 3 - 4 - 4 - 5 - 3 4 1 - 3 - 4 0 @ I - 1 8
IW
Y 3 2 - 2 2 ~ - 2 3 2 3 2 1 2 1 2 0 1 1 1 4 8
W-5-4-3
3 3 4 5 3 1 3 3 3 1 3 2 3 1 2 1 5
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
Dissimilar matrices can provide comparable performance with
optimized gap penalties. The BLOSUM 50 [18] (lower triangle) and
difference matrix (upper triangle) obtained by subtracting the matrix
of Gonnet et al. [17] position by position are shown. Amino acids
are given in single letter code. Although BLOSUM 50 and other
BLOSUM matrices substantially outperformed the Gonnet matrix
in BLAST (local ungapped) alignment tests [13,18], both matrices
were top performers in Smith-Waterman (local) and global alignment
tests using optimized gap penalties. For Smith-Waterman, the best
BLOSUM 50 performance was obtained with a gap penalty of -12
for opening a gap and -2 for extension (-12 and -1 using log-length
normalization), and the best Gonnet performance was obtained
with gap penalties of - 1 4 and -1 (-12 and -2 using log-length
normalization) [23°°]. For global alignments, the best BLOSUM 50
performance was obtained with an addition of 3.4 to each entry and
gap penalties of -9.5 and ~).6, and the best Gonnet performance
was obtained with the addition of 2.5 to each entry and gap penalties
of ~ and ~ . 8 [24°']. Performance differences with optimized gap
penalties were not statistically significant.
Taken together, evaluation studies indicate that strictly
local alignment applications without gap penalties are very
sensitive to matrix choice, whereas the addition of gap
penalties reduces performance differences attributable to
the matrix. Even so, modern log-odds matrices, including
BLOSUM 45-55 [18], structure-based matrices [13,22] and
modern PAM matrices [14,17] strongly and consistently
outperformed Dayhoff's PAM 250. T h e optimization of
gap penalties for each matrix and each application is
needed for best performance. A good overall performer
for gapped alignment and searching applications is BLOSUM 50 (Fig. 3).
Position-specific scoring matrices
T h e term 'position-specific scoring matrix' (PSSM, pronounced 'possum'), was coined by Gribskov eta/. [5}
to describe their 'profile' method. A PSSM consists of
columns of scores for each amino acid derived from
corresponding columns of a multiple sequence alignment
(Fig. lb). A column may also include gap scores. T h e
use of other terms in this particular context, such as
'profile' and 'hidden Markov model' (HMM) often leads
to confusion. A profile is a PSSM constructed using the
Scores for sequence searches and alignments Henikoff
average score method [5] or is a string of structural
environments in place of amino acids [28]. An HMM
is a PSSM constructed using an iterative alignment
procedure that has the attractive feature of determining
position-specific gap penalties in the course of finding
the alignment ([29",30,31]; Eddy, this issue, pp 361-365).
The term 'position-dependent weight matrix' [32°] is
synonymous with PSSM.
Recent improvements in substitution matrices are also
applicable to those PSSMs in which the score for a
column is taken from the average of scores obtained
from a substitution matrix [33,34°]. A drawback of this
method is that when many sequences are represented in
an alignment, scores from a substitution matrix reduce
specificity. For example, if in a multiple alignment of
50 diverse sequences, alanine is in the same position
in all 50 sequences, then substitutions of other amino
acids are all but ruled out in correct alignments; however,
the average score is the same as for a single sequence
with alanine, and so that PSSM position will be very
tolerant of non-alanines. Alternatively, observed counts
of amino acids in a multiple alignment column can
be normalized and used directly; however, zero entries
prohibit taking logarithms, and there is no log-odds basis
for normalized scores that are added across columns. This
situation has led to the employment of 'pseudo-count'
methods, in which hypothetical sequences are included
in the sample of sequences that comprise the multiple
alignment [32°,35-37]. The hypothetical sequences add
counts to each column of the PSSM, and their contribution
diminishes in direct proportion to the number of real
sequences. This procedure can eliminate zero entries, thus
allowing log-odds scores to be added across columns.
An important question is how to choose a model for
pseudo-counts, and several solutions have been considered. In one study [32°1, pseudo-counts based on
background frequencies [36], substitution probabilities
and Dirichlet mixtures [38] were pitted against the average
score method in tests using several PSSMs. More recently,
357
we have found that another important question is how
many pseudo-counts to add to a column (Table 1).
When the number of pseudo-counts is proportional to
the number of different residues in that column, the
performance of PSSMs based on substitution probabilities
improves dramatically, outperforming all other methods.
PSSM performance can also bc improved by differentially
weighting sequences to reduce redundancy [33,34*].
In a comprehensive evaluation of sequence weights,
improvements were seen using all sequence weighting
methods, with the best results for three very different
methods [39°]. Good performance was also reported for a
sequence-weighting method that maximizes the discrimination between true positives and background [31].
T h e application of scoring matrices for
s e q u e n c e a l i g n m e n t to structure prediction
In the absence of a reliable strategy for predicting a
structure from sequence information alone, prediction
methods have largely focused on deducing a structural
model for a sequence from its direct comparison to
known structures (Bryant, this issue, pp 377-385). For
sequences that can be accurately aligned with the
sequence of a known structure, the problem is essentially
solved, because homologous sequences are known to
be closely similar in structure [40]. Currently, 27% of
the sequences in the SWlSS-PROT database fall into
this category [41]. This percentage should continue to
increase, partly because sequence-alignment methods
have been improving as described above. In addition,
most newly determined protein sequences belong to
known families, so that even with large-scale sequencing
projects, the number of newly discovered families and
domains is leveling off ([42,43]; Bork and Koonin, this
issue, pp 366-376; Murzin, this issue, pp 386-394). As
families grow larger, distant relatives are better detected in
database searches, and the added information can improve
multiple sequence alignments. For example, the Clustal
W program [44"] first builds a tree, then aligns sequences
progressively from the leaves to the root. In general,
Table 1
Features and performanceof methodsfor PSSM construction.
Method
Normalized counts
Average score
Pseudo-counts
Background
Substitution
Dirichlet
Position-basedf
Considers
substitutions
Sensitive to
number of sequences
Positionspecific
Relative
performance*
No
Yes
No
No
No
No
+
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
+
++
+++
++++
*The best PSSM performance was obtained when the method for PSSM construction considers substitutions, is sensitive to the number of
sequences and is position-specific. The evaluations were based on relative performance of PSSMs that were constructed using the indicated
method applied to 1673 alignment blocks representing 459 protein families. Similar results were obtained in previous tests of average score and
background, substitution probability and Dirichlet mixture pseudo-count methods [32"]. tThe position-based method uses substitution probabilities,
but pseudo-counts are added in proportion to the number of different residues in a column. Adapted from [64].
358
Sequences and topology
the denser the tree, the better the alignment. As the
root is approached, increasingly dissimilar sequences are
aligned. Gap penalties are modified depending upon the
composition and degree of conservation at each position,
in common with H M M s [29",31] and other multiple alignment methods [45]. Rather than use a single substitution
matrix, Clustal W incorporates a progressive series, such
that the matrix chosen is appropriate for the degree of
dissimilarity encountered. Multiple-matrix strategies are
also available for pairwise alignment applications [13,46].
When no alignment is available between the sequence of
interest and a sequence with a known structure, an attempt
can be made to align the sequence directly to known
structures. This basic approach has successfully identified
misfolded candidate models in structural analysis [471.
However, finding the right structure in the context of
a database search is far more challenging. Furthermore,
accurate alignment of the candidate sequence to the sequence representing the structure can be difficult, because
homologous structures with dissimilar sequences are not
expected to superimpose well in space. Nevertheless,
numerous sequence-structure alignment methods aimed
at solving this problem have appeared during the past few
years (reviewed in [48°]).
T h e three dimensional-one dimensional ( 3 D - I D ) profile
method employs a conventional alignment algorithm to
compare a query sequence to a sequence database. However, either the query sequence or each of the database
sequences is converted to a string of environments
representing successive residues in a known structure.
Scores are obtained from a matrix in which the rows are
made up of amino acids and the columns are made up
of 3D environments [28[. T h a t is, the only difference
from standard sequence alignment is the choice of scoring
matrix. Threading is more complex than the 3 D - 1 D
profile method because it explicitly takes into account
residue contacts (although 'threading' generally refers to
all such methods).
How well does threading work? One of the difficulties in
comparing threading with sequence-only methods is that
a realistic candidate for threading must lack similarity to
sequences with known structures (otherwise a good structural model could be obtained using homologous sequence
alignments). Because sequence similarity is operationally
defined by conventional substitution matrices, these will
fail to work on any sequence that is deemed suitable for
threading. This does not mean that substitution matrices
are inherently inadequate for alignment of such dissimilar
sequences, just that those designed to detect homology are
wholly inappropriate. T h o u g h unlikely, it remains possible
that much of what threading methods detect can also be
found by an amino acid substitution matrix based on an
appropriate data set, such as dissimilar sequences with
known structures that are superimposable in space.
Scoring matrices are well suited for evaluations because
they are simple to change in automated alignment applications. Unfortunately, more complex threading systems are
not fully automated and are difficult to evaluate rigorously.
This situation fueled a 'blind' structure-prediction competition, in which participants proposed structural models for
sequences whose structures had been determined, but not
yet released. An entire issue of Proteins: Structure, Function
and Genetics is devoted to the results (for a description
of the competition and the organization of the issue,
see [49]). Although very interesting and informative, a
competition such as this is no substitute for rigorous
evaluation of the type highlighted in this review. For
example, participants were allowed to choose the proteins
whose structure they would like to predict. With respect
to sequence-structure alignment methods [50], it appears
that none of the participants used purely automated
methodology. Hopefully, future competitions could be
carried out by submitting sequences to participants'
automated internet servers, allowing for full objectivity.
Automation would also encourage the wider use of
successful methods; biologists have become accustomed to
the automation of sequence-analysis tasks, a trend that is
accelerating with the expanding use of the internet, with
the result that methods requiring manual intervention and
special expertise are likely to fall by the wayside regardless
of merit.
T h e competition demonstrated that threading a sequence
through a structure is very challenging. Nine teams
submitted a total of 44 predictions on 11 proteins that
turned out to have known structural homologs. With
more than 200 unrelated structures to choose from, the
fact that there were 14 correct predictions indicates
some degree of success. However, only one of these 14
predictions also correctly aligned all of thc secondary
structural elements with the threaded sequence [50].
This is worrisome because reasonably correct alignments
are needed for predictions to be useful, such as in
site-directed mutagenesis studies. For the 'conventional'
threading methods, the evaluators thought that further
progress will require an increasingly detailed dcscription
of environments, including local and non-local interactions
along the sequence [50]. Two teams that followed vcry
different strategies achieved comparable success. Both
aligned thc query sequence with database homologs and
used their resulting multiple sequence alignments to
predict secondary structures. T h e predicted sccondary
structure strings were then used to search secondary
structure strings obtained from the structural database.
Hubbard and Park [51 °] added an additional step: they
constructed a PSSM frorn the multiple alignment and
searched the structural database sequences. This combination of secondary structure prediction with PSSM scoring
has potential for full automation. Excellent secondary
structure predictions can be obtained automatically from
Scores for sequence searches and alignments Henikoff
multiple alignments using substitution matrices specific
for helices, sheets and coils [52*].
Supplementing scoring matrices with
additional information
Combining sequence information and secondary structure
information is one way that uncertain pairwise alignments
can be improved [53]. Structural information represented
as 3D-1D environments can also be combined with
PSSMs to improve database searches [54]. Although it
is assumed that these methods contribute information
beyond that that can be obtained from sequences alone,
treating sequences as strings of amino acid properties has
likewise been successful in detecting homologies when
conventional substitution matrices fail [55].
Alternative substitution matrices might also play a role in
improving alignments. Two new substitution matrices for
the alignment of transmembrane regions should be useful
when conventional matrices fail [56,57]. A 400 x 400 matrix
of dipeptide substitutions is unlikely to be useful because
it is far too 'contrasty': only 54% of possible dipeptide
substitutions are represented among 80200 observed
dipeptide pairs [58]. However, an approach that scores
clusters of residues shows considerable promise [59°].
6.
DayhoffMO, Eck RV (Eds): Atlas of Protein Sequence and
Structure, vol 3. Silver Spring: National Biomedical Research
Foundation; 1988.
7.
Altschul SF: Amino acid substitution matrices from
an information theoretic perspective. J Mol Biol 1991,
219:555-565.
8.
Grantham R: Amino acid difference formula to help explain
protein evolution. Science 1974, 185:862-864.
9.
Miyata T, Miyazawa S, Yasunaga T: Two types of amino
acid substitutions in protein evolution. J Mol Evo11979,
12:219-236.
10.
FengDF, Johnson MS, Doolittle RF: Aligning amino acid
sequences: comparison of commonly used methods. J Mol
Evol 1985, 21:112-125.
11.
Rao JKM: New scoring matrix for amino acid residue
exchanges based on residue characteristic physical
parameters. Int J Pept Protein Res 1987, 29:276-281.
12.
MiyazawaS, Jernigan RL: A new substitution matrix for protein
sequence searches based on contact frequencies in protein
structures. Protein Eng 1993, 6:267-278.
13.
Henikoff S, Henikoff JG: Performance evaluation of amino acid
substitution matrices. Proteins 1993, 17:49-61.
14.
JonesDT, Taylor WR, Thornton JM: The rapid generation of
mutation data matrices from protein sequences. CABIOS
1992, 8:275-282.
15.
Wilbur WJ: On the PAM matrix model of protein evolution. Mol
Biol Evol 1985, 2:434-447.
16.
Benner SA, Cohen MA, Gonnet GH: Amino acid substitution
during functionally constrained divergent evolution of protein
sequences. Protein Eng 1994, 7:1323-1332.
1Z
Gonnet GH, Cohen MA, Benner SA: Exhaustive matching
of the entire protein sequence database. Science 1992,
256:1443-1445.
18.
Henikoff S, Henikoff JG: Amino acid substitution matrices
from protein blocks. Proc Nat/Acad Sci USA 1992,
89:10915-10919.
19.
Henikoff S, Henikoff JG: Automated assembly of protein
blocks for database searching. Nucleic Acids Res 1991,
19:6565-6572.
20.
Risler JL, Delorme MO, Delacroix H, Henaut A: Amino acid
substitutions in structurally related proteins. A pattern
recognition approach. Determination of a new and efficient
scoring matrix. J Mol Biol 1988, 204:1019-1029.
21.
Overington J, Donnelly D, Johnson MS, Sail A, Blundell TL:
Environment-specific amino acid substitution tables: tertiary
templates and prediction of protein folds. Protein Sci 1992,
t :216-226.
22.
JohnsonMS, Overington JP: A structural basis for sequence
comparisons. An evaluation of scoring methodologies. J Mol
Bio11993, 233:716-738.
Conclusions
Recent improvements in substitution matrices and PSSMs
have dramatically improved alignment-based procedures.
We know this because of comprehensive evaluation
studies which also reveal areas in which improvement is
possible. General pairwisc substitution scores might be
about as good as they need to be, given that gap penalties
limit alignment quality. However, multiple alignments
provide position-specific information that can overcome
this limitation, potentially allowing for the extraction of
protein structural features. With the expansion of protein
families to include more diverse sequences, such as
those from whole bacterial genomes I60], these multiple
alignment-based methods will become applicable to more
protein families.
References and recommended reading
Papers of particular interest, published within the annual period of review,
have been highlighted as:
•
••
of special interest
of outstanding interest
359
23.
Pearson WR: Comparison of methods for searching protein
,,sequence databases. Protein Sci 1995, 4:1145-1160.
Local alignment searching algorithms were tested using several different
substitution matrices and gap penalty combinations. Search sensitivity is significantly improved using modern matrices, such as BLOSUM 45-55, with
optimized gap penalties. The best overall performance was obtained with
the Smith-Waterman algorithm and log-length normalization of alignment
scores.
24.
•,
Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary
change in proteins. In Atlas of Protein Sequence and Structure,
vol 5, suppl 3. Edited by Dayhoff M. Washington, DC: National
Biomedical Research Foundation; 1978:345-358.
Vogt G, Etzold T, Argos P: An assessment of amino acid
exchange matrices in aligning protein sequences: the twilight
zone revisited. J Mo/Bio/t 995, 249:816-831.
Eighty substitution matrices were evaluated using alignment accuracy as
a criterion. Several modern log-odds substitution matrices performed well
when modified for global alignment and paired with optimized gap penalties.
2.
Bork P, Ouzounis C, Sander C: From genome sequences to
protein function. Curr Opin Struct Bio11994, 4:393-403.
25.
he
3.
Brenner S: Phosphotransferase sequence homology. Nature
1987, 329:21.
4.
Bairoch A: PROSlTE: a dictionary of sites and patterns in
proteins. Nucleic Acids Res 1992, 20:2013-2018.
5.
Gribskov M, McLachlan AD, Eisenberg D: Profile analysis:
detection of distantly related proteins. Proc Natl Acad Sci USA
1987, 84:4355-4358.
Taylor WR: Motif-biased protein sequence alignmenL J Comput
Bio11994, 1:297-310.
extreme sensitivity of global alignments to gap penalty choice can be
reduced by scoring runs of matches higher than scattered matches.
26.
Mort R: Maximum-likelihood estimation of the statistical
distribution of Smith-Waterman local sequence similarity
scores. Bull Math Bio/1992, 54:59-75.
27.
Vingron M, Waterman MS: Sequence alignment and penalty
choice: review of concepts, case studies and implications.
J Mol Bio/1994, 235:1-12.
360
28.
Sequences and topology
Bowie JU, Luthy R, Eisenberg D: A method to identify protein
sequences that fold into a known three-dimensional structure.
Science 1991, 253:164-170.
29.
•
Krogh A, Brown M, Mian IS, Sjolander K, Haussler D: Hidden
Markov models in computational biology. J Mol Bio/1994,
235:1501-1531.
An iterative method for multiple sequence alignment is described. PSSM
parameters, such as gap penalties, are determined in the course of aligning
sequences.
30.
Baldi P, Chauvin Y, Hunkapiller T, McClure MA: Hidden Markov
models of biological primary sequence information. Proc Nat/
Acad Sci USA 1994, 91:1059-1063.
31.
Eddy SR, Mitchison G, Durbin R: Maximum discrimination
hidden Markov models of sequence consensus. J Comput Bio/
45.
Taylor WR: An investigation of conservation-biased gappenalties for multiple protein sequence alignment. Gene 1995,
165:GC27-GC35.
46.
Altschul SF: A protein alignment scoring system sensitive at all
evolutionary distances. J Mo/Evo/1993, 36:290-300.
47.
LuthyR, Bowie JU, Eisenberg D: Assessment of protein models
with three-dimensional profiles. Nature 1992, 356:83-85.
46.
BryantSH, Altschul SF: Statistics of sequence-structure
•
threading. Curr Opin Struct B/o/1995, 5:236-244.
A critical review that emphasizes the need for statistical interpretation of
threading scores of the type now routinely used for assessing sequence
alignments.
49.
Moult JT, Pedersen JT, Judson R, Fidelis K: A large-scale
experiment to assess protein structure prediction methods.
Proteins 1995, 23:ii-iv.
50.
LemerCM-R, Rooman MJ, Wodak SJ: Protein structure
prediction by threading methods: evaluation of current
techniques. Proteins 1995, 23:337-355.
1995, 2:9-24.
32.
•
Tatusov RL, Altschul SF, Koonin EV: Detection of conserved
segments in proteins: iterative scanning of sequence
databases with alignment blocks. Proc Nat/Acad Sci USA
1994, 91:12091-12095.
PSSMs representing ungapped alignment blocks were iteratively expanded
by searching sequence databases. In this context, different methods for calculating PSSM column weights were evaluated. The pseudo-count method
based on mixtures of Dirichlet components performed best.
33.
LuthyR, Xenarios I, Bucher P: Improving the sensitivity of the
sequence profile method. Protein Sci 1994, 3:139-146.
34.
Thompson JD, Higgins DG, Gibson TJ: Improved sensitivity of
profile searches through the use of sequence weights and gap
excision. CABIOS 1994, 10:19-29.
Branch-proportional sequence weights were described and found to improve
the performance of several PSSMs. Improvements were also obtained by
using BLOSUM 62 rather than PAM 250 and by selectively excluding gaps.
•
35.
Dodd IB, Egan JB: Systematic method for the detection of
potential lambda Cro-like DNA-binding regions in proteins. J
Mol Bio/198?, 194:557-564.
36.
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF,
Wootton JC: Detecting subtle sequence signals: a Gibbs
sampling strategy for multiple alignment. Science 1993,
262:208-214.
37.
Claverie J-M: Some useful statistical properties of positionweight matrices. J Comput Chem 1994, 18:287-293.
38.
Brown MP, Hughey R, Krogh A, Mian IS, Sjolander K, Haussler D:
Using Dirichlet mixture priors to derive hidden Markov models
for protein families. In Proceedings of the First International
Conference on Intelligent Biology. Edited by Hunter L, Searls D,
Shavlik J. Washington DC: AAAI Press; 1993:47-55.
39.
HenikoffS, Henikoff JG: Position-based sequence weights.
•
J Mol Biol 1994, 243:574-578.
A simple method is introduced for weighting sequences that is based on the
diversity of each position. Performance was evaluated using ungapped alignment blocks to construct PSSMs representing 698 protein families. Positionbased, Voronoi and branch-proportional weights outperformed other methods.
40.
Sander C, Schneider R: Database of homology-derived protein
structures and the structural meaning of sequence alignment.
Proteins 1991,9:56-68.
41.
Schneider R, Sander C: The HSSP database of protein
structure-sequence alignments. Nucleic Acids Res 1996,
24:201-205.
42.
Green P: Ancient conserved regions in gene sequences. Curt
Opin Struct Biol 1994, 4:404-412.
43.
Doolittle RF: The multiplicity of domains in proteins. Annu Rev
Biochem 1995, 64:287-314.
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving
the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res 1994,
22:4673-4680.
A popular progressive method for muttiple sequence alignment was substantially improved by the implementation of several new features, including
sequence weights, multiple substitution matrices and residue-specific gap
penalties.
51.
•
Hubbard TJ, Park J: Fold recognition and ab initio structure
predictions using hidden Markov models and beta-strand pair
potentials. Proteins 1995, 23:398-402.
Structure predictions were based on multiply aligning each unknown sequence with available homologs and using the alignment both to predict
secondary structural features and to construct PSSMs. A compatible structure was one with secondary structural features that aligned well with the
predicted features and with a sequence that had a high PSSM score. Two
out of four structures were correctly identified and the other two were near
misses.
52.
•
Heringa J, Argos P: A simple and fast approach to prediction of
protein secondary structure from multiply aligned sequences
with accuracy above 70%. Protein Sci 1995, 4:2517-2525.
Amino acid substitution matrices specific for helix, sheet and coil regions
underlie predictions. The performance on multiply aligned sequences is at
least as good as that obtained by complex machine-learning approaches.
53.
Gracy J, Chiche L, Sallantin J: Improved alignment of weakly
homologous protein sequences using structural information.
Protein Eng 1994, 6:821-829.
54.
Yi TM, Lander ES: Recognition of related proteins by iterative
template refinement (ITR). Protein Sci 1994, 3:1315-1328.
55.
Hobohm U, Sander C: A sequence property approach to
searching protein databases. J Mol Biol 1995, 251:390-399.
56.
Cserzo M, Bernassau J-M, Simon I, Maigret B: New alignment
strategy for transmembrane proteins. J Mol Biol 1994,
243:388-396.
57.
JonesDT, Taylor WR, Thornton JM: A mutation data matrix for
transmembrane proteins. FEBS Lett 1994, 339:269-275.
58.
Gonnet GH, Cohen MA, Benner SA: Analysis of amino acid
substitution during divergent evolution: the 400 by 400
dipeptide substitution matrix. Biochem Biophys Res Commun
1994, 199:489-496.
59.
Han KF, Baker D: Recurring local sequence motifs in proteins.
•
J Mol Biol 1995, 251:1 76-186.
Cluster analysis identified local sequence motifs shared by multiple protein
families. This higher-order sequence information can potentially be used to
score alignments for the detection of distant relationships.
60.
Nowak R: Bacterial 9enome sequence bagged. Science 1995,
269:468-470.
61.
HenikoffS, Haughn GW, Calvo JM, Wallace JC: A large family
of bacterial activator proteins. Proc Nat/Acad Sci USA 1988,
85:6602-6606.
62.
VialeAM, Kobayashi H, Akazawa T, Henikoff S: rcbR, a 9ene
coding for a member of the LysR family of transcriptional
regulators, is located upstream of the expressed set of
ribulose 1,5-bisphosphate carboxylase/oxygenase genes in
the photosynthetic bacterium Chromatium vinosum. J Bacterio/
1991, 173:5224-5229.
63.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local
alignment search tool. J Mol Biol 1990, 215:403-410.
64.
HenikoffJG, Henikoff S: Using substitution probabilities to
improve position-specific scoring matrices. CABIOS 1996, in
press.
44.
•