J. Mol. Biol. (1997) 269, 423±439
Recognition of Analogous and Homologous Protein
Folds: Analysis of Sequence and
Structure Conservation
Robert B. Russell1, Mansoor A. S. Saqi2, Roger A. Sayle2, Paul A. Bates1
and Michael J. E. Sternberg1*
1
Biomolecular Modelling
Laboratory, Imperial Cancer
Research Fund, Lincoln's Inn
Fields, P.O. Box 123, London
WC2A 3PX, UK
2
Bioinformatics Group
Glaxo-Wellcome Medicines
Research Centre, Gunnels
Wood Road, Stevenage, Herts
SG1 2NY, UK
An analysis was performed on 335 pairs of structurally aligned proteins
derived from the structural classi®cation of proteins (SCOP http://scop.mrclmb.cam.ac.uk/scop/) database. These similarities were divided into analogues, de®ned as proteins with similar three-dimensional structures (same
SCOP fold classi®cation) but generally with different functions and little
evidence of a common ancestor (different SCOP superfamily classi®cation).
Homologues were de®ned as pairs of similar structures likely to be the
result of evolutionary divergence (same superfamily) and were divided
into remote, medium and close sub-divisions based on the percentage
sequence identity. Particular attention was paid to the differences between
analogues and remote homologues, since both types of similarities are generally undetectable by sequence comparison and their detection is the aim
of fold recognition methods. Distributions of sequence identities and substitution matrices suggest a higher degree of sequence similarity in remote
homologues than in analogues. Matrices for remote homologues show
similarity to existing mutation matrices, providing some validity for their
use in previously described fold recognition methods. In contrast, matrices
derived from analogous proteins show little conservation of amino acid
properties beyond broad conservation of hydrophobic or polar character.
Secondary structure and accessibility were more conserved on average in
remote homologues than in analogues, though there was no apparent
difference in the root-mean-square deviation between these two types of
similarities. Alignments of remote homologues and analogues show a similar number of gaps, openings (one or more sequential gaps) and inserted/
deleted secondary structure elements, and both generally contain more
gaps/openings/deleted secondary structure elements than medium and
close homologues. These results suggest that gap parameters for fold recognition should be more lenient than those used in sequence comparison. Parameters were derived from the analogue and remote homologue datasets
for potential used in fold recognition methods. Implications for protein fold
recognition and evolution are discussed.
# 1997 Academic Press Limited
*Corresponding author
Keywords: protein structure prediction; fold recognition; protein evolution;
sequence identity; substitution matrices
Introduction
Present address: R. B. Russell, SmithKline Beecham
Pharmaceuticals, Research & Development,
Bioinformatics, New Frontiers Science Park, Third
Avenue, Harlow, Essex, CM19 5AW, UK.
Abbreviations used: 3D, three-dimensional; URL,
universal resource locator; RMS, root-mean-square;
WWW, world wide web; PDB, Brookhaven Protein Data
Bank. The standard one-letter and three-letter
abbreviations for the amino acids are used throughout.
0022±2836/97/230423±17 $25.00/0/mb971019
With over 4000 determined protein three-dimensional (3D) structures, there are now many
examples of proteins sharing a common fold (Orengo et al., 1994; Holm & Sander, 1996a). These
similarities can be broadly classi®ed into three
types (e.g. see Murzin et al., 1995; Russell &
Barton, 1994). The ®rst type (A) of similarities are
those where a common 3D structure is ac# 1997 Academic Press Limited
424
companied by a signi®cant sequence similarity (i.e.
clear homologues). The remaining examples share
a common structure, but without signi®cant sequence similarity, and can be divided further into
remote homologues (type B) and analogues (type
C). Remote homologues are those similarities
where divergence from a common ancestor (i.e.
homology) is inferred from the similarity of structure and other common features, such as functional
residues, or unusual structural features. Analogues
are different in kind as they do not have these
strong signals suggesting divergence. Such structural similarities may well be the result of convergence to a favourable fold, although a very distant
divergence cannot be excluded.
Several analyses of protein structure similarities
have concentrated on particular homologous
families (e.g. see Lesk & Chothia, 1980, 1982;
Chothia & Lesk, 1982; Martin, 1995; Ollis et al.,
1992) or analogous folds (e.g. see Murzin et al.,
1992; Hazes & Hol, 1992; Pickett et al., 1992;
Laurents et al., 1994). More general analysis identi®ed principles of structural variation in relation to
sequence similarity (Chothia & Lesk, 1986;
Hubbard & Blundell, 1987). The increase in the
number of known structures has allowed more
recent studies to probe this relationship in greater
depth. Pascarella & Argos (1992) analysed insertions and deletions (indels) within homologous
protein structure families. They noted a general
(logarithmic) increase in the frequency of indels
with decreasing sequence identity and the rarity of
indels within secondary structure elements. Flores
et al. (1993) analysed 92 pairs of similar protein
structures with a range of percentage sequence
identity and quanti®ed the conservation of residue
accessibility, secondary structure and side-chain
torsion angles. All properties showed a decrease in
conservation with decreasing sequence identity.
Russell & Barton (1994) analysed conservation of
accessibility, secondary structure and side-chain to
side-chain contacts within 604 pairs of structurally
similar proteins and found that conservation of
these properties was often comparable between
pairs of proteins with similar 3D structures and
low (<10%) sequence identities and pairs of dissimilar structures, suggesting some doubt as to the
validity of many assumptions used during fold recognition. Both Flores et al. (1993) and Russell &
Barton (1994) found little difference in conservation
of these properties between homologous and analogous protein structure similarities. Recently,
Chung & Subbiah (1996) analysed side-chain burial
and w2 angles and found substantial differences between remotely homologous proteins, suggesting
that pairs of proteins within the twilight zone of
sequence homology adopt different side-chain
interactions.
During the last two years, several classi®cations
of protein structures have become generally available. Most notable amongst these are the SCOP
(Murzin et al., 1995), CATH (Orengo et al., 1994)
and FSSP (Holm & Sander, 1996b) databases.
Recognition of Protein Folds
Within the latter two, proteins are grouped mostly
according to the results of structure comparison
algorithms, and structural families tend to re¯ect
structural similarity and not necessarily functional
and/or evolutionary relationships. In contrast, the
SCOP database, being constructed manually, and
by a careful consideration of the literature and protein function, provides the best division of protein
structural similarities into homologies and analogies to date.
Much previous research has concentrated on the
derivation of protein substitution matrices mostly
with the aim of increasing the sensitivity of protein
sequence database searches. The classic work of
Dayhoff et al. (1978) quanti®ed substitutions within
a few families of closely homologous protein
sequences and combined these data with mutation
rates to derive point accepted mutation (PAM)
matrices, now widely used in sequence alignment
methods. With the increase in database size, numerous groups have extended the work of Dayhoff
by including many more sequence alignments and
produced improved matrices for sequence database searching (Henikoff & Henikoff, 1992; Gonnet
et al., 1992; Jones et al., 1992). Blundell and coworkers derived similar matrices by considering
alignments of proteins of known 3D structure. Several environment-speci®c matrices were derived to
allow for searches and alignments with 3D structure templates, and used in several applications
(Overington et al., 1990; Johnson et al., 1993).
Recently, several approaches for protein fold recognition have been developed that use a variety of
criteria to assess how well a sequence for a protein
of unknown structure ®ts on to a library of protein
structures (e.g. see Lemer et al., 1995). The criteria
have included secondary structure, accessibility
and residue pair preferences. The aim is to detect
analogues and remote homologues, although it has
not been evaluated whether there are different
levels of success for these two classes. Many recent
methods have also included substitution matrix
data and/or secondary structure predictions (Johnson et al., 1993; Russell et al., 1995; Rost, 1995;
Fischer & Eisenberg, 1996; Bates et al., 1996; Russell
et al., 1996). All of these methods assume, to a certain degree, conservation of these properties, across
proteins with little or no sequence similarity, and
(for those using substitution matrices) a degree of
residue property conservation. In order to assess
the validity of these assumptions, these conservations of these features must be studied within
known examples of analogues and remote homologues.
The current larger protein structure database
and the SCOP (Murzin et al., 1995) database now
provides, for the ®rst time, suf®cient data to try
and identify differences between analogous and
homologous proteins. In particular, it is only
recently that suf®cient data have been available to
allow the derivation of mutation data matrices for
analogous proteins. The dual aims of this work are
thus to try and discern differences between remote
Recognition of Protein Folds
homologies and analogies and to provide insights
to improve fold recognition methods. These interrelated aims are pursued by analysing non-redundant datasets of analogous and homologous
proteins derived from the SCOP database.
Construction of mutation matrices, investigation of
insertions/deletions and analysis of residue, secondary structure, accessibility and prediction conservation will be performed with these two aims in
mind.
Results
The selection of pairs of structurally
similar proteins
The structural classi®cation of proteins (SCOP)
database (Murzin et al., 1995) was used as a resource for structural similarities. Within SCOP
similarities between protein 3D structures are organised into a hierarchy that distinguishes between
homologous and analogous structures. Proteins
within the same SCOP superfamily are thought to
be homologous (share a common evolutionary origin), often in the absence of sequence similarity,
425
due to an obvious functional similarity, or the presence of common features unlikely to have arisen
by chance. Some superfamily classi®cations are
aided by the observation of a low but signi®cant
sequence identity following optimal superimposition of the 3D structures (e.g. see Murzin, 1993a,b;
Martin et al., 1993). Proteins within different superfamilies, but within the same fold group, still
adopt a similar 3D structure but lack the strong
features that suggest divergence from a common
ancestor. In this study, these types of similarities
are considered as analogous proteins. Although
the distinction is dif®cult, the SCOP database,
constructed by experts after careful analysis and
consultation with the literature, provides the best
possible de®nition and classi®cation of such similarities at present.
An example of the types of structural similarity
considered here is shown in Figure 1. The structure
on the left of the Figure, an OB fold found in
enterotoxin (PDB code 1lts-d), is similar to the
three structures shown to its right. The ®rst structure, cholera toxin (1chp-d) has a high level of sequence similarity (80%) to entertoxin and most of
the residues in the two proteins (98/103) can be
Figure 1. Example of the types of similarity considered in this study. The protein on the left (enterotoxin, PDB code
1lts-d) is similar in 3D structure to all three of the proteins on the right. The ®rst similarity, with cholera toxin (1chpd), is a close homology, since structure/function similarity is accompanied by a high level of sequence identity (80%).
The second similarity, with toxic shock syndrome toxin (TSS; 1tss) is accompanied by a functional similarity, but no
signi®cant sequence similarity (8.8%) and is thus classi®ed as a remote homologue. The third similarity, with a
domain from aminoacyl tRNA synthetase (1krs) is accompanied neither by any signi®cant sequence similarity (4.4%)
nor any functional similarity, thus classifying it as an analogue. More details are given in the text. Darker coloured
regions in the three structures on the right show those regions that are structurally equivalent with enterotoxin (far
left).
426
Ê (Ca atoms).
superimposed with an RMS of 0.6 A
The second structure, toxic chock syndrome toxin
(TSS; 1tss-a-1) shows no signi®cant sequence
similarity with enterotoxin (8.8%) and fewer residues (35/93) can be superimposed with a higher
Ê ). However, these proteins are thought
RMS (2.4 A
to share a common ancestor because of their
functional similarity, both being bacterial pathogen toxins. The third structure is a domain from
aminoacyl tRNA synthetase (1krs), and also
shows no signi®cant sequence similarity to enterotoxin (4.4%). Unlike TSS, however, this domain
does not show a functional/evolutionary similarity to enterotoxin, and is thus classi®ed (in
SCOP and this study) as an analogue. The number of structurally equivalent residues between
this analogue and enterotoxin (41/103) and the
Ê ) suggest a structural simiRMS deviation (2.2 A
larity slightly greater than that between enterotoxin and the remote homologue TSS.
Recognition of Protein Folds
Sequence similarity
Percentage sequence identity (%I) is a widely
used means of assessing the residue conservation
between two proteins. In this study, %I was calculated in two ways. For the ®rst calculation, the
number of positions in each pair containing the
same amino acid was expressed as a percentage of
the length of the shortest sequence. This is probably the most widely used calculation (e.g. see
Flores et al., 1993; Russell & Barton, 1994), and is
the value often derived from sequence alignments.
The second calculation considers only equivalenced
Aligned versus equivalenced positions in
structural alignments
There is an inherent dif®culty when comparing
and analysing alignments derived from protein
3D structure comparison. Although most comparison methods provide an alignment over the
whole length of both protein sequences, they
also often provide a means to assess the signi®cance of aligned positions, since 3D structural
similarities with low levels of sequence identity
often contain regions that are not equivalent
(e.g. loops connecting core secondary structure
elements that serve different functions). Given
such alignments, one is faced with a choice
between considering all aligned positions (i.e. those
not involving gaps) or only those positions
deemed structurally equivalent by the structure
comparison algorithm.
For analysis it is best to consider only equivalent
positions, since alignments involving regions of
different 3D structure are likely to be meaningless.
However, if the aim is to improve methods of
sequence alignment or fold recognition, where 3D
structural information is missing for one or both
sequences in the alignment, then it is probably
best to consider all aligned positions, since in the
absence of 3D structures one cannot know a priori
where the equivalent regions are. Since the 3D
structural alignment is likely the best possible
answer, it is wise to consider all aligned positions.
Here, we considered both sets of positions.
Aligned positions are considered to provide insights into methods of sequence comparison and
fold recognition and structurally equivalent positions are considered to attempt to discern between homologous and analogous folds. These
two datasets will be referred to as aligned and
equivalenced in the sections that follow. Equivalences between pairs of similar 3D structures
were de®ned according to Russell & Barton
(1992).
Figure 2. Histograms showing the distribution of
sequence identities in analogues and homologues in the
datasets described in the text. The ®rst histograms (a)
show sequence identities calculated by dividing the
number of identical amino acid residues in the structural
alignment by the minimum sequence length. The second
histograms (b) show those calculated by only considering structurally equivalent positions. Regions in black
for the homologues show those alignments comprising
undetectable sequence similarities as described in the
text.
427
Recognition of Protein Folds
and for the undetectable homologues the value is
x UH 14.5 (see Table 1). If one considers the subset
with Z 4 4.0, the results change only slightly
(Table 1).
Although the difference between the means for
analogues and remote homologues is signi®cant
at 0.5% for both calculations (t 4.7 or 3.9 onetailed t test), it is more pronounced in the calculation based on equivalent positions (Figure 2)
with a slight increase in the separation (from 2.9
to 3.4). This is perhaps not surprising, since remote homologues are often observed to show
some sequence identity following structure superimposition.
Although analogues have a range of equivalenced %I between 0 and 25%, 53% have
%I < 11%. In contrast, only 23% of the undetectable
homologues have %I 4 10%, suggesting that this
measure of %I may help in distinguishing between
homology and analogy. To our knowledge this is
the ®rst time that a difference between analogues
and remote homologues has been found in terms
of trends in sequence identities.
For the subsequent analysis, the second method
of calculating %I was used to divide the 335 pairs
that were extracted from the SCOP database (see
positions. Although not immediately comparable
to sequence identities calculated in the absence of
structural data, these identities aim to quantify the
frequent observation of residue conservation for
some remote homologues within the core secondary structure elements (e.g. see Murzin, 1993a,b;
Martin et al., 1993).
Histograms showing the distribution of these
two sequence identities for analogues and homologues are shown in Figure 2. For this analysis, a
sub-set of homologues was de®ned as those where
a randomised Z-score, calculated from sequence
alignments, was <6.0. Z-scores of 56.0 usually correspond to detectable similarities (Barton &
Sternberg, 1987). This subset is referred to as undetectable homologues and are shown in dark in
Figure 2. Analogues show Z-scores of 44.0, so an
additional set of homologues with Z 4 4.0 was
also considered in the sections that follow.
Both histograms in Figure 2 show a difference in
the distribution of sequence identities between analogues and undetectable homologues. Considering
conventional sequence identities, for analogues the
mean is x A 8.6; for undetectable homologues the
value is x UH 11.5. Considering this second
measure of %I, for analogues the mean is x A 11.1
Table 1. Summary of details discussed in the text
Analogues
x
s
Sequence identity
(aligned)
Sequence identity
(equivalenced)
RMS equivalenced
Secondary structure
(aligned)
Secondary structure
(equivalenced)
Accessibility
(aligned)
Accessibility
(equivalenced)
Predicted SS
(aligned)
Predicted SS
(equivalenced)
Predicted accessibility
(aligned)
Predicted accessibility
(equivalenced)
SS P1K2
(aligned)
SS P1K2
(equivalenced)
ACC P1K2
(aligned)
ACC P1K2
(equivalenced)
Openings/100 residues
Gaps/100 residues
Equivalenced SS
Aligned SS
x
Remote
a
Test
tA/RH
s
a
Medium
s
x
0.4
10.1
4.7
3.8
3.9
3.5
1.4
2.7
1.3
87.6
0.3
8.9
0.7
94.7
0.4
4.2
81.4
10.4
84.8
7.5
2.0
90.8
7.3
95.0
4.0
49.2
8.8
55.9
8.5
4.4
66.9
10.2
80.4
9.4
56.7
11.2
64.9
8.9
4.4
70.0
10.2
80.9
9.6
60.0
10.6
60.8
11.2
84.2
10.2
92.8
5.3
65.1
14.0
66.5
12.0
84.7
10.9
93.2
5.0
48.3
5.2
53.7
7.7
77.8
12.2
91.5
6.7
54.9
8.3
59.4
8.4
79.7
11.7
92.3
5.7
63.0
11.0
65.2
10.7
74.2
91
76.5
8.4
71.0
13.6
71.8
11.6
74.7
9.9
76.3
8.3
44.9
6.7
49.5
7.1
55.3
7.8
56.8
6.4
56.0
8.3
54.8
9.0
56.2
9.0
57.3
6.5
3.9
19.8
73.9
89.2
2.5
18.1
14.8
9.1
3.3
16.8
75.4
91.6
2.2
16.1
15.3
7.9
1.6
4.5
94.3
97.9
1.4
5.8
6.6
3.5
0.3
0.9
99.0
99.7
0.7
2.4
3.6
1.3
11.1
5.9
2.1
70.6
1.4
1.0
0.6
1.6
±
±
±
s
4.3
3.9b
5.2a
5.0b
0.4
8.8
4.2
±
Close
12.1
11.4b
15.0a
14.7b
2.0
75.3
8.6
SS, secondary structure; ACC, accessibility.
Calculated for undetectable homologues where Z < 6.0.
b
Calculated for undetectable homologues where Z < 4.0 (see the text).
a
x
428
Recognition of Protein Folds
Methods and Data). These were divided into four
categories for subsequent analysis:
Analogues
46
Remote homologues
0 < %I425
94
Medium homologues
25 < %I450
Close homologues
50 < %I4100
89
106
The de®nition of remote homologues as %I < 25%
was used since %I is a widely accepted measure
that is easy to evaluate and is a good approximation to undetectable homologues based on randomised Z-scores (see Figure 2). Only 11 of 94 of
the remote homologues under this de®nition had
Z-scores 56.0.
Substitution matrices
Substitution matrices were derived from the sets
of alignments described above. For each alignment
position (ignoring positions aligned to gap) in each
alignment in the above classi®cations, the frequency of all 210 possible substitutions was calculated and a log odds for substitution derived as
follows:
fXY
SXY log
ÿLave
fX fY
Where fXY is the observed frequency of the substitution X ÿ Y (X and Y are amino acids) and fX, fY
are the observed frequencies of occurrence of
amino acids X and Y. Lave, a normalisation term, is
the average value of log(fXY/(fXfY)) for all 210 possible substitutions. Matrices were derived using
aligned positions, since this is likely to be of the
greatest use in fold recognition and sequence comparison. Differences between matrices calculated
from equivalenced positions are discussed below.
Matrices for the analogues, remote homologues
and a combined (analogue-remote homologue)
matrix are shown in Table 2. Complete linkage
cluster analysis was used to build tree diagrams
for each of these matrices, those for the medium
and close homologues and for the more traditional
PAM250 (Dayhoff et al., 1978), BLOSUM62
(Henikoff & Henikoff, 1992) and GONNET (Gonnet et al., 1992) substitution matrices. These diagrams are shown in Figure 3. The RMS differences
for all pairwise comparisons of the matrices are
shown in Table 3. This RMS provides a broad
measure of the similarity between the matrices.
The low values for the comparison of traditional
substitution matrices (PAM250, BLOSUM62 and
GONNET) provide a benchmark for comparing the
matrices derived here. RMS values for the comparisons between the new matrices are approximately
twice as large as those for the comparison of traditional matrices with each other, re¯ecting major
differences.
Tree diagrams for traditional mutation data
matrices (Figure 3(a) to (c)) show predictable
groups at high values (e.g. D-E and Y-F) and
generally group into ®ve distinct groups at lower
values: small, polar, aliphatic, aromatic and cysteine.
Numerous other sub-groupings occur at higher
values, corresponding to the conservation of features
such as positive and negative charge, six-membered ring, tiny, etc. (Taylor, 1986).
Tree diagrams for remote, medium and close
homologues (Figure 3(d) to (f)) are quite similar to
the traditional matrices. Remote homologues are
remarkable in that they show the highest degree of
similarity with the PAM250, BLOSUM62 and/or
GONNET matrices with respect to RMS and maintain very similar groupings of the amino acids,
despite low levels of sequence identity. Several
sub-groupings often observed in the PAM250,
BLOSUM62 and GONNET matrices also appear in
the remote homologues, including aromatic (YW),
aliphatic (LIM), positive (RK) and small (PGAST).
Despite low levels of sequence identity, the diagonal
scores (i.e. for a residue matched with itself) for
remote homologues (Table 2B) are still positive,
although this may be due to the alignments with
the highest %I. The remote homologue matrix
suggests that some character of residue property
conservation is maintained in these proteins
despite little overall sequence identity. Tree diagrams and matrices for remote, medium and close
homologues derived by considering only equivalenced positions show only minor differences from
those calculated with aligned positions (results not
shown).
Traditional substitution matrices are derived by
an analysis of obviously homologous sequences
(i.e. sequence identities 525%). It therefore may
seem surprising that the remote homologue matrix
is the most similar (considering both RMS and
clustering) to the PAM250, BLOSUM62 and GONNET matrices. The application of the rates of mutation to the substitution data, as occurs in the
derivation of the PAM250 and GONNET matrices,
appears to mimic the substitutions observed within
the remote homologues. The similarity with BLOSUM62, however, is less easily explained, since the
matrix was not derived by applying mutation rates
to substitution data. The similarity between the remote homologue matrix and the PAM250/BLOSUM62/GONNET matrices suggests that a
sequence substitution legacy remains in the remote
homologues even when most of the similarities are
undetectable by traditional sequence comparison
methods. The scale of the substitution log odds for
remote homologues, ranging from ÿ 8 to 8
agrees most closely with those for the PAM250
matrix, and is slightly broader than the ranges for
the BLOSUM62 and GONNET matrices. Those for
the medium and close homologues have a much
lower negative boundary, re¯ecting the rarity
of unusual substitutions (e.g. Ala to Trp) at high
sequence identity.
Despite comparable RMS values (Table 3) to the
remote homologues, the tree diagram for the analogues matrix (Figure 3(g)) shows little coherence
in the groupings of the amino acids. Instead, the 16
429
Recognition of Protein Folds
Table 2. Substitution matrices derived from (A) analogues, (B) homologues and (C) analogues homologues
A. Analogues
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
ÿ2
1
0
0
2
0
0
3
2
ÿ2
2
ÿ3
ÿ4
0
12
1
ÿ2
2
3
2
0
2
2
2
6
ÿ8
6
ÿ2
1
2
3
3
ÿ1
ÿ1
3
0
ÿ1
2
7
2
5
0
ÿ11
ÿ5
ÿ3
1
1
ÿ4
ÿ4
ÿ1
1
ÿ4
0
1
2
3
ÿ2
0
ÿ1
4
ÿ2
ÿ1
ÿ2
3
3
ÿ3
1
4
1
0
ÿ1
3
2
2
2
ÿ4
0
0
4
ÿ5
ÿ1
ÿ1
3
ÿ1
ÿ3
1
3
1
8
1
ÿ9
ÿ1
5
1
ÿ3
ÿ1
ÿ6
ÿ4
ÿ1
0
5
3
0
0
ÿ3
2
1
0
3
ÿ1
1
1
0
2
2
0
4
ÿ17
4
ÿ1
0
2
5
3
ÿ5
0
2
4
1
ÿ1
ÿ1
0
ÿ3
ÿ2
5
ÿ10
1
0
1
3
2
3
2
0
5
0
ÿ3
ÿ2
3
3
ÿ1
4
0
ÿ4
ÿ5
ÿ3
ÿ4
1
ÿ1
ÿ1
0
2
2
4
3
0
11
2
0
3
5
1
3
ÿ3
0
0
2
ÿ2
1
3
3
3
0
4
0
ÿ3
2
1
ÿ12
ÿ4
ÿ1
ÿ4
ÿ1
ÿ2
7
2
ÿ3
ÿ3
0
4
6
0
3
5
ÿ1
1
0
2
3
ÿ1
B. Remote homologues
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
0
0
4
0
0
3
ÿ1
0
5
5
2
ÿ1
0
ÿ1
13
ÿ1
3
2
3
ÿ6
4
0
2
0
5
ÿ4
2
3
1
0
1
1
ÿ1
0
ÿ1
6
ÿ2
2
4
2
1
4
3
ÿ1
11
0
0
ÿ8
ÿ7
1
ÿ3
ÿ3
ÿ4
ÿ4
1
0
ÿ2
ÿ2
ÿ7
2
ÿ5
ÿ2
ÿ4
0
7
2
1
6
1
1
0
4
4
0
1
ÿ6
ÿ4
2
0
0
ÿ3
ÿ6
1
0
ÿ3
ÿ3
1
5
6
ÿ1
3
0
ÿ3
ÿ1
ÿ6
3
ÿ2
ÿ5
ÿ6
0
4
4
ÿ4
4
5
0
ÿ1
2
1
ÿ1
0
2
0
1
ÿ2
ÿ5
0
ÿ3
ÿ1
7
2
0
3
2
ÿ1
4
1
0
ÿ4
ÿ4
ÿ2
0
ÿ2
0
2
0
0
0
4
1
ÿ2
ÿ2
0
0
1
0
0
2
2
ÿ2
1
4
0
ÿ1
2
0
ÿ5
2
ÿ3
ÿ2
ÿ2
0
3
0
ÿ2
3
4
ÿ2
1
ÿ4
16
0
2
0
ÿ2
4
1
ÿ1
ÿ1
2
0
0
0
1
7
0
0
ÿ1
8
3
1
ÿ1
ÿ3
ÿ3
2
ÿ1
ÿ2
ÿ2
ÿ3
7
3
ÿ2
3
3
0
ÿ1
0
ÿ2
1
0
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
ÿ2
1
0
0
0
0
ÿ1
0
1
0
2
ÿ4
ÿ1
0
12
0
1
3
3
ÿ4
0
1
2
0
4
ÿ8
4
0
0
1
2
2
0
0
1
1
0
3
3
3
2
0
ÿ1
ÿ3
1
1
1
ÿ7
ÿ3
ÿ4
0
ÿ3
ÿ1
ÿ1
0
2
ÿ2
0
ÿ3
2
ÿ2
0
ÿ2
2
5
0
1
3
0
2
2
4
3
1
3
ÿ5
ÿ2
ÿ1
2
0
ÿ3
ÿ2
1
0
ÿ1
0
3
4
6
0
ÿ5
0
ÿ1
1
ÿ4
3
ÿ3
ÿ5
ÿ3
0
4
4
ÿ1
0
0
0
0
1
3
0
0
1
1
2
0
ÿ2
1
0
1
0
0
0
4
2
ÿ3
1
2
0
ÿ1
ÿ3
0
0
ÿ4
0
5
ÿ3
0
0
3
2
0
0
1
0
3
0
ÿ1
1
2
0
1
3
ÿ2
0
0
ÿ1
ÿ5
2
ÿ1
ÿ3
ÿ1
0
4
1
0
3
5
2
2
0
7
0
3
0
ÿ1
1
0
ÿ1
0
1
2
0
0
1
5
ÿ1
0
0
4
ÿ1
0
0
0
ÿ1
5
0
ÿ2
ÿ1
ÿ1
5
4
0
1
3
ÿ1
0
0
0
2
ÿ2
C. Combined
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
430
Recognition of Protein Folds
amino acids of suf®cient abundance in the dataset
to be clustered with con®dence group into only
two major groups: hydrophobic and polar. The
higher value groupings do not suggest any common side-chain properties, with the exception of
clusters for R/K, Q/E, V/L and P/A. The diagonal
from the analogue matrix is also very different
from the others, containing very few positive
entries, re¯ecting the low levels of sequence identity
in the alignments used to derive the matrix. The
two major analogue groupings suggest that hydrophobic character must be maintained within the
protein core, even when an evolutionary relationship is unlikely. Analysis of equivalenced positions
Figure 3. (legend opposite)
Recognition of Protein Folds
431
Figure 3. Tree diagrams calculated for substitution/mutation matrices as discussed in the text. (a) PAM250 (Dayhoff
et al., 1978), (b) BLOSUM62 (Henikoff & Henikoff, 1992), (c) GONNET (Gonnet et al., 1992), (d) homologues 0 to 25%,
(e) homologues 25 to 50%, (f) homologues 50 to 100%, (g) analogues. Groupings of the amino acids according to sidechain properties are shown to the right. Amino acids with an expected number of observations with any other amino
acid of <5 were excluded and appear below the Figure with a highly negative clustering score. The numbers appearing below the trees show the minimum value log odds residue exchange value for each cluster; everything within a
cluster must have a pairwise log odds of 5 the value at the cluster node.
432
Recognition of Protein Folds
Table 3. Root-mean-square difference for substitution matrices derived as discussed in the
text
A
A
RH
MH
CH
PAM
BLOS
GON
RH
MH
CH
PAM
BLOS
GON
3.7
7.8
6.0
13.8
12.1
7.5
4.5
3.0
5.5
11.2
4.4
3.4
6.8
11.9
1.7
3.1
3.1
6.6
11.9
1.4
1.0
A, analogues; RH, remote homologues; MH, medium homologues; CH, close homologues; PAM, point
accepted mutation matrix (250); BLOS, BLOSUM62 matrix; GON, GONNET matrix.
within the analogues unfortunately does not provide suf®cient data for the derivation of substitution matrices, since 11 of the amino acids are
involved in substitutions where expected values
are <5. Pairs of analogous proteins give Z-scores
up to 3.6. Matrices calculated with homologues
having Z-scores 44.0 differ slightly from those
calculated with %I 4 25. The small amino acids
group less well, and tryptophan/cysteine no longer occur often enough to be clustered with con®dence. It is clearly important to repeat this analysis
once suf®cient data are available.
Our results suggest that there is a weak, but detectable difference between analogous and homologous similarities with respect to residue
substitution frequencies. Although this ®nding is
not particularly surprising, it does suggest that
alternative measures of quantifying sequence similarity may be required when sequence identity is
below 25%.
For possible application to fold recognition
methods that use substitution matrices (Rost, 1995;
Fischer & Eisenberg, 1996; Bates et al., 1996;
Alexandrov et al., 1996), a combined analogue/remote-homologue matrix (Table 2C) was derived by
considering all 46 analogues and the 46 homologues having the lowest percentage sequence
identity.
Figure 4. Plot of RMS deviation on Ca atoms versus
sequence identity. More details are given in the text and
in Table 1.
RMS deviation
For comparison with previous studies, a plot of
RMS deviation of the distances between equivalenced Ca coordinates versus %I is shown in
Figure 4. The plot shows the same exponential increase in RMS deviation with decreasing %I that
has been seen previously (Chothia & Lesk, 1986;
Hubbard & Blundell, 1987; Flores et al., 1993;
Russell & Barton, 1994), and that there is no apparent difference between analogues and remote homologues (Flores et al., 1993; Russell & Barton, 1994).
Ê has to do largely
Note that the upper limit of 3 A
with the STAMP selection criteria. A one-tailed t
test for signi®cance (Table 1) shows no signi®cant
difference between the means for analogues and
remote homologues.
Secondary structure and accessibility
The alignments from the four datasets were analysed to quantify the nature of secondary structure
and accessibility conservation with experimental
and predicted information. For these analyses, the
plots show only the results considering aligned
positions. Those for equivalenced positions
showed little difference in overall trends although
the means were generally higher, as shown in
Table 1. Means given below (x ) are given with
subscripts A and RH to denote analogues and remote homologues; a and e denote calculations on
aligned and equivalenced positions, respectively.
All means, standard deviations and t values for
one-tailed tests for the differences between the
means between analogues and remote homologues are given in Table 1.
Figure 5 shows the agreement between threestate secondary structure (a) and accessibility (b)
as a function of percentage sequence identity.
The plots agree with previous studies (Russell &
Barton, 1993, 1994; Rost & Sander, 1994; Flores
et al., 1993; Rost, 1995) and show a general drop
in the conservation of these two features with a
decrease in sequence identity. Analogues show
mean three-state secondary structure agreements
(Figure 5(a)) of x Aa 70.6, x Ae 81.4 and remote
homologues x RHa 75.3, x RHe 84.8. As has been
seen previously, accessibility (Figure 5(b)) is less
well preserved in structural alignments below
25% sequence identity (Russell & Barton, 1994;
Recognition of Protein Folds
Figure 5. Plots showing the conservation of (a) secondary structure and (b) accessibility as a function of
sequence identity. See the text and Table 1 for more
details.
Flores et al., 1993; Rost & Sander, 1994), giving
average values of x Aa 49.2, x Ae 56.7 and
x RHa 55.9, x RHe 64.9. This lower degree of conservation is not surprising, since accessibility varies on a per-residue basis, whereas secondary
structure varies on a segment basis. For both secondary structure and accessibility, the analogues
show lower average agreement than the remote
homologues, and the differences are signi®cant as
de®ned using a one-tailed t test for signi®cance
(Table 1).
How well do predictions agree between structurally similar proteins? Figure 6 shows plots of the
secondary structure and accessibility prediction
agreements. For closely homologous proteins, the
plots have limited value, since one sequence is
likely to have been used in the prediction of the
other; below 25% identity, this is unlikely, and the
results are valid. The agreement between predic-
433
Figure 6. Plots showing the agreement between predictions of (a) secondary structure and (b) accessibility as a
function of percentage sequence identity. See the text
and Table 1 for more details.
tions mirrors that for the experimental data, with
the difference that the agreement drops lower than
the experimental agreement. This is expected from
the combination of errors in prediction that occurs
in the alignment of two structures. Secondary
structure predictions show a three-state agreement
x Ae 65.1 and x RHa 60.8,
of x Aa 60.0,
x RHe 66.5. Thus predictions of secondary structure are similar between proteins having similar
folds in the absence of sequence similarity. Accessibility predictions show less agreement across proteins with similar folds, giving values of
x Aa 48.3, x Ae 54.9 and x RHa 53.7, x RHe 59.4.
The lower conservation of accessibility probably
re¯ects the poorer prediction accuracies.
If secondary structure and accessibility predictions are to augment fold recognition, then it is
important to quantify how well the prediction
for one protein agrees with the experimental
434
Recognition of Protein Folds
and agreements generally drop with decreasing
sequence identity. The means for analogues and
remote homologues are x Aa 44.9, x Ae 56.0 and
x RHa 49.5, x RHe 54.8. The apparent maximum
agreement of approximately 65% in Figure 7(b) is
largely due to the rarity with which PHD predicts the intermediate burial state (i).
Log odds matrices (Table 4) were constructed in
an attempt to quantify for secondary structure and
accessibility the matching of known/predicted values in one protein in the alignment with predicted
values in the other. For each position within an
alignment, the known and predicted secondary
structure for one protein and the predicted for the
other were used to de®ne a confusion matrix. To
mimic a fold recognition situation, one protein was
de®ned as the template (t), where both experimental and predicted information are available, whilst
the other was de®ned as the probe (p), where only
predicted information is available. This was performed twice for each alignment, treating each sequence and the template in turn. A total of 27
possible alignment situations were de®ned by allowing three states in each secondary structure or
accessibility assignment (one known, two predicted):
template known Statet
Ki and predicted Statep
Pj
aligned with predicted Statep
Pk
Figure 7. Plots showing the agreement between the prediction in one protein and the experimental value in the
other for (a) secondary structure and (b) accessibility as
a function of percentage sequence identity. See the text
and Table 1 for more details.
result for the other within the alignments.
Figure 7(a) shows that for remote homologues
and analogues, the secondary structure agreement between one prediction and the other
known shows a general decrease from 70% for
close homologues (i.e. the approximate limit of
secondary structure prediction) down to as low
as 20 to 30% for some remote homologues and
analogues. The means for analogues and remote
homologues are x Aa 63.0, x Ae 71.0 and
x RHa 65.2, x RHe 71.8 again showing little
difference between these two types of similarities.
Nevertheless, the data do suggest that there is a
strong agreement between the prediction for one
protein and the experimental structure for the
other. Figure 7(b) shows a similar trend for accessibility: agreements at high sequence identities
re¯ect the accuracy of accessibility predictions,
Where (i, j, k) 2 {a, b, c} (a, a-helix; b, b-strand; c,
coil) for secondary structure and (i, j, k) 2 {b, i, e}
(b, buried; i, intermediate; e, exposed) for accessibility. The frequency of each of 27 possibilities was
calculated for each set of alignments, and a log
odds derived by:
fKiPjPk
log
ÿLave
fKi fPj fPk
Where fKiPjPk is the observed frequency of the
change KiPj to Pk, and fKi, fPj, fPk are the frequencies
of each of the three secondary structure or accessibility types (known or predicted). Matrices for
both secondary structure and accessibility were calculated for analogues, remote homologues and the
combined set (analogues the 46 remote homologues with the lowest sequence identity) as described above.
These matrices provide a means to aid protein
fold recognition. As expected, the highest scores
from such matrices occur when all three states are
the same (Ki Pj Pk). However, positive scores
are also found when the predictions disagree, providing a means to tolerate errors in secondary
structure/accessibility predictions during fold recognition. An assessment of these matrices for use
in fold recognition will be performed elsewhere.
Insertions and deletions
Many methods for detecting sequence homologies and many methods of fold recognition and
435
Recognition of Protein Folds
Table 4. Secondary structure confusion matrices and accessibility confusion matrices as discussed in the text
aiaj
aibj
aicj
biaj
bibj
bicj
ciaj
cibj
cicj
ak
Analogues
bk
28
ÿ6
3
ÿ19
0
ÿ9
0
ÿ11
7
ÿ3
ÿ25
ÿ12
ÿ2
22
4
ÿ12
2
10
A. Secondary structure confusion matrices
Remote homologues
cK
ak
bk
ck
11
ÿ11
0
ÿ6
10
3
ÿ3
4
18
24
ÿ3
0
ÿ17
1
ÿ12
ÿ1
ÿ10
4
B. Accessibility confusion matrices
Analogues
ik
bk
ek
bk
bibJ
biij
biej
iibj
iiij
iiej
eibj
eiij
eiej
0
ÿ2
ÿ9
0
6
3
ÿ4
1
9
7
ÿ1
ÿ10
0
4
ÿ1
ÿ9
ÿ4
0
8
0
ÿ9
0
4
0
ÿ7
ÿ2
2
4
ÿ10
ÿ8
1
8
4
ÿ7
ÿ2
6
0
ÿ17
ÿ11
ÿ2
24
6
ÿ10
1
7
6
ÿ12
0
ÿ12
10
2
ÿ1
3
18
Remote homologues
ik
ek
1
5
ÿ8
1
13
2
ÿ6
1
5
ÿ2
ÿ4
ÿ9
ÿ1
3
4
ÿ5
2
10
ak
Combined
bk
ck
25
ÿ4
1
ÿ19
0
ÿ11
ÿ1
ÿ10
6
0
ÿ22
ÿ13
ÿ2
23
6
ÿ11
1
9
8
ÿ12
ÿ1
ÿ9
11
3
ÿ2
3
18
bk
Combined
ik
ek
8
0
ÿ8
0
4
0
ÿ8
ÿ1
3
3
ÿ12
ÿ7
2
7
2
ÿ5
ÿ3
7
0
ÿ1
ÿ9
0
5
4
ÿ4
3
10
i denotes template known value, j template predicted value and k probe predicted value.
protein structure comparison depend on af®ne gap
penalties to allow for insertions and deletions
(indels) in a sequence/structure alignment. To aid
in the derivation of gap penalties the alignments
derived above were probed to see how the number
and nature of indels varied as a function of
sequence identity and evolutionary relatedness (i.e.
analogues versus homologues).
Figure 8 shows how the number of both gaps
(i.e. any position where a gap is aligned to a residue in the other protein) and openings (i.e. one or
more sequential gaps) increase as sequence identity
drops. For analogues and remote homologues the
number of gaps/openings is often much greater
than for medium/close homologues. Whereas
medium/close homologues rarely have more than
20 gaps and two openings per 100 residues, analogues and remote homologues can have as many
as 80 gaps and nine openings. These ®ndings
suggest, as expected, that much more lenient gap
penalties should be used during fold recognition
methods than those used for conventional sequence alignment.
Composition of insertions and deletions is also
important. For fold recognition, it is most important to determine how often whole secondary structures inserted/deleted in one protein relative to
another. Similarities between 3D structures, particularly in the absence of clear sequence identity,
often differ in structure outside the core. Secondary
structures may be present in one member of the
structural family and absent in others. We quanti®ed this observation by calculating the proportion
of secondary structures common to both proteins
in each alignment. Figure 9 shows a plot of how
the fraction of equivalenced (a) and aligned (b)
secondary structures behaves as a function of
sequence identity. Equivalenced structures are those
that are deemed common in 3D by the structure
comparison algorithm. Aligned structures include
all of the equivalenced structures, and those of the
same type that happened to be aligned. These
structures are not equivalent according to STAMP,
but lie in approximately the same location in the
sequence alignment.
For close and medium homologue proteins,
insertions/deletions of whole secondary structure
elements are rare, as expected, with usually 585%
of secondary structures equivalent. However for
remote homologues and analogues, insertions/deletions of whole secondary structure elements are
very common. Interestingly, when one includes
aligned, but not structurally equivalent, secondary
structures in the calculation, the number of indels
decreases, possibly re¯ecting the divergence of
structure, and the subjectivity of the structural
equivalence criteria. Nevertheless, the results
suggest that methods attempting to detect analogues and remote homologues should allow for
such whole secondary structure element indels.
Russell et al. (1995, 1996) used a depth-®rst search
algorithm to allow for such whole secondary structure element indels.
Discussion and Conclusions
The results of this study suggest that there is a
signi®cant difference in the percentage sequence
identity distributions between analogues and undetectable homologues. They suggest that simi-
436
Recognition of Protein Folds
Figure 8. Plots showing the number of (a) gaps and (b)
openings per 100 residues as a function of percentage
sequence identity. See the text and Table 1 for more
details.
Figure 9. Plots showing the percentage of secondary
structures (a) equivalenced and (b) aligned as a function
of percentage sequence identity. See the text and Table 1
for more details.
larities between 3D structures may be more easily
classi®ed into analogues and homologues when %I
is calculated by a consideration of only structurally
equivalent positions, and may provide insights as
to protein structure relationships in the absence of
other data. They suggest that even when similarities are undetectable via sequence comparison,
that there is a greater degree of residue property
conservation in remote homologues. Figure 2
suggests a strategy for deciding between remote
homology and analogy as the more likely explanation for a structural similarity in or beyond the
twilight zone of sequence identity. The percentage
sequence identity calculated by considering only
structurally equivalent positions could be used to
discern between these two examples. If %I is >11,
the similarity is more likely to indicate a remote
homology and a common ancestor; conversely, if
%I is <9, the similarity is more likely to indicate an
analogous relationship. Although this is not a general rule and analogues can still have higher %I
than remote homologues, this strategy may be of
use in the classi®cation of protein structures.
The greater signi®cance of percentage sequence
identity calculated by a consideration of structurally equivalent positions has been noted previously. Numerous analyses on particular protein
families have found short stretches of conserved
amino acid residues following structural alignment, and suggested that this strategy can provide
more insight into evolutionary relationships (Martin et al., 1993; Murzin, 1993a,b).
The substitution matrices provide an important
alternative means to compare protein sequences.
The data derived from real examples of undetectable similarities may provide a means to improve
existing means of protein sequence comparison
(Altschuh et al., 1987; Smith & Waterman, 1981;
437
Recognition of Protein Folds
Barton, 1993) and methods of fold recognition that
use mutation matrices (Rost, 1995; Fischer &
Eisenberg, 1996; Bates et al., 1996; Alexandrov et al.,
1996). For the latter methods, the similarity
between the traditional mutation matrices and the
substitution matrix derived from remote homologies provides validation of this approach for the
detection of remote homologues. The analogue and
combined matrices provide a possible means to
extend and improve such methods for the detection
of analogous similarities. An assessment of how
these matrices perform in a fold recognition algorithm will be assessed elsewhere.
The conservation of experimental and predicted
secondary structure and accessibility across analogues and remote homologues re-con®rmed previous analyses as to their conservation (Russell &
Barton, 1994; Rost et al., 1994). The agreement of
predicted secondary structure/accessibility in one
protein and experimental values in another provides an assessment of how useful these commodities could be in fold recognition methods. The
confusion matrices also provide a means to correlate predictions in both known and unknown
structures with known experimental values.
The analysis of insertions and deletions provide
important insights for methods of fold recognition.
Most importantly, they suggest that much more
lenient gap parameters should be used to allow for
longer insertions and deletions, and the insertion/
deletion of whole secondary structure elements
during alignments.
Our results suggest that there is a perceptible
difference in structural properties and sequence
similarity between analogues and remote homologues. This ®nding disagrees with previous studies
that have suggested little difference (Russell &
Barton, 1994; Flores et al., 1993). One possible
explanation for this difference is that the datasets
used previously were of varied content and the assignment of homologues and analogues was not
entirely correct (e.g. thioredoxin folds as analogues
according to Russell & Barton, 1994). The dataset
of Flores et al. (1993) also contained very few true
analogous similarities (4). For this study a much
larger database of structures was available, and the
crucial assignment of protein pairs into analogues
and remote homologues is according to the SCOP
database and is based on structure, function and
careful expert insight into possible evolutionary relationships. For fold recognition, our results
suggest that it is important to consider analogues
and remote homologues as different in kind. The
general lower degree of sequence and property
conservation suggests that analogues will be more
dif®cult to detect.
The ®nding of a signi®cant difference in the distribution of sequence identities between analogues
and remote homologues is important, since this is
not generally thought to be true. However, this
®nding is not entirely surprising, since there are
biological explanations for the presence of a sequence legacy in remote homologues. The appar-
ently higher levels of sequence identity for remote
homologues compared to analogues suggest a limited period over which the family has evolved
leading to greater conservation of sequence in the
homologues. Alternatively or additionally, the sequence similarities could be the consequence of
functional constraints on the sequence of homologous proteins. Clearly as more structures are
solved and as understanding and assessment of
protein sequence/structural similarities improves,
these results must be re-examined. Our ®ndings
demonstrate the continuing importance of comparative taxonomy of protein structures in providing insights into evolution.
Methods and Data
Superimposition of the PDB according to SCOP
The SCOP database (Murzin et al., 1995) was used as a
resource to de®ne homologous and analogous protein
structure pairs. The entire database was ``expanded'' and
down-loaded (URL http://scop.cam.cpe.ac.uk/scop/)
and all entries within the PDB grouped according to the
hierarchy, which consists of divisions for class, fold,
superfamily, family, protein and species. The objective
was to superimpose all entries within each fold grouping
using the STAMP package (Russell & Barton, 1992), in
the least possible time.
PDB entries grouped at the species level (i.e. highly
similar or identical) were superimposed simply by overlaying sequences and using the Ca from the resulting
equivalences in a least-squares ®t. To get superimpositions at the next level up (protein division), sequences
for one representative from each species level were then
aligned using the AMPS package (Barton, 1990) and the
resulting equivalences used for the Ca-based superimposition. Representatives from each protein level were then
superimposed using 3D structure comparison (Russell &
Barton, 1992). This was performed by using one representative to scan the others to provide an initial superimposition (Matthews et al., 1994), which was then re®ned
as described by Russell & Barton (1992). The result was a
superimposition containing one member from each protein level in the fold group. This superimposition was
then merged with the protein and species level superimpositions to form a ®nal global superimposition for all
PDB entries in the fold group. These groups were inspected manually to remove erroneous SCOP entries or
correct initially poor superimposition.
Derivation of sets of analogous and
homologous pairs
The database described above was used to derive
non-redundant sets of homologous and analogous alignments/transformations using STAMP (Russell & Barton,
1992). Representatives were chosen from fold-family
(analogues) and superfamily (homologues) sub-groupings within the SCOP database. All structures in the sets
were required to be solved by X-ray crystallography,
Ê or better, and to be full
re®ned at a resolution of 2.8 A
coordinate entries (i.e. no Ca only structure was allowed).
Structure pairs were required to have a STAMP similarity score (Sc) of 53.0. These scores can be between 0
(no similarity) and 9.8 (identical structure), with values
52.5 indicating a signi®cant similarity; and values 45.5
438
usually indicating structural similarity in the absence of
sequence similarity. The ®rst pair from each family was
chosen as the pair with Sc nearest to (but not less than)
3.0 (the most structurally distant pair, though with a reliable alignment of the two structures). The next pair
contained structures having the lowest scores with each
of the structures in the previous pair (still Sc 5 3.0) and
not comprising any structures already considered. The
procedure was run four times, each time allowing up to
four pairs from each fold group, to select proteins from
the following groupings: analogues, homologues 0 to 20%,
20 to 40%, 40 to 80%, 80 to 100%. The alignments were
then pooled and divided into the ®nal sets as discussed
in Results.
The STAMP similarity index was used to show the
regions that are structurally equivalent, de®ned as positions having a P0ij (Russell & Barton, 1992) score of 54.0
for stretches of two or more residues. This is a lenient
choice of parameters and leads sometimes to the alignment of only weakly structurally equivalent residues.
This leniency was chosen to give the largest fraction
possible of structurally matched residues. Gaps and residues outside these structurally equivalent positions were
minimised to the shortest possible length. This was
thought to give the best comparison between alignments
derived from structure comparison and those that might
be derived using sequence alignment or fold recognition
methods where structural data are absent for at least one
protein in the alignment.
All datasets, alignments and superimpositions used in
this study are available via the WWW on URL http://
www.icnet.uk/bmm/AH/.
Sequence alignment significance
For remote homologues (%I 4 25%) a randomised Zscore was calculated using the AMPS package (Barton &
Sternberg, 1987; Barton, 1990). A Z-score for the optimal
alignment of the two sequences was calculated using the
mean and standard deviation from 100 randomisations
of the two sequences. Z-scores of <6.0 were considered
to indicate an undetectable similarity.
Secondary structure and accessibility
Secondary structures and accessibilities were calculated using the program DSSP (Kabsch & Sander, 1983).
Secondary structure assignments were converted to
three-state: a and 310 helix, H; b sheet or b bridge, E;
other, C. DSSP accessibilities were converted to threestate as described by Rost & Sander (1994) using the
log(relacc) value. Three-state predictions of secondary
structure and accessibility were obtained from the PHD
server, which performs the methods of Rost & Sander
(1993, 1994).
Both three-state PHD (Rost & Sander, 1993) secondary
structure and accessibility predictions show the expected
distributions: with means of 75% and 57%, respectively. These values are slightly higher than the means
reported by Rost & Sander (1993, 1994), as is to be
expected, since the predictions are not cross-validated.
Acknowledgement
The authors thank Andrew Lyall (Glaxo-Wellcome
Bioinformatics) for encouragement and support.
Recognition of Protein Folds
References
Alexandrov, N. N., Nussinov, R. & Zimmer, R. M.
(1996). Fast protein fold recognition via sequence to
structure alignment and contact capacity potentials.
In Paci®c Symposium on Biocomputing, pp. 53 ± 72,
World Scienti®c Publishing Co, Singapore.
Altschuh, D., Lesk, A. M., Bloomer, A. C. & Klug, A.
(1987). Correlation of co-ordinate amino acid substitutions with function in viruses related to tobacco
mosaic virus. J. Mol. Biol. 193, 693± 707.
Barton, G. J. (1990). Protein multiple sequence alignment
and ¯exible pattern matching. Methods Enzymol.
183, 403± 428.
Barton, G. J. (1993). Alscript: a tool to format multiple
sequence alignments. Protein Eng. 6, 37±40.
Barton, G. J. & Sternberg, M. J. E. (1987). A strategy for
the rapid multiple alignment of protein sequences:
con®dence
levels
from
tertiary
structure
comparisons. J. Mol. Biol. 198, 327± 337.
Bates, P. A., Jackson, R. M. & Sternberg, M. J. E. (1996).
Prediction of protein structures and their docking.
In Genomes Molecular Biology and Drug Discovery,
pp. 73±86, Academic Press, London.
Chothia, C. & Lesk, A. M. (1982). Evolution of proteins
formed by b-sheets. I. Plastocyanin and azurin.
J. Mol. Biol. 160, 309± 323.
Chothia, C. & Lest, A. M. (1986). The relation between
the divergence of sequence and structure in
proteins. EMBO J. 5, 823± 826.
Chung, S. Y. & Subbiah, S. (1996). A structural explanation of the twilight zone or protein sequence
homology. Curr. Biol. 4, 1123± 1127.
Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978).
A model of evolutionary change in proteins,
matrices for detecting distant relationships. In Atlas
of Protein Sequence and Structure (Dayhoff, M. O.,
ed.), vol. 5, pp. 345± 358, National Biomedical
Research Foundation, Washington, DC.
Fischer, D. & Eisenberg, D. (1996). Protein fold recognition using sequence-derived predictions. Protein
Sci. 5, 947±955.
Flores, T. P., Orengo, C. A., Moss, D. S. & Thornton,
J. M. (1993). Comparison of conformational characteristics in structurally similar protein pairs. Protein
Sci. 2, 1811± 1826.
Gonnet, G. H., Cohen, M. A. & Benner, S. A. (1992).
Exhaustive matching of the entire protein sequence
database. Science, 256, 1443 ±1444.
Hazes, B. & Hol, W. G. J. (1992). Comparison of the
hemocyanin b-barrel with other greek key b-barrels:
possible improtance of the ``b-zipper'' in protein
structure and folding. Proteins: Struct. Funct. Genet.
12, 278± 298.
Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl Acad.
Sci. USA, 89, 10915± 10919.
Holm, L. & Sander, C. (1996a). Mapping the protein
universe. Science, 273, 595± 602.
Holm, L. & Sander, C. (1996b). The fssp database Ð fold
classi®cation based on structure alignment of
proteins. Nucl. Acids Res. 24, 206± 209.
Hubbard, T. J. P. & Blundell, T. L. (1987). Comparison
of solvent-inaccessible cores of homologous proteins Ðde®nitions useful for protein modelling.
Protein Eng. 1, 159±171.
Johnson, M. S., Overginton, J. P. & Blundell, T. L.
(1993). Alignment and searching for common pro-
Recognition of Protein Folds
tein folds using a data bank of structural templates.
J. Mol. Biol. 231, 735±752.
Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). The
rapid generation of mutation data matrices from
protein sequences. Comput. Appl. Biosci. 8, 275 ±282.
Kabsch, W. & Sander, C. (1983). A dictionary of protein
secondary structure. Biopolymers, 22, 2577± 2637.
Laurents, D. V., Subbiah, S. & Levitt, M. (1994). Different protein sequences can give rise to highly similar
folds through different stabilizing interactions.
Protein Sci. 3, 1938± 1944.
Lemer, C. M. R., Rooman, M. J. & Wodak, S. J. (1995).
Protein-structure prediction by threading methods Ð
evaluation of current techniques. Proteins: Struct.
Funct. Genet. 23, 337± 355.
Lesk, A. M. & Chothia, C. (1980). How different amino
acid sequences determine similar protein structures:
the structure and evolutionary dynamics of the
globins. J. Mol. Biol. 136, 225 ± 270.
Lesk, A. M. & Chothia, C. (1982). Evolution of proteins
formed by b-sheets II. The core of the immunoglobulin domains. J. Mol. Biol. 160, 325± 342.
Martin, J. L. (1995). Thioredoxin Ða fold for all reasons.
Structure, 3, 245± 250.
Martin, J. L., Bardwell, J. C. A. & Kuriyan, J. (1993).
Crystal structure of the DsbA protein required for
disulphide bond formation in vivo. Nature, 365,
464± 468.
Matthews, S., Barlow, P., Boyd, J., Barton, G., Russell,
R., Mills, H., Cunningham, M., Meyers, N., Burns,
N., Clark, N., Kingsman, S., Kingsman, A. &
Campbell, I. (1994). Structural similarity between
the p17 matrix protein of HIV-1 and interferon-g.
Nature, 370, 666 ± 668.
Murzin, A., Brenner, S. E., Hubbard, T. & Chothia, C.
(1995). scop: a structural classi®cation of proteins
database for the investigation of sequences and
structures. J. Mol. Biol. 247, 536±540.
Murzin, A. G. (1993a). Sweet tasting protein monellin is
related to the cystatin family of thiol proteinase
inhibitors. J. Mol. Biol. 230, 689 ± 694.
Murzin, A. G. (1993b). Can homologous proteins evolve
different enzymatic activities?. J. Mol. Biol. 18, 403 ±
405.
Murzin, A. G., Lesk, A. M. & Chothia, C. (1992). b-trefoil fold patterns of structure and sequence in the
Kunitz inhibitors interleukins-1b and 1a and ®broblast growth factors. J. Mol. Biol. 223, 531± 543.
Murzin, A. Z. (1993). OB(oligonucleotide/oligosaccharide binding)-fold: common structural and functional
solution for non-homologous sequences. EMBO J.
12, 861± 867.
Ollis, D. L., Cheah, E., Cygler, M., Dijkstra, B., Frolow,
S. M., Ranken, S. M., Harel, M., Remington, S. J.,
Silman, I. & Schrag, J. (1992). The alpha/beta hydrolase fold. Protein Eng. 5, 197± 211.
439
Orengo, C. A., Jones, D. T. & Thornton, J. M. (1994).
Protein superfamilies and domain superfolds.
Nature, 372, 631± 634.
Overington, J., Johnson, M. S., Sali, A. & Blundell, T. L.
(1990). Tertiary structural constraints on protein
evolutionary diversity: templates, key residues and
structure prediction. Proc. Roy. Soc. ser. B, 241, 132 ±
145.
Pascarella, S. & Argos, P. (1992). A data bank merging
related protein structures and sequences. Protein
Eng. 5, 121± 137.
Pickett, S. D., Saqi, M. A. S. & Sternberg, M. J. E. (1992).
Evaluation of the sequence template method for
protein structure prediction. Discrimination of the
b/a-barrel fold. J. Mol. Biol. 228, 170± 187.
Rost, B. (1995). TOPITS: treading one-dimensional predictions into three-dimensional structures. In Proc.
3rd. Int. Conf. Intel. Sys. Mol. Biol. (Rawlings, C.,
Clark, D., Altman, R., Hunter, L., Lengauer, T. &
Wodak, S., eds), pp. 314± 321, AAAI Press, Menlo
Park, CA.
Rost, B. & Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. J. Mol.
Biol. 232, 584± 599.
Rost, B. & Sander, C. (1994). Conservation and prediction of solvent accessibility in protein families. Proteins: Struct. Funct. Genet. 20, 216± 226.
Rost, B., Sander, C. & Schneider, R. (1994). Rede®ning
the goals of protein secondary structure prediction.
J. Mol. Biol. 235, 13 ± 26.
Russell, R. B. & Barton, G. J. (1992). Multiple protein
sequence alignment from tertiary structure comparison: assignment of global and residue con®dence
levels. Proteins: Struct. Funct. Genet. 14, 309± 323.
Russell, R. B. & Barton, G. J. (1993). The limits of protein
secondary structure prediction accuracy from multiple sequence alignment. J. Mol. Biol. 234, 951± 957.
Russell, R. B. & Barton, G. J. (1994). Structural features
can be unconserved in proteins with similar folds:
An analysis of side-chain to side-chain contacts, secondary structure and accessibility. J. Mol. Biol. 244,
332± 350.
Russell, R. B., Copley, R. R. & Barton, G. J. (1995). Protein fold recognition from secondary structure
assignments. In Proc. 28th Hawaii. Int. Conf. Sys. Sci.
(Hunter, L. & Shriver, B. D., eds), vol. 5, pp. 302 ±
311, IEEE Computer Society Press, Los Alamitos,
CA.
Russell, R. B., Copley, R. R. & Barton, G. J. (1996). Protein fold recognition by mapping predicted secondary structures. J. Mol. Biol. 259, 349 ±365.
Smith, T. F. & Waterman, M. S. (1981). Identi®cation of
common molecular subsequences. J. Mol. Biol. 147,
195± 197.
Taylor, W. R. (1986). Classi®cation of amino acid
conservation. J. Theoret. Biol. 119, 205± 218.
Edited by F. E. Cohen
(Received 9 December 1996; received in revised form 21 February 1997; accepted 21 February 1997)
© Copyright 2026 Paperzz