Efficient SCOP-fold classification and retrieval

BIOINFORMATICS
ORIGINAL PAPER
Vol. 25 no. 19 2009, pages 2559–2565
doi:10.1093/bioinformatics/btp474
Structural bioinformatics
Efficient SCOP-fold classification and retrieval using index-based
protein substructure alignments
Pin-Hao Chi1,2 , Bin Pang1,2 , Dmitry Korkin1,2 and Chi-Ren Shyu1,2,∗
1 Medical
and Biological Digital Library Research Lab, Informatics Institute and 2 Department of Computer Science,
University of Missouri, Columbia, MO 65211, USA
Received on March 31, 2009; revised on July 7, 2009; accepted on July 24, 2009
Advance Access publication August 10, 2009
Associate Editor: Anna Tramontano
ABSTRACT
Motivation: To investigate structure–function relationships, life
sciences researchers usually retrieve and classify proteins with
similar substructures into the same fold. A manually constructed
database, SCOP, is believed to be highly accurate; however, it
is labor intensive. Another known method, DALI, is also precise
but computationally expensive. We have developed an efficient
algorithm, namely, index-based protein substructure alignment
(IPSA), for protein-fold classification. IPSA constructs a two-layer
indexing tree to quickly retrieve similar substructures in proteins and
suggests possible folds by aligning these substructures.
Results: Compared with known algorithms, such as DALI, CE,
MultiProt and MAMMOTH, on a sample dataset of non-redundant
proteins from SCOP v1.73, IPSA exhibits an efficiency improvement
of 53.10, 16.87, 3.60 and 1.64 times speedup, respectively. Evaluated
on three different datasets of non-redundant proteins from SCOP,
average accuracy of IPSA is approximately equal to DALI and better
than CE, MAMMOTH, MultiProt and SSM. With reliable accuracy and
efficiency, this work will benefit the study of high-throughput protein
structure–function relationships.
Availability: IPSA is publicly accessible at http://ProteinDBS.rnet
.missouri.edu/IPSA.php
Contact: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
Protein tertiary structure classification has become an important
research topic in computational and molecular biology. Since
similar protein tertiary structures might have correlations with
specific biological functions (Zarembinski et al., 1998), several
structural genomics projects (Chen et al., 2004; von Grotthuss
et al., 2006) are focused on understanding the links between protein
sequences, structures and functional properties. To understand
protein structure–function relationships, life sciences researchers
usually group proteins of similar substructures into a fold and study
functional properties of these similar proteins. The computational
methods of protein structure comparisons are usually focused on
finding a detailed structural alignment between two proteins. The
root mean square deviation (RMSD) is utilized to gauge the
∗ To
whom correspondence should be addressed.
quality of alignment results from their optimized superimposition.
Finding a global optimum of structural alignment is known to be
an nondeterministic polynomial time (NP)-hard problem (Godzik,
1996). To reduce the computational time, traditional structural
alignment algorithms such as SSAP (Taylor and Orengo, 1989),
DALI (Holm and Sander, 1993), CE (Shindyalov and Bourne, 1998),
MAMMOTH (Ortiz et al., 2002) and MultiProt (Shatsky et al., 2004)
apply heuristics and search locally optimal solutions.
Computational methods usually conduct one-against-all pairwise
alignments between a newly discovered protein and all known
proteins in the database to suggest protein folds. A known protein
classification database, CATH (Pearl et al., 2003), is constructed
using the SSAP algorithm. Another fold classification database,
FSSP (Holm and Sander, 1994), is built based on the DALI
algorithm. Since these classification works rely on heuristicbased structural alignments algorithms to measure the similarity
of proteins, different heuristics may return divergent classification
results for the same query structure. At present, a human curated
database, SCOP (Murzin et al., 1995), is believed to maintain highly
accurate classification results. Proteins with structural relationships
are hierarchically grouped at the fold level. Even though manual
inspection is more accurate, it is also labor intensive.
A consensus approach intersects classification results from
multiple structural alignment algorithms to automate SCOP-fold
classification (Can et al., 2004). With manually assigned weights
for each individual method, this consensus approach yields an
improved classification accuracy. However, a combination of
structural alignment algorithms is known to be computationally
expensive. With the advent of high-throughput techniques such
as X-ray crystallography and nuclear magnetic resonance (NMR),
the number of known protein structures has rapidly increased in
recent years. To accelerate SCOP-fold classification, there is an
urgent need for an efficient protein structure classification algorithm
with satisfactory accuracy. Our previous work, ProteinDBS (Shyu
et al., 2004), extracts 33 features from 2D distance matrices
generated from protein 3D structures. ProteinDBS is able to quickly
measure the global similarity of two proteins based on the Euclidean
distance of the corresponding feature vectors. Extending the global
similarities used in ProteinDBS, we developed an efficient protein
classification tool, E-Predict (Chi et al., 2006), that assigns newly
discovered proteins to possible SCOP folds in real time. Other
global similarity measurements, such as SGM (Rogen and Fain,
2003), PCC (Zhou et al., 2006) and SSM (Krissinel and Henrick,
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
[14:56 31/8/2009 Bioinformatics-btp474.tex]
2559
Page: 2559
2559–2565
P.-H.Chi et al.
2004) extract global features from protein 3D structures using
the scaled Gauss metric, principle component correlation analysis
and matching graph, respectively. However, the global similarity
cannot be used to classify locally similar proteins with common
substructures.
Several works study retrieving similar protein substructures
without structural alignments. The MUSTA algorithm (Leibowitz
et al., 2001) uses a clustering technique to identify similar
substructures. Huan et al. (2004) model protein tertiary structures
using adjacency matrices from graph theory. These methods are
usually computationally expensive.
Hashing and indexing techniques have been used for improving
the efficiency of protein substructure retrieval. Young et al. (1999)
utilize hashing techniques to identify similar substructures that are
composed of triple secondary structure elements (SSEs) in two
proteins. ProtDex2 (Aung and Tan, 2004) extracts spatial features
such as distances and angles from each pair of SSEs and utilizes an
inverted-file index to improve search efficiency.
Recent research has studied mapping protein tertiary structures
into 1D sequences for fast substructure retrieval (Yang and Tung,
2006). This approach exhibits good efficiency, except that the
1D representation of protein substructures potentially looses the
structural topology. Identical sequences from two proteins may
correspond to dissimilar structures in 3D space. Therefore, the
accuracies are lower than detailed structure alignment algorithms
such as DALI and CE. From these works, efficient approaches are
less accurate and accurate methods are less efficient. To consider
both accuracy and efficiency of SCOP-fold classification and
retrieval, we have developed an index-based protein substructure
alignment (IPSA) algorithm.
2
METHODS
Protein structure retrievals can be conceptually considered as an application
in the field of information retrieval (IR). For an IR system, types, orders and
locations of terms make up the basic semantics of a document. Analogous to
these concepts of IR, types, orders and topological relationships of protein
substructure units can be identified to assist human inspections of protein
folds.
2.1
Protein substructure unit extraction
Our algorithm compares protein tertiary structures in terms of matching
protein substructure units. In order to extract the protein substructure units for
comparisons, SSEs, Helix (H) and Sheet (E), are first identified from tertiary
structures. In our implementation, the identification of SSEs is conducted by
sequentially matching protein amino acid residues with the H and E templates
of Spatial ARangement of backbone Fragments (SARF; Alexandrov, 1996).
As shown in Figure 1, our substructure unit, ui , consisting of (2×l) amino
acids, is extracted by sliding a window of l amino acids within one SSE
(SSEo ) and another window of l amino acids in another SSE (SSEc ). The
substructure unit, ui = Lio ⊕Ljc , is defined as a concatenation of an opening
segment with l amino acids (Lio ) and a closing segment with l amino acids
(Ljc ), where i and j are the starting residues of these two segments and j −i ≥ l.
For small proteins with less than two SSE assignments, our method slides
windows within the entire protein. For those identified SSEs with more than
l amino acid residues, sliding a window of l-mers by one residue at a time
produces a large amount of substructure units. To reduce the search space,
the sliding window of l-mers is shifted every three amino acids. Since SSEs
usually contain more than five amino acids (Carl and John, 1999), l is set
to 5 in order to cover most of the protein secondary structures. With the
sliding window, our algorithm can efficiently find the structural features
Fig. 1. A substructure unit, ui , is a concatenation of two non-overlapping
amino acid segments, Lio and Ljc , where Lio ⊆ SSEo and Ljc ⊆ SSEc .
Algorithm 1 Substructure representative generation
Require: , η, Mtree = NIL
1: for each database protein P ∈ do
2:
for each substructure unit ui ∈ P do
3:
if Range_Search(ui ,Mtree , η) = then
4:
Insert (ui ,Mtree )
end if
5:
end for
6:
7: end for
and identify two proteins with same sequence of SSE but different in the
overall fold. A large number of protein substructure units can be extracted
from database proteins. In order to efficiently search proteins with similar
substructures, our algorithm first maps protein tertiary structure into a series
of 1D terms by comparing substructure units with our predefined substructure
representatives. The algorithm then indexes these terms of the entire database
proteins for fast searching.
2.2
Protein substructure representative generation
A protein substructure representative is conceptually similar to a stemmed
word, which addresses all variations of terms sharing the same root words
in an IR system. Each substructure representative is selected to represent a
group of structurally similar substructure units. To compute RMSD of two
substructure units, our method uses (X, Y, Z) coordinates of the Cα atom
to model each amino acid residue. The substructure unit with (2×l) amino
acids is therefore converted into a (2l ×3)-D vector. RMSD is then measured
by the Kabsch procedure (Kabsch, 1976), which superimposes 6l-D vectors
of two substructure units and optimizes the rotation and translation matrices.
Two substructure units within a preset RMSD threshold of η are considered
to be similar. In our implementation, η is set to 3.0 Å. Our method
utilizes database indexing tree to quickly identify all unique substructure
representatives in the entire protein database. This global collection of
substructure representatives is named global substructure representatives
(GSRs). The concept of identifying GSRs is to iteratively grow an indexing
tree in a top-down manner. The leaf nodes of the indexing tree maintains
the current GSRs. Algorithm 1 shows the pseudocode of generating GSRs.
Range_Search is a function to query a substructure unit ui , which is a
6l-D vector, against the indexing tree and retrieve current GSRs that are
structurally similar to ui . If ui is structurally similar to any existing GSR with
RMSD ≤ η, it is then considered as a duplicate substructure representative
and ignored. Otherwise, ui will be inserted into the tree as a GSR with an
unique term identifier. The algorithm starts with an empty indexing tree.
The inner for-loop between lines 2 and 6 iteratively checks whether each
substructure unit is a unique GSR or not. We chose an M-tree (Ciaccia
et al., 1997) to index GSRs due to its high scalability for a large number
of vector data. Also, M-trees can efficiently conduct Range_Searches in a
multi-dimensional vector space. Algorithm 1 is applied to parse all protein
tertiary structures in the database, , and grow a global indexing tree, which
serves as a dictionary of all substructure representatives.
2.3
Mapping protein substructure unit into terms
The major purpose of creating the global indexing tree is to translate
protein tertiary structure units into a series of terms, which make fast
substructure retrieval possible. As discussed previously, each protein,
P, can be decomposed into a sequential set of na substructure units,
2560
[14:56 31/8/2009 Bioinformatics-btp474.tex]
Page: 2560
2559–2565
Index-based protein substructure alignment
Fig. 2. Two-layer term mapping: local mapping is to map an individual protein into a series of LSRs that are indexed in a local indexing tree; global mapping
is to map each LSR into term identifiers of similar GSRs.
SP = {u1 ,u2 ,...,una }. One intuitive way of translating a protein tertiary
structure into terms is to conduct a Range_Search against the global indexing
tree for each unit, ui . From the search result, the unique term identifiers
of returned GSRs are then assigned to ui . Iteratively, searching in the
global indexing tree, na substructure units of P are mapped into a list of
nb terms, f : SP → T = {t1 ,t2 ,...,tnb }, where nb ≥ na . Since a Range_Search
usually needs to execute computationally expensive RMSD comparisons for
thousands of GSRs, the efficiency of the term mapping process needs to be
addressed.
To tackle the efficiency issue, we introduce a two-layer term mapping
mechanism. The first layer constructs a local indexing tree in terms of
applying lines 2–6 of Algorithm 1 for a single protein, P. The purpose of
constructing a local indexing tree is to generate a set of local substructure
representatives (LSRs) for P. Instead of directly querying thousands of
substructure units against the global indexing tree in the second layer, the
algorithm queries only hundreds of LSRs. We will refer Figure 2 to discuss
examples of the term mapping processing in this section. The algorithm of
the first layer maps a group of similar substructure units from P to an LSR,
which is then queried against the global indexing tree in the second layer
by conducting a Range_Search. The term identifiers of GSRs in the search
result are indirectly assigned to the group of similar protein substructure units
via the LSR. For example, the table in the upper-right corner of Figure 2
shows that u1 , u3 and u5 of protein P are similar to a local substructure
representative, LSR1 . The algorithm conducts a Range_Search to query
LSR1 in the global indexing tree; the returned term identifiers of similar
GSRs, 1, 2 and 5, are associated with the local representative, LSR1 and
are assigned to each of substructure units u1 , u3 and u5 . Using the same
procedure, all substructure units in the protein P are converted into a list
of terms. Normally, the size of a local indexing tree is much less than the
size of the global tree. Therefore, the efficiency of term mapping is greatly
increased.
2.4
Query processing
Given a newly discovered protein chain, the query process applies similar
two-layer term mapping procedures as discussed in Section 2.3. However,
instead of conducting a Range_Search in the global indexing tree for each
LSR, this query process searches only the nearest neighbor from the global
2561
[14:56 31/8/2009 Bioinformatics-btp474.tex]
Page: 2561
2559–2565
P.-H.Chi et al.
tree to reduce the number of query terms, which will be submitted to a
computationally expensive, on-line ranking procedure.
2.5
On-line retrieval and ranking
Once protein structures are converted into terms, it is intuitive to apply
existing IR algorithms of inverted-file index that use ‘bag of words’
or ‘syntactic characterization’ approaches. We construct a data structure,
namely, the inverted-protein index, for fast on-line retrieval. This index
supports a two-layer linkage: associating a term identifier ti with a subset of
database proteins that have ti , ti = {P1 ,P2 ,...,Pn }; referencing each protein
P
P
P
P
j
j
j
,uti ,2
,...,uti ,n
},
Pj ∈ ti to a set of protein substructure units in Pj , ti j = {uti ,1
P
j
, denotes 3D
that maintains sequential occurrences of ti . The element, uti ,m
coordinates of the m-th occurrence of ti in Pj . With the inverted-protein
index, instead of one-against-all comparisons, our algorithm only needs
to compute structural similarities between a query protein and a subset of
database proteins that share common terms. In our implementation, the global
indexing tree and the inverted-protein index reside in memory for fast on-line
retrieval.
Starting from a simple case, we discuss how to compute the structural
similarity between a query protein and a database protein. When a term
has only one occurrence in these two proteins, our algorithm superimposes
these two substructure units with the same term and computes the optimized
rotation matrix and translation vector using the Kabsch procedure (Kabsch,
1976). In terms of applying the rigid transformation based on the previously
computed rotation matrix and translation vector, the entire polypeptide chain
of these two proteins can be overlapped in 3D space. The algorithm then
checks the one-to-one correspondence of amino acid residues on the two
overlapped proteins and measures the structural similarity using chain-tochain alignment.
2.5.1 Chain-to-Chain alignment Once a query protein has been rotated,
translated and superimposed on a database protein, our algorithm applies a
dynamic programming technique to find the longest chain alignment at the
amino acid level. Let X = {x1 ,x2 ,...,xnX } be a query protein with nX amino
acid residues and Y = {y1 ,y2 ,...,ynY } be a database protein with nY residues.
In our chain-to-chain alignment, xi is aligned with yj under the condition that
the Euclidean distance between their Cα atoms, dist(xi ,yj ), is less than γ Å.
By empirical observations, γ is set to 4.5 Å. Let θ[i,j] denote the alignment
length for two subsets of amino acid residues {x1 ,x2 ,...,xi } and {y1 ,y2 ,...,yj }.
The procedure that iteratively finds the longest chain alignment is described
as follows:
⎧
, if i = 0 or j = 0
⎨0
θ[i,j] = θ[i−1,j −1]+1
, if dist(xi ,yj ) ≤ γ
⎩
max(θ[i−1,j],θ[i,j −1]) , if dist(xi ,yj ) > γ
From the alignment result, the longest alignment length NA and the RMSD
value are used to compute the structure similarity between a query protein X
and a database protein Y that contains nY amino acid residues. A similarity
score function is defined in the following equation.
similarity score(X,Y ) =
NA2
NA
nY RMSD+1.0
(1)
Longer alignment lengths mean that larger common substructures potentially
exist in X and Y . Adding the RMSD value with additional 1.0 to avoid the
singular condition. In our current implementation, NA is highly weighted by
taking a square in the first ratio of Equation 1. Since a longer protein may
potentially result in a longer alignment than a small protein, our scoring
function is normalized by nY . Also, the structural variation of aligned amino
acid residues is penalized based on the RMSD value. The algorithm ranks
a subset of database proteins with matched terms based on the similarity
scores.
2.5.2 Term-to-term alignment In reality, each term usually has multiple
occurrences in both a query protein and a database protein. In order to match
Algorithm 2 LSA
1: for i = 1 to 2m do
2:
for j = 1 to 2n do
3:
if i = 1 or j = 1 then
4:
[i,j] = ∅
5:
z[i,j] = 0
6:
else
7:
if i%2 = 0 and j%2 = 0 then
X ,t Y }) ≤ γ then
8:
if RMSD([i−1,j −1] ∪{ti/2
j/2
X ,t Y }
9:
[i,j] = [i−1,j −1] ∪{ti/2
j/2
10:
z[i,j] = z[i−1,j −1]+1
11:
else
X ,t Y }
12:
[i,j] = {ti/2
j/2
13:
z[i,j] = 1
14:
end if
15:
else
16:
if z[i−1,j] < z[i,j −1] then
17:
[i,j] = [i,j −1]
18:
z[i,j] = z[i,j −1]
19:
else if z[i−1,j] > z[i,j −1] then
20:
[i,j] = [i−1,j]
21:
z[i,j] = z[i−1,j]
22:
else
23:
if RMSD([i−1,j]) < RMSD([i,j −1]) then
[i,j] = [i−1,j]
24:
25:
z[i,j] = z[i−1,j]
26:
else
27:
[i,j] = [i,j −1]
28:
z[i,j] = z[i,j −1]
29:
end if
30:
end if
end if
31:
32:
end if
33:
end for
34: end for
substructure units of one protein with units of the other protein, our algorithm
performs substructure alignments with topological constraints using a variant
of dynamic programming. Given a term t, the algorithm first sequentially
identifies all occurrences of t in X, Xt = {t1X ,t2X ,...,tmX }, where tmX denotes
the 3D coordinates of the m-th occurrence of t in X. The algorithm then
accesses the inverted-protein index to find a subset of database proteins that
are associated with t. Such a database protein Y can be represented by an
ordered sequence of all occurrences of t, Yt = {t1Y ,t2Y ,...,tnY }, where tnY is the
3D coordinates of the n-th t in Y . Our method employs a customized dynamic
programming technique for finding the longest substructure alignment (LSA)
between Xt and Yt . The algorithm first creates two data structures: (i) an
aligned coordinate set that keeps one-to-one corresponding coordinates
of aligned substructures between X and Y and (ii) an alignment length
matrix z2n×2m with dummy columns and rows between two consecutive
occurrence of t in both X and Y as shown in Supplementary Figure S1.
The first row and column are initialized by filling zeros. There are two
types of cells in the matrix, namely term–term and dummy cells. The term–
term cell has co-occurrence of terms from both X and Y , while the dummy
cell has existence of either a dummy row (−) or a dummy column (−).
The rationale of inserting dummy elements is to pass the best alignment
results of previous elements for use by later elements. The ‘if statement’
at line 7 in Algorithm 2 distinguishes these two type of cells. Lines 8–14
deal with term–term cells for growing the alignment and lines 16–30 handle
dummy cells for updating the currently best alignment result. Let z[i,j] be
X
Y
the alignment length for aligning {t1X ,t2X ,...,ti/2
} and {t1Y ,t2Y ,...,tj/2
},
where i ≤ 2m and j ≤ 2n. The 3D coordinates of aligned substructures are
kept in (i,j) with a corresponding RMSD value, which is measured from
the aligned substructures from both proteins. From lines 8–14, z[i,j] is
X ,t Y } can
increased by one when the 3D coordinates of (i−1,j −1) ∪{ti/2
j/2
be superimposed within an RMSD threshold, γ. Otherwise, z[i,j] is assigned
to 1. Supplementary Figure S1 shows an example to explain the alignment
process. Since the first pair of substructures {t1X ,t1Y } in Xt and Yt are able
to be aligned within an RMSD threshold, the alignment length z[2,2] = 1.
Another alignment result, z[4,4] = 2, is due to the fact that two protein
2562
[14:56 31/8/2009 Bioinformatics-btp474.tex]
Page: 2562
2559–2565
Index-based protein substructure alignment
substructures, {t1X ,t1Y } and {t2X ,t2Y }, can be superimposed within an RMSD
threshold. When adding a new pair of substructures in the current alignment
results in an RMSD that exceeds γ, the algorithm keeps the new substructures
and assigns 1 to the cell, such as z[6,6] in Supplementary Figure S1. From
lines 16–30 of Algorithm 2, a dummy element z[i,j] propagates alignment
results from the previous stage by taking the maximal values from two
neighbors, z[i−1,j] (↓) and z[i,j −1] (→), as depicted in cell z[4,5] of
Supplementary Figure S1. If these two neighbors have the same alignment
length, the one with a smaller RMSD will be chosen.
z[i,j]
LSA score(i,j) =
(2)
RMSD+1.0
In the process of aligning Xt and Yt , we define a scoring function in
Equation (2) to evaluate the quality of alignment z[i,j] based on the alignment
length z[i,j] and an RMSD value. To provide better accuracy in proteinfold classification, the top 1 result of term-to-term alignment is utilized to
determine a candidate set of rotation matrices and translation vectors, which
are used for superimposing two proteins and computing the similarity score
in the chain-to-chain alignment.
2.6
Further refinement
During the term mapping procedure (see Section 2.3), Range_Search is
utilized to find a structurally similar unit with RMSD ≤ η. However, during
experiments, we found that some proteins with dissimilar structure are also
mapped to the same term, which would increase response time of IPSA.
To further improve the efficiency of IPSA, we introduce type, angle and
distance of substructure unit as new criteria into Range_Search. The refined
IPSA algorithm is called IPSA*.
Given a substructure unit ui consisted of SSEo and SSEc , its type, denoted
as Ti , is a concatenation of type of SSEo and SSEc (Helix or Sheet). To
compute the angle and distance of substructure unit, we adopt the methods
from Singh and Brutlag (1997). First, each SSE in the substructure unit is
represented by its start point Astart and end point Aend . Then, unit direction
vector of each SSE is calculated with the start and end points as
Astart −Aend
V=
||Astart −Aend ||
For substructure unit ui , its angle i is calculated by taking the inverse cosine
of the dot product of the two unit direction vectors. The distance Di is set to
minimum value of the distances between corresponding start and end points
of two SSEs.
With these new features, two substructure units ui and uj are considered as
similar if the following four conditions are met: (i) Tj = Ti , (ii) |j −i | < ,
(iii) |Dj −Di | < d and (iv) RMSD ≤ η. Here and d are thresholds for the
angle and distance, respectively. It is noted that IPSA* is only used to process
query which requires quick response but low accuracy.
3
COMPUTATIONAL RESULTS
In this section, we investigate both the efficiency and accuracy
for SCOP-fold classification and protein structure retrieval. From
an evaluation work (Novotny et al., 2004), both DALI and CE
are considered as accurate protein-fold comparison methods. Our
proposed IPSA algorithms are compared with these two wellrecognized algorithms and other representative algorithms, which
include MAMMOTH (Ortiz et al., 2002), MultiProt (Shatsky
et al., 2004) and SSM (Krissinel and Henrick, 2004), using NonRedundant Protein Data. The consensus approach (Can et al., 2004)
is not compared since the weight for each individual alignment
algorithm is data dependent.
3.1
Non-redundant protein data
With proteins from the Protein Data Bank (PDB) (Berman et al.,
2000), SCOP manually classifies structurally similar proteins into
Table 1. The number of non-redundant proteins in a test set of general and
v ,known
non-redundant test sets in v21
Sets
Test proteins
Database proteins
I
640 proteins from v1.69
v1.67
2767 proteins from SCOP v1.67
II
607 proteins from v1.71
v1.69
3015 proteins from SCOP v1.69
III
610 proteins from v1.73
v1.71
3276 proteins from SCOP v1.71
Table 2. The average response time of IPSA*, MAMMOTH, IPSA,
MultiProt, CE and DALI for SCOP-fold classification per query
Algorithm
Average response time (s)
Speedup of IPSA*
IPSA*
MAMMOTH
IPSA
MultiProt
CE
DALI
182
280
402
655
3071
9665
–
1.64
2.21
3.60
16.87
53.10
folds. Since structural similarity could be captured using sequence
alignment tools, classifying redundant proteins which have high
sequence similarities is basically considered to be a trivial case of
fold classification. We therefore use non-redundant protein data for
performance evaluation. The non-redundant dataset is selected using
the latest version of PDBselect (Hobohm and Sander, 1994) with
<25% sequence similarity. Small proteins with less than 20 amino
acids are also excluded.
Our dataset is shown in Table 1, which is consisted of two parts:
(i) test proteins for querying SCOP fold and (ii) database proteins.
Given two consecutive SCOP releases v1 and v2 (v1 ⊂ v2 ), vv21 =
v2 −v1 denotes a set of newly discovered proteins in v2 that have not
been identified in v1 . To mimic the SCOP-fold classification, we use
the newly discovered proteins to construct the test proteins, which
include at least one non-redundant and representative protein from
each superfamily existing in vv21 . Similarly, the database proteins are
constructed to ensure at least one non-redundant and representative
protein from each superfamily in the entire SCOP space.
3.2
Efficiency
Due to the time complexity of the DALI algorithm, we evaluate the
efficiency of SCOP-fold classification and retrieval using sampled
non-redundant test proteins on a single server, which are selected
from the protein dataset (III) listed in Table 1 based on distribution
of number of amino acids in each protein. The list of the sampled
proteins are shown in Supplementary Table S1.
We perform 3-fold experiments to compare the efficiency of IPSA
with other algorithms, each using 50 proteins randomly selected
from set (III). We measure the average response time of total 150
proteins to evaluate the efficiency of SCOP-fold classification and
structure retrieval. The experiments are conducted on a Linux Fedora
server with AMD Opteron dual-core 1000 series processors and
2 GB RAM. Table 2 shows that the fastest algorithm is IPSA*, which
has 1.64, 16.87 and 53.10 times faster than MAMMOTH, CE and
DALI, respectively.
2563
[14:56 31/8/2009 Bioinformatics-btp474.tex]
Page: 2563
2559–2565
P.-H.Chi et al.
(a)
(b)
Normalized for IPSA, IPSA*, DALI, MAMMOTH, CE, MultiProt and SSM using the protein sets (I) , (II) and (III) in
Fig. 3. The plots of (a) CCR and (b) FScore(q)
Table 1.
It is noted that SSM is not included in this experiment as it has
some failed cases which make the measurement of response time
inaccurate.
3.3 Accuracy—SCOP-fold classification
SCOP-fold classification categorizes a test protein into a specific
fold. In our experiment, the test protein is classified into the same
fold as the top-ranked database protein. To evaluate the accuracy
of SCOP-fold classification, we use a general metric, correct
classification rate (CCR), which is defined as follows:
CCR =
The number of correctly classified proteins
The total number of test proteins
(3)
Figures 3a presents the CCR performance comparison among IPSA,
IPSA*, DALI, CE, MAMMOTH, MultiProt and SSM using the
protein sets (I), (II) and (III) listed in Table 1. Intuitively, the
optimal accuracy of SCOP-fold classification is 100% CCR. Our
classification results for the three sets show that IPSA exhibits
85.31%, 87.31% and 89.65% CCR. These accuracies are comparable
with those of DALI: 85.58%, 88.47% and 88.24% CCR. Also, IPSA
outperforms MAMMOTH and CE in both protein datasets for the
SCOP-fold classification by at least 3.28% and 7.74%, respectively.
As IPSA utilizes the substructure extracted from PDB to classify
SCOP fold, we also investigate relationship between the accuracy
and quality of compared structures by randomly selecting 50-folds
from set (III) and calculating average CCR and resolution of PDB in
each fold. The results are shown in Supplementary Figure S2. The
correlation coefficient r = 0.06, which means no significant linear
association between the accuracy and resolution.
It is noted that we use the scoring function of IPSA chain-to-chain
alignment to rank the alignment results of MultiProt as it does not
have such function. In addition, when running SSM, we have 82,
106 and 133 failed test cases for the protein sets (I), (II) and (III),
respectively. All the failed cases are excluded from the ‘total number
of test proteins’ in Equation (3).
3.4 Accuracy—SCOP-fold retrieval
Our experiment utilizes the precision–recall (PR) curves and
F-measure (van Rijsbergen, 1979) to gauge the accuracy of SCOPfold retrieval. With the retrieval results, the retrieved database
proteins are relevant when the SCOP-fold labels of these proteins
match the fold label of a query protein. Otherwise, these proteins
are irrelevant. If there are k database proteins with the same fold
as a query protein, an ideal retrieval should rank these proteins
in the top k results. Precision and Recall, which have been
discussed previously, are two standard performance measurements
for evaluating an IR system. The average PR curves of protein sets
(I), (II) and (III) are shown in Supplementary Figure S3. From the
figure, we can see that IPSA and IPSA* have higher precision at all
recall levels.
Given a query protein, q, the F-measure shown in Equation (4)
considers both Precision and Recall for the i-th relevant protein.
F(i,q) =
2×Precision(i)×Recall(i)
Precision(i)+Recall(i)
(4)
Since there may exist more than one relevant protein for a query
Normalized
protein, q, we define a normalized, single measurement, FScore(q)
in Equation (5), which takes the average of the sum of each
individual F-measure and normalizes this average sum into a value
between 0 and 1. When relevant proteins are highly ranked, a high
Normalized is expected.
FScore(q)
nq
Normalized =
FScore(q)
2×
i=1 F(i,q)
nq
i
(5)
i=1 i+nq
Normalized for IPSA, IPSA*,
Figures 3b shows the plots of FScore(q)
DALI, CE, MAMMOTH and MultiProt using the protein sets (I),
(II) and (III) listed in Table 1. SSM is excluded as it only outputs
top 1 result. From the results, DALI presents the best retrieval
accuracies, 72.59%, 70.97% and 72.84%, while IPSA exhibits
competitive retrieval accuracies, 70.14%, 68.40% and 71.57%. In
addition, IPSA outperforms MAMMOTH and CE in both protein
datasets with retrieval accuracies that are better by at least 11.33%
and 13.10%, respectively.
3.5
Structural alignment
To provide a fair comparison of structural alignment, we use a
set of protein pairs that have been used in the algorithms of
MAMMOTH (Ortiz et al., 2002), SHEBA (Jung and Lee, 2000),
DALI (Holm and Sander, 1993), MLC (Boutonnet et al., 1995),
VAST (Gibrat et al., 1996) and PropSup (Lackner et al., 2000).
The number of aligned residues after optimal structural alignment
is shown in Supplementary Table 2, which is taken from Table 1
of MAMMOTH. From the table, we can see IPSA can find the
longest alignment in 13 (out of 19) cases out of all methods reported
in Ortiz et al. (2002). One interesting case is 1acx-1tnfA pair.
Both MAMMOTH and SHEBA failes to find the correct structural
2564
[14:56 31/8/2009 Bioinformatics-btp474.tex]
Page: 2564
2559–2565
Index-based protein substructure alignment
alignment as reported by others, but IPSA is able to find a solution
close to DALI.
4
DISCUSSIONS
In this article, we introduce the IPSA algorithm to efficiently
compare protein structure and retrieve proteins with similar
substructures. The algorithm first builds a two-layer indexing tree
to convert protein substructures into terms, and then aligns these
terms and chains to ensure one-to-one correspondence of amino
acids between the two overlapped proteins. The experiment results
show IPSA achieves speedups of over an order of magnitude and
approximately equal accuracy compared with DALI.
One of the key applications for a large-scale method of
protein structure comparison is the molecular evolution of protein
repertoire. To understand the full picture of structural relationship
between proteins, it would be desirable to perform all-against-all
structural comparison of currently solved protein structures in an
organism. Specifically, one would be interested to determine locally
conserved substructures for each pair of proteins. According to the
UniProtKB/Swiss-Prot Human Proteome Initiative, there are 25 686
unique protein structures currently available. This corresponds to
about 338 million of protein–protein structural comparisons. Using
a current method for local structural alignment such as DALI, and
estimating the running time for a single protein–protein structure
comparison to be 3 s, the whole task would require running for 112
days on a 100-node CPU cluster. The efficiency of the proposed
algorithm allows to reduce the running time to 2 day making this
task feasible in the nearest future.
ACKNOWLEDGEMENTS
The authors would like to acknowledge the constructive comments
and suggestions from anonymous reviewers. The authors would also
like to thank Dr Dong Xu of University of Missouri-Columbia for
helpful discussions. We would also like to thank the program authors
of DALI, CE, MAMMOTH, MultiProt and SSM who provided
openly accessible tools for research use.
Funding: University of Missouri-Columbia Research Council (in
part); the Shumaker Endowment in Bioinformatics.
Conflict of Interest: none declared.
REFERENCES
Alexandrov,N.N. (1996) Sarfing the pdb. Protein Eng., 9, 727–732.
Aung,Z. and Tan,K.L. (2004) Rapid 3D protein structure database searching using
information retrieval techniques. Bioinformatics, 20, 1045–1052.
Berman,H.M. et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242.
Boutonnet,N.S. et al. (1995) Optimal protein structure alignments by multiple linkage
clustering: application to distantly related proteins. Protein Eng., 8, 647–662.
Can,T. et al. (2004) Automated protein classification using consensus decision, In
Proceedings of the Third International IEEE Computer Society Computational
Systems Bioinformatics Conference, IEEE Computer Society, Stanford, USA,
pp. 224–235.
Carl,B. and John,T. (1999) Introduction to Protein Structures, 2nd edn. Garland
Publishing Inc., New York.
Chen,L. et al. (2004) TargetDB: a target registration database for structural genomics
projects. Bioinformatics, 20, 2860–2862.
Chi,P.H. et al. (2006) A fast SCOP fold classification system using content-based
E-Predict algorithm. BMC Bioinformatics, 7, 362.
Ciaccia,P. et al. (1997) M-tree: an efficient access method for similarity search in metric
spaces. In Proceedings of the International Conference on Very Large Databases,
Morgan Kaufmann, Athens, GreeceHuan, pp. 426–435.
Gibrat,J.F. et al. (1996) Surprising similarities in structure comparison. Curr. Opin.
Struct. Biol., 6, 377–385.
Godzik,A. (1996) The structural alignment between two proteins: is there a unique
answer? Protein Sci., 5, 1325–1338.
Hobohm,U. and Sander,C. (1994) Enlarged representative set of protein structures.
Protein Sci., 3, 522.
Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance
matrices. J. Mol. Biol., 233, 123–138.
Holm,L. and Sander,C. (1994) The FSSP database of structurally aligned protein fold
families. Nucleic Acids Res., 22, 3600–3609.
Huan,J. et al. (2004) Accurate classification of protein structural families using coherent
subgraph analysis. In Proceedings of the Pacific Symposium on Biocomputing,
World Scientific Press, Hawaii, USA, pp. 411–422.
Jung,J. and Lee,B. (2000) Protein structure alignment using environmental profiles.
Protein Eng., 13, 535–543.
Kabsch,W. (1976) A solution for the best rotation to relate two sets of vectors. Acta
Crystallographica Section A, 32A, 922–923.
Krissinel,E. and Henrick,K. (2004) Secondary-structure matching (SSM), a new tool for
fast protein structure alignment in three dimensions. Acta Aryst., D60, 2256–2268.
Lackner,P. et al. (2000) ProSup: a refined tool for protein structure alignment. Protein
Eng., 13, 745–752.
Leibowitz,N. et al. (2001) Automated multiple structure alignment and detection of a
common substructure motif. Proteins, 43, 235–245.
Murzin,A.G. et al. (1995) Scop: a structural classification of proteins database for the
investigation of sequences and structures. J. Mol. Biol., 247, 536–540.
Novotny,M. et al. (2004) Evaluation of protein fold comparison servers. Proteins, 54,
260–270.
Ortiz,A.R. et al. (2002) MAMMOTH (Matching molecular models obtained from
theory): an automated method for model comparison. Protein Sci., 11, 2606–2621.
Pearl,F.M. et al. (2003) The CATH database: an extended protein family resource for
structural and functional genomics. Nucleic Acids Res., 31, 452–455.
Rogen,P. and Fain,B. (2003) Automatic classification of protein structure by using Gauss
integrals. Proc. Natl Sci. USA, 100, 119–124.
Shatsky,M. et al. (2004) A method for simultaneous alignment of multiple protein
structures. Proteins, 56, 143–156.
Shindyalov,H.N. and Bourne,P.E. (1998) Protein structure alignment by incremental
combinatorial extension (CE) of the optimal path. Protein Eng., 9, 739–747.
Shyu,C.R. et al. (2004) ProteinDBS—a content-based retrieval system for protein
structure databases. Nucleic Acids Res., 32, 572–575.
Singh,A.P. and Brutlag,D.L. (1997) Hierarchical protein structure superposition using
both secondary structure and atomic representations. In Proceedings of 5th
International Conference on Intelligent Systems for Molecular Biology (ISMB’97),
AAAI Press, Halkidiki, Greece, pp. 284–293
Taylor,W.R. and Orengo,C.A. (1989) Protein structure alignment. J. Mol. Biol., 208,
1–22.
van Rijsbergen,C.J (1979) Information Retrieval, 2nd edn. Butterworths, London.
von Grotthuss,M. et al. (2006) PDB-UF: database of predicted enzymatic functions for
unannotated protein structures from structural genomics. BMC Bioinformatics, 7,
53–63.
Yang,J.M. and Tung,C.H. (2006) Protein structure database search and evolutionary
classification. Nucleic Acids Res., 34, 3646–3659.
Young,M.M. et al. (1999) A rapid method for exploring the protein structure universe.
Proteins., 34, 317–332.
Zarembinski,T.I. et al. (1998) Structure-based assignment of the biochemical function
of a hypothetical protein: a test case of structural genomics. Proc. Natl Sci. USA,
95, 189–193.
Zhou,X. et al. (2006) Protein structure similarity from principle component correlation
analysis. BMC Bioinformatics, 7, 40–50.
2565
[14:56 31/8/2009 Bioinformatics-btp474.tex]
Page: 2565
2559–2565