BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 19 2009, pages 2559–2565 doi:10.1093/bioinformatics/btp474 Structural bioinformatics Efficient SCOP-fold classification and retrieval using index-based protein substructure alignments Pin-Hao Chi1,2 , Bin Pang1,2 , Dmitry Korkin1,2 and Chi-Ren Shyu1,2,∗ 1 Medical and Biological Digital Library Research Lab, Informatics Institute and 2 Department of Computer Science, University of Missouri, Columbia, MO 65211, USA Received on March 31, 2009; revised on July 7, 2009; accepted on July 24, 2009 Advance Access publication August 10, 2009 Associate Editor: Anna Tramontano ABSTRACT Motivation: To investigate structure–function relationships, life sciences researchers usually retrieve and classify proteins with similar substructures into the same fold. A manually constructed database, SCOP, is believed to be highly accurate; however, it is labor intensive. Another known method, DALI, is also precise but computationally expensive. We have developed an efficient algorithm, namely, index-based protein substructure alignment (IPSA), for protein-fold classification. IPSA constructs a two-layer indexing tree to quickly retrieve similar substructures in proteins and suggests possible folds by aligning these substructures. Results: Compared with known algorithms, such as DALI, CE, MultiProt and MAMMOTH, on a sample dataset of non-redundant proteins from SCOP v1.73, IPSA exhibits an efficiency improvement of 53.10, 16.87, 3.60 and 1.64 times speedup, respectively. Evaluated on three different datasets of non-redundant proteins from SCOP, average accuracy of IPSA is approximately equal to DALI and better than CE, MAMMOTH, MultiProt and SSM. With reliable accuracy and efficiency, this work will benefit the study of high-throughput protein structure–function relationships. Availability: IPSA is publicly accessible at http://ProteinDBS.rnet .missouri.edu/IPSA.php Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 INTRODUCTION Protein tertiary structure classification has become an important research topic in computational and molecular biology. Since similar protein tertiary structures might have correlations with specific biological functions (Zarembinski et al., 1998), several structural genomics projects (Chen et al., 2004; von Grotthuss et al., 2006) are focused on understanding the links between protein sequences, structures and functional properties. To understand protein structure–function relationships, life sciences researchers usually group proteins of similar substructures into a fold and study functional properties of these similar proteins. The computational methods of protein structure comparisons are usually focused on finding a detailed structural alignment between two proteins. The root mean square deviation (RMSD) is utilized to gauge the ∗ To whom correspondence should be addressed. quality of alignment results from their optimized superimposition. Finding a global optimum of structural alignment is known to be an nondeterministic polynomial time (NP)-hard problem (Godzik, 1996). To reduce the computational time, traditional structural alignment algorithms such as SSAP (Taylor and Orengo, 1989), DALI (Holm and Sander, 1993), CE (Shindyalov and Bourne, 1998), MAMMOTH (Ortiz et al., 2002) and MultiProt (Shatsky et al., 2004) apply heuristics and search locally optimal solutions. Computational methods usually conduct one-against-all pairwise alignments between a newly discovered protein and all known proteins in the database to suggest protein folds. A known protein classification database, CATH (Pearl et al., 2003), is constructed using the SSAP algorithm. Another fold classification database, FSSP (Holm and Sander, 1994), is built based on the DALI algorithm. Since these classification works rely on heuristicbased structural alignments algorithms to measure the similarity of proteins, different heuristics may return divergent classification results for the same query structure. At present, a human curated database, SCOP (Murzin et al., 1995), is believed to maintain highly accurate classification results. Proteins with structural relationships are hierarchically grouped at the fold level. Even though manual inspection is more accurate, it is also labor intensive. A consensus approach intersects classification results from multiple structural alignment algorithms to automate SCOP-fold classification (Can et al., 2004). With manually assigned weights for each individual method, this consensus approach yields an improved classification accuracy. However, a combination of structural alignment algorithms is known to be computationally expensive. With the advent of high-throughput techniques such as X-ray crystallography and nuclear magnetic resonance (NMR), the number of known protein structures has rapidly increased in recent years. To accelerate SCOP-fold classification, there is an urgent need for an efficient protein structure classification algorithm with satisfactory accuracy. Our previous work, ProteinDBS (Shyu et al., 2004), extracts 33 features from 2D distance matrices generated from protein 3D structures. ProteinDBS is able to quickly measure the global similarity of two proteins based on the Euclidean distance of the corresponding feature vectors. Extending the global similarities used in ProteinDBS, we developed an efficient protein classification tool, E-Predict (Chi et al., 2006), that assigns newly discovered proteins to possible SCOP folds in real time. Other global similarity measurements, such as SGM (Rogen and Fain, 2003), PCC (Zhou et al., 2006) and SSM (Krissinel and Henrick, © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] [14:56 31/8/2009 Bioinformatics-btp474.tex] 2559 Page: 2559 2559–2565 P.-H.Chi et al. 2004) extract global features from protein 3D structures using the scaled Gauss metric, principle component correlation analysis and matching graph, respectively. However, the global similarity cannot be used to classify locally similar proteins with common substructures. Several works study retrieving similar protein substructures without structural alignments. The MUSTA algorithm (Leibowitz et al., 2001) uses a clustering technique to identify similar substructures. Huan et al. (2004) model protein tertiary structures using adjacency matrices from graph theory. These methods are usually computationally expensive. Hashing and indexing techniques have been used for improving the efficiency of protein substructure retrieval. Young et al. (1999) utilize hashing techniques to identify similar substructures that are composed of triple secondary structure elements (SSEs) in two proteins. ProtDex2 (Aung and Tan, 2004) extracts spatial features such as distances and angles from each pair of SSEs and utilizes an inverted-file index to improve search efficiency. Recent research has studied mapping protein tertiary structures into 1D sequences for fast substructure retrieval (Yang and Tung, 2006). This approach exhibits good efficiency, except that the 1D representation of protein substructures potentially looses the structural topology. Identical sequences from two proteins may correspond to dissimilar structures in 3D space. Therefore, the accuracies are lower than detailed structure alignment algorithms such as DALI and CE. From these works, efficient approaches are less accurate and accurate methods are less efficient. To consider both accuracy and efficiency of SCOP-fold classification and retrieval, we have developed an index-based protein substructure alignment (IPSA) algorithm. 2 METHODS Protein structure retrievals can be conceptually considered as an application in the field of information retrieval (IR). For an IR system, types, orders and locations of terms make up the basic semantics of a document. Analogous to these concepts of IR, types, orders and topological relationships of protein substructure units can be identified to assist human inspections of protein folds. 2.1 Protein substructure unit extraction Our algorithm compares protein tertiary structures in terms of matching protein substructure units. In order to extract the protein substructure units for comparisons, SSEs, Helix (H) and Sheet (E), are first identified from tertiary structures. In our implementation, the identification of SSEs is conducted by sequentially matching protein amino acid residues with the H and E templates of Spatial ARangement of backbone Fragments (SARF; Alexandrov, 1996). As shown in Figure 1, our substructure unit, ui , consisting of (2×l) amino acids, is extracted by sliding a window of l amino acids within one SSE (SSEo ) and another window of l amino acids in another SSE (SSEc ). The substructure unit, ui = Lio ⊕Ljc , is defined as a concatenation of an opening segment with l amino acids (Lio ) and a closing segment with l amino acids (Ljc ), where i and j are the starting residues of these two segments and j −i ≥ l. For small proteins with less than two SSE assignments, our method slides windows within the entire protein. For those identified SSEs with more than l amino acid residues, sliding a window of l-mers by one residue at a time produces a large amount of substructure units. To reduce the search space, the sliding window of l-mers is shifted every three amino acids. Since SSEs usually contain more than five amino acids (Carl and John, 1999), l is set to 5 in order to cover most of the protein secondary structures. With the sliding window, our algorithm can efficiently find the structural features Fig. 1. A substructure unit, ui , is a concatenation of two non-overlapping amino acid segments, Lio and Ljc , where Lio ⊆ SSEo and Ljc ⊆ SSEc . Algorithm 1 Substructure representative generation Require: , η, Mtree = NIL 1: for each database protein P ∈ do 2: for each substructure unit ui ∈ P do 3: if Range_Search(ui ,Mtree , η) = then 4: Insert (ui ,Mtree ) end if 5: end for 6: 7: end for and identify two proteins with same sequence of SSE but different in the overall fold. A large number of protein substructure units can be extracted from database proteins. In order to efficiently search proteins with similar substructures, our algorithm first maps protein tertiary structure into a series of 1D terms by comparing substructure units with our predefined substructure representatives. The algorithm then indexes these terms of the entire database proteins for fast searching. 2.2 Protein substructure representative generation A protein substructure representative is conceptually similar to a stemmed word, which addresses all variations of terms sharing the same root words in an IR system. Each substructure representative is selected to represent a group of structurally similar substructure units. To compute RMSD of two substructure units, our method uses (X, Y, Z) coordinates of the Cα atom to model each amino acid residue. The substructure unit with (2×l) amino acids is therefore converted into a (2l ×3)-D vector. RMSD is then measured by the Kabsch procedure (Kabsch, 1976), which superimposes 6l-D vectors of two substructure units and optimizes the rotation and translation matrices. Two substructure units within a preset RMSD threshold of η are considered to be similar. In our implementation, η is set to 3.0 Å. Our method utilizes database indexing tree to quickly identify all unique substructure representatives in the entire protein database. This global collection of substructure representatives is named global substructure representatives (GSRs). The concept of identifying GSRs is to iteratively grow an indexing tree in a top-down manner. The leaf nodes of the indexing tree maintains the current GSRs. Algorithm 1 shows the pseudocode of generating GSRs. Range_Search is a function to query a substructure unit ui , which is a 6l-D vector, against the indexing tree and retrieve current GSRs that are structurally similar to ui . If ui is structurally similar to any existing GSR with RMSD ≤ η, it is then considered as a duplicate substructure representative and ignored. Otherwise, ui will be inserted into the tree as a GSR with an unique term identifier. The algorithm starts with an empty indexing tree. The inner for-loop between lines 2 and 6 iteratively checks whether each substructure unit is a unique GSR or not. We chose an M-tree (Ciaccia et al., 1997) to index GSRs due to its high scalability for a large number of vector data. Also, M-trees can efficiently conduct Range_Searches in a multi-dimensional vector space. Algorithm 1 is applied to parse all protein tertiary structures in the database, , and grow a global indexing tree, which serves as a dictionary of all substructure representatives. 2.3 Mapping protein substructure unit into terms The major purpose of creating the global indexing tree is to translate protein tertiary structure units into a series of terms, which make fast substructure retrieval possible. As discussed previously, each protein, P, can be decomposed into a sequential set of na substructure units, 2560 [14:56 31/8/2009 Bioinformatics-btp474.tex] Page: 2560 2559–2565 Index-based protein substructure alignment Fig. 2. Two-layer term mapping: local mapping is to map an individual protein into a series of LSRs that are indexed in a local indexing tree; global mapping is to map each LSR into term identifiers of similar GSRs. SP = {u1 ,u2 ,...,una }. One intuitive way of translating a protein tertiary structure into terms is to conduct a Range_Search against the global indexing tree for each unit, ui . From the search result, the unique term identifiers of returned GSRs are then assigned to ui . Iteratively, searching in the global indexing tree, na substructure units of P are mapped into a list of nb terms, f : SP → T = {t1 ,t2 ,...,tnb }, where nb ≥ na . Since a Range_Search usually needs to execute computationally expensive RMSD comparisons for thousands of GSRs, the efficiency of the term mapping process needs to be addressed. To tackle the efficiency issue, we introduce a two-layer term mapping mechanism. The first layer constructs a local indexing tree in terms of applying lines 2–6 of Algorithm 1 for a single protein, P. The purpose of constructing a local indexing tree is to generate a set of local substructure representatives (LSRs) for P. Instead of directly querying thousands of substructure units against the global indexing tree in the second layer, the algorithm queries only hundreds of LSRs. We will refer Figure 2 to discuss examples of the term mapping processing in this section. The algorithm of the first layer maps a group of similar substructure units from P to an LSR, which is then queried against the global indexing tree in the second layer by conducting a Range_Search. The term identifiers of GSRs in the search result are indirectly assigned to the group of similar protein substructure units via the LSR. For example, the table in the upper-right corner of Figure 2 shows that u1 , u3 and u5 of protein P are similar to a local substructure representative, LSR1 . The algorithm conducts a Range_Search to query LSR1 in the global indexing tree; the returned term identifiers of similar GSRs, 1, 2 and 5, are associated with the local representative, LSR1 and are assigned to each of substructure units u1 , u3 and u5 . Using the same procedure, all substructure units in the protein P are converted into a list of terms. Normally, the size of a local indexing tree is much less than the size of the global tree. Therefore, the efficiency of term mapping is greatly increased. 2.4 Query processing Given a newly discovered protein chain, the query process applies similar two-layer term mapping procedures as discussed in Section 2.3. However, instead of conducting a Range_Search in the global indexing tree for each LSR, this query process searches only the nearest neighbor from the global 2561 [14:56 31/8/2009 Bioinformatics-btp474.tex] Page: 2561 2559–2565 P.-H.Chi et al. tree to reduce the number of query terms, which will be submitted to a computationally expensive, on-line ranking procedure. 2.5 On-line retrieval and ranking Once protein structures are converted into terms, it is intuitive to apply existing IR algorithms of inverted-file index that use ‘bag of words’ or ‘syntactic characterization’ approaches. We construct a data structure, namely, the inverted-protein index, for fast on-line retrieval. This index supports a two-layer linkage: associating a term identifier ti with a subset of database proteins that have ti , ti = {P1 ,P2 ,...,Pn }; referencing each protein P P P P j j j ,uti ,2 ,...,uti ,n }, Pj ∈ ti to a set of protein substructure units in Pj , ti j = {uti ,1 P j , denotes 3D that maintains sequential occurrences of ti . The element, uti ,m coordinates of the m-th occurrence of ti in Pj . With the inverted-protein index, instead of one-against-all comparisons, our algorithm only needs to compute structural similarities between a query protein and a subset of database proteins that share common terms. In our implementation, the global indexing tree and the inverted-protein index reside in memory for fast on-line retrieval. Starting from a simple case, we discuss how to compute the structural similarity between a query protein and a database protein. When a term has only one occurrence in these two proteins, our algorithm superimposes these two substructure units with the same term and computes the optimized rotation matrix and translation vector using the Kabsch procedure (Kabsch, 1976). In terms of applying the rigid transformation based on the previously computed rotation matrix and translation vector, the entire polypeptide chain of these two proteins can be overlapped in 3D space. The algorithm then checks the one-to-one correspondence of amino acid residues on the two overlapped proteins and measures the structural similarity using chain-tochain alignment. 2.5.1 Chain-to-Chain alignment Once a query protein has been rotated, translated and superimposed on a database protein, our algorithm applies a dynamic programming technique to find the longest chain alignment at the amino acid level. Let X = {x1 ,x2 ,...,xnX } be a query protein with nX amino acid residues and Y = {y1 ,y2 ,...,ynY } be a database protein with nY residues. In our chain-to-chain alignment, xi is aligned with yj under the condition that the Euclidean distance between their Cα atoms, dist(xi ,yj ), is less than γ Å. By empirical observations, γ is set to 4.5 Å. Let θ[i,j] denote the alignment length for two subsets of amino acid residues {x1 ,x2 ,...,xi } and {y1 ,y2 ,...,yj }. The procedure that iteratively finds the longest chain alignment is described as follows: ⎧ , if i = 0 or j = 0 ⎨0 θ[i,j] = θ[i−1,j −1]+1 , if dist(xi ,yj ) ≤ γ ⎩ max(θ[i−1,j],θ[i,j −1]) , if dist(xi ,yj ) > γ From the alignment result, the longest alignment length NA and the RMSD value are used to compute the structure similarity between a query protein X and a database protein Y that contains nY amino acid residues. A similarity score function is defined in the following equation. similarity score(X,Y ) = NA2 NA nY RMSD+1.0 (1) Longer alignment lengths mean that larger common substructures potentially exist in X and Y . Adding the RMSD value with additional 1.0 to avoid the singular condition. In our current implementation, NA is highly weighted by taking a square in the first ratio of Equation 1. Since a longer protein may potentially result in a longer alignment than a small protein, our scoring function is normalized by nY . Also, the structural variation of aligned amino acid residues is penalized based on the RMSD value. The algorithm ranks a subset of database proteins with matched terms based on the similarity scores. 2.5.2 Term-to-term alignment In reality, each term usually has multiple occurrences in both a query protein and a database protein. In order to match Algorithm 2 LSA 1: for i = 1 to 2m do 2: for j = 1 to 2n do 3: if i = 1 or j = 1 then 4: [i,j] = ∅ 5: z[i,j] = 0 6: else 7: if i%2 = 0 and j%2 = 0 then X ,t Y }) ≤ γ then 8: if RMSD([i−1,j −1] ∪{ti/2 j/2 X ,t Y } 9: [i,j] = [i−1,j −1] ∪{ti/2 j/2 10: z[i,j] = z[i−1,j −1]+1 11: else X ,t Y } 12: [i,j] = {ti/2 j/2 13: z[i,j] = 1 14: end if 15: else 16: if z[i−1,j] < z[i,j −1] then 17: [i,j] = [i,j −1] 18: z[i,j] = z[i,j −1] 19: else if z[i−1,j] > z[i,j −1] then 20: [i,j] = [i−1,j] 21: z[i,j] = z[i−1,j] 22: else 23: if RMSD([i−1,j]) < RMSD([i,j −1]) then [i,j] = [i−1,j] 24: 25: z[i,j] = z[i−1,j] 26: else 27: [i,j] = [i,j −1] 28: z[i,j] = z[i,j −1] 29: end if 30: end if end if 31: 32: end if 33: end for 34: end for substructure units of one protein with units of the other protein, our algorithm performs substructure alignments with topological constraints using a variant of dynamic programming. Given a term t, the algorithm first sequentially identifies all occurrences of t in X, Xt = {t1X ,t2X ,...,tmX }, where tmX denotes the 3D coordinates of the m-th occurrence of t in X. The algorithm then accesses the inverted-protein index to find a subset of database proteins that are associated with t. Such a database protein Y can be represented by an ordered sequence of all occurrences of t, Yt = {t1Y ,t2Y ,...,tnY }, where tnY is the 3D coordinates of the n-th t in Y . Our method employs a customized dynamic programming technique for finding the longest substructure alignment (LSA) between Xt and Yt . The algorithm first creates two data structures: (i) an aligned coordinate set that keeps one-to-one corresponding coordinates of aligned substructures between X and Y and (ii) an alignment length matrix z2n×2m with dummy columns and rows between two consecutive occurrence of t in both X and Y as shown in Supplementary Figure S1. The first row and column are initialized by filling zeros. There are two types of cells in the matrix, namely term–term and dummy cells. The term– term cell has co-occurrence of terms from both X and Y , while the dummy cell has existence of either a dummy row (−) or a dummy column (−). The rationale of inserting dummy elements is to pass the best alignment results of previous elements for use by later elements. The ‘if statement’ at line 7 in Algorithm 2 distinguishes these two type of cells. Lines 8–14 deal with term–term cells for growing the alignment and lines 16–30 handle dummy cells for updating the currently best alignment result. Let z[i,j] be X Y the alignment length for aligning {t1X ,t2X ,...,ti/2 } and {t1Y ,t2Y ,...,tj/2 }, where i ≤ 2m and j ≤ 2n. The 3D coordinates of aligned substructures are kept in (i,j) with a corresponding RMSD value, which is measured from the aligned substructures from both proteins. From lines 8–14, z[i,j] is X ,t Y } can increased by one when the 3D coordinates of (i−1,j −1) ∪{ti/2 j/2 be superimposed within an RMSD threshold, γ. Otherwise, z[i,j] is assigned to 1. Supplementary Figure S1 shows an example to explain the alignment process. Since the first pair of substructures {t1X ,t1Y } in Xt and Yt are able to be aligned within an RMSD threshold, the alignment length z[2,2] = 1. Another alignment result, z[4,4] = 2, is due to the fact that two protein 2562 [14:56 31/8/2009 Bioinformatics-btp474.tex] Page: 2562 2559–2565 Index-based protein substructure alignment substructures, {t1X ,t1Y } and {t2X ,t2Y }, can be superimposed within an RMSD threshold. When adding a new pair of substructures in the current alignment results in an RMSD that exceeds γ, the algorithm keeps the new substructures and assigns 1 to the cell, such as z[6,6] in Supplementary Figure S1. From lines 16–30 of Algorithm 2, a dummy element z[i,j] propagates alignment results from the previous stage by taking the maximal values from two neighbors, z[i−1,j] (↓) and z[i,j −1] (→), as depicted in cell z[4,5] of Supplementary Figure S1. If these two neighbors have the same alignment length, the one with a smaller RMSD will be chosen. z[i,j] LSA score(i,j) = (2) RMSD+1.0 In the process of aligning Xt and Yt , we define a scoring function in Equation (2) to evaluate the quality of alignment z[i,j] based on the alignment length z[i,j] and an RMSD value. To provide better accuracy in proteinfold classification, the top 1 result of term-to-term alignment is utilized to determine a candidate set of rotation matrices and translation vectors, which are used for superimposing two proteins and computing the similarity score in the chain-to-chain alignment. 2.6 Further refinement During the term mapping procedure (see Section 2.3), Range_Search is utilized to find a structurally similar unit with RMSD ≤ η. However, during experiments, we found that some proteins with dissimilar structure are also mapped to the same term, which would increase response time of IPSA. To further improve the efficiency of IPSA, we introduce type, angle and distance of substructure unit as new criteria into Range_Search. The refined IPSA algorithm is called IPSA*. Given a substructure unit ui consisted of SSEo and SSEc , its type, denoted as Ti , is a concatenation of type of SSEo and SSEc (Helix or Sheet). To compute the angle and distance of substructure unit, we adopt the methods from Singh and Brutlag (1997). First, each SSE in the substructure unit is represented by its start point Astart and end point Aend . Then, unit direction vector of each SSE is calculated with the start and end points as Astart −Aend V= ||Astart −Aend || For substructure unit ui , its angle i is calculated by taking the inverse cosine of the dot product of the two unit direction vectors. The distance Di is set to minimum value of the distances between corresponding start and end points of two SSEs. With these new features, two substructure units ui and uj are considered as similar if the following four conditions are met: (i) Tj = Ti , (ii) |j −i | < , (iii) |Dj −Di | < d and (iv) RMSD ≤ η. Here and d are thresholds for the angle and distance, respectively. It is noted that IPSA* is only used to process query which requires quick response but low accuracy. 3 COMPUTATIONAL RESULTS In this section, we investigate both the efficiency and accuracy for SCOP-fold classification and protein structure retrieval. From an evaluation work (Novotny et al., 2004), both DALI and CE are considered as accurate protein-fold comparison methods. Our proposed IPSA algorithms are compared with these two wellrecognized algorithms and other representative algorithms, which include MAMMOTH (Ortiz et al., 2002), MultiProt (Shatsky et al., 2004) and SSM (Krissinel and Henrick, 2004), using NonRedundant Protein Data. The consensus approach (Can et al., 2004) is not compared since the weight for each individual alignment algorithm is data dependent. 3.1 Non-redundant protein data With proteins from the Protein Data Bank (PDB) (Berman et al., 2000), SCOP manually classifies structurally similar proteins into Table 1. The number of non-redundant proteins in a test set of general and v ,known non-redundant test sets in v21 Sets Test proteins Database proteins I 640 proteins from v1.69 v1.67 2767 proteins from SCOP v1.67 II 607 proteins from v1.71 v1.69 3015 proteins from SCOP v1.69 III 610 proteins from v1.73 v1.71 3276 proteins from SCOP v1.71 Table 2. The average response time of IPSA*, MAMMOTH, IPSA, MultiProt, CE and DALI for SCOP-fold classification per query Algorithm Average response time (s) Speedup of IPSA* IPSA* MAMMOTH IPSA MultiProt CE DALI 182 280 402 655 3071 9665 – 1.64 2.21 3.60 16.87 53.10 folds. Since structural similarity could be captured using sequence alignment tools, classifying redundant proteins which have high sequence similarities is basically considered to be a trivial case of fold classification. We therefore use non-redundant protein data for performance evaluation. The non-redundant dataset is selected using the latest version of PDBselect (Hobohm and Sander, 1994) with <25% sequence similarity. Small proteins with less than 20 amino acids are also excluded. Our dataset is shown in Table 1, which is consisted of two parts: (i) test proteins for querying SCOP fold and (ii) database proteins. Given two consecutive SCOP releases v1 and v2 (v1 ⊂ v2 ), vv21 = v2 −v1 denotes a set of newly discovered proteins in v2 that have not been identified in v1 . To mimic the SCOP-fold classification, we use the newly discovered proteins to construct the test proteins, which include at least one non-redundant and representative protein from each superfamily existing in vv21 . Similarly, the database proteins are constructed to ensure at least one non-redundant and representative protein from each superfamily in the entire SCOP space. 3.2 Efficiency Due to the time complexity of the DALI algorithm, we evaluate the efficiency of SCOP-fold classification and retrieval using sampled non-redundant test proteins on a single server, which are selected from the protein dataset (III) listed in Table 1 based on distribution of number of amino acids in each protein. The list of the sampled proteins are shown in Supplementary Table S1. We perform 3-fold experiments to compare the efficiency of IPSA with other algorithms, each using 50 proteins randomly selected from set (III). We measure the average response time of total 150 proteins to evaluate the efficiency of SCOP-fold classification and structure retrieval. The experiments are conducted on a Linux Fedora server with AMD Opteron dual-core 1000 series processors and 2 GB RAM. Table 2 shows that the fastest algorithm is IPSA*, which has 1.64, 16.87 and 53.10 times faster than MAMMOTH, CE and DALI, respectively. 2563 [14:56 31/8/2009 Bioinformatics-btp474.tex] Page: 2563 2559–2565 P.-H.Chi et al. (a) (b) Normalized for IPSA, IPSA*, DALI, MAMMOTH, CE, MultiProt and SSM using the protein sets (I) , (II) and (III) in Fig. 3. The plots of (a) CCR and (b) FScore(q) Table 1. It is noted that SSM is not included in this experiment as it has some failed cases which make the measurement of response time inaccurate. 3.3 Accuracy—SCOP-fold classification SCOP-fold classification categorizes a test protein into a specific fold. In our experiment, the test protein is classified into the same fold as the top-ranked database protein. To evaluate the accuracy of SCOP-fold classification, we use a general metric, correct classification rate (CCR), which is defined as follows: CCR = The number of correctly classified proteins The total number of test proteins (3) Figures 3a presents the CCR performance comparison among IPSA, IPSA*, DALI, CE, MAMMOTH, MultiProt and SSM using the protein sets (I), (II) and (III) listed in Table 1. Intuitively, the optimal accuracy of SCOP-fold classification is 100% CCR. Our classification results for the three sets show that IPSA exhibits 85.31%, 87.31% and 89.65% CCR. These accuracies are comparable with those of DALI: 85.58%, 88.47% and 88.24% CCR. Also, IPSA outperforms MAMMOTH and CE in both protein datasets for the SCOP-fold classification by at least 3.28% and 7.74%, respectively. As IPSA utilizes the substructure extracted from PDB to classify SCOP fold, we also investigate relationship between the accuracy and quality of compared structures by randomly selecting 50-folds from set (III) and calculating average CCR and resolution of PDB in each fold. The results are shown in Supplementary Figure S2. The correlation coefficient r = 0.06, which means no significant linear association between the accuracy and resolution. It is noted that we use the scoring function of IPSA chain-to-chain alignment to rank the alignment results of MultiProt as it does not have such function. In addition, when running SSM, we have 82, 106 and 133 failed test cases for the protein sets (I), (II) and (III), respectively. All the failed cases are excluded from the ‘total number of test proteins’ in Equation (3). 3.4 Accuracy—SCOP-fold retrieval Our experiment utilizes the precision–recall (PR) curves and F-measure (van Rijsbergen, 1979) to gauge the accuracy of SCOPfold retrieval. With the retrieval results, the retrieved database proteins are relevant when the SCOP-fold labels of these proteins match the fold label of a query protein. Otherwise, these proteins are irrelevant. If there are k database proteins with the same fold as a query protein, an ideal retrieval should rank these proteins in the top k results. Precision and Recall, which have been discussed previously, are two standard performance measurements for evaluating an IR system. The average PR curves of protein sets (I), (II) and (III) are shown in Supplementary Figure S3. From the figure, we can see that IPSA and IPSA* have higher precision at all recall levels. Given a query protein, q, the F-measure shown in Equation (4) considers both Precision and Recall for the i-th relevant protein. F(i,q) = 2×Precision(i)×Recall(i) Precision(i)+Recall(i) (4) Since there may exist more than one relevant protein for a query Normalized protein, q, we define a normalized, single measurement, FScore(q) in Equation (5), which takes the average of the sum of each individual F-measure and normalizes this average sum into a value between 0 and 1. When relevant proteins are highly ranked, a high Normalized is expected. FScore(q) nq Normalized = FScore(q) 2× i=1 F(i,q) nq i (5) i=1 i+nq Normalized for IPSA, IPSA*, Figures 3b shows the plots of FScore(q) DALI, CE, MAMMOTH and MultiProt using the protein sets (I), (II) and (III) listed in Table 1. SSM is excluded as it only outputs top 1 result. From the results, DALI presents the best retrieval accuracies, 72.59%, 70.97% and 72.84%, while IPSA exhibits competitive retrieval accuracies, 70.14%, 68.40% and 71.57%. In addition, IPSA outperforms MAMMOTH and CE in both protein datasets with retrieval accuracies that are better by at least 11.33% and 13.10%, respectively. 3.5 Structural alignment To provide a fair comparison of structural alignment, we use a set of protein pairs that have been used in the algorithms of MAMMOTH (Ortiz et al., 2002), SHEBA (Jung and Lee, 2000), DALI (Holm and Sander, 1993), MLC (Boutonnet et al., 1995), VAST (Gibrat et al., 1996) and PropSup (Lackner et al., 2000). The number of aligned residues after optimal structural alignment is shown in Supplementary Table 2, which is taken from Table 1 of MAMMOTH. From the table, we can see IPSA can find the longest alignment in 13 (out of 19) cases out of all methods reported in Ortiz et al. (2002). One interesting case is 1acx-1tnfA pair. Both MAMMOTH and SHEBA failes to find the correct structural 2564 [14:56 31/8/2009 Bioinformatics-btp474.tex] Page: 2564 2559–2565 Index-based protein substructure alignment alignment as reported by others, but IPSA is able to find a solution close to DALI. 4 DISCUSSIONS In this article, we introduce the IPSA algorithm to efficiently compare protein structure and retrieve proteins with similar substructures. The algorithm first builds a two-layer indexing tree to convert protein substructures into terms, and then aligns these terms and chains to ensure one-to-one correspondence of amino acids between the two overlapped proteins. The experiment results show IPSA achieves speedups of over an order of magnitude and approximately equal accuracy compared with DALI. One of the key applications for a large-scale method of protein structure comparison is the molecular evolution of protein repertoire. To understand the full picture of structural relationship between proteins, it would be desirable to perform all-against-all structural comparison of currently solved protein structures in an organism. Specifically, one would be interested to determine locally conserved substructures for each pair of proteins. According to the UniProtKB/Swiss-Prot Human Proteome Initiative, there are 25 686 unique protein structures currently available. This corresponds to about 338 million of protein–protein structural comparisons. Using a current method for local structural alignment such as DALI, and estimating the running time for a single protein–protein structure comparison to be 3 s, the whole task would require running for 112 days on a 100-node CPU cluster. The efficiency of the proposed algorithm allows to reduce the running time to 2 day making this task feasible in the nearest future. ACKNOWLEDGEMENTS The authors would like to acknowledge the constructive comments and suggestions from anonymous reviewers. The authors would also like to thank Dr Dong Xu of University of Missouri-Columbia for helpful discussions. We would also like to thank the program authors of DALI, CE, MAMMOTH, MultiProt and SSM who provided openly accessible tools for research use. Funding: University of Missouri-Columbia Research Council (in part); the Shumaker Endowment in Bioinformatics. Conflict of Interest: none declared. REFERENCES Alexandrov,N.N. (1996) Sarfing the pdb. Protein Eng., 9, 727–732. Aung,Z. and Tan,K.L. (2004) Rapid 3D protein structure database searching using information retrieval techniques. Bioinformatics, 20, 1045–1052. Berman,H.M. et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242. Boutonnet,N.S. et al. (1995) Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. Protein Eng., 8, 647–662. Can,T. et al. (2004) Automated protein classification using consensus decision, In Proceedings of the Third International IEEE Computer Society Computational Systems Bioinformatics Conference, IEEE Computer Society, Stanford, USA, pp. 224–235. Carl,B. and John,T. (1999) Introduction to Protein Structures, 2nd edn. Garland Publishing Inc., New York. Chen,L. et al. (2004) TargetDB: a target registration database for structural genomics projects. Bioinformatics, 20, 2860–2862. Chi,P.H. et al. (2006) A fast SCOP fold classification system using content-based E-Predict algorithm. BMC Bioinformatics, 7, 362. Ciaccia,P. et al. (1997) M-tree: an efficient access method for similarity search in metric spaces. In Proceedings of the International Conference on Very Large Databases, Morgan Kaufmann, Athens, GreeceHuan, pp. 426–435. Gibrat,J.F. et al. (1996) Surprising similarities in structure comparison. Curr. Opin. Struct. Biol., 6, 377–385. Godzik,A. (1996) The structural alignment between two proteins: is there a unique answer? Protein Sci., 5, 1325–1338. Hobohm,U. and Sander,C. (1994) Enlarged representative set of protein structures. Protein Sci., 3, 522. Holm,L. and Sander,C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123–138. Holm,L. and Sander,C. (1994) The FSSP database of structurally aligned protein fold families. Nucleic Acids Res., 22, 3600–3609. Huan,J. et al. (2004) Accurate classification of protein structural families using coherent subgraph analysis. In Proceedings of the Pacific Symposium on Biocomputing, World Scientific Press, Hawaii, USA, pp. 411–422. Jung,J. and Lee,B. (2000) Protein structure alignment using environmental profiles. Protein Eng., 13, 535–543. Kabsch,W. (1976) A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 32A, 922–923. Krissinel,E. and Henrick,K. (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Aryst., D60, 2256–2268. Lackner,P. et al. (2000) ProSup: a refined tool for protein structure alignment. Protein Eng., 13, 745–752. Leibowitz,N. et al. (2001) Automated multiple structure alignment and detection of a common substructure motif. Proteins, 43, 235–245. Murzin,A.G. et al. (1995) Scop: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540. Novotny,M. et al. (2004) Evaluation of protein fold comparison servers. Proteins, 54, 260–270. Ortiz,A.R. et al. (2002) MAMMOTH (Matching molecular models obtained from theory): an automated method for model comparison. Protein Sci., 11, 2606–2621. Pearl,F.M. et al. (2003) The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res., 31, 452–455. Rogen,P. and Fain,B. (2003) Automatic classification of protein structure by using Gauss integrals. Proc. Natl Sci. USA, 100, 119–124. Shatsky,M. et al. (2004) A method for simultaneous alignment of multiple protein structures. Proteins, 56, 143–156. Shindyalov,H.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 9, 739–747. Shyu,C.R. et al. (2004) ProteinDBS—a content-based retrieval system for protein structure databases. Nucleic Acids Res., 32, 572–575. Singh,A.P. and Brutlag,D.L. (1997) Hierarchical protein structure superposition using both secondary structure and atomic representations. In Proceedings of 5th International Conference on Intelligent Systems for Molecular Biology (ISMB’97), AAAI Press, Halkidiki, Greece, pp. 284–293 Taylor,W.R. and Orengo,C.A. (1989) Protein structure alignment. J. Mol. Biol., 208, 1–22. van Rijsbergen,C.J (1979) Information Retrieval, 2nd edn. Butterworths, London. von Grotthuss,M. et al. (2006) PDB-UF: database of predicted enzymatic functions for unannotated protein structures from structural genomics. BMC Bioinformatics, 7, 53–63. Yang,J.M. and Tung,C.H. (2006) Protein structure database search and evolutionary classification. Nucleic Acids Res., 34, 3646–3659. Young,M.M. et al. (1999) A rapid method for exploring the protein structure universe. Proteins., 34, 317–332. Zarembinski,T.I. et al. (1998) Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. Proc. Natl Sci. USA, 95, 189–193. Zhou,X. et al. (2006) Protein structure similarity from principle component correlation analysis. BMC Bioinformatics, 7, 40–50. 2565 [14:56 31/8/2009 Bioinformatics-btp474.tex] Page: 2565 2559–2565
© Copyright 2026 Paperzz