(This is a sample cover image for this issue. The actual cover is not yet available at this time.) This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright Author's personal copy Journal of Theoretical Biology 284 (2011) 42–51 Contents lists available at ScienceDirect Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites Xuan Xiao a,b,n, Zhi-Cheng Wu a, Kuo-Chen Chou b a b Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA a r t i c l e i n f o a b s t r a c t Article history: Received 11 February 2011 Received in revised form 31 May 2011 Accepted 4 June 2011 Available online 14 June 2011 In the last two decades or so, although many computational methods were developed for predicting the subcellular locations of proteins according to their sequence information, it is still remains as a challenging problem, particularly when the system concerned contains both single- and multiplelocation proteins. Also, among the existing methods, very few were developed specialized for dealing with viral proteins, those generated by viruses. Actually, knowledge of the subcellular localization of viral proteins in a host cell or virus-infected cell is very important because it is closely related to their destructive tendencies and consequences. In this paper, by introducing the ‘‘multi-label scale’’ and by hybridizing the gene ontology information with the sequential evolution information, a predictor called iLoc-Virus is developed. It can be utilized to identify viral proteins among the following six locations: (1) viral capsid, (2) host cell membrane, (3) host endoplasmic reticulum, (4) host cytoplasm, (5) host nucleus, and (6) secreted. The iLoc-Virus predictor not only can more accurately predict the location sites of viral proteins in a host cell, but also have the capacity to deal with virus proteins having more than one location. As a user-friendly web-server, iLoc-Virus is freely accessible to the public at http:// icpr.jci.edu.cn/bioinfo/iLoc-Virus. Meanwhile, a step-by-step guide is provided on how to use the webserver to get the desired results. Furthermore, for the user’s convenience, the iLoc-Virus web-server also has the function to accept the batch job submission. It is anticipated that iLoc-Virus may become a useful high throughput tool for both basic research and drug development. & 2011 Elsevier Ltd. All rights reserved. Keywords: Multiplex proteins Locative proteins Accumulation-layer scale ML-KNN Absolute accuracy 1. Introduction With the avalanche of protein sequences generated in the post-genomic age, many computational methods were established for timely identifying their subcellular localization according to the sequence information alone. These methods were developed generally following three directions. One is to enhance the power of practical application by enlarging the coverage scope, such as from covering only 2 subcellular location sites (Nakashima and Nishikawa, 1994), to 5 location sites (Cedano et al., 1997), to 12 location sites (Chou and Elrod, 1999; Park and Kanehisa, 2003), and to 22 location sites (Chou and Shen, 2010c). The second direction is to extract more useful information from protein sequences via different approaches or models, such as from the model of targeting or leader sequences (Emanuelsson et al., 2000; Nakai and Kanehisa, 1991), to the amino acid composition (Cedano et al., 1997; Reinhardt and Hubbard, 1998), to the various modes (Chen and Li, 2007; Ding and Zhang, 2008; Li and Li, 2008; n Corresponding author at: Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China. E-mail addresses: [email protected], [email protected] (X. Xiao). 0022-5193/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.jtbi.2011.06.005 Liu et al., 2010; Pan et al., 2003; Xiao et al., 2006b) of pseudo-amino acid composition (Chou, 2001), and to the higher-level forms of pseudo-amino acid composition by incorporating the functional domain information, gene ontology information, and sequential evolution information. The third direction is to develop predictors focused on different organisms (Chou and Shen, 2008a) such as human, plant (Small et al., 2004), and bacteria (Gardy et al., 2005). For the details about the developing process, see two comprehensive review articles (Chou and Shen, 2007; Nakai, 2000) as well as a long list of references cited therein. However, very few methods were developed with a focus on viral proteins, those generated by viruses. Actually, the knowledge of the subcellular localization of viral proteins in a host cell or virus-infected cell is very important because it is closely related to their destructive tendencies and consequences. Therefore, it would be especially meaningful to develop methods for predicting the locations of viral proteins in a viral infected cell since they are intimately associated with human health, medical science, and design of antiviral drugs. In 2007 a predictor called ‘‘Virus-PLoc’’ (Shen and Chou, 2007) was proposed for identifying the subcellular locations of viral proteins. In that predictor, a protein sample was firstly formulated by incorporating the gene ontology information, which is a set of Author's personal copy X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51 defined terms to describe cellular component, molecular function and biological process (Ashburner et al., 2000). Gene ontology can be represented by a directed acyclic graph or digraph (Chou, 1989), where the terms are nodes and the relationships among terms are edges. These edges describe the parent–child relationships between nodes. A node in that graph may have one or more than one parent. Such graphic structural characteristics enable powerful grouping, searching, and analysis of genes and gene products. Because gene ontology can provide a uniform form for the annotations of proteins and genes, it becomes quite useful for bench-biologists and computation biologists to share their research data. In recent years, the applications of gene ontology have been increasingly reported in various areas of bioinformatics. As indicated by Virus-PLoc (Shen and Chou, 2007), when the protein samples were formulated via gene ontology, the prediction quality was improved remarkably. However, since the database of gene ontology was far from complete yet in 2007, for those protein samples that could not be formulated via gene ontology, the pseudo-amino acid composition (Chou, 2001) was adopted in Virus-PLoc as a backup. Besides, Virus-PLoc (Shen and Chou, 2007) does not have the capacity to deal with multiplex proteins that may simultaneously exist at, or move between, two or more different subcellular location sites. Proteins with multiple locations or dynamic feature of this kind are particularly worthy of our notice because they may have some special functions (Smith, 2008) useful for both drug development (Wong and Ng, 2009) and basic research (Glory and Murphy, 2007). To make Virus-PLoc (Shen and Chou, 2007) be able to deal with multiplex viral proteins as well, a predictor called Virus-mPLoc (Shen and Chou, 2010b) was developed recently, where the character ‘‘m’’ in front of ‘‘PLoc’’ stands for ‘‘multiple’’, meaning that it can be also used to deal with viral proteins with multiple locations. However, Virus-mPLoc has the following shortcomings. (1) In formulating the protein samples, only the integer numbers 0 and 1 were used to reflect the GO (gene ontology) information (Ashburner et al., 2000; Camon et al., 2004). Such an over-simplified formulation might cause some useful information lost so as to limit the prediction quality. (2) In predicting the number of subcellular location sites for a query protein, an optimal threshold factor yn (see Eq. (48) of Chou and Shen (2007)) was adopted without providing its statistical implication and detailed learning process. It would be more instructive if we could find a more intuitive approach to treat such a problem with a more natural manner. (3) Although a webserver for Virus-mPLoc has been established at http://www.csbio. sjtu.edu.cn/bioinf/virus-multi/, only one query protein sequence at a time is allowed when using the web-server to conduct prediction. For the convenience of users in handling many query viral protein sequences, such a rigid limit should be improved. (4) Particularly, Virus-mPLoc lacks the function for batch job submission. For those users who need to identify the subcellular locations for a large number of query virus proteins, it is important for the web-server to have such function. The present study was devoted to develop a new and more powerful predictor for predicting virus protein subcellular localization by addressing the above four problems. To establish a really useful statistical predictor for protein systems, the following things were often needed to consider (Chou, 2011): (1) benchmark dataset construction or selection, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. Below, let us describe how to cope with these procedures. 43 dataset for the current study. The reasons why doing so are as follows. (1) The dataset was constructed specialized for viral proteins. (2) None of proteins included in S has Z25% pairwise sequence identity to any other in a same subcellular location; compared with most of the other benchmark datasets for studying protein subcellular location prediction, the dataset S is much more rigorous in excluding homology bias and redundancy. (3) It also contains proteins with more than one location and hence can be used to train and test a predictor developed aimed at being able to deal with proteins having both single and multiple location sites. (4) Using the dataset S will also make it easier to compare the new predictor with the existing one because the results by Virus-mPLoc on S have been well documented and reported (Shen and Chou, 2010b). The dataset S contains 207 viral protein sequences, of which 165 belong to one subcellular location, 39 to two locations, 3 to three locations, and none to four or more locations. The dataset covers 6 subcellular locations (Fig. 1), and hence can be formulated as S ¼ S1 [ S2 [ S3 [ S4 [ S5 [ S6 ð1Þ where S1 represents the subset for the subcellular location of ‘‘viral capsid’’, S2 for ‘‘host cell membrane’’, S3 for ‘‘host endoplasmic reticulum’’, and so forth (Table 1); while [ represents the symbol for ‘‘union’’ in the set theory. To avoid homology bias and redundancy, none of the proteins in S has Z25% pairwise sequence identity to any other in a same subset. For convenience, hereafter let us just use the subscripts of Eq. (1) as the codes of the 6 location sites; i.e., ‘‘1’’ for ‘‘viral capsid’’, ‘‘2’’ for ‘‘host cell membrane’’, ‘‘3’’ for ‘‘host endoplasmic reticulum’’, and so forth (Table 2). For readers’ convenience, the corresponding accession numbers and protein sequences in S are given in Online Supporting Information S1. Note that because some virus proteins may simultaneously exist in two or more locations, it is instructive to introduce the concept of ‘‘locative protein’’ as briefed below. If a protein coexists at two different subcellular location sites, it will be counted as two locative proteins; if it coexists at three location sites, it will be counted as three locative proteins; and so forth. Thus, the number of 2. Materials We choose to use the same dataset S constructed by Shen and Chou (2010b) in establishing Virus-mPLoc as the benchmark Fig. 1. Illustration to show the 6 subcellular locations of viral proteins. The 6 locations are: (1) viral capsid, (2) host cell membrane, (3) host endoplasmic reticulum, (4) host cytoplasm, (5) host nucleus, and (6) secreted. Author's personal copy 44 X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51 Table 1 Breakdown of the viral protein benchmark dataset S taken from Shen and Chou (2010b). None of proteins included here has Z 25% sequence identity to any other in a same subcellular location. Subset Subcellular location Number of proteins S1 S2 S3 S4 S5 S6 Viral capsid Host cell membrane Host endoplasmic reticulum Host cytoplasm Host nucleus Secreted Total number of locative proteins N(loc) Total number of different proteins N(seq) 8 33 20 87 84 20 252a 207b See Eqs. (36)–(38) of Chou and Shen (2007) for the definition about the number of locative proteins, and its relation with the number of different proteins. b Of the 207 different viral proteins, 165 belong to one subcellular location, 39 to two locations, and 3 to three locations. See Online Supporting Information S1 for the protein sequences. Table 2 A comparison of the jackknife success rates by Virus-mPLoc (Shen and Chou, 2010b) and the current iLoc-virus on the benchmark dataset S (cf. Online Supporting Information S1) that covers 6 location sites of viral proteins in which none of the proteins included has Z25% pairwise sequence identity to any other in a same location. 1 2 3 4 5 6 Subcellular location Viral capsid Host cell membrane Host endoplasmic reticulum Host cytoplasm Host nucleus Secreted Overall Success rate by jackknife test Virus-mPLoca iLoc-Virusb 8/8¼ 100.0% 19/33 ¼57.6% 13/20 ¼65.0% 52/87 ¼59.8% 51/84 ¼60.7% 9/20 ¼ 45.0% 8/8¼ 100.0% 25/33¼ 75.8% 15/20¼ 75.0% 64/87¼ 73.6% 70/84¼ 83.3% 15/20¼ 75.0% 152/252 ¼ 60.3%c 197/252 ¼78.2%c a The predictor from Shen and Chou (2010a). The predictor proposed in this paper. c Note that instead of 207 (the number of total different viral proteins), here we use 252 (the number of total different locative proteins) for the denominator. This is because some of the viral proteins in S may have more than one location site. See footnotes a and b of Table 1 for further explanation. b total locative proteins can be expressed as NðlocÞ ¼ NðseqÞ þ M X ðm1ÞNðmÞ ð2Þ m¼1 where N(loc) is the number of total locative proteins, N(seq) the number of total different protein sequences, N(1) the number of proteins with one location, N(2) the number of proteins with two locations, and so forth; while M¼6 is the number of total subcellular location sites concerned (cf. Eq. (1)). As we can see from Eq. (2), the number of total locative proteins is generally greater than that of total different protein sequences. When and only when all the proteins have a single location site, can the two be the same. The benchmark dataset used in this study covers 6 subcellular locations (Fig. 1) with a total of 207 different virus protein sequences, of which 165 belong to one subcellular location, 39 to two locations, 3 to three locations, and none to four and more locations (Shen and Chou, 2010b). Submitting these data into Eq. (2), we obtain NðlocÞ ¼ NðseqÞ þð11Þ 165 þ ð21Þ 39 þ ð31Þ 3 þ 6 X ðm1Þ 0 ¼ 207 þ39 þ 6 ¼ 252 To develop a powerful method for statistically predicting protein subcellular localization according to the sequence information, one of the most important things is to formulate the protein sequences with an effective mathematical expression that can truly reflect the intrinsic correlation with their subcellular localization (Chou, 2011). However, it is by no means a trivial job because it is usually not easy to find out this kind of correlation. The most straightforward method to formulate the sample of a query protein P was just using its entire amino acid sequence, as can be generally written by P ¼ R1 R2 R3 R4 R5 R6 R7 . . .RL a Code 3. Methods ð3Þ m¼4 which is fully consistent with the figures in Table 1 of Shen and Chou (2010b). ð4Þ where R1 represents the 1st residue of the protein P, R2 the 2nd residue, y, RL the L-th residue, and they each belong to one of the 20 native amino acids. To identify its subcellular location(s), one of the straightforward and preliminary methods was to utilize the sequence-similarity-search-based tools, such as BLAST (Altschul, 1997; Wootton and Federhen, 1993), to search protein database for those proteins that have high sequence similarity to the query protein P. Subsequently, the subcellular location annotations of the proteins thus found were used to deduce the subcellular location(s) for P. Unfortunately, although it was quite intuitive and able to contain the entire information of a protein sequence, this kind of straightforward sequential model failed to work when the query protein P did not have significant sequence similarity to any location-known proteins. Thus, various non-sequential or discrete models to formulate protein samples were proposed in hopes to establish some sort of correlation or cluster manner by which the prediction quality could be improved. Among the discrete models for a protein sample, the simplest one is its amino acid (AA) composition or AAC (Chou and Zhang, 1994). According to the AAC-discrete model, the protein P of Eq. (2) can be formulated by (Chou, 1995b; Nakashima and Nishikawa, 1994) h iT P ¼ f1 f2 f20 ð5Þ where fi ði ¼ 1,2, . . ., 20Þ are the normalized occurrence frequencies of the 20 native amino acids in protein P, and T the transposing operator. Many methods for predicting protein subcellular localization were based on the AAC-discrete model (see, e.g., Cedano et al., 1997; Chou and Elrod, 1999; Nakashima and Nishikawa, 1994; Reinhardt and Hubbard, 1998; Zhou and Doctor, 2003). However, as we can see from Eq. (5), if using the ACC model to represent the protein P, all its sequence-order effects would be lost, and hence the prediction quality might be limited. To avoid completely losing the sequence-order information, the pseudo-amino acid composition (PseAAC) was proposed to represent the sample of a protein, as formulated by (Chou, 2001) h iT P ¼ p1 p2 p20 p20 þ 1 p20 þ l ð6Þ where the first 20 elements are associated with the 20 elements in Eq. (3) or the 20 amino acid components of the protein P, while the additional l factors are used to incorporate some sequenceorder information via a series of rank-different correlation factors along a protein chain. The concept of PseAAC has been widely used to study various problems in proteins and protein-related systems, such as predicting protein folding rates (Guo et al., 2011), protein structural class prediction (Sahu and Panda, 2010; Xiao et al., 2006a), supersecondary structure prediction (Zou et al., 2011), protein secondary structure content prediction (Chen et al., 2009), protein quaternary structural attribute prediction (Shen and Chou, 2009b), fold pattern prediction (Shen and Chou, 2009a), ion Author's personal copy X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51 channel prediction (Lin and Ding, 2011), predicting enzymes and their family/sub-family classification (Qiu et al., 2010; Zhou et al., 2007), protein subcellular location prediction (Li and Li, 2008), apoptosis protein subcellular location prediction (Kandaswamy et al., 2010), predicting protein subchloroplast locations (Du et al., 2009), predicting protein submitochondria locations (Nanni and Lumini, 2008; Zeng et al., 2009), predicting membrane proteins and their types (Hayat and Khan, 2011), discrimination of outer membrane proteins (Lin, 2008), identifying proteases and their types (Chou and Shen, 2008b), identifying GPCRs and their classes (Xiao et al., 2011b), prediction of nuclear receptors (Gao et al., 2009), prediction of cyclin proteins (Mohabatkar, 2010), identifying bacterial secreted proteins (Yu et al., 2010), identifying risk type of human papillomaviruses (Esmaeili et al., 2010), prediction of cell wall lytic enzymes (Ding et al., 2009), prediction of lipases types (Zhang et al., 2008), predicting conotoxin superfamily and family (Mondal et al., 2006), predicting the cofactors of oxidoreductases (Zhang and Fang, 2008), and others (e.g., Georgiou et al., 2009). According to (Chou, 2011), the PseAAC for a protein P can be generally formulated as h iT P ¼ c1 c2 cu cO ð7Þ where the subscript O is an integer, and its value as well as the components c1, c2, y will depend on how to extract the desired information from the amino acid sequence of P (cf. Eq. (4)). As a general form, Eq. (7) can cover various different modes of PseAAC. For example, when its elements are given by 8 fu > > < P20 fi þ w Pl yj , ð1r u r 20Þ i ¼ 1 j ¼ 1 ð8Þ cu ¼ wyu20 > > P P , ð20 þ 1 ru r 20 þ l ¼ O; l o LÞ 20 l : f þw y i ¼ 1 i j ¼ 1 j we immediately obtain the formulation of PseAAC as originally introduced in (Chou, 2001), where w is the weight factor for the sequence order effect, yj the jth tier correlation factor reflecting the sequence order correlation between all the jth most contiguous residues along a protein chain, and l is an integer parameter for the maximum number of correlation tires to be considered. Readers can also find a concise description of Eq. (8) as well as the definition for each of the symbols therein by clicking the link at http://en.wikipedia.org/wiki/Pseudo_amino_a cid_composition to see a Wikipedia article about the pseudoamino acid composition. Below, let us use the general form of PseAAC (Eq. (7)) to find the formulations to reflect the core and essential features of protein samples that are closely correlated with their subcellular localization. 3.1. GO (gene ontology) formulation GO database (Ashburner et al., 2000) was established according to the molecular function, biological process, and cellular component. Accordingly, protein samples defined in a GO database space would be clustered in a way better reflecting their subcellular locations (Chou and Shen, 2007; Chou and Shen, 2008a). However, in order to incorporate more information, instead of only using 0 and 1 elements as done in (Shen and Chou, 2010a), here let us use a different approach as described below. Step 1. Compression and reorganization of the existing GO numbers. The GO database (version 74.0 released 30 July 2009) contains many GO numbers. However, these numbers do not increase successively and orderly. For easier handling, some reorganization and compression procedure was taken to renumber them. For example, after such a procedure, the original GO numbers GO:0000001, GO:0000002, GO:0000003, GO:0000009, 45 GO:00000011, GO:0000012, GO:0000015, y, GO:0090204 would become GO_compress: 00001, GO_compress: 00002, GO_compress: 00003, GO_compress: 00004, GO_compress: 00005, GO_compress: 00006, GO_compress: 00007, yy, GO_compress: 11118, respectively. The GO database obtained thru such a treatment is called GO_compress database, which contains 11,118 numbers increasing successively from 1 to the last one. Step 2. Using Eq. (7) with O ¼11,118, the protein P can be formulated as h iT G G G PGO ¼ cG1 c2 cu c11118 ð9Þ G where cu ðu ¼ 1,2, . . ., 11,118Þ are defined via the following steps. Step 3. Use BLAST (Schaffer et al., 2001) to search the homologous proteins of the protein P from the Swiss-Prot database (version 55.3), with the expect value Er0.001 for the BLAST parameter. Step 4. Those proteins which have Z60% pairwise sequence homo identity with the protein P are collected into a set, SP , called homo the ‘‘homology set’’ of P. All the elements in SP can be deemed as the ‘‘representative proteins’’ of P, sharing some similar attributes such as structural conformations and biological functions (Chou, 2004; Gerstein and Thornton, 2003; Loewenstein et al., 2009). Because they were retrieved from the Swiss-Prot database, these representative proteins must each have their own accession numbers. Step 5. Search each of these accession numbers collected in Step 4 against the GO database at http://www.ebi.ac.uk/GOA/ to find the corresponding GO numbers (Camon et al., 2003). Step 6. Based on the results obtained in Step 5, the elements in Eq. (9) can be written as cGu ¼ PNhomo P dðu,kÞ k¼1 Nhomo P ðu ¼ 1,2, . . ., 11,118Þ ð10Þ homo where Nhomo is the number of representative proteins in SP P and ( dðu,kÞ ¼ , 1, if the k-th representative protein hits the uth GO_compress number 0, otherwise ð11Þ As we can see from Eq. (9), the GO formulation derived from the above steps consists of 11,118 real numbers rather than only the elements 0 and 1 as in the GO formulation adopted in Shen and Chou (2010b). Note that the GO formulation of Eq. (9) may become a naught vector or meaningless under any of the following situations: (1) the protein P does not have significant homology to any homo protein in the Swiss-Prot database, i.e., SP ¼ | meaning the homo homology set SP is an empty one; (2) its representative proteins do not contain any useful GO information for statistical prediction based on a given training dataset. Under such a circumstance, let us consider using the sequential evolution formulation to represent the protein P, as described below. 3.2. SeqEvo (sequential evolution) formulation Biology is a natural science with historic dimension. All biological species have developed starting out from a very limited number of ancestral species. It is true for protein sequence as well (Chou, 2004). Their evolution involves changes of single residues, insertions and deletions of several residues (Chou, 1995a), gene doubling, and gene fusion. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes, such as having basically the same biological function and residing in a same subcellular location. Author's personal copy 46 X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51 To incorporate the sequential evolution information into the general form of PseAAC (Eq. (7)), here let us use the information of the Position-Specific Scoring Matrix (PSSM) (Schaffer et al., 2001), as described below. Step 1. According to Schaffer et al. (2001), the sequential evolution information of protein P can be expressed by a 20 L matrix as given by 2 0 3 E1-1 E02-1 E0L-1 6 0 7 E02-2 E0L-2 7 6 E1-2 7 PSSM ¼ 6 ð12Þ 6 ^ ^ & ^ 7 4 5 E01-20 E02-20 E0L-20 where L is the length of P (counted in the total number of its constituent amino acids as shown in Eq. (4)), E0i-j represents the score of the amino acid residue in the ith position of the protein sequence being changed to amino acid type j during the evolutionary process. Here, the numerical codes 1, 2, y, 20 are used to denote the 20 native amino acid types according to the alphabetical order of their single character codes. The 20 L scores in Eq. (12) were generated by using PSI-BLAST (Schaffer et al., 2001) to search the UniProtKB/Swiss-Prot database (Release 2010_04 of 23-Mar-2010) through three iterations with 0.001 as the E-value cutoff for multiple sequence alignment against the sequence of the protein P. However, according to the formulation of Eq. (12), proteins with different lengths will correspond to column-different matrices causing difficulty for developing a predictor able to uniformly cover proteins of any length. To make the descriptor become a size-uniform matrix, let us consider the following steps. Step 2. Use the elements in PSSM of Eq. (12) to define a new matrix M as formulated by 2 3 E1-1 E2-1 EL-1 6 7 E2-2 EL-2 7 6 E1-2 7 M¼6 ð13Þ 6 ^ ^ & ^ 7 4 5 E1-20 E2-20 EL-20 with 0 Ei-j ¼ E0i-j Ej 0 ði ¼ 1,2, . . ., L; j ¼ 1,2, . . ., 20Þ ð14Þ SDðEj Þ lower triangular elements, to formulate the protein P; i.e., the general PseAAC form of Eq. (7) can now be formulated as E PEvo ¼ ½ c1 cE2 cEu cE210 T ðu ¼ 1,2, . . ., 210Þ are, respectively, where the components taken from the 210 diagonal and lower triangular elements of Eq. (17) by following a given order, say from left to right and from the 1st row to the last as illustrated by following equation: 2 3 ð1Þ 6 ð2Þ 7 ð3Þ 6 7 6 7 6 ð4Þ 7 ð5Þ ð6Þ ð19Þ 6 7 6 ^ 7 ^ ^ & 4 5 ð191Þ ð192Þ ð193Þ ð210Þ where the numbers in parentheses indicate the order of elements taken from Eq. (17) for Eq. (18). 3.3. The self-consistency formulation principle Regardless of using which formulation to represent protein samples, the following self-consistency principle must be observed during the course of prediction: if the query protein P was defined in the form of PGO (see Eq. (9)), then all the protein samples used to train the prediction engine should also be expressed in the GO formulation; if the query protein was defined in the form of PEvo (see Eq. (18)), then all the training data should be expressed in the SeqEvo formulation as well. Below, let us consider the algorithm or operation engine for conducting the prediction. 3.4. Multi-label K-nearest neighbor (KNN) classifier In this study, let us introduce a novel classifier, called the multi-label KNN or abbreviated as ML-KNN classifier, to predict the subcellular localization for the systems that contain both single-location and multiple-location proteins. Suppose the m-th subset Sm of S (Eq. (1)) contains Nm proteins, and P(m,j) is the jth one in that subset. Thus, we have ( Pðm,jÞ ¼ PGO ðm,jÞ, in GO space PEvo ðm,jÞ, in SeqEvo space where 0 Ej ð18Þ cEu ðm ¼ 1,2, . . ., 6; j ¼ 1,2, . . ., Nm Þ ð20Þ L 1X ¼ E0 L i ¼ 1 i-j ðj ¼ 1,2, . . ., 20Þ ð15Þ is the mean for E0i-j ði ¼ 1,2, . . ., LÞ and vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u L uX 0 0 ½E0i-j Ej 2 =L SDðEj Þ ¼ t ð16Þ i¼1 is the corresponding standard deviation. Step 3. Introduce a new matrix generated by multiplying M with its own transpose matrix MT; i.e., 2 L 3 L L X X X Ei-1 Ei-1 Ei-1 Ei-2 Ei-1 Ei-20 7 6 6 i¼1 7 i¼1 i¼1 6 7 6 L 7 L L X X 6 X 7 6 7 E E E E E E i-2 i-1 i-2 i-2 i-2 i-20 6 7 T MM ¼ 6 i ¼ 1 ð17Þ 7 i¼1 i¼1 6 7 6 7 ^ ^ & ^ 6 7 6 L 7 L L X X 6X 7 4 Ei-20 Ei-1 Ei-20 Ei-2 Ei-20 Ei-20 5 i¼1 i¼1 i¼1 which contains 20 20 ¼400 elements. Since MMT is a symmetric matrix, we only need the information of its 210 elements, of which 20 are the diagonal elements and (400 20)/2¼ 190 are the where PGO(m,j) and PEvo(m,j) have the same forms as PGO (Eq. (9)), and PEvo (Eq. (18)), respectively; the only difference is that the corresponding constituent elements are derived from the amino acid sequence of P(m,j) instead of P. In sequence analysis, there are many different scales to define the distance between two proteins, such as Euclidean distance, Hamming distance (Mardia et al., 1979), and Mahalanobis distance (Chou and Zhang, 1994; Mahalanobis, 1936; Pillai, 1985). In Chou and Shen (2010b), the distance between P(m,j) and P was defined by 1 cos 1[P, P(m,j)]. However, we have observed that when the GO descriptor was formulated with real numbers, better outcomes would be resulted by using the Euclidean metric; i.e., the distance between P and P(m,j) should be defined here by DfP,Pðm,jÞg ¼ :PPðm,jÞ: ð21Þ where :PPðm,jÞ: represents the module of the vector difference between P and P(m,j) in the Euclidean space. According to Eq. (21), when P P(m,j) we have D{P,P(m,j)}¼0, indicating the distance between these two protein sequences is zero and hence they have perfect or 100% similarity. Suppose Pn1 ,Pn2 , . . ., PnK are the K nearest neighbor proteins to P the protein P that forms a set denoted by SK , which is a subset of P S; i.e., SK D S. Based on the K nearest neighbor proteins in SPK , let Author's personal copy X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51 us define an accumulation-layer (AL) scale, given by n o QðP,KÞ ¼ rK1 rK2 rK3 rK4 rK5 rK6 ð22Þ where PK dðPn ,mÞ rm ¼ i ¼ 1 n i ð23Þ NK ðm ¼ 1,2, . . ., 6Þ where ( n dðPi ,mÞ ¼ 1, if Pni belongs to the mth location 0, otherwise ð24Þ and NnK ¼ 6 X K X dðPni ,mÞ ð25Þ m¼1i¼1 Note that NnK Z K because a protein may belong to one or more subcellular location sites in the current system. Now, for a query protein P, its subcellular location(s) will be predicted according to the following steps. Step 1. The number of how many different subcellular locations it belongs to will be determined by its nearest neighbor protein in S. For example, suppose Pn is the nearest protein to P in S. If Pn has only one subcellular location, then P will also have only one location; if Pn has two subcellular locations, then P will also have two locations; and so forth. In general, if Pn belongs to M different location sites, then P will be predicted to have the same number, M, of subcellular locations as well, as can be formulated by M ¼ Num Pn ) L ¼ Num P ) L ð26Þ n where M is an integer ( r6), Num P ) L represents the number of different subcellular locations to which Pn belongs, and Num P ) L the number of different subcellular locations to which P belongs. Step 2. However, the concrete location site(s) to which P belongs will not be determined by the location site(s) of Pn, but by the element(s) in Eq. (22) that has (have) the highest score(s), as can be expressed by f‘g, the subscript(s) of Eq. (1). For example, if P is found belonging to only one location ðM ¼ 1Þ in Step 1, and the 47 highest score in Eq. (22) is rK3 , then P will be predicted as f‘g ¼ 3 meaning that it belongs to S3 or resides at ‘‘host endoplasmic reticulum’’ (cf. Table 1). If P is found belonging to three locations ðM ¼ 3Þ, and the first three highest scores in Eq. (22) are rK1 , rK2 , and rK6 , then P will be predicted as f‘g ¼ ð1,2, 6Þ meaning that it belongs to S1 , S2 , and S6 or resides simultaneously at ‘‘viral capsid’’, ‘‘host cell membrane’’, and ‘‘secreted’’. And so forth. In other words, the concrete predicted subcellular location(s) can be formulated as n o rK1 rK2 rK3 rK4 rK5 rK6 ðMr 6Þ ð27Þ f‘g ¼ Max x M Sub where the operator ‘‘Max x M Sub ’’ means identifying the M highest scores for the elements in the brackets right after it, followed by taking their M subscripts. The entire classifier thus established is called iLoc-Virus, which can be used to predict the subcellular localization of both singleplex and multiplex viral proteins. To provide an intuitive picture, a flowchart is provided in Fig. 2 to illustrate the prediction process of iLoc-Virus. 3.5. Protocol guide For user’s convenience, a web-server for iLoc-Virus was established. Below, let us give a step-by-step guide on how to use it to get the desired results. Step 1. Open the web server at site http://icpr.jci.edu.cn/ bioinfo/iLoc-Virus and you will see the top page of the predictor on your computer screen, as shown in Fig. 3. Click on the Read Me button to see a brief introduction about iLoc-Virus predictor and the caveat when using it. Step 2. Either type or copy and paste the query protein sequence into the input box at the center of Fig. 3. The input sequence should be in the FASTA format. A sequence in FASTA format consists of a single initial line beginning with a greaterthan symbol (‘‘ 4’’) in the first column, followed by lines of sequence data. The words right after the ‘‘ 4’’ symbol in the single initial line are optional and only used for the purpose of identification and description. All lines should be no longer than 120 characters and usually do not exceed 80 characters. The sequence ends if another line starting with a ‘‘ 4 ’’ appears; this Fig. 2. A flowchart to show the prediction process of iLoc-Virus. Author's personal copy 48 X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51 Fig. 3. A semi-screenshot to show the top page of the iLoc-Virus web-server. Its website address is at http://icpr.jci.edu.cn/bioinfo/iLoc-Virus. Fig. 4. A semi-screenshot to show the output of iLoc-Virus. The input was taken from the three protein sequences listed in the Example window of the iLoc-Virus webserver (cf. Fig. 3). indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the input box. For more information about FASTA format, visit http://en.wikipedia.org/wiki/Fasta_format. Different with VirusmPLoc (Shen and Chou, 2010b), where only one query protein sequence at a time is allowed for each submission, now the maximum number of query proteins for each submission can be 10. Step 3. Click on the Submit button to see the predicted result. For example, if you use the three query protein sequences in the Example window as the input, after clicking the Submit button, you will see Fig. 4 shown on your screen, indicating that the predicted result for the 1st query protein is ‘‘Host cytoplasm’’, that for the 2nd one is ‘‘Host cytoplasm; Host nucleus’’, and that for the 3rd one is ‘‘Host cell membrane; Host cytoplasm; Host nucleus’’. In other words, the 1st query protein (P04487) is a single-location one residing at ‘‘host cytoplasm’’ only, the 2nd one (Q65202) can simultaneously occur in two different sites (‘‘host cytoplasm’’ and ‘‘host nucleus’’), while the 3rd one (P21935) can simultaneously occur in three different sites (‘‘host cell membrane’’, ‘‘host cytoplasm’’, and ‘‘host nucleus cytoplasm’’). All these results are fully consistent with the experimental observation as indicated in the Online Supporting Information S1. It takes about 10 s for the above computation before the predicted result appears on your computer screen; the more number of query proteins and longer of each sequence, the more time it is usually needed. Step 4. As mentioned in Introduction, Virus-mPLoc (Shen and Chou, 2010b) does not have the function to handle batch jobs but the current iLoc-Virus does. As shown on the lower panel of Author's personal copy X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51 49 Fig. 3, you may also choose the batch prediction by entering your e-mail address and your desired batch input file (in FASTA format) via the ‘‘Browse’’ button. To see the sample of batch input file, click on the button Batch-example. The maximum number of the query proteins for each batch input file is 50. After clicking the button Batch-submit, you will see ‘‘Your batch job is under computation; once the results are available, you will be notified by e-mail’’. Note that if you submit a batch input file from an Apple computer, although it looks like in the FASTA format, your input might change to non-FASTA format in the server end and cause errors. Under such a circumstance, the safest way is to submit your input file with a pdf format. Step 5. Click on the Citation button to find the relevant papers that document the detailed development and algorithm of iLoc-Virus. Step 6. Click on the Data button to download the benchmark datasets used to train and test the iLoc-Virus predictor. Caveat. To obtain the predicted result with the expected success rate, the entire sequence of the query protein rather than its fragment should be used as an input. A sequence with less than 50 amino acid residues is generally deemed as a fragment. Also, if the query protein is known not one of the 6 locations as shown in Fig. 1, stop the prediction because the result thus obtained will not make any sense. benchmark dataset S by the jackknife test. As we can see from Table 2, for such a stringent dataset, the overall success rate achieved by iLoc-Virus is over 78.1%, which is about 18% higher than that by Virus-mPLoc (Shen and Chou, 2010b). Note that during the course of the jackknife test by VirulmPLoc and iLoc-Virus, the false positives (over-predictions) and false negatives (under-predictions) were also taken into account to reduce the scores in calculating the overall success rate. As for the detailed process of how to count the over-predictions and under-predictions for a system containing both single-location and multiple-location proteins, see Eqs. (43)–(48) and Fig. 4 in a comprehensive review (Chou and Shen, 2007). To provide a more intuitive and easier-to-understand measurement, let us introduce a new scale, the so-called ‘‘absolute true’’ success rate, to reflect the accuracy of a predictor, as defined by PN DðiÞ ð28Þ L¼ i¼1 N 4. Results and discussion According to the above definition, for a protein belonging to, say, three subcellular locations, if only two of the three are correctly predicted, or the predicted result contains a location not belonging to the three, the prediction score will be counted as 0. In other words, when and only when all the subcellular locations of a query protein are exactly predicted without any underprediction or overprediction, can the prediction be scored with 1. Therefore, the absolute true scale is much more strict and harsh than the scale used previously (Chou and Shen, 2007; Shen and Chou, 2010b) in measuring the success rate. However, even if using such a stringent criterion on the same benchmark dataset by the jackknife test, the overall absolute true success rate achieved by iLoc-Virus was 155/207 ¼74.8%. Why can iLoc-Virus enhance the success rate so remarkably? One of the key reasons is that the GO formulation for protein samples in iLoc-Virus contains more information than that in Virus-mPLoc (Shen and Chou, 2010b), as can be illustrated as follows. Suppose given a protein P(q), according to Steps 3 and 4 in the Section of ‘‘GO (Gene Ontology) Formulation’’, we found 20 proteins that were homologous to it; i.e., Nhomo P ðqÞ ¼ 20. Of the 20 homologous proteins, 4 hit GO_compress:00008, 16 hit GO_compress:00023, 12 hit GO_compress:00826, and all hit GO_compress:01938. Substituting these data into Eqs. (8)–(9), we have 8 4=20 ¼ 0:2, if u ¼ 8 > > > > > > < 16=20 ¼ 0:8, if u ¼ 23 cGu ðPðqÞÞ ¼ 12=20 ¼ 0:6, if u ¼ 826 ðu ¼ 1,2, . . ., 11,118Þ ð30Þ > > > 20=20 ¼ 1:0, if u ¼ 1938 > > > : 0, otherwise where L represents the absolute true rate, N the number of total proteins investigated, and 8 > < 1, if all the subcellular locations of the ith protein are correctly predicted without any overprediction DðiÞ ¼ > : 0, otherwise ð29Þ In statistical prediction, the following three methods are often used to examine the quality of a predictor: independent dataset test, subsampling test, and jackknife test (Chou and Zhang, 1995). Because the subsampling test and jackknife test can be carried out using one benchmark dataset and also because the independent dataset test can be treated as a special case of subsampling test, one benchmark dataset would suffice to serve all the three kinds of cross-validation. However, as demonstrated by Eq. (1) of Chou and Shen (2010a) and elucidated in Chou and Shen (2007), among the three cross-validation methods, the jackknife test is thought the least arbitrary because it can always yield a unique result for a given benchmark dataset and hence has been widely recognized and increasingly used to examine the power of various predictors (see, e.g., Chou and Shen, 2010b; Ding et al., 2009; Du et al., 2009; Jahandideh et al., 2009; Kannan et al., 2008; Li and Li, 2008; Lin, 2008; Mohabatkar, 2010; Xiao et al., 2011a; Zou et al., 2011). Accordingly, in this study, the jackknife test will also be used to evaluate the anticipated accuracy for iLoc-Virus. However, even if using the jackknife test to examine the accuracy, a same predictor may still yield obviously different success rates when tested by different benchmark datasets. This is because the more stringent of a benchmark dataset in excluding homologous sequences, the more difficult for a predictor to achieve a high success rate. Also, the more number of subsets (subcellular locations) a benchmark dataset covers, the more difficult to achieve a high overall success rate, as elaborated in a recent review (Chou, 2011). As mentioned in the Materials section, the benchmark dataset used in this study is S (cf. Online Supporting Information S1), which is the same benchmark dataset constructed in (Shen and Chou, 2010b) for Virus-mPLoc. Actually, for such a dataset containing both single-location and multiple-location viral proteins distributed among 6 subcellular location sites, so far only one existing predictor, i.e., Virus-mPLoc (Shen and Chou, 2010b), had the capacity to deal with it. Therefore, to demonstrate the power of the current predictor, it would suffice to just compare iLoc-Virus with Virus-mPLoc (Shen and Chou, 2010b). Listed in Table 2 are the results obtained with Virus-mPLoc (Shen and Chou, 2010b) and iLoc-Virus on the aforementioned In contrast, if the same protein was represented according to the formulation in Virus-mPLoc (Shen and Chou, 2010b), it would be 8 1, if u ¼ 8 > > > > > > < 1, if u ¼ 23 ð31Þ cGu ðPðqÞ ¼ 1, if u ¼ 826 ðu ¼ 1,2, . . ., 11,118Þ > > > > 1, if u ¼ 1938 > > : 0, otherwise As can be clearly seen by comparing Eq. (30) with Eq. (31), although the elements in the 8th, 23rd, 826th, and 1938th components are all not zero in both formulations, in the one as Author's personal copy 50 X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51 defined in Virus-mPLoc (Shen and Chou, 2010b) all the elements are either 1 or zero, completely ignoring their weights. In other words, the GO formulation in the current iLoc-Virus contains more information than that in Virus-mPLoc (Shen and Chou, 2010b) and hence leading to better prediction results. The other reason is that in Virus-mPLoc (Shen and Chou, 2010b) the number of the subcellular location sites for a query protein was determined by a threshold factor yn (cf. Eq. (48) in Chou and Shen (2007)) that actually functioned as a ‘‘black box’’ without providing any physicochemical rationale. In contrast, it is very much different in the current iLoc-Virus as reflected by the fact that the number of the subcellular location sites for a query protein is determined according to the nearest neighbor (NN) principle (cf. Eq. (26)), and that its concrete location sites are determined according to the accumulation-layer scale (cf. Eqs. (22) and (27)). 5. Conclusions Prediction of protein subcellular localization is a challenging problem, particularly when the system concerned contains both singleplex and multiplex proteins. The reasons why iLoc-Virus can achieve higher success rates than Virus-mPLoc are as follows. (1) The GO formulation used to represent protein samples in iLocVirus is formed by the probabilities of hits (cf. Eqs. (10)–(11)) and hence contains more information than that in Virus-mPLoc (Shen and Chou, 2010b) where only the number ‘‘0’’ or ‘‘1’’ was used regardless how many hits were found to the corresponding component in the GO formulation. (2) The accumulation-layer scale has been introduced in iLoc-Virus that is more natural and effective for dealing with proteins having both single and multiple subcellular locations. Acknowledgments The authors wish to thank the two anonymous Reviewers, whose constructive comments are very helpful for strengthening the presentation of this study. This work was supported by the grants from the National Natural Science Foundation of China (No. 60961003), the Key Project of Chinese Ministry of Education (No. 210116), and the Province National Natural Science Foundation of JiangXi (2009GZS0064 and 2010GZS0122). Appendix A. Supplementary materials Supplementary materials associated with this article can be found in the online version at doi:10.1016/j.jtbi.2011.06.005. References Altschul, S.F., 1997. Evaluating the statistical significance of multiple distinct local alignments. In: Suhai, S. (Ed.), Theoretical and Computational Methods in Genome Research. Plenum, New York, pp. 1–14. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G., 2000. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29. Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., Apweiler, R., 2004. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32, D262-6. Camon, E., Magrane, M., Barrell, D., Binns, D., Fleischmann, W., Kersey, P., Mulder, N., Oinn, T., Maslen, J., Cox, A., Apweiler, R., 2003. The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res. 13, 662–672. Cedano, J., Aloy, P., P’erez-Pons, J.A., Querol, E., 1997. Relation between amino acid composition and cellular location of proteins. J. Mol. Biol. 266, 594–600. Chen, C., Chen, L., Zou, X., Cai, P., 2009. Prediction of protein secondary structure content by using the concept of Chou’s pseudo amino acid composition and support vector machine. Protein Pept. Lett. 16, 27–31. Chen, Y.L., Li, Q.Z., 2007. Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo amino acid composition. J. Theor. Biol. 248, 377–381. Chou, K.C., 1989. Graphic rules in steady and non-steady enzyme kinetics. J. Biol. Chem. 264, 12074–12079. Chou, K.C., 1995a. The convergence-divergence duality in lectin domains of the selectin family and its implications. FEBS Lett. 363, 123–126. Chou, K.C., 1995b. A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins: Struct. Funct. Genet. 21, 319–344. Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Struct. Funct. Genet. 43, 246–255 (Erratum: ibid., 2001, vol. 44, 60). Chou, K.C., 2004. Review: structural bioinformatics and its impact to biomedical science. Curr. Med. Chem. 11, 2105–2134. Chou, K.C., 2011. Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). J. Theor. Biol. 273, 236–247. Chou, K.C., Zhang, C.T., 1994. Predicting protein folding types by distance functions that make allowances for amino acid interactions. J. Biol. Chem. 269, 22014–22020. Chou, K.C., Zhang, C.T., 1995. Review: prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 30, 275–349. Chou, K.C., Elrod, D.W., 1999. Protein subcellular location prediction. Protein Eng. 12, 107–118. Chou, K.C., Shen, H.B., 2007. Review: recent progresses in protein subcellular location prediction. Anal. Biochem. 370, 1–16. Chou, K.C., Shen, H.B., 2008a. Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat. Protocols 3, 153–162. Chou, K.C., Shen, H.B., 2008b. ProtIdent: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information. Biochem. Biophys. Res. Commun. 376, 321–325. Chou, K.C., Shen, H.B., 2010a. Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms. Nat. Sci. 2, 1090–1103 (openly accessible at /http://www.scirp.org/journal/NS/S). Chou, K.C., Shen, H.B., 2010b. Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS ONE 5, e11335. Chou, K.C., Shen, H.B., 2010c. A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: EukmPLoc 2.0. PLoS ONE 5, e9931. Ding, H., Luo, L., Lin, H., 2009. Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition. Protein Pept. Lett. 16, 351–355. Ding, Y.S., Zhang, T.L., 2008. Using Chou’s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble classifier. Pattern Recognition Lett. 29, 1887–1892. Du, P., Cao, S., Li, Y., 2009. SubChlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm. J. Theor. Biol. 261, 330–335. Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G., 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300, 1005–1016. Esmaeili, M., Mohabatkar, H., Mohsenzadeh, S., 2010. Using the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses. J. Theor. Biol. 263, 203–209. Gao, Q.B., Jin, Z.C., Ye, X.F., Wu, C., He, J., 2009. Prediction of nuclear receptors with optimal pseudo amino acid composition. Anal. Biochem. 387, 54–59. Gardy, J.L., Laird, M.R., Chen, F., Rey, S., Walsh, C.J., Ester, M., Brinkman, F.S., 2005. PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21, 617–623. Georgiou, D.N., Karakasidis, T.E., Nieto, J.J., Torres, A., 2009. Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition. J. Theor. Biol. 257, 17–26. Gerstein, M., Thornton, J.M., 2003. Sequences and topology. Curr. Opin. Struct. Biol. 13, 341–343. Glory, E., Murphy, R.F., 2007. Automated subcellular location determination and high-throughput microscopy. Dev. Cell 12, 7–16. Guo, J., Rao, N., Liu, G., Yang, Y., Wang, G., 2011. Predicting protein folding rates using the concept of Chou’s pseudo amino acid composition. Journal of Computational Chemistry 32, 1612–1617. Hayat, M., Khan, A., 2011. Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. J. Theor. Biol. 271, 10–17. Jahandideh, S., Hoseini, S., Jahandideh, M., Hoseini, A., Disfani, F.M., 2009. Gammaturn types prediction in proteins using the two-stage hybrid neural discriminant model. J. Theor. Biol. 259, 517–522. Author's personal copy X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51 Kandaswamy, K.K., Pugalenthi, G., Moller, S., Hartmann, E., Kalies, K.U., Suganthan, P.N., Martinetz, T., 2010. Prediction of apoptosis protein locations with genetic algorithms and support vector machines through a new mode of pseudo amino acid composition. Protein Pept. Lett. 17, 1473–1479. Kannan, S., Hauth, A.M., Burger, G., 2008. Function prediction of hypothetical proteins without sequence similarity to proteins of known function. Protein Pept. Lett. 15, 1107–1116. Li, F.M., Li, Q.Z., 2008. Predicting protein subcellular location using Chou’s pseudo amino acid composition and improved hybrid approach. Protein Pept. Lett. 15, 612–616. Lin, H., 2008. The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition. J. Theor. Biol. 252, 350–356. Lin, H., Ding, H., 2011. Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. J. Theor. Biol. 269, 64–69. Liu, T., Zheng, X., Wang, C., Wang, J., 2010. Prediction of subcellular location of apoptosis proteins using pseudo amino acid composition: an approach from auto covariance transformation. Protein Pept. Lett. 17, 1263–1269. Loewenstein, Y., Raimondo, D., Redfern, O.C., Watson, J., Frishman, D., Linial, M., Orengo, C., Thornton, J., Tramontano, A., 2009. Protein function annotation by homology-based inference. Genome Biol. 10, 207. Mahalanobis, P.C., 1936. On the generalized distance in statistics. Proc. Natl. Inst. Sci. India 2, 49–55. Mardia, K.V., Kent, J.T., Bibby, J.M., 1979. Multivariate Analysis: Chapter 11 Discriminant Analysis; Chapter 12 Multivariate Analysis of Variance; Chapter 13 Cluster Analysis. Academic Press, London (pp. 322–381). Mohabatkar, H., 2010. Prediction of cyclin proteins using Chou’s pseudo amino acid composition. Protein Pept. Lett. 17, 1207–1214. Mondal, S., Bhavna, R., Mohan Babu, R., Ramakumar, S., 2006. Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification. J. Theor. Biol. 243, 252–260. Nakai, K., 2000. Protein sorting signals and prediction of subcellular localization. Adv. Protein Chem. 54, 277–344. Nakai, K., Kanehisa, M., 1991. Expert system for predicting protein localization sites in Gram-negative bacteria. Proteins: Struct. Funct. Genet. 11, 95–110. Nakashima, H., Nishikawa, K., 1994. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol. 238, 54–61. Nanni, L., Lumini, A., 2008. Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization. Amino Acids 34, 653–660. Pan, Y.X., Zhang, Z.Z., Guo, Z.M., Feng, G.Y., Huang, Z.D., He, L., 2003. Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach. J. Protein Chem. 22, 395–402. Park, K.J., Kanehisa, M., 2003. Prediction of protein subcellular locations by support vector machines using compositions of amino acid and amino acid pairs. Bioinformatics 19, 1656–1663. Pillai, K.C.S., 1985. Mahalanobis D2. In: Kotz, S., Johnson, N.L. (Eds.), Encyclopedia of Statistical Sciences, vol. 5. John Wiley & Sons, New York, pp. 176–181 (This reference also presents a brief biography of Mahalanobis who was a man of great originality and who made considerable contributions to statistics). Qiu, J.D., Huang, J.H., Shi, S.P., Liang, R.P., 2010. Using the concept of Chou’s pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. Protein Pept. Lett. 17, 715–722. Reinhardt, A., Hubbard, T., 1998. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 26, 2230–2236. Sahu, S.S., Panda, G., 2010. A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction. Comput. Biol. Chem. 34, 320–327. 51 Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., Altschul, S.F., 2001. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29, 2994–3005. Shen, H.B., Chou, K.C., 2007. Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. Biopolymers 85, 233–240. Shen, H.B., Chou, K.C., 2009a. Predicting protein fold pattern with functional domain and sequential evolution information. J. Theor. Biol. 256, 441–446. Shen, H.B., Chou, K.C., 2009b. QuatIdent: a web server for identifying protein quaternary structural attribute by fusing functional domain and sequential evolution information. J. Proteome Res. 8, 1577–1584. Shen, H.B., Chou, K.C., 2010a. Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. J. Theor. Biol. 264, 326–333. Shen, H.B., Chou, K.C., 2010b. Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites. J. Biomol. Struct. Dyn. 28, 175–186. Small, I., Peeters, N., Legeai, F., Lurin, C., 2004. Predotar: a tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics 4, 1581–1590. Smith, C., 2008. Subcellular targeting of proteins and drugs. /http://www. biocompare.com/Articles/TechnologySpotlight/976/Subcellular-Targeting-OfProteins-And-Drugs.htmlS. Wong, J.H., Ng, T.B., 2009. Studies on an antifungal protein and a chromatographically and structurally related protein isolated from the culture broth of Bacillus amyloliquefaciens. Protein Pept. Lett. 16, 1399–1406. Wootton, J.C., Federhen, S., 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17, 149–163. Xiao, X., Wang, P., Chou, K.C., 2011a. Quat-2L: a web-server for predicting protein quaternary structural attributes. Mol. Diversity 15, 149–155. Xiao, X., Wang, P., Chou, K.C., 2011b. GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions. Mol. Biosyst. 7, 911–919. Xiao, X., Shao, S.H., Huang, Z.D., Chou, K.C., 2006a. Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor. J. Comput. Chem. 27, 478–482. Xiao, X., Shao, S.H., Ding, Y.S., Huang, Z.D., Chou, K.C., 2006b. Using cellular automata images and pseudo amino acid composition to predict protein subcellular location. Amino Acids 30, 49–54. Yu, L., Guo, Y., Li, Y., Li, G., Li, M., Luo, J., Xiong, W., Qin, W., 2010. SecretP: identifying bacterial secreted proteins by fusing new features into Chou’s pseudo-amino acid composition. J. Theor. Biol. 267, 1–6. Zeng, Y.H., Guo, Y.Z., Xiao, R.Q., Yang, L., Yu, L.Z., Li, M.L., 2009. Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J. Theor. Biol. 259, 366–372. Zhang, G.Y., Fang, B.S., 2008. Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou’s amphiphilic pseudo amino acid composition. J. Theor Biol. 253, 310–315. Zhang, G.Y., Li, H.C., Gao, J.Q., Fang, B.S., 2008. Predicting lipase types by improved Chou’s pseudo-amino acid composition. Protein Pept. Lett. 15, 1132–1137. Zhou, G.P., Doctor, K., 2003. Subcellular location prediction of apoptosis proteins. Proteins: Struct. Funct. Genet. 50, 44–48. Zhou, X.B., Chen, C., Li, Z.C., Zou, X.Y., 2007. Using Chou’s amphiphilic pseudoamino acid composition and support vector machine for prediction of enzyme subfamily classes. J. Theor. Biol. 248, 546–551. Zou, D., He, Z., He, J., Xia, Y., 2011. Supersecondary structure prediction using Chou’s pseudo amino acid composition. J. Comput. Chem. 32, 271–278.
© Copyright 2026 Paperzz