Author`s personal copy

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)
This article appeared in a journal published by Elsevier. The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elsevier’s archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/copyright
Author's personal copy
Journal of Theoretical Biology 284 (2011) 42–51
Contents lists available at ScienceDirect
Journal of Theoretical Biology
journal homepage: www.elsevier.com/locate/yjtbi
iLoc-Virus: A multi-label learning classifier for identifying the subcellular
localization of virus proteins with both single and multiple sites
Xuan Xiao a,b,n, Zhi-Cheng Wu a, Kuo-Chen Chou b
a
b
Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA
a r t i c l e i n f o
a b s t r a c t
Article history:
Received 11 February 2011
Received in revised form
31 May 2011
Accepted 4 June 2011
Available online 14 June 2011
In the last two decades or so, although many computational methods were developed for predicting the
subcellular locations of proteins according to their sequence information, it is still remains as a
challenging problem, particularly when the system concerned contains both single- and multiplelocation proteins. Also, among the existing methods, very few were developed specialized for dealing
with viral proteins, those generated by viruses. Actually, knowledge of the subcellular localization of
viral proteins in a host cell or virus-infected cell is very important because it is closely related to their
destructive tendencies and consequences. In this paper, by introducing the ‘‘multi-label scale’’ and by
hybridizing the gene ontology information with the sequential evolution information, a predictor called
iLoc-Virus is developed. It can be utilized to identify viral proteins among the following six locations:
(1) viral capsid, (2) host cell membrane, (3) host endoplasmic reticulum, (4) host cytoplasm, (5) host
nucleus, and (6) secreted. The iLoc-Virus predictor not only can more accurately predict the location
sites of viral proteins in a host cell, but also have the capacity to deal with virus proteins having more
than one location. As a user-friendly web-server, iLoc-Virus is freely accessible to the public at http://
icpr.jci.edu.cn/bioinfo/iLoc-Virus. Meanwhile, a step-by-step guide is provided on how to use the webserver to get the desired results. Furthermore, for the user’s convenience, the iLoc-Virus web-server
also has the function to accept the batch job submission. It is anticipated that iLoc-Virus may become a
useful high throughput tool for both basic research and drug development.
& 2011 Elsevier Ltd. All rights reserved.
Keywords:
Multiplex proteins
Locative proteins
Accumulation-layer scale
ML-KNN
Absolute accuracy
1. Introduction
With the avalanche of protein sequences generated in the
post-genomic age, many computational methods were established for timely identifying their subcellular localization according to the sequence information alone. These methods were
developed generally following three directions. One is to enhance
the power of practical application by enlarging the coverage
scope, such as from covering only 2 subcellular location sites
(Nakashima and Nishikawa, 1994), to 5 location sites (Cedano
et al., 1997), to 12 location sites (Chou and Elrod, 1999; Park and
Kanehisa, 2003), and to 22 location sites (Chou and Shen, 2010c).
The second direction is to extract more useful information from
protein sequences via different approaches or models, such as from
the model of targeting or leader sequences (Emanuelsson et al.,
2000; Nakai and Kanehisa, 1991), to the amino acid composition
(Cedano et al., 1997; Reinhardt and Hubbard, 1998), to the various
modes (Chen and Li, 2007; Ding and Zhang, 2008; Li and Li, 2008;
n
Corresponding author at: Computer Department, Jing-De-Zhen Ceramic
Institute, Jing-De-Zhen 333403, China.
E-mail addresses: [email protected],
[email protected] (X. Xiao).
0022-5193/$ - see front matter & 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.jtbi.2011.06.005
Liu et al., 2010; Pan et al., 2003; Xiao et al., 2006b) of pseudo-amino
acid composition (Chou, 2001), and to the higher-level forms of
pseudo-amino acid composition by incorporating the functional
domain information, gene ontology information, and sequential
evolution information. The third direction is to develop predictors
focused on different organisms (Chou and Shen, 2008a) such as
human, plant (Small et al., 2004), and bacteria (Gardy et al., 2005).
For the details about the developing process, see two comprehensive
review articles (Chou and Shen, 2007; Nakai, 2000) as well as a long
list of references cited therein.
However, very few methods were developed with a focus on
viral proteins, those generated by viruses. Actually, the knowledge of the subcellular localization of viral proteins in a host cell
or virus-infected cell is very important because it is closely
related to their destructive tendencies and consequences. Therefore, it would be especially meaningful to develop methods for
predicting the locations of viral proteins in a viral infected cell
since they are intimately associated with human health, medical
science, and design of antiviral drugs.
In 2007 a predictor called ‘‘Virus-PLoc’’ (Shen and Chou, 2007)
was proposed for identifying the subcellular locations of viral
proteins. In that predictor, a protein sample was firstly formulated
by incorporating the gene ontology information, which is a set of
Author's personal copy
X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51
defined terms to describe cellular component, molecular function
and biological process (Ashburner et al., 2000). Gene ontology can be
represented by a directed acyclic graph or digraph (Chou, 1989),
where the terms are nodes and the relationships among terms are
edges. These edges describe the parent–child relationships between
nodes. A node in that graph may have one or more than one parent.
Such graphic structural characteristics enable powerful grouping,
searching, and analysis of genes and gene products. Because gene
ontology can provide a uniform form for the annotations of proteins
and genes, it becomes quite useful for bench-biologists and computation biologists to share their research data. In recent years, the
applications of gene ontology have been increasingly reported in
various areas of bioinformatics. As indicated by Virus-PLoc (Shen
and Chou, 2007), when the protein samples were formulated via
gene ontology, the prediction quality was improved remarkably.
However, since the database of gene ontology was far from
complete yet in 2007, for those protein samples that could not be
formulated via gene ontology, the pseudo-amino acid composition
(Chou, 2001) was adopted in Virus-PLoc as a backup. Besides,
Virus-PLoc (Shen and Chou, 2007) does not have the capacity to
deal with multiplex proteins that may simultaneously exist at, or
move between, two or more different subcellular location sites.
Proteins with multiple locations or dynamic feature of this kind are
particularly worthy of our notice because they may have some
special functions (Smith, 2008) useful for both drug development
(Wong and Ng, 2009) and basic research (Glory and Murphy, 2007).
To make Virus-PLoc (Shen and Chou, 2007) be able to deal with
multiplex viral proteins as well, a predictor called Virus-mPLoc
(Shen and Chou, 2010b) was developed recently, where the character ‘‘m’’ in front of ‘‘PLoc’’ stands for ‘‘multiple’’, meaning that it
can be also used to deal with viral proteins with multiple locations.
However, Virus-mPLoc has the following shortcomings. (1) In
formulating the protein samples, only the integer numbers 0 and
1 were used to reflect the GO (gene ontology) information
(Ashburner et al., 2000; Camon et al., 2004). Such an over-simplified
formulation might cause some useful information lost so as to limit
the prediction quality. (2) In predicting the number of subcellular
location sites for a query protein, an optimal threshold factor yn (see
Eq. (48) of Chou and Shen (2007)) was adopted without providing its
statistical implication and detailed learning process. It would be
more instructive if we could find a more intuitive approach to treat
such a problem with a more natural manner. (3) Although a webserver for Virus-mPLoc has been established at http://www.csbio.
sjtu.edu.cn/bioinf/virus-multi/, only one query protein sequence at a
time is allowed when using the web-server to conduct prediction.
For the convenience of users in handling many query viral protein
sequences, such a rigid limit should be improved. (4) Particularly,
Virus-mPLoc lacks the function for batch job submission. For those
users who need to identify the subcellular locations for a large
number of query virus proteins, it is important for the web-server to
have such function.
The present study was devoted to develop a new and more
powerful predictor for predicting virus protein subcellular localization by addressing the above four problems.
To establish a really useful statistical predictor for protein
systems, the following things were often needed to consider (Chou,
2011): (1) benchmark dataset construction or selection, (2) protein
sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. Below, let us
describe how to cope with these procedures.
43
dataset for the current study. The reasons why doing so are as
follows. (1) The dataset was constructed specialized for viral
proteins. (2) None of proteins included in S has Z25% pairwise
sequence identity to any other in a same subcellular location;
compared with most of the other benchmark datasets for studying protein subcellular location prediction, the dataset S is much
more rigorous in excluding homology bias and redundancy. (3) It
also contains proteins with more than one location and hence can
be used to train and test a predictor developed aimed at being
able to deal with proteins having both single and multiple
location sites. (4) Using the dataset S will also make it easier to
compare the new predictor with the existing one because the
results by Virus-mPLoc on S have been well documented and
reported (Shen and Chou, 2010b).
The dataset S contains 207 viral protein sequences, of which
165 belong to one subcellular location, 39 to two locations, 3 to
three locations, and none to four or more locations. The dataset
covers 6 subcellular locations (Fig. 1), and hence can be formulated as
S ¼ S1 [ S2 [ S3 [ S4 [ S5 [ S6
ð1Þ
where S1 represents the subset for the subcellular location of
‘‘viral capsid’’, S2 for ‘‘host cell membrane’’, S3 for ‘‘host endoplasmic reticulum’’, and so forth (Table 1); while [ represents the
symbol for ‘‘union’’ in the set theory. To avoid homology bias and
redundancy, none of the proteins in S has Z25% pairwise
sequence identity to any other in a same subset. For convenience,
hereafter let us just use the subscripts of Eq. (1) as the codes of
the 6 location sites; i.e., ‘‘1’’ for ‘‘viral capsid’’, ‘‘2’’ for ‘‘host cell
membrane’’, ‘‘3’’ for ‘‘host endoplasmic reticulum’’, and so forth
(Table 2).
For readers’ convenience, the corresponding accession numbers and protein sequences in S are given in Online Supporting
Information S1.
Note that because some virus proteins may simultaneously
exist in two or more locations, it is instructive to introduce the
concept of ‘‘locative protein’’ as briefed below. If a protein coexists at
two different subcellular location sites, it will be counted as two
locative proteins; if it coexists at three location sites, it will be
counted as three locative proteins; and so forth. Thus, the number of
2. Materials
We choose to use the same dataset S constructed by Shen and
Chou (2010b) in establishing Virus-mPLoc as the benchmark
Fig. 1. Illustration to show the 6 subcellular locations of viral proteins. The
6 locations are: (1) viral capsid, (2) host cell membrane, (3) host endoplasmic
reticulum, (4) host cytoplasm, (5) host nucleus, and (6) secreted.
Author's personal copy
44
X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51
Table 1
Breakdown of the viral protein benchmark dataset S taken from Shen and Chou
(2010b). None of proteins included here has Z 25% sequence identity to any other
in a same subcellular location.
Subset
Subcellular location
Number of proteins
S1
S2
S3
S4
S5
S6
Viral capsid
Host cell membrane
Host endoplasmic reticulum
Host cytoplasm
Host nucleus
Secreted
Total number of locative proteins N(loc)
Total number of different proteins N(seq)
8
33
20
87
84
20
252a
207b
See Eqs. (36)–(38) of Chou and Shen (2007) for the definition about the number
of locative proteins, and its relation with the number of different proteins.
b
Of the 207 different viral proteins, 165 belong to one subcellular location, 39 to
two locations, and 3 to three locations. See Online Supporting Information S1 for the
protein sequences.
Table 2
A comparison of the jackknife success rates by Virus-mPLoc (Shen and Chou,
2010b) and the current iLoc-virus on the benchmark dataset S (cf. Online
Supporting Information S1) that covers 6 location sites of viral proteins in which
none of the proteins included has Z25% pairwise sequence identity to any other
in a same location.
1
2
3
4
5
6
Subcellular location
Viral capsid
Host cell membrane
Host endoplasmic reticulum
Host cytoplasm
Host nucleus
Secreted
Overall
Success rate by jackknife test
Virus-mPLoca
iLoc-Virusb
8/8¼ 100.0%
19/33 ¼57.6%
13/20 ¼65.0%
52/87 ¼59.8%
51/84 ¼60.7%
9/20 ¼ 45.0%
8/8¼ 100.0%
25/33¼ 75.8%
15/20¼ 75.0%
64/87¼ 73.6%
70/84¼ 83.3%
15/20¼ 75.0%
152/252 ¼ 60.3%c
197/252 ¼78.2%c
a
The predictor from Shen and Chou (2010a).
The predictor proposed in this paper.
c
Note that instead of 207 (the number of total different viral proteins), here
we use 252 (the number of total different locative proteins) for the denominator.
This is because some of the viral proteins in S may have more than one location
site. See footnotes a and b of Table 1 for further explanation.
b
total locative proteins can be expressed as
NðlocÞ ¼ NðseqÞ þ
M
X
ðm1ÞNðmÞ
ð2Þ
m¼1
where N(loc) is the number of total locative proteins, N(seq) the
number of total different protein sequences, N(1) the number of
proteins with one location, N(2) the number of proteins with two
locations, and so forth; while M¼6 is the number of total subcellular
location sites concerned (cf. Eq. (1)). As we can see from Eq. (2), the
number of total locative proteins is generally greater than that of
total different protein sequences. When and only when all the
proteins have a single location site, can the two be the same.
The benchmark dataset used in this study covers 6 subcellular
locations (Fig. 1) with a total of 207 different virus protein sequences,
of which 165 belong to one subcellular location, 39 to two locations,
3 to three locations, and none to four and more locations (Shen and
Chou, 2010b). Submitting these data into Eq. (2), we obtain
NðlocÞ ¼ NðseqÞ þð11Þ 165 þ ð21Þ 39 þ ð31Þ 3
þ
6
X
ðm1Þ 0 ¼ 207 þ39 þ 6 ¼ 252
To develop a powerful method for statistically predicting
protein subcellular localization according to the sequence information, one of the most important things is to formulate the
protein sequences with an effective mathematical expression that
can truly reflect the intrinsic correlation with their subcellular
localization (Chou, 2011). However, it is by no means a trivial job
because it is usually not easy to find out this kind of correlation.
The most straightforward method to formulate the sample of a
query protein P was just using its entire amino acid sequence, as
can be generally written by
P ¼ R1 R2 R3 R4 R5 R6 R7 . . .RL
a
Code
3. Methods
ð3Þ
m¼4
which is fully consistent with the figures in Table 1 of Shen and Chou
(2010b).
ð4Þ
where R1 represents the 1st residue of the protein P, R2 the 2nd
residue, y, RL the L-th residue, and they each belong to one of the 20
native amino acids. To identify its subcellular location(s), one of the
straightforward and preliminary methods was to utilize the
sequence-similarity-search-based tools, such as BLAST (Altschul,
1997; Wootton and Federhen, 1993), to search protein database for
those proteins that have high sequence similarity to the query protein
P. Subsequently, the subcellular location annotations of the proteins
thus found were used to deduce the subcellular location(s) for P.
Unfortunately, although it was quite intuitive and able to contain the
entire information of a protein sequence, this kind of straightforward
sequential model failed to work when the query protein P did not
have significant sequence similarity to any location-known proteins.
Thus, various non-sequential or discrete models to formulate
protein samples were proposed in hopes to establish some sort of
correlation or cluster manner by which the prediction quality could
be improved.
Among the discrete models for a protein sample, the simplest
one is its amino acid (AA) composition or AAC (Chou and Zhang,
1994). According to the AAC-discrete model, the protein P of Eq.
(2) can be formulated by (Chou, 1995b; Nakashima and
Nishikawa, 1994)
h
iT
P ¼ f1 f2 f20
ð5Þ
where fi ði ¼ 1,2, . . ., 20Þ are the normalized occurrence frequencies
of the 20 native amino acids in protein P, and T the transposing
operator. Many methods for predicting protein subcellular localization were based on the AAC-discrete model (see, e.g., Cedano
et al., 1997; Chou and Elrod, 1999; Nakashima and Nishikawa,
1994; Reinhardt and Hubbard, 1998; Zhou and Doctor, 2003).
However, as we can see from Eq. (5), if using the ACC model to
represent the protein P, all its sequence-order effects would be
lost, and hence the prediction quality might be limited.
To avoid completely losing the sequence-order information,
the pseudo-amino acid composition (PseAAC) was proposed to
represent the sample of a protein, as formulated by (Chou, 2001)
h
iT
P ¼ p1 p2 p20 p20 þ 1 p20 þ l
ð6Þ
where the first 20 elements are associated with the 20 elements
in Eq. (3) or the 20 amino acid components of the protein P, while
the additional l factors are used to incorporate some sequenceorder information via a series of rank-different correlation factors
along a protein chain.
The concept of PseAAC has been widely used to study various
problems in proteins and protein-related systems, such as predicting protein folding rates (Guo et al., 2011), protein structural
class prediction (Sahu and Panda, 2010; Xiao et al., 2006a),
supersecondary structure prediction (Zou et al., 2011), protein
secondary structure content prediction (Chen et al., 2009), protein
quaternary structural attribute prediction (Shen and Chou,
2009b), fold pattern prediction (Shen and Chou, 2009a), ion
Author's personal copy
X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51
channel prediction (Lin and Ding, 2011), predicting enzymes and
their family/sub-family classification (Qiu et al., 2010; Zhou et al.,
2007), protein subcellular location prediction (Li and Li, 2008),
apoptosis protein subcellular location prediction (Kandaswamy
et al., 2010), predicting protein subchloroplast locations (Du et al.,
2009), predicting protein submitochondria locations (Nanni and
Lumini, 2008; Zeng et al., 2009), predicting membrane proteins
and their types (Hayat and Khan, 2011), discrimination of outer
membrane proteins (Lin, 2008), identifying proteases and their types
(Chou and Shen, 2008b), identifying GPCRs and their classes (Xiao
et al., 2011b), prediction of nuclear receptors (Gao et al., 2009),
prediction of cyclin proteins (Mohabatkar, 2010), identifying bacterial secreted proteins (Yu et al., 2010), identifying risk type of human
papillomaviruses (Esmaeili et al., 2010), prediction of cell wall lytic
enzymes (Ding et al., 2009), prediction of lipases types (Zhang et al.,
2008), predicting conotoxin superfamily and family (Mondal et al.,
2006), predicting the cofactors of oxidoreductases (Zhang and Fang,
2008), and others (e.g., Georgiou et al., 2009).
According to (Chou, 2011), the PseAAC for a protein P can be
generally formulated as
h
iT
P ¼ c1 c2 cu cO
ð7Þ
where the subscript O is an integer, and its value as well as the
components c1, c2, y will depend on how to extract the desired
information from the amino acid sequence of P (cf. Eq. (4)). As a
general form, Eq. (7) can cover various different modes of PseAAC.
For example, when its elements are given by
8
fu
>
>
< P20 fi þ w Pl yj , ð1r u r 20Þ
i ¼ 1
j ¼ 1
ð8Þ
cu ¼
wyu20
>
>
P
P
, ð20 þ 1 ru r 20 þ l ¼ O; l o LÞ
20
l
:
f þw
y
i ¼ 1 i
j ¼ 1
j
we immediately obtain the formulation of PseAAC as originally
introduced in (Chou, 2001), where w is the weight factor for
the sequence order effect, yj the jth tier correlation factor reflecting the sequence order correlation between all the jth most
contiguous residues along a protein chain, and l is an integer
parameter for the maximum number of correlation tires to be
considered. Readers can also find a concise description of Eq. (8)
as well as the definition for each of the symbols therein by
clicking the link at http://en.wikipedia.org/wiki/Pseudo_amino_a
cid_composition to see a Wikipedia article about the pseudoamino acid composition.
Below, let us use the general form of PseAAC (Eq. (7)) to find
the formulations to reflect the core and essential features of
protein samples that are closely correlated with their subcellular
localization.
3.1. GO (gene ontology) formulation
GO database (Ashburner et al., 2000) was established according
to the molecular function, biological process, and cellular component. Accordingly, protein samples defined in a GO database space
would be clustered in a way better reflecting their subcellular
locations (Chou and Shen, 2007; Chou and Shen, 2008a). However,
in order to incorporate more information, instead of only using
0 and 1 elements as done in (Shen and Chou, 2010a), here let us use
a different approach as described below.
Step 1. Compression and reorganization of the existing GO
numbers. The GO database (version 74.0 released 30 July 2009)
contains many GO numbers. However, these numbers do not
increase successively and orderly. For easier handling, some
reorganization and compression procedure was taken to renumber them. For example, after such a procedure, the original GO
numbers GO:0000001, GO:0000002, GO:0000003, GO:0000009,
45
GO:00000011, GO:0000012, GO:0000015, y, GO:0090204 would
become GO_compress: 00001, GO_compress: 00002, GO_compress:
00003, GO_compress: 00004, GO_compress: 00005, GO_compress:
00006, GO_compress: 00007, yy, GO_compress: 11118, respectively. The GO database obtained thru such a treatment is called
GO_compress database, which contains 11,118 numbers increasing
successively from 1 to the last one.
Step 2. Using Eq. (7) with O ¼11,118, the protein P can be
formulated as
h
iT
G
G
G
PGO ¼ cG1 c2 cu c11118
ð9Þ
G
where cu ðu ¼ 1,2, . . ., 11,118Þ are defined via the following steps.
Step 3. Use BLAST (Schaffer et al., 2001) to search the homologous
proteins of the protein P from the Swiss-Prot database (version 55.3),
with the expect value Er0.001 for the BLAST parameter.
Step 4. Those proteins which have Z60% pairwise sequence
homo
identity with the protein P are collected into a set, SP
, called
homo
the ‘‘homology set’’ of P. All the elements in SP
can be deemed
as the ‘‘representative proteins’’ of P, sharing some similar
attributes such as structural conformations and biological functions (Chou, 2004; Gerstein and Thornton, 2003; Loewenstein
et al., 2009). Because they were retrieved from the Swiss-Prot
database, these representative proteins must each have their own
accession numbers.
Step 5. Search each of these accession numbers collected in
Step 4 against the GO database at http://www.ebi.ac.uk/GOA/ to
find the corresponding GO numbers (Camon et al., 2003).
Step 6. Based on the results obtained in Step 5, the elements in
Eq. (9) can be written as
cGu ¼
PNhomo
P
dðu,kÞ
k¼1
Nhomo
P
ðu ¼ 1,2, . . ., 11,118Þ
ð10Þ
homo
where Nhomo
is the number of representative proteins in SP
P
and
(
dðu,kÞ ¼
,
1,
if the k-th representative protein hits the uth GO_compress number
0,
otherwise
ð11Þ
As we can see from Eq. (9), the GO formulation derived from
the above steps consists of 11,118 real numbers rather than only
the elements 0 and 1 as in the GO formulation adopted in Shen
and Chou (2010b).
Note that the GO formulation of Eq. (9) may become a naught
vector or meaningless under any of the following situations:
(1) the protein P does not have significant homology to any
homo
protein in the Swiss-Prot database, i.e., SP
¼ | meaning the
homo
homology set SP
is an empty one; (2) its representative proteins
do not contain any useful GO information for statistical prediction
based on a given training dataset.
Under such a circumstance, let us consider using the sequential
evolution formulation to represent the protein P, as described below.
3.2. SeqEvo (sequential evolution) formulation
Biology is a natural science with historic dimension. All biological
species have developed starting out from a very limited number of
ancestral species. It is true for protein sequence as well (Chou, 2004).
Their evolution involves changes of single residues, insertions and
deletions of several residues (Chou, 1995a), gene doubling, and
gene fusion. With these changes accumulated for a long period of
time, many similarities between initial and resultant amino acid
sequences are gradually eliminated, but the corresponding proteins
may still share many common attributes, such as having basically
the same biological function and residing in a same subcellular
location.
Author's personal copy
46
X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51
To incorporate the sequential evolution information into the
general form of PseAAC (Eq. (7)), here let us use the information of
the Position-Specific Scoring Matrix (PSSM) (Schaffer et al., 2001),
as described below.
Step 1. According to Schaffer et al. (2001), the sequential
evolution information of protein P can be expressed by a 20 L
matrix as given by
2 0
3
E1-1
E02-1 E0L-1
6 0
7
E02-2 E0L-2 7
6 E1-2
7
PSSM ¼ 6
ð12Þ
6 ^
^
&
^ 7
4
5
E01-20 E02-20 E0L-20
where L is the length of P (counted in the total number of its
constituent amino acids as shown in Eq. (4)), E0i-j represents the
score of the amino acid residue in the ith position of the protein
sequence being changed to amino acid type j during the evolutionary process. Here, the numerical codes 1, 2, y, 20 are used to
denote the 20 native amino acid types according to the alphabetical order of their single character codes. The 20 L scores in
Eq. (12) were generated by using PSI-BLAST (Schaffer et al., 2001)
to search the UniProtKB/Swiss-Prot database (Release 2010_04 of
23-Mar-2010) through three iterations with 0.001 as the E-value
cutoff for multiple sequence alignment against the sequence of
the protein P. However, according to the formulation of Eq. (12),
proteins with different lengths will correspond to column-different matrices causing difficulty for developing a predictor able to
uniformly cover proteins of any length. To make the descriptor
become a size-uniform matrix, let us consider the following steps.
Step 2. Use the elements in PSSM of Eq. (12) to define a new
matrix M as formulated by
2
3
E1-1
E2-1 EL-1
6
7
E2-2 EL-2 7
6 E1-2
7
M¼6
ð13Þ
6 ^
^
&
^ 7
4
5
E1-20 E2-20 EL-20
with
0
Ei-j ¼
E0i-j Ej
0
ði ¼ 1,2, . . ., L;
j ¼ 1,2, . . ., 20Þ
ð14Þ
SDðEj Þ
lower triangular elements, to formulate the protein P; i.e., the
general PseAAC form of Eq. (7) can now be formulated as
E
PEvo ¼ ½ c1
cE2 cEu cE210 T
ðu ¼ 1,2, . . ., 210Þ are, respectively,
where the components
taken from the 210 diagonal and lower triangular elements of
Eq. (17) by following a given order, say from left to right and from
the 1st row to the last as illustrated by following equation:
2
3
ð1Þ
6 ð2Þ
7
ð3Þ
6
7
6
7
6 ð4Þ
7
ð5Þ
ð6Þ
ð19Þ
6
7
6 ^
7
^
^
&
4
5
ð191Þ ð192Þ ð193Þ ð210Þ
where the numbers in parentheses indicate the order of elements
taken from Eq. (17) for Eq. (18).
3.3. The self-consistency formulation principle
Regardless of using which formulation to represent protein
samples, the following self-consistency principle must be
observed during the course of prediction: if the query protein P
was defined in the form of PGO (see Eq. (9)), then all the protein
samples used to train the prediction engine should also be
expressed in the GO formulation; if the query protein was defined
in the form of PEvo (see Eq. (18)), then all the training data should
be expressed in the SeqEvo formulation as well.
Below, let us consider the algorithm or operation engine for
conducting the prediction.
3.4. Multi-label K-nearest neighbor (KNN) classifier
In this study, let us introduce a novel classifier, called the
multi-label KNN or abbreviated as ML-KNN classifier, to predict
the subcellular localization for the systems that contain both
single-location and multiple-location proteins.
Suppose the m-th subset Sm of S (Eq. (1)) contains Nm
proteins, and P(m,j) is the jth one in that subset. Thus, we have
(
Pðm,jÞ ¼
PGO ðm,jÞ,
in GO space
PEvo ðm,jÞ,
in SeqEvo space
where
0
Ej
ð18Þ
cEu
ðm ¼ 1,2, . . ., 6;
j ¼ 1,2, . . ., Nm Þ
ð20Þ
L
1X
¼
E0
L i ¼ 1 i-j
ðj ¼ 1,2, . . ., 20Þ
ð15Þ
is the mean for E0i-j ði ¼ 1,2, . . ., LÞ and
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u L
uX
0
0
½E0i-j Ej 2 =L
SDðEj Þ ¼ t
ð16Þ
i¼1
is the corresponding standard deviation.
Step 3. Introduce a new matrix generated by multiplying M
with its own transpose matrix MT; i.e.,
2 L
3
L
L
X
X
X
Ei-1 Ei-1
Ei-1 Ei-2 Ei-1 Ei-20 7
6
6 i¼1
7
i¼1
i¼1
6
7
6 L
7
L
L
X
X
6 X
7
6
7
E
E
E
E
E
E
i-2
i-1
i-2
i-2
i-2
i-20
6
7
T
MM ¼ 6 i ¼ 1
ð17Þ
7
i¼1
i¼1
6
7
6
7
^
^
&
^
6
7
6 L
7
L
L
X
X
6X
7
4
Ei-20 Ei-1
Ei-20 Ei-2 Ei-20 Ei-20 5
i¼1
i¼1
i¼1
which contains 20 20 ¼400 elements. Since MMT is a symmetric
matrix, we only need the information of its 210 elements, of
which 20 are the diagonal elements and (400 20)/2¼ 190 are the
where PGO(m,j) and PEvo(m,j) have the same forms as PGO (Eq. (9)),
and PEvo (Eq. (18)), respectively; the only difference is that the
corresponding constituent elements are derived from the amino
acid sequence of P(m,j) instead of P.
In sequence analysis, there are many different scales to define
the distance between two proteins, such as Euclidean distance,
Hamming distance (Mardia et al., 1979), and Mahalanobis distance (Chou and Zhang, 1994; Mahalanobis, 1936; Pillai, 1985). In
Chou and Shen (2010b), the distance between P(m,j) and P was
defined by 1 cos 1[P, P(m,j)]. However, we have observed that
when the GO descriptor was formulated with real numbers, better
outcomes would be resulted by using the Euclidean metric; i.e.,
the distance between P and P(m,j) should be defined here by
DfP,Pðm,jÞg ¼ :PPðm,jÞ:
ð21Þ
where :PPðm,jÞ: represents the module of the vector difference
between P and P(m,j) in the Euclidean space. According to Eq.
(21), when P P(m,j) we have D{P,P(m,j)}¼0, indicating the
distance between these two protein sequences is zero and hence
they have perfect or 100% similarity.
Suppose Pn1 ,Pn2 , . . ., PnK are the K nearest neighbor proteins to
P
the protein P that forms a set denoted by SK , which is a subset of
P
S; i.e., SK D S. Based on the K nearest neighbor proteins in SPK , let
Author's personal copy
X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51
us define an accumulation-layer (AL) scale, given by
n
o
QðP,KÞ ¼ rK1 rK2 rK3 rK4 rK5 rK6
ð22Þ
where
PK
dðPn ,mÞ
rm ¼ i ¼ 1 n i
ð23Þ
NK
ðm ¼ 1,2, . . ., 6Þ
where
(
n
dðPi ,mÞ ¼
1,
if Pni belongs to the mth location
0,
otherwise
ð24Þ
and
NnK ¼
6 X
K
X
dðPni ,mÞ
ð25Þ
m¼1i¼1
Note that NnK Z K because a protein may belong to one or more
subcellular location sites in the current system.
Now, for a query protein P, its subcellular location(s) will be
predicted according to the following steps.
Step 1. The number of how many different subcellular locations it belongs to will be determined by its nearest neighbor
protein in S. For example, suppose Pn is the nearest protein to P
in S. If Pn has only one subcellular location, then P will also have
only one location; if Pn has two subcellular locations, then P will
also have two locations; and so forth. In general, if Pn belongs
to M different location sites, then P will be predicted to have
the same number, M, of subcellular locations as well, as can be
formulated by
M ¼ Num Pn ) L ¼ Num P ) L
ð26Þ
n
where M is an integer ( r6), Num P ) L represents the
number of different subcellular locations to which Pn belongs,
and Num P ) L the number of different subcellular locations to
which P belongs.
Step 2. However, the concrete location site(s) to which P belongs
will not be determined by the location site(s) of Pn, but by the
element(s) in Eq. (22) that has (have) the highest score(s), as can be
expressed by f‘g, the subscript(s) of Eq. (1). For example, if P is
found belonging to only one location ðM ¼ 1Þ in Step 1, and the
47
highest score in Eq. (22) is rK3 , then P will be predicted as f‘g ¼ 3
meaning that it belongs to S3 or resides at ‘‘host endoplasmic
reticulum’’ (cf. Table 1). If P is found belonging to three locations
ðM ¼ 3Þ, and the first three highest scores in Eq. (22) are rK1 , rK2 , and
rK6 , then P will be predicted as f‘g ¼ ð1,2, 6Þ meaning that it belongs
to S1 , S2 , and S6 or resides simultaneously at ‘‘viral capsid’’, ‘‘host
cell membrane’’, and ‘‘secreted’’. And so forth. In other words, the
concrete predicted subcellular location(s) can be formulated as
n
o
rK1 rK2 rK3 rK4 rK5 rK6
ðMr 6Þ
ð27Þ
f‘g ¼ Max x M
Sub
where the operator ‘‘Max x M
Sub ’’ means identifying the M highest
scores for the elements in the brackets right after it, followed by
taking their M subscripts.
The entire classifier thus established is called iLoc-Virus,
which can be used to predict the subcellular localization of both
singleplex and multiplex viral proteins. To provide an intuitive
picture, a flowchart is provided in Fig. 2 to illustrate the prediction process of iLoc-Virus.
3.5. Protocol guide
For user’s convenience, a web-server for iLoc-Virus was established. Below, let us give a step-by-step guide on how to use it to get
the desired results.
Step 1. Open the web server at site http://icpr.jci.edu.cn/
bioinfo/iLoc-Virus and you will see the top page of the predictor
on your computer screen, as shown in Fig. 3. Click on the Read Me
button to see a brief introduction about iLoc-Virus predictor and
the caveat when using it.
Step 2. Either type or copy and paste the query protein
sequence into the input box at the center of Fig. 3. The input
sequence should be in the FASTA format. A sequence in FASTA
format consists of a single initial line beginning with a greaterthan symbol (‘‘ 4’’) in the first column, followed by lines of
sequence data. The words right after the ‘‘ 4’’ symbol in the single
initial line are optional and only used for the purpose of
identification and description. All lines should be no longer than
120 characters and usually do not exceed 80 characters. The
sequence ends if another line starting with a ‘‘ 4 ’’ appears; this
Fig. 2. A flowchart to show the prediction process of iLoc-Virus.
Author's personal copy
48
X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51
Fig. 3. A semi-screenshot to show the top page of the iLoc-Virus web-server. Its website address is at http://icpr.jci.edu.cn/bioinfo/iLoc-Virus.
Fig. 4. A semi-screenshot to show the output of iLoc-Virus. The input was taken from the three protein sequences listed in the Example window of the iLoc-Virus webserver (cf. Fig. 3).
indicates the start of another sequence. Example sequences in FASTA
format can be seen by clicking on the Example button right above
the input box. For more information about FASTA format, visit
http://en.wikipedia.org/wiki/Fasta_format. Different with VirusmPLoc (Shen and Chou, 2010b), where only one query protein
sequence at a time is allowed for each submission, now the
maximum number of query proteins for each submission can be 10.
Step 3. Click on the Submit button to see the predicted result.
For example, if you use the three query protein sequences in the
Example window as the input, after clicking the Submit button,
you will see Fig. 4 shown on your screen, indicating that the
predicted result for the 1st query protein is ‘‘Host cytoplasm’’,
that for the 2nd one is ‘‘Host cytoplasm; Host nucleus’’, and that
for the 3rd one is ‘‘Host cell membrane; Host cytoplasm; Host
nucleus’’. In other words, the 1st query protein (P04487) is a
single-location one residing at ‘‘host cytoplasm’’ only, the 2nd one
(Q65202) can simultaneously occur in two different sites (‘‘host
cytoplasm’’ and ‘‘host nucleus’’), while the 3rd one (P21935) can
simultaneously occur in three different sites (‘‘host cell membrane’’,
‘‘host cytoplasm’’, and ‘‘host nucleus cytoplasm’’). All these results
are fully consistent with the experimental observation as indicated
in the Online Supporting Information S1. It takes about 10 s for the
above computation before the predicted result appears on your
computer screen; the more number of query proteins and longer of
each sequence, the more time it is usually needed.
Step 4. As mentioned in Introduction, Virus-mPLoc (Shen and
Chou, 2010b) does not have the function to handle batch jobs but
the current iLoc-Virus does. As shown on the lower panel of
Author's personal copy
X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51
49
Fig. 3, you may also choose the batch prediction by entering your
e-mail address and your desired batch input file (in FASTA format)
via the ‘‘Browse’’ button. To see the sample of batch input file,
click on the button Batch-example. The maximum number of the
query proteins for each batch input file is 50. After clicking the
button Batch-submit, you will see ‘‘Your batch job is under
computation; once the results are available, you will be notified
by e-mail’’. Note that if you submit a batch input file from an
Apple computer, although it looks like in the FASTA format, your
input might change to non-FASTA format in the server end and
cause errors. Under such a circumstance, the safest way is to
submit your input file with a pdf format.
Step 5. Click on the Citation button to find the relevant papers that
document the detailed development and algorithm of iLoc-Virus.
Step 6. Click on the Data button to download the benchmark
datasets used to train and test the iLoc-Virus predictor.
Caveat. To obtain the predicted result with the expected
success rate, the entire sequence of the query protein rather than
its fragment should be used as an input. A sequence with less
than 50 amino acid residues is generally deemed as a fragment.
Also, if the query protein is known not one of the 6 locations as
shown in Fig. 1, stop the prediction because the result thus obtained
will not make any sense.
benchmark dataset S by the jackknife test. As we can see from
Table 2, for such a stringent dataset, the overall success rate
achieved by iLoc-Virus is over 78.1%, which is about 18% higher
than that by Virus-mPLoc (Shen and Chou, 2010b).
Note that during the course of the jackknife test by VirulmPLoc and iLoc-Virus, the false positives (over-predictions) and
false negatives (under-predictions) were also taken into account
to reduce the scores in calculating the overall success rate. As for
the detailed process of how to count the over-predictions and
under-predictions for a system containing both single-location
and multiple-location proteins, see Eqs. (43)–(48) and Fig. 4 in a
comprehensive review (Chou and Shen, 2007).
To provide a more intuitive and easier-to-understand measurement, let us introduce a new scale, the so-called ‘‘absolute true’’
success rate, to reflect the accuracy of a predictor, as defined by
PN
DðiÞ
ð28Þ
L¼ i¼1
N
4. Results and discussion
According to the above definition, for a protein belonging to,
say, three subcellular locations, if only two of the three are
correctly predicted, or the predicted result contains a location
not belonging to the three, the prediction score will be counted as
0. In other words, when and only when all the subcellular
locations of a query protein are exactly predicted without any
underprediction or overprediction, can the prediction be scored
with 1. Therefore, the absolute true scale is much more strict and
harsh than the scale used previously (Chou and Shen, 2007; Shen
and Chou, 2010b) in measuring the success rate. However, even if
using such a stringent criterion on the same benchmark dataset
by the jackknife test, the overall absolute true success rate
achieved by iLoc-Virus was 155/207 ¼74.8%.
Why can iLoc-Virus enhance the success rate so remarkably?
One of the key reasons is that the GO formulation for protein
samples in iLoc-Virus contains more information than that in
Virus-mPLoc (Shen and Chou, 2010b), as can be illustrated as
follows. Suppose given a protein P(q), according to Steps 3 and
4 in the Section of ‘‘GO (Gene Ontology) Formulation’’, we found
20 proteins that were homologous to it; i.e., Nhomo
P ðqÞ ¼ 20. Of the 20
homologous proteins, 4 hit GO_compress:00008, 16 hit GO_compress:00023, 12 hit GO_compress:00826, and all hit GO_compress:01938. Substituting these data into Eqs. (8)–(9), we have
8
4=20 ¼ 0:2,
if u ¼ 8
>
>
>
>
>
>
< 16=20 ¼ 0:8, if u ¼ 23
cGu ðPðqÞÞ ¼ 12=20 ¼ 0:6, if u ¼ 826
ðu ¼ 1,2, . . ., 11,118Þ ð30Þ
>
>
>
20=20 ¼ 1:0, if u ¼ 1938
>
>
>
:
0,
otherwise
where L represents the absolute true rate, N the number of total
proteins investigated, and
8
>
< 1, if all the subcellular locations of the ith protein are
correctly predicted without any overprediction
DðiÞ ¼
>
: 0, otherwise
ð29Þ
In statistical prediction, the following three methods are often
used to examine the quality of a predictor: independent dataset
test, subsampling test, and jackknife test (Chou and Zhang, 1995).
Because the subsampling test and jackknife test can be carried out
using one benchmark dataset and also because the independent
dataset test can be treated as a special case of subsampling test,
one benchmark dataset would suffice to serve all the three kinds
of cross-validation. However, as demonstrated by Eq. (1) of Chou
and Shen (2010a) and elucidated in Chou and Shen (2007), among
the three cross-validation methods, the jackknife test is thought
the least arbitrary because it can always yield a unique result for a
given benchmark dataset and hence has been widely recognized
and increasingly used to examine the power of various predictors
(see, e.g., Chou and Shen, 2010b; Ding et al., 2009; Du et al., 2009;
Jahandideh et al., 2009; Kannan et al., 2008; Li and Li, 2008; Lin,
2008; Mohabatkar, 2010; Xiao et al., 2011a; Zou et al., 2011).
Accordingly, in this study, the jackknife test will also be used
to evaluate the anticipated accuracy for iLoc-Virus.
However, even if using the jackknife test to examine the accuracy,
a same predictor may still yield obviously different success rates
when tested by different benchmark datasets. This is because the
more stringent of a benchmark dataset in excluding homologous
sequences, the more difficult for a predictor to achieve a high success
rate. Also, the more number of subsets (subcellular locations) a
benchmark dataset covers, the more difficult to achieve a high
overall success rate, as elaborated in a recent review (Chou, 2011).
As mentioned in the Materials section, the benchmark dataset
used in this study is S (cf. Online Supporting Information S1),
which is the same benchmark dataset constructed in (Shen and
Chou, 2010b) for Virus-mPLoc.
Actually, for such a dataset containing both single-location and
multiple-location viral proteins distributed among 6 subcellular
location sites, so far only one existing predictor, i.e., Virus-mPLoc
(Shen and Chou, 2010b), had the capacity to deal with it. Therefore, to demonstrate the power of the current predictor, it would
suffice to just compare iLoc-Virus with Virus-mPLoc (Shen and
Chou, 2010b).
Listed in Table 2 are the results obtained with Virus-mPLoc
(Shen and Chou, 2010b) and iLoc-Virus on the aforementioned
In contrast, if the same protein was represented according to the
formulation in Virus-mPLoc (Shen and Chou, 2010b), it would be
8
1, if u ¼ 8
>
>
>
>
>
>
< 1, if u ¼ 23
ð31Þ
cGu ðPðqÞ ¼ 1, if u ¼ 826 ðu ¼ 1,2, . . ., 11,118Þ
>
>
>
> 1, if u ¼ 1938
>
>
:
0, otherwise
As can be clearly seen by comparing Eq. (30) with Eq. (31),
although the elements in the 8th, 23rd, 826th, and 1938th
components are all not zero in both formulations, in the one as
Author's personal copy
50
X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51
defined in Virus-mPLoc (Shen and Chou, 2010b) all the elements
are either 1 or zero, completely ignoring their weights. In other
words, the GO formulation in the current iLoc-Virus contains
more information than that in Virus-mPLoc (Shen and Chou,
2010b) and hence leading to better prediction results.
The other reason is that in Virus-mPLoc (Shen and Chou,
2010b) the number of the subcellular location sites for a query
protein was determined by a threshold factor yn (cf. Eq. (48) in
Chou and Shen (2007)) that actually functioned as a ‘‘black box’’
without providing any physicochemical rationale. In contrast, it is
very much different in the current iLoc-Virus as reflected by the
fact that the number of the subcellular location sites for a query
protein is determined according to the nearest neighbor (NN)
principle (cf. Eq. (26)), and that its concrete location sites are
determined according to the accumulation-layer scale (cf. Eqs.
(22) and (27)).
5. Conclusions
Prediction of protein subcellular localization is a challenging
problem, particularly when the system concerned contains both
singleplex and multiplex proteins. The reasons why iLoc-Virus
can achieve higher success rates than Virus-mPLoc are as follows.
(1) The GO formulation used to represent protein samples in iLocVirus is formed by the probabilities of hits (cf. Eqs. (10)–(11)) and
hence contains more information than that in Virus-mPLoc (Shen
and Chou, 2010b) where only the number ‘‘0’’ or ‘‘1’’ was used
regardless how many hits were found to the corresponding
component in the GO formulation. (2) The accumulation-layer
scale has been introduced in iLoc-Virus that is more natural and
effective for dealing with proteins having both single and multiple subcellular locations.
Acknowledgments
The authors wish to thank the two anonymous Reviewers,
whose constructive comments are very helpful for strengthening
the presentation of this study. This work was supported by the
grants from the National Natural Science Foundation of China (No.
60961003), the Key Project of Chinese Ministry of Education (No.
210116), and the Province National Natural Science Foundation of
JiangXi (2009GZS0064 and 2010GZS0122).
Appendix A. Supplementary materials
Supplementary materials associated with this article can be
found in the online version at doi:10.1016/j.jtbi.2011.06.005.
References
Altschul, S.F., 1997. Evaluating the statistical significance of multiple distinct local
alignments. In: Suhai, S. (Ed.), Theoretical and Computational Methods in
Genome Research. Plenum, New York, pp. 1–14.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P.,
Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L.,
Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M.,
Sherlock, G., 2000. Gene ontology: tool for the unification of biology. Nat.
Genet. 25, 25–29.
Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte,
N., Lopez, R., Apweiler, R., 2004. The Gene Ontology Annotation (GOA)
Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids
Res. 32, D262-6.
Camon, E., Magrane, M., Barrell, D., Binns, D., Fleischmann, W., Kersey, P., Mulder,
N., Oinn, T., Maslen, J., Cox, A., Apweiler, R., 2003. The Gene Ontology
Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL,
and InterPro. Genome Res. 13, 662–672.
Cedano, J., Aloy, P., P’erez-Pons, J.A., Querol, E., 1997. Relation between amino acid
composition and cellular location of proteins. J. Mol. Biol. 266, 594–600.
Chen, C., Chen, L., Zou, X., Cai, P., 2009. Prediction of protein secondary structure
content by using the concept of Chou’s pseudo amino acid composition and
support vector machine. Protein Pept. Lett. 16, 27–31.
Chen, Y.L., Li, Q.Z., 2007. Prediction of apoptosis protein subcellular location using
improved hybrid approach and pseudo amino acid composition. J. Theor. Biol.
248, 377–381.
Chou, K.C., 1989. Graphic rules in steady and non-steady enzyme kinetics. J. Biol.
Chem. 264, 12074–12079.
Chou, K.C., 1995a. The convergence-divergence duality in lectin domains of the
selectin family and its implications. FEBS Lett. 363, 123–126.
Chou, K.C., 1995b. A novel approach to predicting protein structural classes in a
(20-1)-D amino acid composition space. Proteins: Struct. Funct. Genet. 21,
319–344.
Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid
composition. Proteins: Struct. Funct. Genet. 43, 246–255 (Erratum: ibid., 2001,
vol. 44, 60).
Chou, K.C., 2004. Review: structural bioinformatics and its impact to biomedical
science. Curr. Med. Chem. 11, 2105–2134.
Chou, K.C., 2011. Some remarks on protein attribute prediction and pseudo
amino acid composition (50th Anniversary Year Review). J. Theor. Biol. 273,
236–247.
Chou, K.C., Zhang, C.T., 1994. Predicting protein folding types by distance functions
that make allowances for amino acid interactions. J. Biol. Chem. 269,
22014–22020.
Chou, K.C., Zhang, C.T., 1995. Review: prediction of protein structural classes. Crit.
Rev. Biochem. Mol. Biol. 30, 275–349.
Chou, K.C., Elrod, D.W., 1999. Protein subcellular location prediction. Protein Eng.
12, 107–118.
Chou, K.C., Shen, H.B., 2007. Review: recent progresses in protein subcellular
location prediction. Anal. Biochem. 370, 1–16.
Chou, K.C., Shen, H.B., 2008a. Cell-PLoc: a package of Web servers for predicting
subcellular localization of proteins in various organisms. Nat. Protocols 3,
153–162.
Chou, K.C., Shen, H.B., 2008b. ProtIdent: a web server for identifying proteases and
their types by fusing functional domain and sequential evolution information.
Biochem. Biophys. Res. Commun. 376, 321–325.
Chou, K.C., Shen, H.B., 2010a. Cell-PLoc 2.0: an improved package of web-servers
for predicting subcellular localization of proteins in various organisms. Nat.
Sci. 2, 1090–1103 (openly accessible at /http://www.scirp.org/journal/NS/S).
Chou, K.C., Shen, H.B., 2010b. Plant-mPLoc: a top-down strategy to augment the
power for predicting plant protein subcellular localization. PLoS ONE 5,
e11335.
Chou, K.C., Shen, H.B., 2010c. A new method for predicting the subcellular
localization of eukaryotic proteins with both single and multiple sites: EukmPLoc 2.0. PLoS ONE 5, e9931.
Ding, H., Luo, L., Lin, H., 2009. Prediction of cell wall lytic enzymes using
Chou’s amphiphilic pseudo amino acid composition. Protein Pept. Lett. 16,
351–355.
Ding, Y.S., Zhang, T.L., 2008. Using Chou’s pseudo amino acid composition to
predict subcellular localization of apoptosis proteins: an approach with
immune genetic algorithm-based ensemble classifier. Pattern Recognition
Lett. 29, 1887–1892.
Du, P., Cao, S., Li, Y., 2009. SubChlo: predicting protein subchloroplast locations
with pseudo-amino acid composition and the evidence-theoretic K-nearest
neighbor (ET-KNN) algorithm. J. Theor. Biol. 261, 330–335.
Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G., 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid
sequence. J. Mol. Biol. 300, 1005–1016.
Esmaeili, M., Mohabatkar, H., Mohsenzadeh, S., 2010. Using the concept of Chou’s
pseudo amino acid composition for risk type prediction of human papillomaviruses. J. Theor. Biol. 263, 203–209.
Gao, Q.B., Jin, Z.C., Ye, X.F., Wu, C., He, J., 2009. Prediction of nuclear receptors with
optimal pseudo amino acid composition. Anal. Biochem. 387, 54–59.
Gardy, J.L., Laird, M.R., Chen, F., Rey, S., Walsh, C.J., Ester, M., Brinkman, F.S., 2005.
PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization
and insights gained from comparative proteome analysis. Bioinformatics 21,
617–623.
Georgiou, D.N., Karakasidis, T.E., Nieto, J.J., Torres, A., 2009. Use of fuzzy clustering
technique and matrices to classify amino acids and its impact to Chou’s
pseudo amino acid composition. J. Theor. Biol. 257, 17–26.
Gerstein, M., Thornton, J.M., 2003. Sequences and topology. Curr. Opin. Struct. Biol.
13, 341–343.
Glory, E., Murphy, R.F., 2007. Automated subcellular location determination and
high-throughput microscopy. Dev. Cell 12, 7–16.
Guo, J., Rao, N., Liu, G., Yang, Y., Wang, G., 2011. Predicting protein folding rates
using the concept of Chou’s pseudo amino acid composition. Journal of
Computational Chemistry 32, 1612–1617.
Hayat, M., Khan, A., 2011. Predicting membrane protein types by fusing composite
protein sequence features into pseudo amino acid composition. J. Theor. Biol.
271, 10–17.
Jahandideh, S., Hoseini, S., Jahandideh, M., Hoseini, A., Disfani, F.M., 2009. Gammaturn types prediction in proteins using the two-stage hybrid neural discriminant model. J. Theor. Biol. 259, 517–522.
Author's personal copy
X. Xiao et al. / Journal of Theoretical Biology 284 (2011) 42–51
Kandaswamy, K.K., Pugalenthi, G., Moller, S., Hartmann, E., Kalies, K.U., Suganthan,
P.N., Martinetz, T., 2010. Prediction of apoptosis protein locations with genetic
algorithms and support vector machines through a new mode of pseudo
amino acid composition. Protein Pept. Lett. 17, 1473–1479.
Kannan, S., Hauth, A.M., Burger, G., 2008. Function prediction of hypothetical
proteins without sequence similarity to proteins of known function. Protein
Pept. Lett. 15, 1107–1116.
Li, F.M., Li, Q.Z., 2008. Predicting protein subcellular location using Chou’s pseudo
amino acid composition and improved hybrid approach. Protein Pept. Lett. 15,
612–616.
Lin, H., 2008. The modified Mahalanobis discriminant for predicting outer
membrane proteins by using Chou’s pseudo amino acid composition. J. Theor.
Biol. 252, 350–356.
Lin, H., Ding, H., 2011. Predicting ion channels and their types by the dipeptide
mode of pseudo amino acid composition. J. Theor. Biol. 269, 64–69.
Liu, T., Zheng, X., Wang, C., Wang, J., 2010. Prediction of subcellular location of
apoptosis proteins using pseudo amino acid composition: an approach from
auto covariance transformation. Protein Pept. Lett. 17, 1263–1269.
Loewenstein, Y., Raimondo, D., Redfern, O.C., Watson, J., Frishman, D., Linial, M.,
Orengo, C., Thornton, J., Tramontano, A., 2009. Protein function annotation by
homology-based inference. Genome Biol. 10, 207.
Mahalanobis, P.C., 1936. On the generalized distance in statistics. Proc. Natl. Inst.
Sci. India 2, 49–55.
Mardia, K.V., Kent, J.T., Bibby, J.M., 1979. Multivariate Analysis: Chapter 11
Discriminant Analysis; Chapter 12 Multivariate Analysis of Variance; Chapter 13 Cluster Analysis. Academic Press, London (pp. 322–381).
Mohabatkar, H., 2010. Prediction of cyclin proteins using Chou’s pseudo amino
acid composition. Protein Pept. Lett. 17, 1207–1214.
Mondal, S., Bhavna, R., Mohan Babu, R., Ramakumar, S., 2006. Pseudo amino acid
composition and multi-class support vector machines approach for conotoxin
superfamily classification. J. Theor. Biol. 243, 252–260.
Nakai, K., 2000. Protein sorting signals and prediction of subcellular localization.
Adv. Protein Chem. 54, 277–344.
Nakai, K., Kanehisa, M., 1991. Expert system for predicting protein localization
sites in Gram-negative bacteria. Proteins: Struct. Funct. Genet. 11, 95–110.
Nakashima, H., Nishikawa, K., 1994. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies.
J. Mol. Biol. 238, 54–61.
Nanni, L., Lumini, A., 2008. Genetic programming for creating Chou’s pseudo
amino acid based features for submitochondria localization. Amino Acids 34,
653–660.
Pan, Y.X., Zhang, Z.Z., Guo, Z.M., Feng, G.Y., Huang, Z.D., He, L., 2003. Application of
pseudo amino acid composition for predicting protein subcellular location:
stochastic signal processing approach. J. Protein Chem. 22, 395–402.
Park, K.J., Kanehisa, M., 2003. Prediction of protein subcellular locations by support
vector machines using compositions of amino acid and amino acid pairs.
Bioinformatics 19, 1656–1663.
Pillai, K.C.S., 1985. Mahalanobis D2. In: Kotz, S., Johnson, N.L. (Eds.), Encyclopedia
of Statistical Sciences, vol. 5. John Wiley & Sons, New York, pp.
176–181 (This reference also presents a brief biography of Mahalanobis who
was a man of great originality and who made considerable contributions to
statistics).
Qiu, J.D., Huang, J.H., Shi, S.P., Liang, R.P., 2010. Using the concept of Chou’s pseudo
amino acid composition to predict enzyme family classes: an approach with
support vector machine based on discrete wavelet transform. Protein Pept.
Lett. 17, 715–722.
Reinhardt, A., Hubbard, T., 1998. Using neural networks for prediction of the
subcellular location of proteins. Nucleic Acids Res. 26, 2230–2236.
Sahu, S.S., Panda, G., 2010. A novel feature representation method based on Chou’s
pseudo amino acid composition for protein structural class prediction.
Comput. Biol. Chem. 34, 320–327.
51
Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin,
E.V., Altschul, S.F., 2001. Improving the accuracy of PSI-BLAST protein database
searches with composition-based statistics and other refinements. Nucleic
Acids Res. 29, 2994–3005.
Shen, H.B., Chou, K.C., 2007. Virus-PLoc: a fusion classifier for predicting the
subcellular localization of viral proteins within host and virus-infected cells.
Biopolymers 85, 233–240.
Shen, H.B., Chou, K.C., 2009a. Predicting protein fold pattern with functional
domain and sequential evolution information. J. Theor. Biol. 256, 441–446.
Shen, H.B., Chou, K.C., 2009b. QuatIdent: a web server for identifying protein
quaternary structural attribute by fusing functional domain and sequential
evolution information. J. Proteome Res. 8, 1577–1584.
Shen, H.B., Chou, K.C., 2010a. Gneg-mPLoc: a top-down strategy to enhance the
quality of predicting subcellular localization of Gram-negative bacterial
proteins. J. Theor. Biol. 264, 326–333.
Shen, H.B., Chou, K.C., 2010b. Virus-mPLoc: a fusion classifier for viral protein
subcellular location prediction by incorporating multiple sites. J. Biomol.
Struct. Dyn. 28, 175–186.
Small, I., Peeters, N., Legeai, F., Lurin, C., 2004. Predotar: a tool for rapidly screening
proteomes for N-terminal targeting sequences. Proteomics 4, 1581–1590.
Smith, C., 2008. Subcellular targeting of proteins and drugs. /http://www.
biocompare.com/Articles/TechnologySpotlight/976/Subcellular-Targeting-OfProteins-And-Drugs.htmlS.
Wong, J.H., Ng, T.B., 2009. Studies on an antifungal protein and a chromatographically and structurally related protein isolated from the culture broth of
Bacillus amyloliquefaciens. Protein Pept. Lett. 16, 1399–1406.
Wootton, J.C., Federhen, S., 1993. Statistics of local complexity in amino acid
sequences and sequence databases. Comput. Chem. 17, 149–163.
Xiao, X., Wang, P., Chou, K.C., 2011a. Quat-2L: a web-server for predicting protein
quaternary structural attributes. Mol. Diversity 15, 149–155.
Xiao, X., Wang, P., Chou, K.C., 2011b. GPCR-2L: predicting G protein-coupled
receptors and their types by hybridizing two different modes of pseudo amino
acid compositions. Mol. Biosyst. 7, 911–919.
Xiao, X., Shao, S.H., Huang, Z.D., Chou, K.C., 2006a. Using pseudo amino acid
composition to predict protein structural classes: approached with complexity
measure factor. J. Comput. Chem. 27, 478–482.
Xiao, X., Shao, S.H., Ding, Y.S., Huang, Z.D., Chou, K.C., 2006b. Using cellular
automata images and pseudo amino acid composition to predict protein
subcellular location. Amino Acids 30, 49–54.
Yu, L., Guo, Y., Li, Y., Li, G., Li, M., Luo, J., Xiong, W., Qin, W., 2010. SecretP:
identifying bacterial secreted proteins by fusing new features into Chou’s
pseudo-amino acid composition. J. Theor. Biol. 267, 1–6.
Zeng, Y.H., Guo, Y.Z., Xiao, R.Q., Yang, L., Yu, L.Z., Li, M.L., 2009. Using the
augmented Chou’s pseudo amino acid composition for predicting protein
submitochondria locations based on auto covariance approach. J. Theor. Biol.
259, 366–372.
Zhang, G.Y., Fang, B.S., 2008. Predicting the cofactors of oxidoreductases based on
amino acid composition distribution and Chou’s amphiphilic pseudo amino
acid composition. J. Theor Biol. 253, 310–315.
Zhang, G.Y., Li, H.C., Gao, J.Q., Fang, B.S., 2008. Predicting lipase types by improved
Chou’s pseudo-amino acid composition. Protein Pept. Lett. 15, 1132–1137.
Zhou, G.P., Doctor, K., 2003. Subcellular location prediction of apoptosis proteins.
Proteins: Struct. Funct. Genet. 50, 44–48.
Zhou, X.B., Chen, C., Li, Z.C., Zou, X.Y., 2007. Using Chou’s amphiphilic pseudoamino acid composition and support vector machine for prediction of enzyme
subfamily classes. J. Theor. Biol. 248, 546–551.
Zou, D., He, Z., He, J., Xia, Y., 2011. Supersecondary structure prediction using
Chou’s pseudo amino acid composition. J. Comput. Chem. 32, 271–278.