OMICS A Journal of Integrative Biology
Volume 10, Number 1, 2006
© Mary Ann Liebert, Inc.
Diffusion Kernel-Based Logistic Regression Models for
Protein Function Prediction
HYUNJU LEE,1 ZHIDONG TU,2 MINGHUA DENG,3 FENGZHU SUN,2
and TING CHEN2
ABSTRACT
Assigning functions to unknown proteins is one of the most important problems in proteomics.
Several approaches have used protein–protein interaction data to predict protein functions. We
previously developed a Markov random fields (MRF) based method to infer a protein’s functions using protein-protein interaction data and the functional annotations of its protein interaction partners. In the original model, only direct interactions were considered and each function was considered separately. In this study, we develop a new model which extends direct
interactions to all neighboring proteins, and one function to multiple functions. The goal is to
understand a protein’s function based on information on all the neighboring proteins in the interaction network. We first developed a novel kernel logistic regression (KLR) method based
on diffusion kernels for protein interaction networks. The diffusion kernels provide means to
incorporate all neighbors of proteins in the network. Second, we identified a set of functions
that are highly correlated with the function of interest, referred to as the correlated functions,
using the chi-square test. Third, the correlated functions were incorporated into our new KLR
model. Fourth, we extended our model by incorporating multiple biological data sources such
as protein domains, protein complexes, and gene expressions by converting them into networks.
We showed that the KLR approach of incorporating all protein neighbors significantly improved the accuracy of protein function predictions over the MRF model. The incorporation
of multiple data sets also improved prediction accuracy. The prediction accuracy is comparable to another protein function classifier based on the support vector machine (SVM), using a
diffusion kernel. The advantages of the KLR model include its simplicity as well as its ability
to explore the contribution of neighbors to the functions of proteins of interest.
INTRODUCTION
T
HE COMPLETE GENOME SEQUENCES of many organisms are now available and the proteome of these organisms can be roughly estimated. Despite extensive efforts to annotate the functions of proteins, a
large number of proteins remain unknown. For example, for the best studied organism, yeast, about a third
of all the proteins are still unknown. Several methods for protein function prediction have been developed.
1Department
of Computer Science, University of Southern California, Los Angeles, California.
and Computational Biology Program, University of Southern California, Los Angeles, California.
3LMAM, School of Mathematical Sciences and Center for Theoretical Biology, Peking University, Beijing China.
2Molecular
40
DIFFUSION KERNEL-BASED LOGISTIC REGRESSION AND PROTEINS
The most widely used method for protein function prediction first finds homologies between the protein of
interest and other proteins in protein databases using programs such as FASTA (Pearson et al., 1988) and
BLAST (Altschul et al., 1997), and then predicts functions of the protein based on the functions of the homologous proteins. Recent developments of highthroughput biotechniques have generated a variety of different data sources that are useful for protein function prediction. Microarray-based gene expression profiles have been extensively used to cluster genes into groups having similar functions (Brown et al., 2002;
Eisen et al., 1998; Pavlidis et al., 2001). Several investigators have developed protein function prediction
methods based on protein physical or genetic interactions (Fellenberg et al., 2002; Schwikowski et al., 2002;
Hishigaki et al., 2001; Vazquez et al., 2003; Deng et al., 2002, 2004), as well as features of individual proteins (Gupta et al., 2002; Hegyi et al., 1999; Jensen et al., 2002; Stawiski et al. 2002; Clare et al., 2002;
Kell et al., 2000; King et al., 2001; Drawid et al., 2000). For the variety of different protein interaction data
sets and how they relate to protein functions, see Mering et al. (2002).
Deng et al. (2002) developed a Markov random field (MRF) model for protein function prediction using
protein-protein interaction data. The method was later extended to incorporate multiple interaction data
sources such as genetic interactions, protein complexes, and co-expressed gene expressions, as well as features of individual proteins (Deng et al., 2003). It has been shown that the integrated approach significantly
improved the prediction accuracy. The available MRF methods used information on immediate interactions
to infer functions for an unknown protein (Deng et al., 2002, 2003, 2004), and considered each function
category separately. A natural question is whether the prediction accuracy can be significantly improved if
all neighbors are incorporated into the MRF method.
Lanckriet et al. (2004) developed a support vector machine (SVM) approach for predicting protein functions using a diffusion kernel on a protein interaction network. They showed that the prediction accuracy of
the SVM approach is higher than that of the original MRF approach. As the MRF approach is model-based,
it can be used to explore the contributions of neighbors to the functions of proteins of interest while the SVM
cannot. We also noticed that the number of support vectors was generally very large, close to the number of
proteins in the training set, which may result in overfitting of the parameters to the data in training sets.
In this study, we develop a new diffusion kernel-based logistic regression (KLR) model to predict unknown function in a model organism, yeast. Even though yeast is one of the most studied species, the functions of about a third of the proteins are still unknown. The KLR model combines the advantages of both
the MRF approach and the diffusion kernel. The KLR approach has the advantage of high prediction accuracy similar to the SVM approach as well as the ability to explore the contributing factors for the functions of proteins. We applied the KLR model to a yeast protein interaction network with 1,881 proteins and
2,568 interactions using 34 functions defined in Gene Ontology (GO). The results showed that the KLR
approach improved the prediction accuracy over MRF, suggesting that the KLR model captured more information for protein function prediction compared to the original MRF model. We also compared the KLR
method with the SVM method using the same diffusion kernel. The prediction accuracy of the KLR approach is similar or slightly better than that of the SVM approach. However, the KLR model is much simpler. The most important feature of the KLR method is that it can be used to study the contributions of
neighboring proteins to the functions of a protein of interest. In terms of computational time, we ran 170
(34 functions * 5-fold) training sets on a PC with 1.4-GHz CPU and 256-MB memory. The KLR method
took two minutes, while the SVM method took from around ten minutes to more than two hours depending on the diffusion kernel constants.
We also extended our KLR method to combine multiple data sources such as protein domains, genetic
interactions, protein complexes, gene expressions, and protein localizations to predict protein function. We
will call this model the generalized KLR method.
METHODS
41
We extended the MRF model for protein function prediction to incorporate all proteins directly and indirectly connected to the protein of interest in a network. For a protein of interest, its k-th level neighbors
are defined as the proteins having the shortest distance of k to the protein in the network. Figure 1a shows
F1
LEE ET AL.
FIG. 1. (a) Protein STE12 has a function of conjugation with cellular fusion, while none of its immediate neighbors
have this function, and seven out of 13 second-level neighbors have this function. (Proteins in white circles have the
function conjugation with cellular fusion and proteins in black circles do not have the function.) (b) While protein
TUB4 has function mitosis, its neighbor proteins do not have the function except for its second-level neighbor SPC72.
However, all neighbors of the protein TUB4 have function cytoskeleton organization and biogenesis. (Proteins in white
circles have the function mitosis, and proteins in black circles have the function cytoskeleton organization and biogenesis, and proteins in gray circles have both functions.)
the contributions of second-level neighbors of protein STE12 to its function. Note that protein STE12 has
function “conjugation with cellular fusion.” None of its immediate neighbors have this function, and seven
out of 13 second-level neighbors have this function. For a function of interest, we define its correlated functions as those when a large fraction of the neighbors of a protein have those functions, the protein is more
likely to have the function of interest. Figure 1b shows the contributions of correlated functions of the neighbors of protein TUB4 to its function “mitosis.” Only one of TUB4’s second-level neighbors (SPC72) has
the function “mitosis.” However, all neighbors of the protein TUB4 have function “cytoskeleton organization and biogenesis.” We show later in the paper that “cytoskeleton organization and biogenesis” is highly
correlated with “mitosis.” Since all the neighbor proteins of TUB4 have the function “cytoskeleton organization and biogenesis,” we can infer that TUB4 has a high chance of having the function “mitosis.”
In order to make the paper self-contained, we describe the formulation of the problem first, and then
briefly describe the MRF approach, the SVM approach, and the KLR approach. Suppose a proteome has
N proteins X1, . . . , XN. Some proteins have known functions, while others are unknown. We refer to proteins having known functions as known proteins and to proteins with unknown functions as unknown proteins. Let X1, . . . , Xn be the unknown proteins, Xn1, . . . , Xnm be the known proteins, and N n m.
We are also given a protein interaction network. The objective is to assign functions to the unknown proteins based on the functions of known proteins and the interaction network. We also want to know the contributing factors for a protein’s function.
Previous research
Several investigators have developed methods for protein function prediction based on interaction networks. Here, we concentrate on two of these approaches that are most relevant to the current work: the
MRF-based approach of Deng et al. (2002) and the SVM approach of Lanckriet et al. (2004). In both approaches, the investigators considered one function at a time. Let Xi 1 if the i-th protein has the function
and Xi 0 otherwise.
MRF approach of Deng et al. (2002)
Let X (X1, . . . , Xnm) be the configuration of the functional labeling of the proteins. The assumption
for inferring a protein function from a protein interaction network is that if a large fraction of a protein’s
neighbors have a certain function, the protein most likely has the function. The simplest approach is to
42
DIFFUSION KERNEL-BASED LOGISTIC REGRESSION AND PROTEINS
count the numbers of neighbor proteins having some functions of interest and then select the most common functions (Schwikowski et al., 2002). However, this ignores the frequency of functions among all the
proteins and does not consider the structure of protein interaction networks. Deng et al. (2002) considered
a MRF approach. They modeled (X1, X2, . . . , Xnm) as a MRF with its probability proportional to
exp(b10N10 b11N11 b00N00)
(1)
Therefore, the total probability of the functional labeling is modeled as proportional to exp(U(x)),
exp(U(x)) exp(N1 b10N10 b11N11 b00N00)
(2)
N1 is the number of proteins having the function of interest; Nll is the number of protein interactions with
one protein labeled l and the other protein labeled l, l, l 0, 1.
It can then be shown that
Pr(Xi 1X[i],)
log a (b10 b00)N0(i) (b11 b10)N1(i)
1 Pr(Xi 1X[i],)
(3)
where N1(i) and N0(i) are the number of immediate interaction neighbors of protein i having the function
and not having the function, respectively, and X[i] (X1, . . . , Xi1, Xi1, . . . , Xnm). Deng et al. (2002)
used a pseudo-likelihood approach to estimate the parameters a, b10, b11, and b00, based on the functions
of the known proteins. They also developed a Markov Chain Monte Carlo (MCMC) approach to estimate
the posterior probabilities of Xi 1 conditional on both the network and (Xn1, Xn2, . . . , Xnm).
SVM approach of Lanckriet et al. (2002)
The performance of a SVM method largely depends on the kernel used for represent the data set. Lanckriet et al. (2002) developed a SVM approach for protein function prediction based on a diffusion kernel K
for a network. The diffusion kernel K calculates the similarity distance between any two nodes in the network and it is defined as follows.
K e{H}
where
H(i, j) 1
di
0
(4)
if protein i interacts with protein j
if protein i is the same as protein j
otherwise
where di is the number of interaction partners for protein i, is diffusion constant, and e{H} represents the
matrix exponential of the adjacent matrix H. They showed that the prediction accuracy of the SVM approach is higher than that of the MRF approach for all the function categories considered.
New kernel-based logistic regression model
Both the MRF and the SVM approaches have some advantages. Since the MRF approach is model-based,
it has the advantages of simplicity as well as its ability to explore the contributions of neighbors to the function of interest. The SVM approach, on the other hand, gives higher prediction accuracy for protein function.
It is not clear, though, how to use the SVM approach to study the contributing factors for a protein’s function.
In this paper, we combine the advantages of both approaches to develop a new kernel-based logistic regression
model for protein function prediction. Note that the diffusion kernel K(i,j) defined in equation 4 decreases as
the shortest distance in the protein interaction network between protein i and protein j increases. Therefore,
K(i,j) can be regarded as a similarity measure between pairs of proteins in the interaction network. K(i,j) is not
only defined for pairs of interacting proteins, but also for protein pairs separated through several interactions.
Similar to the idea of the MRF approach of Deng et al. (2002) using the diffusion kernel, we model the
probability for X (X1, . . . , Xnm) as proportional to
exp(N1 10 D10 11D11 00D00)
43
(5)
LEE ET AL.
where ,10,11, and 11 are constants, and
N1 I{xi 1}
i
D11 K(i, j)I{xi 1, xj 1}
ij
D10 K(i, j)I{(xi 1, xj 0) or (xi 0, xj 1)}
ij
D00 K(i, j)I{xi 0, xj 0}
ij
The summations are over all the protein pairs. From equation 5, it can be shown that
Pr(Xi 1X[i],)
log (10 00)K0(i) (11 10)K1(i)
1 Pr(Xi 1X[i],)
(6)
where
K0(i) K(i, j)I{xj 0}
i≠j
K1(i) K(i, j)I{xj 1}
i≠j
Note that if we let K(i, j) 1 if protein i interacts with protein j and K(i, j) 0 otherwise, this new model
is the same as the MRF model of Deng et al. (2002). We can similarly develop a MCMC approach to approximate the probability that an unknown protein has the function of interest conditional on the network
and the functions of known proteins.
KLR model for one function
Instead of using the MCMC approach, we also consider a much simpler kernel based logistic regression
(KLR) model based on equation 6. Let
M0(i) K(i, j)I{xj 0}
K(i, j)I{xj 1}
j≠i,xj known
M1(i) j≠i,xj known
The KLR model is given by
Pr(Xi 1X[i],)
log M0(i) M1(i)
1 Pr(Xi 1X[i],)
(7)
Note that the KLR approach uses only known proteins in the prediction and unknown proteins are left out.
Preliminary results showed that the MCMC model in equation 6 and the KLR model in equation 7 gave
similar results. Thus, we will only present the results based on the KLR model.
KLR model for correlated functions
Here, we generalize the KLR model for one function in equation 7 by incorporating correlated functions
K
into the model. Assume that we are given K functional categories: C1, C2, . . . , CK and
Pr(Xi Ck) 1,
k1
and let {Xi Ck} be the event that the i-th protein has function CK. We can generalize the KLR model
(eq. 7) as follows.
K
Pr(Xi Ck)
log k kl Ml (i)
Pr(Xi CK)
l1
44
(8)
DIFFUSION KERNEL-BASED LOGISTIC REGRESSION AND PROTEINS
where
Ml (i) K(i, j)I{xj Cl},l 1, 2, . . . , K
j≠i
Ml (i) is the weighted number of neighbors of protein i having function l with weight K(i, j) for protein j.
This is a multi-variate logistic regression problem and can be solved using statistical software such as
S-plus.
The model in equation 8 may contain many parameters when the number of functions being considered
is large. For example, if we use 34 functions in this study, the number of parameters will be 33 35 1,155 parameters. In order to reduce the number of parameters, for each function of interest we identify a
subset of functions of interest highly associated with the function to be included in equation 8.
We use the chi-square test to identify correlated functions for a function of interest. For a protein Pi having a function Cj, the chi-square association value between the function Cj and a function Cl, based on Pi’s
immediate neighbors, is defined as
(Ni(1)(l) Ni(1)Ql)2
Ni(1)Qi
(9)
where N(1)i is the number of immediate neighbors of Pi, N(1)i (l) is the number of immediate neighbors of
Pi having function Cl, Ql is the fraction of known proteins having function Cl. We sum the corresponding
quantities over all proteins having function Cl in the network to obtain an overall statistic,
( Ni(1) Ni(1)Ql)2
Ni(1)Ql
(10)
KLR model for multiple data sources
Suppose we have several data sources that may be useful for protein function prediction. We convert
each data source into a matrix and treat them the same as physical interaction data. Suppose we have
D data sources that have all been transformed into kernel matrices. Let K(d)(i,j) be the kernel matrix
for the d-th data source. Then, the KLR model with correlated functions (eq. 8) can then further be
extended to
D
Pr(Xi Ck)
log k K
kl
d1 l1
(d)
(d)
M l (i)
k 1, 2, . . . , K 1.
where
(d)
(d)
M l (i) ∑ K (i, j)I{Xj Cl},
(d)
l 1, 2, . . . , K
Ml (i) is the weighted number of neighbors of protein i having function l with weight K (d)(i, j) for protein
j in the d-th network.
45
Table 2 shows the top three chi-square values for each function. Significant function correlations are observed. For example, for “secretory pathway (function 15)”, the chi-square value with “vesicle-meditated
transport (function 16)” is comparable to that of the function itself. For “protein modification (function 31),”
the chi-square value for “cellular morphogenesis (function 2)” is even larger than that for the function
itself.
Fitting the data to the full model in equation 8 will be unreliable due to the large number of parameters
involved. In order to reduce the number of parameters in the model, for each function we consider only
functions with the top five chi-square values.
AU1
T1
T2
LEE ET AL.
TABLE 1.
LIST OF GENE ONTOLOGY INFORMATIVE TERMINAL NODES
FUNCTIONAL CATEGORY OF “BIOLOGICAL PROCESS”
IN THE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
GO
ID
No. of
protein
Function description
7165
902
7010
7005
7033
6364
7047
16568
6260
7067
7126
74
6605
6811
45045
16192
747
6066
9309
6520
8610
6412
5975
6511
15980
6281
6397
8380
6357
16310
6464
9628
9607
6950
143
102
254
100
125
116
127
126
100
120
111
116
209
102
203
239
102
133
103
144
109
438
173
129
169
120
124
112
163
109
356
183
100
224
Signal transduction
Cellular morphogenesis
Cytoskeleton organization and biogenesis
Mitochondrion organization and biogenesis
Vacuole organization and biogenesis
rRNA processing
Cell wall organization and biogenesis
Chromatin modification
DNA replication
Mitosis
Meiosis
Regulation of cell cycle
Protein targeting
Ion transport
Secretory pathway
Vesicle-mediated transport
Conjugation with cellular fusion
Alcohol metabolism
Amine biosynthesis
Amino acid metabolism
Lipid biosynthesis
Protein biosynthesis
Carbohydrate metabolism
Ubiquitin-dependent protein catabolism
Energy derivation by oxidation of organic compounds
DNA repair
mRNA processing
RNA splicing
Regulation of transcription from Pol II promoter
Phosphorylation
Protein modification
Response to abiotic stimulus
Response to biotic stimulus
Response to stress
Prediction accuracy measures
We use fivefold cross validations to compare the prediction accuracy for different methods. The true positive, true negative, false positive, and false negative are given in the following table.
Real positive
Real negative
Predicted positive
Predicted negative
True positive, TP
False positive, FP
False negative, FN
True negative, TN
The standard performance measures for the classification problem based on these four values are sensitivity (SN) and false-positive rate (FPR) defined as follows,
TP
SN FP
FPR 46
(12)
DIFFUSION KERNEL-BASED LOGISTIC REGRESSION AND PROTEINS
TABLE 2.
CHI-SQUARE VALUES
OF
FUNCTION CORRELATIONS
Function
1st
2
value
2nd
2
value
3rd
2
value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
20
20
21
22
23
24
25
26
27
28
29
2
8
32
33
34
192.9
89.0
447.2
488.8
204.9
1646.6
24.7
1070.8
737.6
306.6
38.6
297.1
933.8
366.9
710.2
811.9
87.2
33.4
33.7
29.9
54.0
710.4
81.9
656.2
62.6
434.1
1154.0
410.9
131.0
17.7
308.7
69.6
21.7
16.0
2
1
10
13
15
28
3
31
26
3
12
24
4
22
16
15
2
23
19
19
18
27
25
12
23
9
28
27
8
30
31
17
9
15
50.1
37.2
92.4
86.6
20.1
58.8
23.0
174.6
91.4
144.2
23.2
73.7
102.1
6.4
644.9
572.0
56.9
12.0
17.9
21.3
5.9
17.5
30.8
82.5
43.6
118.3
198.7
210.8
127.2
16.6
200.3
54.5
16.8
6.1
32
17
12
31
16
27
2
29
11
12
9
10
31
13
31
31
32
21
24
24
11
15
18
10
15
11
6
6
31
16
29
1
26
33
27.6
23.4
25.3
4.2
15.3
51.3
15.3
56.6
9.5
67.3
13.6
61.2
16.4
1.6
19.5
22.0
45.1
4.8
4.9
8.6
2.1
7.2
7.7
62.8
4.8
25.5
49.0
66.0
36.9
10.3
51.9
39.9
9.0
5.5
RESULTS
Datasets
47
We applied our method to infer the functions of unknown proteins in yeast using the functional annotations from the GO Consortium (Ashburner et al., 2002). GO is a set of structured vocabularies organized
in a rooted directed acyclic graph (DAG), describing attributes of gene products in three categories of “cellular component,” “molecular function,” and “biological process.” We downloaded the three ontology files
from the GO database and the SGD gene list with GO annotations from the SGD database. Thirty-four function categories that contain at least 50 genes and none of their offspring nodes satisfying this condition are
used in this study as in Deng et al. (2004). Table 1 lists all these 34 functions.
We obtained 2,566 physical interactions from the Munich Information Center for Protein Sequences
(MIPS, http://mips.gsf.de), and the domain information for proteins were obtained from Pfam (Bateman
et al., 2002; www.sanger.ac.uk/Software/Pfam/iPfam). The protein complex data sets included the data
obtained by high-throughput mass spectrometric protein complex identification (HMS-PCI) by Ho et al.
F2
LEE ET AL.
(2002). The protein localization data were obtained from Huh et al. (2003), and the gene expression data
were obtained from Spellman et al. (1998).
Function prediction
We compared results of the protein function prediction based on MIPS physical interaction data using
the MRF approach of Deng et al. (2002), the SVM approach of Lanckriet et al. (2004), the KLR approach
for one function, and the KLR approach for correlated functions. The relative performances of the four approaches using five-fold cross validation are given below.
Relative performance of the SVM and the KLR approaches
FIG. 2. The relationships between the false-positive rate (x-axis) and the sensitivity (y-axis) of the SVM and the KLR
approaches. For each function and each approach, the diffusion constant yielding the highest ROC score is used.
48
Lanckriet et al. (2004) convincingly showed that the prediction accuracy of the SVM approach is higher
than that of the MRF approach. The primary reason might be due to the use of kernel to incorporate all
interaction neighbors in the network. A natural question is the relative performance of the SVM approach
versus the KLR approach. We used the Gist SVM http://svm.sdsc.edu for training and testing data for
the SVM approach. Because each function category has different characteristics and the prediction accuracy depends on the diffusion constant , it is desirable to use different values of for different functions to conduct a fair comparison. As in Lanckriet et al. (2004), we chose the value of that gives the
highest receiver operating characteristic (ROC) score (the area under a curve of SN and FPR) (Hanley
et al., 1982) for each individual function. For example, for the function “cellular morphogenesis,” the
optimal for the KLR approach and optimal 0.1 for the KLR approach. We predict protein functions using the optimal values of for each function in each approach. However, we only tested four values of 0.1,0.5,1.3. Figure 2 shows the average results for the 34 functions. The ROC score of the
KLR approach is 0.830 and the ROC score of the SVM approach is 0.826. The p-value for the paired ttest of no difference is 0.53 suggesting that there is no significant difference in the performance of the
F2
DIFFUSION KERNEL-BASED LOGISTIC REGRESSION AND PROTEINS
FIG. 3. The ROC score of the MRF, KLR, and SVM approaches. The ROC score of the KLR approach is higher
than that of the MRF in all but one of the 34 functions. The ROC score of the KLR approach is higher than that of the
SVM approach in 21 out of the 34 functions considered.
F3
two approaches. Figure 3 shows the ROC scores of the two approaches as well as the MRF approach for
each function. Consistent with Lanckriet et al. (2004), the prediction accuracy of the SVM approach is
higher than that of the MRF approach in all the 34 functions being considered. The KLR approach outperformed the MRF approach in all but one function categories. The ROC score of the KLR approach is
higher than that of the SVM approach in 21 out of 34 function categories being considered (for detailed
results, see Supplemental Material).
F4
Relative performance of the KLR and the MRF approaches
Figure 4 shows the relationship between SN and FPR using the two approaches combining all the functions and multiple neighbors. The ROC score of the KLR approach is 0.828 and the ROC score of the MRF
approach is 0.79. This shows that the improved prediction accuracy in the KLR approach is due to the contributions of the diffusion kernel. The p-value is almost zero with the paired t-test with hypothesis of no
mean difference of ROC values of 34 functions.
Given the simplicity of the KLR model and its good performance compared to the MRF model and the
SVM model, we extend the KLR model to include multiple functions. We select five most correlated functions for each function using the chi-square test. In cross-validation, the five correlated functions are based
on the functions of known proteins. These five correlated functions capture most of the function correlations without sacrificing the simplicity of the model. The ROC score of the KLR with correlated functions
is 0.830, only slightly higher than the ROC score for one function 0.828. Even though the KLR approach
with correlated functions did not significantly increase the overall prediction accuracy, it helps for specific
proteins. Figure 1 shows a specific example. Using the KLR model with correlated function, the probability of protein STE12 having the function “conjugation with cellular fusion” is 0.17. On the other hand, the
corresponding probability is 0.02 when the KLR approach with one function is used. Similarly, using the
KLR model with correlated function, the probability of protein TUB4 having the function “mitosis” is 0.39.
The corresponding probability is 0.04 when the KLR approach with one function is used.
49
LEE ET AL.
FIG. 4. The relationships between the false-positive rate (x-axis) and the sensitivity (y-axis) of the approaches by the
MRF, the KLR, the KLR with correlated functions. For each function and each approach, the diffusion constant yielding the highest ROC score is used. The KLR with correlated functions gives the best prediction result.
Incorporating multiple data sources
50
In order to incorporate the domain information of each protein, we conducted experiments with the kernel approach described in the Methods section. To construct the kernel for the domain data, we used the
inner product of domain vectors of each pair of proteins. Figure 5 shows the ROC curve for the KLR approach with physical interaction network only and the ROC curve for the KLR approach with domain information added. The ROC score by integrating the physical interaction network and the domain information is 0.846, slightly higher than that with physical interactions only.
We conducted experiments of adding other available information, specifically, the genetic interactions,
the protein complexes, the gene expressions, and the protein localizations by the kernel approach. For the
genetic interactions, the protein complexes, and the gene expressions, we converted each data source into
an interaction network and then computed the diffusion kernel matrices. For the protein complexes, we assume that there is an interaction between two proteins if they are found in the same complex. For the gene
expressions, we assume that there is an interaction between two proteins if the gene expression correlation
coefficient between them is equal to or larger than 0.8. For each information source, we chose the parameter value in the diffusion kernel yielding the highest ROC score based on the constructed network. For
the localization data, we constructed a kernel matrix in the same manner as the domain data, i.e., the inner
product of 22 cellular location vectors of each pair of proteins. We integrated each data source with the
physical interaction network. However, the ROC score by integrating these data sources with the physical
interaction network did not increase (for details, see Supplementary Material).
One of the advantages of the KLR approach is its ability to explore the contribution of neighbors to the
functions of proteins of interest. We conducted the experiment with all five data sources: the physical interactions, the genetic interactions, the protein complexes, the gene expressions, and the protein localiza-
F5
DIFFUSION KERNEL-BASED LOGISTIC REGRESSION AND PROTEINS
FIG. 5. The relationships between the false-positive rate (x-axis) and the sensitivity (y-axis) of the approaches by the
KLR with correlated functions (physical interaction) and the KLR with correlated functions incorporating the domain
kernel. For each function, the diffusion constant yielding the highest ROC score is used.
51
tions. We measure the contribution of neighbors in each network with the t-value based on the t-test statistic for the hypothesis that the parameter of interest is zero. Figure 6 shows the t-values for each data
source and each function category. The average t-value over the 34 functions is highest for the physical interactions with respect to the first major correlated function, followed by the protein domains, the genetic
interactions, the protein complexes, the protein localizations and the gene expressions. However, the relative strength of contributions from different data sources measured by the t-values is not the same for the
functions of interest. Out of 34 functions, the physical interactions are the most important for 20 functions,
the protein domains for 10 functions, the genetic interactions for three functions, and the protein locations
for only one function.
In the supervised learning, when the number of features is large, the feature selection process is important and can improve the prediction accuracies by removing irrelevant features. In our study, we first
used the chi-square test to select the most correlated functions to reduce the number of parameters in
the model. We then combine six biological data sets for protein function prediction. By considering five
most correlated functions, we have 6*(51) 36 parameters. We obtained the features giving best prediction results by a forward stepwise selection strategy for each function. Depending on the function,
around 3–19 features are selected for each function. With these features, we predicted functions of unknown proteins.
We applied the generalized KLR method to predict functions of unknown proteins with the selected
features from all six data sources. The current version of the GO annotations (December 2005) contains
updated annotations, which were unknown at the time of our computational experiment (May 2005). To
compare the newly added annotations with our computational predictions, we define the most relevant
informative node for a new annotation as follows. If the newly added annotation is an offspring node of
F6
LEE ET AL.
FIG. 6. The t-values for each data source and each function category. “Phy1” represents the t-value of the parameter
for the weighted distance to neighbors having the function of interest in the physical interaction network, and “Phy-”
represents that for the weighted distance to neighbors not having five correlated functions of interest in the physical interaction network. “Phy2,” “Phy3,” “Phy4,” and “Phy5” represent the case with the second, third, fourth, and fifth correlated functions of interest. “Domain,” “Genetic,” “Complex,” “Exp,” and “Loc” correspond to protein domains, genetic interactions, protein complexes, gene expressions, and protein localizations. The “—” in each data set means
neighbors not having the function of interest.
T3
some informative nodes (parents), the most relevant informative node is defined as the parent informative node with the largest predicted probability. Similarly, if the newly added annotation is a parent node
of some informative nodes (offspring), the most relevant informative node is defined as the offspring informative node with the largest predicted probability. Table 3 gives the 11 newly annotated proteins (that
are also in the protein interaction network) with probability of 0.2, their functions, the most relevant
informative node, and the predicted probability for the protein belonging to the most relevant informative node.
ED1
Robustness of the KLR method
The KLR method assumes that all the data used for inferring protein function are correct. However, the
experimentally derived data contains certain fraction of false positives and false negatives. We tested the
robustness of the KLR approach by adding noises to the protein annotation and protein-protein interactions.
For easy presentation, we used the KLR model for one function (eq. 7) with t 1. The ROC score of this
model is 0.82 using the protein interaction data in MIPS and the GO annotations. To add noises in the protein annotation, we changed annotations of 10% and 20% of proteins having the function of interest with
other proteins not having the function of interest. The ROC scores with noise data in the GO annotation
decreased to 0.80 and 0.78 for 10% and 20% added noises, respectively. To add noises in the protein–protein interactions, we deleted 10% and 20% of protein interactions and added the same amount of interactions which were not present in MIPS. The ROC scores with noise data in the MIPS protein–protein interactions also decreased to 0.80 and 0.78, respectively.
52
DIFFUSION KERNEL-BASED LOGISTIC REGRESSION AND PROTEINS
TABLE 3.
Protein
FRQ1
YLR254C
YMR009W
YPR118W
PIG2
HRT3
PSY3
YJU2
YLR424W
YLR424W
YDR140W
ELEVEN NEWLY ADDED ANNOTATED PROTEINS
New annotation
MRIN*
Probability
Regulation of signal transduction
Nuclear migration, microtubule-mediated
Methionine salvage
Methionine salvage
Regulation of glycogen biosynthesis
Ubiquitin-dependent protein catabolism
Error-free DNA repair
Nuclear mRNA splicing, via spliceosome
Nuclear mRNA splicing, via spliceosome
Nuclear mRNA splicing, via spliceosome
Peptidyl-glutamine methylation
1
3
20
20
23
24
26
27
27
27
31
0.404
0.906
1.000
1.000
0.268
0.477
0.334
0.916
0.299
0.299
0.320
Eleven newly added annotated proteins (probability 0.2) between May 2003 and December 2005, their function annotation, the most relevant informative node (MRIN*), and the predicted probability for the protein having the function
represented by the MRIN. MRIN is represented as numbers corresponding to the functional categories in the Table 1.
DISCUSSION
Several methods have been developed to predict protein functions based on the idea of guilt-by-association. The prominent methods include the MRF approach of Deng et al. (2002) and the SVM approach of
Lanckriet et al. (2004). The MRF approach has features of being simple as well as its ability to explore the
contributions of neighbor proteins to the functions of proteins. However, its prediction accuracy is not as
high as the SVM approach. The objective of this study is to extend the MRF model to achieve higher prediction accuracy and at the same time to keep the advantages of being simple and easy interpretability of
the model. We developed a novel KLR model incorporating multi-level interaction neighbors as well as
correlated functions by adapting a kernel approach to the original MRF model. This model can be used to
explore the contributions of neighboring proteins to the functions of proteins.
This KLR approach captures the global information of the network, and adapting this to the MRF model
makes the model simpler compared to the SVM method, and, at the same time, having the similar prediction accuracy as the SVM approach. In addition, the KLR model can be used to study the contribution of
neighboring proteins to the functions of proteins of interest. This model can be easily extended to incorporate multiple data sources. One data source usually captures only limited information, and therefore, combining multiple data sources helps to improve the protein function prediction.
We showed that our novel method significantly increased the prediction accuracy over the original MRF
approach, and incorporating physical interaction data with domain information improves the prediction accuracy as well. However, other data sources such as the genetic interaction data, the protein complex data,
the gene expression data, and the protein localization data do not help to improve the prediction accuracy.
This is partly due to the small number of common proteins contained in both the physical interaction data
set and the other data sets. The other reason is that the information from the other data sets is already covered by the physical interactions and the protein domains.
In the KLR approach, we use only known proteins to predict unknown ones similar to the supervised
learning in the machine learning. The approach to utilize unknown data in addition to the known data is referred as the semi-supervised learning. Given that there are still a large fraction of unknown proteins even
in the most studied species, yeast, it may seem desirable to use the unknown proteins to increase the prediction accuracy. We tried a Gibbs sampling strategy using equation 6 to incorporate unknown proteins in
the prediction. However, the prediction accuracy is similar to that using KLR with only known proteins
(data not shown). One possible explanation is that the fraction of known proteins (1,495 out of 1,881 proteins having at least one interaction partner) is already very high. Imputing the unknown proteins based on
the Gibbs sampling does not improve accuracy. For organisms with a large fraction of unknown proteins,
53
LEE ET AL.
imputation based on the Gibbs sampling or Monte Carlo Markov Chain strategy may help to improve prediction accuracy. This is a topic for further research.
There are several limitations in this study. First, we assumed the protein interaction data and the annotation
of the known proteins are correct. As being well known, both the false positive rate and the false negative rate
for protein interactions are very high. Several methods have been developed to estimate the fraction of true interactions in putative protein interaction data sets yielding very different estimates for different technologies, of
68–27% (Lee et al., 2005). Although the MIPS protein interactions used in this study were supposed to be highly
accurate, it certainly contains some false positives. More importantly, many of the true protein interactions are
not in the MIPS protein interaction database. Second, many protein interactions are condition-dependent, and it
is hard to detect transient interactions. The interactions deposited in the protein interaction databases are generally static and thus, can not represent the dynamic features of protein interactions. Third, the annotations of the
known proteins are incomplete and contain many errors too. The effect of all these needs to be further studied.
In this study, we assumed that protein pairs are either interacting or not. Because our study is limited to the
highly accurate interactions, proteins with interactions represent only a relatively small part of all the proteins.
For example, only 1,881 out of over 6,000 proteins in yeast were studied in our approach. Several methods
are available to predict the probability of interactions for protein pairs based on multiple data sources, such
as yeast two hybrid systems, gene expressions, and gene localizations (Jansen et al., 2003; Ben-Hur and Noble 2005). The KLR model studied in this paper can be extended to such situations by replacing H(i,j) in equation 4 with pij, the probability that protein i interacts with protein j. H(i,i) can be replaced by the expected degree for protein i. The protein function prediction accuracy of this extended method will certainly depend on
the methods to estimate pij. It has the advantage of being able to cover a significantly large fraction of all the
proteins. However, the prediction accuracy using this approach needs to be further studied. In either the MRF
model or the KLR model, it is assumed that the contributions of neighboring proteins to the function of an
unknown protein do not depend on the protein of interest. However, there is the possibility that proteins having similar functions may form clusters and the sub-clusters of proteins having a given functions of interest
may have different characteristics. In that case, protein function prediction based on clustering maybe more
appropriate. In this study, we applied the KLR approach to protein function prediction for 34 functional categories based on GO achieving comparable results as SVM. The KLR approach can certainly be applicable
to any two-way-classification problems as long as a kernel matrix can be defined. Under what conditions the
KLR approach has similar prediction accuracy as SVM is a topic for further research.
ACKNOWLEDGMENTS
This research is supported by NIH/NSF joint mathematical biology initiative DMS-0241102. M.D. is supported by grants from the National Key Basic Research Project of China (no. 2003CB715903) and National
Natural Science Foundation of China (no. 90208022, No.30570425).
REFERENCES
ALTSCHUL, S.F., MADDEN, T.L., SCHAFFER, A.A., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402.
ASHBURNER, M., BALL C.A., BLAKE, J.A., et al. (2000). Gene Ontology: tool for the unification of biology. Nat
Genet 25, 25–29.
BATEMAN, A., BIRNEY, E., CERRUTI, L., et al. (2002). The pfam protein families database. Nucleic Acids Res 30,
276–280.
BEN-HUR, A., and NOBLE, W.S. (2005). Kernel methods for predicting protein–protein interactions. Bioinformatics
21, i38–i46.
BROWN, M., GRUNDY, W.N., LIN, D., et al. (2000). Knowledge-based analysis of microarray gene expression data
by using support vector machines. Proc Natl Acad Sci USA 97, 262–267.
CLARE, A., and KING, R.D. (2002). Machine learning of functional class from phenotype data. Bioinformatics 18, 160–166.
DENG, M., ZHANG, K., MEHTA, S., et al. (2002). Prediction of protein function using protein–protein interaction
data. Presented at IEEE Comput Soc Bioinform.
54
DIFFUSION KERNEL-BASED LOGISTIC REGRESSION AND PROTEINS
DENG, M., CHEN, T., and SUN, F. (2003). An integrative analysis of protein function prediction. Presented at Recomb.
DENG, M., SUN, F., and CHEN, T. (2004). Mapping gene ontology to proteins based on protein–protein interaction
data. Bioinformatics 20, 895–902.
DRAWID, A., and GERSTEIN, M. (2000). A bayesian system integrating expression data with sequence patterns for
localizing proteins: comprehensive application to the yeast genome. J Mol Biol 301, 1059–1075.
EISEN, M.B., SPELLMAN, P.T. BROWN, P.O., et al. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863–14868.
FELLENBERG, M., ALBERMANN, K., ZOLLNER, A., et al. (2000). Integrative analysis of protein interaction data.
Presented at the International Conference on Intelligent Systems for Molecular Biology.
GUPTA, R., and BRUNAK, S. (2002). Prediction of glycosylation across the human proteome and the correlation to
protein function. Pesented at the Pacific Symposium on Biocomputing.
HANLEY, J.A., and MCNEIL, B.J. (1982). The meaning and use of the area under a receiver operating characteristic
(ROC) curve. Radiology 143, 29–36.
HEGYI, H., and GERSTEIN, M. (1999). The relationship between protein structure and function: a comprehensive survey with application to yeast genome. J Mol Biol 288, 147–164.
HISHIGAKI, H., NAKAI, K., ONO, T., et al. (2001). Assessment of prediction accuracy of protein function from protein–protein interaction data. Yeast 18, 523–531.
HO, Y., GRUHLER, A., HEILBUT, A., et al. (2002). Systematic identification of protein complexes in Saccharomyces
cerevisiae by mass spectrometry. Nature 415, 180–183.
HUH, W.K., FALVO, J.V., GERKE, L.C., et al. (2003). Global analysis of protein localization in budding yeast. Nature 425, 686–691.
JANSEN, R., YU, H., GREENBAUM, D., et al. (2003). A Bayesian networks approach for predicting protein-protein
interactions from genomic data. Science 302, 449–453.
JENSEN, L.J., GUPTA, R., BLOM, N., et al. (2002). Prediction of human protein function from post-translational modifications and localization features. J Comput Biol 319, 1257–1265.
KELL, D.B., and KING, R.D. (2000). On the optimization of classes for the assignment of unidentified reading frames
in functional genomics programmes: the need for machine learning. J Comput Biol 18, 93–98.
KING, R.D., KARWATH, A., CLARE, A., et al. (2001). The utility of different representations of protein sequence
for predicting functional class. Bioinformatics 17, 445–454.
LANCKRIET. G., DENG, M., CRISTIANINI, N., et al. (2004). Kernel-based data fusion and its application to protein
function prediction in yeast. Presented at the Pacific Symposium on Biocomputing.
LEE, H.J., DENG, M., SUN, F.S., et al. (2005). Assessment of the reliability of protein–protein interactions using protein localization and gene expression data. Presented at Bioinfo 2005, Busan.
MERING, C.V., KRAUSE, R., SNEL, M., et al. (2002). Comparative assessment of large scale data sets of protein–protein interactions. Nature 417, 399–403.
PAVLIDIS, P., and WESTON, J. (2001). Gene functional classification from heterogeneous data. Presented at Recomb.
PEARSON, W.R., and LIPMAN, D.J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci
USA 85, 2444–2448.
SCHWIKOWSKI, B., UETZ, P., and FIELDS, S. (2002). A network of protein–protein interactions in yeast. Nat Biotech
18, 1257–1261.
SPELLMAN P.T., SHERLOCK, G., ZHANG, M.Q., et al. (1998). Comprehensive identification of cell cycle-regulated
genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273–3297.
STAWIKI, E.W., MANDEL-GUTFREUND, Y., LOWENTHAL, A.C., et al. (2002). Progress in predicting protein
function from structure: unique features of O-glycosidases. Presented at the Pacific Symposium on Biocomputing.
VAZQUEZ, A., FLAMMINI, A., MARITAN, A., et al. (2003). Global protein function prediction from protein–protein interaction networks. Nat Biotech 21, 697–700.
Address reprint requests to:
Dr. Fengzhu Sun
Molecular and Computational Biology Program
University of Southern California
1050 Childs Way
Los Angeles, CA 90089-2910
E-mail: [email protected]
55
LEE
AU1
Please call out Table 1 in text.
ED1
Should “t” be italic?
© Copyright 2026 Paperzz