PSEA: Kinase-specific prediction and analysis of human

PSEA:
Kinase-specific
prediction
and
analysis
of
human
phosphorylation substrates
Sheng-Bao Suo1, Jian-Ding Qiu1,2,*, Shao-Ping Shi1,3, Xiang Chen1, Ru-Ping Liang1
1
Department of Chemistry, Nanchang University, Nanchang, 330031, China.
2
Department of Chemical Engineering, Pingxiang College, Pingxiang, 337055, China.
3
Department of Mathematics, Nanchang University, Nanchang, 330031, China.
* To whom correspondence should be addressed. Tel: + 86 791 83969518; Email: [email protected]
Data preparation
Positive set
The experimentally validated phosphorylation sites were extracted from Phospho.ELM
(release 9.0) 1, PhosphoSitePlus 2 and UniProtKB/Swiss-Prot. Since we considered the disease-related
phosphorylation in this study, we extracted a dataset containing only human phosphorylation sites.
After removing the redundant proteins among these three databases, we collected 128122
phosphorylation sites within 32148 proteins, where the number of serine (S), threonine (T) and tyrosine
(Y) substrate are 69315, 30398 and 28409, respectively (all the data can be obtained from the online
web site). To construct a kinase-specific phosphorylation site predictor, for each entry, we only retained
substrate proteins with the exact positions of the residues that are experimentally verified to be
phosphorylated by a given kinase. Finally, we collected 8033 kinase-specific phosphorylation entries.
Although disagreement exists over the optimal peptide length for representing phosphorylation sites in
a machine-learning model,
3
a few studies have proposed lengths between 9 and 15 is suitable.
4-6
Besides, from the sequence logos of all single kinases which we selected to build the predictors, we
could find that, within the window size of 15, almost all of conserved residues around the central sites
1
are contained, as shown in Figures S5 and S6. Hence, in this study, a window size of a maximum
number of 15 residues was chosen, for each phosphorylation site, we extracted the 15-mer sequence 5, 7,
including the central residue and the -7 to +7 amino acids surrounding it. If there are not enough
residues in either side, the missing residues were represented by “_” characters. The prediction was
performed in a kinase specific way. It is difficult to construct predictors of all single kinases because
some single kinases did not have enough data for statistical analysis. To have a comprehensive
coverage of kinase-specific prediction, we collected some single kinases data as a kinase family or
kinase group according to human protein kinase classification
8
to construct kinase family and kinase
group predictors. Finally the known phosphorylation sites of each single kinase, kinase family and
kinase group were extracted separately. These three levels of kinase hierarchical classification only
containing at least 50 experimental phosphorylation sites were used in this study. Tables S1-S3
summarize the statistics of all satisfactory kinase-specific phosphorylation data.
Background set and negative set
The background set contains all the phosphorylation sites (Serine,
Threonine and Tyrosine, S/T/Y) that have not marked by any phosphorylation information on the same
proteins. It is well known that protein phosphorylation is a dynamic event and depends heavily on
conditions.
9
Many sites that are not reported as phosphorylated in one experiment may be
phosphorylated in other tissues or conditions. For some other proteins which are currently not reported
as phosphorylated, the reason might be that they are not expressed at the same time or in the same
tissue with the protein kinase. It is hard to collect a set of protein sequences which can be safely
regarded as non-phosphorylation. A given S/T/Y residue have to meet three criteria to be selected as a
non-phosphorylation site (background set) 10. First, a potential non-phosphorylation site could not have
been reported as a positive site. Second, as suggested by Neuberger et al. ,
11
it had to be within a
2
protein that contained known positive sites. The rationale for this criterion is that as proteins with
several known phosphorylation sites have been well studied with respect to phosphorylation, sites in
these proteins that are not known to be phosphorylated are more likely to be true negative sites. Third,
as suggested by Blom et al.,
12
a negative phosphorylation site had to be predicted as
solvent-inaccessible; the rationale here is that residues buried in the core of a protein would not be
accessible to any kinase. To predict solvent accessibility, the NetSurfP 1.0 program
13
was used. If a
given S/T/Y residue was predicted as buried by NetSurfP 1.0, it was deemed to be a potential
non-phosphorylation site. To estimate the performance of different kinds of kinase-specific predictors,
we randomly selected the negative set from the background set with the same size as positive one.
Independent test set
An independent test set is needed to evaluate the performance of the method
and compare with other existing methods. All the known human phosphorylation sites in Phospho.ELM,
PhosphoSitePlus and UniProtKB/Swiss-Prot databases with given kinases have been used in the
positive set. These three databases almost cover all experimentally verified phosphorylation sites, so it
is hard to find another set of human data as independent test set. It has been known that the
phosphorylation mechanisms are conserved across eukaryotic species
14-16
. We therefore collected the
nonhuman phosphorylation sites of CDK, CK1, CK2, MAPK, PKA, PKC and Src kinase families from
these databases as the positive independent test set. The existence of peptides in the independent test
set that has high sequence homologous to samples in the predefined positive set may cause the
evaluation of performance biased. To avoid this, all the positive independent set samples which have
over 40% sequence homologous with any of samples in the positive set were discarded using the
CD-HIT
17
. The final numbers of above 7 kinase families are listed in Table S4 and the data can be
obtained from our online web site.
3
Table S1. The number of phosphorylation site for considered single kinases. The numbers of
phosphorylation substrate larger than 50 are marked in red.
Phosphorylated residues
Single
Phosphorylated residues
Single kinase
kinase
Serine
Threonine
Tyrosine
CDK2
318
182
0
PKACa
384
60
Serine
Threonine
Tyrosine
CK1-A
96
11
1
1
Abl
0
0
106
PKCA
329
74
0
CaMK2-alpha
76
20
0
CK2-A1
328
60
1
Fyn
1
0
95
CDK1
223
139
0
JNK1
60
36
0
Src
0
0
328
CDK5
64
25
0
ERK2
213
89
0
PKCD
66
23
0
ERK1
174
58
2
DNA-PK
66
22
0
ATM
159
18
0
ATR
60
14
0
Akt1
122
39
0
Lck
0
0
74
GSK3B
106
38
2
mTOR
54
17
0
PLK1
81
36
0
AurB
53
11
0
p38-alpha
75
41
0
Lyn
0
0
52
Table S2. The number of phosphorylation site for considered kinases families. The numbers of
phosphorylation substrate larger than 50 are marked in red.
Phosphorylated residues
Kinase
Phosphorylated residues
Kinase
family
Serine
Threonine
Tyrosine
family
Serine
Threonine
Tyrosine
CDK
712
386
1
PLK
105
43
1
MAPK
606
278
3
STE20
85
64
0
PKC
603
151
0
Abl
0
0
122
Src
1
0
602
AUR
96
20
1
PKA
387
60
1
CAMK2
84
23
0
CK2
338
69
1
IKK
96
8
0
PIKK
290
54
0
RSK
94
9
0
CAMKL
156
59
2
MAPKAPK
61
1
0
GSK
141
48
4
Syk
0
0
61
CK1
159
23
1
PKD
55
5
0
AKT
134
42
0
JakA
1
0
52
Table S3. The number of phosphorylation site for considered kinase groups. The numbers of
phosphorylation substrate larger than 50 are marked in red.
Kinase
Phosphorylated residues
Kinase group
Phosphorylated residues
4
group
Serine
Threonine
Tyrosine
Serine
Threonine
Tyrosine
CMGC
1525
746
9
Atypical
354
77
7
AGC
1325
329
1
TKL
119
79
7
TK
9
0
1173
CK1
170
32
2
Other
CAMK
737
194
12
STE
105
71
0
465
125
2
Table S4. The number of phosphorylation site for considered kinase families in independent set.
Phosphorylated residues
Kinase family
Serine
Threonine
Tyrosine
CDK
70
31
0
CK1
52
15
0
CK2
91
30
0
MAPK
111
75
0
PKA
217
35
0
PKC
161
40
0
Src
0
0
146
Table S5. The prediction performance of phosphoserine for different single kinases
High (P-Value≤0.002, %)
Middle (P-Value≤0.005, %)
Low (P-Value≤0.015, %)
Sensitivity
Sensitivity
Sensitivity
Kinase
Specificity
Specificity
Specificity
CDK2
92.45
84.59
93.40
81.13
94.65
79.87
PKACa
96.88
75.26
96.88
74.22
97.14
71.88
PKCA
67.48
82.67
69.91
81.16
73.25
78.12
CK2-A1
83.54
76.22
84.15
73.48
85.37
71.04
CDK1
94.17
87.44
94.62
85.20
95.07
81.61
ERK2
97.18
81.69
97.18
80.75
97.18
78.87
ERK1
94.83
90.80
94.83
89.08
94.83
86.78
ATM
96.86
72.96
96.86
71.70
96.86
65.41
Akt1
96.72
87.70
98.36
86.07
98.36
86.07
GSK3B
72.64
89.62
76.42
85.85
76.42
81.13
PLK1
46.91
91.36
48.15
88.89
54.32
86.42
p38-alpha
88.00
89.33
88.00
89.33
88.00
86.67
CK1-A
58.33
83.33
61.46
82.29
67.71
76.04
CaMK2-alpha
44.74
82.89
51.32
82.89
60.53
77.63
JNK1
95.00
96.67
95.00
95.00
95.00
90.00
CDK5
98.44
89.06
98.44
87.50
98.44
81.25
PKCD
66.67
93.94
69.70
93.94
69.70
90.91
DNA-PK
69.70
90.91
71.21
89.39
74.24
86.36
ATR
96.67
91.67
96.67
88.33
96.67
83.33
5
mTOR
66.67
90.74
68.52
90.74
70.37
88.89
AurB
86.79
88.68
88.68
84.91
92.45
79.25
Table S6. The prediction performance of phosphothreonine for different single kinases
High (P-Value≤0.002, %)
Middle (P-Value≤0.005, %)
Low (P-Value≤0.015, %)
Sensitivity
Specificity
Sensitivity
Specificity
Sensitivity
Specificity
CDK2
88.46
86.81
90.11
85.16
92.86
82.97
PKACa
90.00
88.33
90.00
88.33
90.00
86.67
PKCA
39.19
97.30
48.65
95.95
54.05
91.89
CK2-A1
73.33
88.33
78.33
85.00
85.00
81.67
CDK1
92.81
85.61
92.81
82.01
94.96
78.42
ERK2
96.63
95.51
96.63
94.38
96.63
89.89
ERK1
98.28
84.48
98.28
82.76
98.28
82.76
Kinase
Table S7. The prediction performance of phosphotyrosine for different single kinases
High (P-Value≤0.002, %)
Middle (P-Value≤0.005, %)
Low (P-Value≤0.015, %)
Sensitivity
Specificity
Sensitivity
Specificity
Sensitivity
Specificity
Src
58.84
71.95
61.59
70.12
68.60
66.16
Abl
28.30
96.23
38.68
91.51
45.28
83.96
Fyn
47.37
86.32
53.68
82.11
60.00
80.00
Lck
51.35
78.38
54.05
75.68
62.16
68.92
Lyn
15.38
88.46
30.77
86.54
36.54
84.62
Kinase
Table S8. The prediction performance of phosphoserine for different kinase families
High (P-Value≤0.002, %)
Middle (P-Value≤0.005, %)
Low (P-Value≤0.015, %)
Sensitivity
Specificity
Sensitivity
Specificity
Sensitivity
Specificity
CDK
94.38
81.60
94.52
79.78
94.94
77.25
MAPK
95.05
78.71
95.71
77.72
95.71
75.74
PKC
74.46
75.62
75.29
73.13
78.44
70.81
PKA
96.38
73.90
96.64
72.87
97.67
72.09
CK2
83.14
75.74
84.91
73.37
85.21
71.89
PIKK
93.45
67.24
94.14
64.48
94.48
61.03
CAMKL
84.62
83.97
84.62
81.41
87.18
80.13
GSK
70.92
85.82
71.63
84.40
82.27
82.98
CK1
73.58
86.16
78.62
83.65
83.02
82.39
AKT
97.76
82.09
98.51
79.85
98.51
78.36
PLK
53.33
89.52
56.19
88.57
60.00
86.67
STE20
68.24
91.76
69.41
87.06
71.76
84.71
AUR
82.29
75.00
86.46
72.92
88.54
65.62
CAMK2
48.81
90.48
53.57
89.29
58.33
88.10
Kinase Family
6
IKK
30.21
82.29
34.38
81.25
39.58
76.04
RSK
77.66
81.91
80.85
80.85
82.98
76.60
MAPKAPK
86.89
91.80
90.16
86.89
93.44
81.97
PKD
87.27
80.00
87.27
80.00
87.27
70.91
Table S9. The prediction performance of phosphothreonine for different kinase families
High (P-Value≤0.002, %)
Middle (P-Value≤0.005, %)
Low (P-Value≤0.015, %)
Sensitivity
Specificity
Sensitivity
Specificity
Sensitivity
Specificity
CDK
91.45
81.61
92.23
79.27
92.75
78.24
MAPK
95.32
83.09
95.68
80.94
96.76
76.98
PKC
55.63
90.73
56.95
88.74
62.25
82.12
PKA
90.00
86.67
90.00
85.00
90.00
83.33
CK2
81.16
88.41
84.06
85.51
86.96
82.61
PIKK
87.04
92.59
87.04
92.59
87.04
88.89
CAMKL
50.85
91.53
52.54
84.75
61.02
74.58
STE20
54.69
85.94
59.38
81.25
71.88
78.12
Kinase Family
Table S10. The prediction performance of phosphotyrosine for different kinase families
High (P-Value≤0.002, %)
Middle (P-Value≤0.005, %)
Low (P-Value≤0.015, %)
Sensitivity
Specificity
Sensitivity
Specificity
Sensitivity
Specificity
Src
66.61
64.95
69.93
61.13
72.59
57.81
Abl
28.69
91.80
38.52
88.52
45.08
81.97
Syk
86.89
78.69
88.52
73.77
93.44
67.21
JakA
3.846
98.08
15.38
96.15
25.00
88.46
Kinase Family
Table S11. The prediction performance of phosphoserine for different kinase groups
High (P-Value≤0.002, %)
Middle (P-Value≤0.005, %)
Low (P-Value≤0.015, %)
Sensitivity
Specificity
Sensitivity
Specificity
Sensitivity
Specificity
CMGC
94.23
76.46
94.23
75.48
94.75
74.10
AGC
86.72
67.55
87.62
65.89
88.53
64.00
Other
69.61
74.22
71.37
72.18
75.03
68.93
CAMK
86.67
72.26
87.10
69.46
88.17
67.53
Atypical
85.31
71.47
86.44
68.93
87.01
64.97
TKL
28.57
96.64
34.45
94.96
47.06
89.08
CK1
72.35
74.12
74.12
71.18
80.59
66.47
STE
59.05
87.62
63.81
85.71
66.67
82.86
Kinase Group
Table S12. The prediction performance of phosphothreonine for different kinase groups
Kinase Group
High (P-Value≤0.002, %)
Middle (P-Value≤0.005, %)
Low (P-Value≤0.015, %)
7
Sensitivity
Specificity
Sensitivity
Specificity
Sensitivity
Specificity
CMGC
94.91
78.42
95.04
76.41
95.31
73.59
AGC
75.99
78.12
76.29
74.16
78.42
72.34
Other
52.58
79.90
56.70
78.35
62.89
75.26
CAMK
40.80
87.20
49.60
85.60
60.80
80.00
Atypical
67.53
94.81
68.83
92.21
71.43
83.12
TKL
16.46
92.41
22.78
89.87
32.91
84.81
STE
59.15
76.06
67.61
67.61
70.42
64.79
Table S13. The prediction performance of phosphotyrosine for different kinase groups
High (P-Value≤0.002, %)
Middle (P-Value≤0.005, %)
Low (P-Value≤0.015, %)
Sensitivity
Specificity
Sensitivity
Specificity
Sensitivity
Specificity
71.70
61.89
73.74
59.08
75.70
55.50
Kinase Group
TK
Figure S1. The ROC curve and the corresponding AUCs for phosphothreonine prediction of different
single kinases.
8
Figure S2. The ROC curve and the corresponding AUCs for phosphotyrosine prediction of different
single kinases.
Figure S3. The data statistics of predicted phosphothreonine kinase family types for disease-related and
normal phosphorylation substrates.
9
Figure S4. The data statistics of predicted phosphotyrosine kinase family types for disease-related and
normal phosphorylation substrates.
Figure S5. The sequence logos of phosphoserine for different single kinases with the window size as 15
(-7~+7)
10
Figure S6. The sequence logos of phosphothreonine and phosphotyrosine for different single kinases
with the window size as 15 (-7~+7). A: log plot for phosphothreonine; B: log plot for phosphotyrosine
References
1.
Dinkel, H. et al. Phospho.ELM: a database of phosphorylation sites-update 2011. Nucleic
Acids Res. 39, D261-D267 (2011).
2.
Hornbeck, P.V. et al. PhosphoSitePlus: a comprehensive resource for investigating the
structure and function of experimentally determined post-translational modifications in man
and mouse. Nucleic Acids Res. 40, D261-D270 (2012).
3.
Trost, B. & Kusalik, A. Computational prediction of eukaryotic phosphorylation sites.
Bioinformatics 27, 2927-2935 (2011).
4.
Biswas, A.K., Noman, N. & Sikder, A.R. Machine learning approach to predict protein
phosphorylation sites by incorporating evolutionary information. BMC Bioinf. 11, 273 (2010).
5.
Li, T.T., Du, P.F. & Xu, N.F. Identifying Human Kinase-Specific Protein Phosphorylation
11
Sites by Integrating Heterogeneous Information from Various Sources. PLoS One 5, e15411
(2010).
6.
Gao, J.J., Thelen, J.J., Dunker, A.K. & Xu, D. Musite, a Tool for Global Prediction of General
and Kinase-specific Phosphorylation Sites. Mol. Cell. Proteomics 9, 2586-2600 (2010).
7.
Wong, Y.H. et al. KinasePhos 2.0: a web server for identifying protein kinase-specific
phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res. 35,
W588-W594 (2007).
8.
Manning, G., Whyte, D.B., Martinez, R., Hunter, T. & Sudarsanam, S. The protein kinase
complement of the human genome. Science 298, 1912-1934 (2002).
9.
Ubersax, J.A. & Ferrell, J.E. Mechanisms of specificity in protein phosphorylation. Nat. Rev.
Mol. Cell Biol. 8, 530-541 (2007).
10.
Trost, B. & Kusalik, A. Computational phosphorylation site prediction in plants using random
forests and organism-specific instance weights. Bioinformatics 29, 686-694 (2013).
11.
Neuberger, G., Schneider, G. & Eisenhaber, F. pkaPS: prediction of protein kinase A
phosphorylation sites with the simplified kinase-substrate binding model. Biol. Direct 2, 1
(2007).
12.
Blom, N., Sicheritz-Ponten, T., Gupta, R., Gammeltoft, S. & Brunak, S. Prediction of
post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
Proteomics 4, 1633-1649 (2004).
13.
Petersen, B., Petersen, T.N., Andersen, P., Nielsen, M. & Lundegaard, C. A generic method for
assignment of reliability scores applied to solvent accessibility predictions. BMC Struct. Biol.
9, 51 (2009).
12
14.
Blom, N., Sicheritz-Ponten, T., Gupta, R., Gammeltoft, S. & Brunak, S. Prediction of
post-translational glycosylation and phosphorylation of proteins from the amino acid sequence.
Proteomics 4, 1633-1649 (2004).
15.
Jensen, L.J., Ussery, D.W. & Brunak, S. Functionality of system components: Conservation of
protein function in protein feature space. Genome Res. 13, 2444-2449 (2003).
16.
Budovskaya, Y.V., Stephan, J.S., Deminoff, S.J. & Herman, P.K. An evolutionary proteomics
approach identifies substrates of the cAMP-dependent protein kinase. P. Natl. Acad. Sci. Usa.
102, 13933-13938 (2005).
17.
Fu, L.M., Niu, B.F., Zhu, Z.W., Wu, S.T. & Li, W.Z. CD-HIT: accelerated for clustering the
next-generation sequencing data. Bioinformatics 28, 3150-3152 (2012).
13