PSEA: Kinase-specific prediction and analysis of human phosphorylation substrates Sheng-Bao Suo1, Jian-Ding Qiu1,2,*, Shao-Ping Shi1,3, Xiang Chen1, Ru-Ping Liang1 1 Department of Chemistry, Nanchang University, Nanchang, 330031, China. 2 Department of Chemical Engineering, Pingxiang College, Pingxiang, 337055, China. 3 Department of Mathematics, Nanchang University, Nanchang, 330031, China. * To whom correspondence should be addressed. Tel: + 86 791 83969518; Email: [email protected] Data preparation Positive set The experimentally validated phosphorylation sites were extracted from Phospho.ELM (release 9.0) 1, PhosphoSitePlus 2 and UniProtKB/Swiss-Prot. Since we considered the disease-related phosphorylation in this study, we extracted a dataset containing only human phosphorylation sites. After removing the redundant proteins among these three databases, we collected 128122 phosphorylation sites within 32148 proteins, where the number of serine (S), threonine (T) and tyrosine (Y) substrate are 69315, 30398 and 28409, respectively (all the data can be obtained from the online web site). To construct a kinase-specific phosphorylation site predictor, for each entry, we only retained substrate proteins with the exact positions of the residues that are experimentally verified to be phosphorylated by a given kinase. Finally, we collected 8033 kinase-specific phosphorylation entries. Although disagreement exists over the optimal peptide length for representing phosphorylation sites in a machine-learning model, 3 a few studies have proposed lengths between 9 and 15 is suitable. 4-6 Besides, from the sequence logos of all single kinases which we selected to build the predictors, we could find that, within the window size of 15, almost all of conserved residues around the central sites 1 are contained, as shown in Figures S5 and S6. Hence, in this study, a window size of a maximum number of 15 residues was chosen, for each phosphorylation site, we extracted the 15-mer sequence 5, 7, including the central residue and the -7 to +7 amino acids surrounding it. If there are not enough residues in either side, the missing residues were represented by “_” characters. The prediction was performed in a kinase specific way. It is difficult to construct predictors of all single kinases because some single kinases did not have enough data for statistical analysis. To have a comprehensive coverage of kinase-specific prediction, we collected some single kinases data as a kinase family or kinase group according to human protein kinase classification 8 to construct kinase family and kinase group predictors. Finally the known phosphorylation sites of each single kinase, kinase family and kinase group were extracted separately. These three levels of kinase hierarchical classification only containing at least 50 experimental phosphorylation sites were used in this study. Tables S1-S3 summarize the statistics of all satisfactory kinase-specific phosphorylation data. Background set and negative set The background set contains all the phosphorylation sites (Serine, Threonine and Tyrosine, S/T/Y) that have not marked by any phosphorylation information on the same proteins. It is well known that protein phosphorylation is a dynamic event and depends heavily on conditions. 9 Many sites that are not reported as phosphorylated in one experiment may be phosphorylated in other tissues or conditions. For some other proteins which are currently not reported as phosphorylated, the reason might be that they are not expressed at the same time or in the same tissue with the protein kinase. It is hard to collect a set of protein sequences which can be safely regarded as non-phosphorylation. A given S/T/Y residue have to meet three criteria to be selected as a non-phosphorylation site (background set) 10. First, a potential non-phosphorylation site could not have been reported as a positive site. Second, as suggested by Neuberger et al. , 11 it had to be within a 2 protein that contained known positive sites. The rationale for this criterion is that as proteins with several known phosphorylation sites have been well studied with respect to phosphorylation, sites in these proteins that are not known to be phosphorylated are more likely to be true negative sites. Third, as suggested by Blom et al., 12 a negative phosphorylation site had to be predicted as solvent-inaccessible; the rationale here is that residues buried in the core of a protein would not be accessible to any kinase. To predict solvent accessibility, the NetSurfP 1.0 program 13 was used. If a given S/T/Y residue was predicted as buried by NetSurfP 1.0, it was deemed to be a potential non-phosphorylation site. To estimate the performance of different kinds of kinase-specific predictors, we randomly selected the negative set from the background set with the same size as positive one. Independent test set An independent test set is needed to evaluate the performance of the method and compare with other existing methods. All the known human phosphorylation sites in Phospho.ELM, PhosphoSitePlus and UniProtKB/Swiss-Prot databases with given kinases have been used in the positive set. These three databases almost cover all experimentally verified phosphorylation sites, so it is hard to find another set of human data as independent test set. It has been known that the phosphorylation mechanisms are conserved across eukaryotic species 14-16 . We therefore collected the nonhuman phosphorylation sites of CDK, CK1, CK2, MAPK, PKA, PKC and Src kinase families from these databases as the positive independent test set. The existence of peptides in the independent test set that has high sequence homologous to samples in the predefined positive set may cause the evaluation of performance biased. To avoid this, all the positive independent set samples which have over 40% sequence homologous with any of samples in the positive set were discarded using the CD-HIT 17 . The final numbers of above 7 kinase families are listed in Table S4 and the data can be obtained from our online web site. 3 Table S1. The number of phosphorylation site for considered single kinases. The numbers of phosphorylation substrate larger than 50 are marked in red. Phosphorylated residues Single Phosphorylated residues Single kinase kinase Serine Threonine Tyrosine CDK2 318 182 0 PKACa 384 60 Serine Threonine Tyrosine CK1-A 96 11 1 1 Abl 0 0 106 PKCA 329 74 0 CaMK2-alpha 76 20 0 CK2-A1 328 60 1 Fyn 1 0 95 CDK1 223 139 0 JNK1 60 36 0 Src 0 0 328 CDK5 64 25 0 ERK2 213 89 0 PKCD 66 23 0 ERK1 174 58 2 DNA-PK 66 22 0 ATM 159 18 0 ATR 60 14 0 Akt1 122 39 0 Lck 0 0 74 GSK3B 106 38 2 mTOR 54 17 0 PLK1 81 36 0 AurB 53 11 0 p38-alpha 75 41 0 Lyn 0 0 52 Table S2. The number of phosphorylation site for considered kinases families. The numbers of phosphorylation substrate larger than 50 are marked in red. Phosphorylated residues Kinase Phosphorylated residues Kinase family Serine Threonine Tyrosine family Serine Threonine Tyrosine CDK 712 386 1 PLK 105 43 1 MAPK 606 278 3 STE20 85 64 0 PKC 603 151 0 Abl 0 0 122 Src 1 0 602 AUR 96 20 1 PKA 387 60 1 CAMK2 84 23 0 CK2 338 69 1 IKK 96 8 0 PIKK 290 54 0 RSK 94 9 0 CAMKL 156 59 2 MAPKAPK 61 1 0 GSK 141 48 4 Syk 0 0 61 CK1 159 23 1 PKD 55 5 0 AKT 134 42 0 JakA 1 0 52 Table S3. The number of phosphorylation site for considered kinase groups. The numbers of phosphorylation substrate larger than 50 are marked in red. Kinase Phosphorylated residues Kinase group Phosphorylated residues 4 group Serine Threonine Tyrosine Serine Threonine Tyrosine CMGC 1525 746 9 Atypical 354 77 7 AGC 1325 329 1 TKL 119 79 7 TK 9 0 1173 CK1 170 32 2 Other CAMK 737 194 12 STE 105 71 0 465 125 2 Table S4. The number of phosphorylation site for considered kinase families in independent set. Phosphorylated residues Kinase family Serine Threonine Tyrosine CDK 70 31 0 CK1 52 15 0 CK2 91 30 0 MAPK 111 75 0 PKA 217 35 0 PKC 161 40 0 Src 0 0 146 Table S5. The prediction performance of phosphoserine for different single kinases High (P-Value≤0.002, %) Middle (P-Value≤0.005, %) Low (P-Value≤0.015, %) Sensitivity Sensitivity Sensitivity Kinase Specificity Specificity Specificity CDK2 92.45 84.59 93.40 81.13 94.65 79.87 PKACa 96.88 75.26 96.88 74.22 97.14 71.88 PKCA 67.48 82.67 69.91 81.16 73.25 78.12 CK2-A1 83.54 76.22 84.15 73.48 85.37 71.04 CDK1 94.17 87.44 94.62 85.20 95.07 81.61 ERK2 97.18 81.69 97.18 80.75 97.18 78.87 ERK1 94.83 90.80 94.83 89.08 94.83 86.78 ATM 96.86 72.96 96.86 71.70 96.86 65.41 Akt1 96.72 87.70 98.36 86.07 98.36 86.07 GSK3B 72.64 89.62 76.42 85.85 76.42 81.13 PLK1 46.91 91.36 48.15 88.89 54.32 86.42 p38-alpha 88.00 89.33 88.00 89.33 88.00 86.67 CK1-A 58.33 83.33 61.46 82.29 67.71 76.04 CaMK2-alpha 44.74 82.89 51.32 82.89 60.53 77.63 JNK1 95.00 96.67 95.00 95.00 95.00 90.00 CDK5 98.44 89.06 98.44 87.50 98.44 81.25 PKCD 66.67 93.94 69.70 93.94 69.70 90.91 DNA-PK 69.70 90.91 71.21 89.39 74.24 86.36 ATR 96.67 91.67 96.67 88.33 96.67 83.33 5 mTOR 66.67 90.74 68.52 90.74 70.37 88.89 AurB 86.79 88.68 88.68 84.91 92.45 79.25 Table S6. The prediction performance of phosphothreonine for different single kinases High (P-Value≤0.002, %) Middle (P-Value≤0.005, %) Low (P-Value≤0.015, %) Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity CDK2 88.46 86.81 90.11 85.16 92.86 82.97 PKACa 90.00 88.33 90.00 88.33 90.00 86.67 PKCA 39.19 97.30 48.65 95.95 54.05 91.89 CK2-A1 73.33 88.33 78.33 85.00 85.00 81.67 CDK1 92.81 85.61 92.81 82.01 94.96 78.42 ERK2 96.63 95.51 96.63 94.38 96.63 89.89 ERK1 98.28 84.48 98.28 82.76 98.28 82.76 Kinase Table S7. The prediction performance of phosphotyrosine for different single kinases High (P-Value≤0.002, %) Middle (P-Value≤0.005, %) Low (P-Value≤0.015, %) Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Src 58.84 71.95 61.59 70.12 68.60 66.16 Abl 28.30 96.23 38.68 91.51 45.28 83.96 Fyn 47.37 86.32 53.68 82.11 60.00 80.00 Lck 51.35 78.38 54.05 75.68 62.16 68.92 Lyn 15.38 88.46 30.77 86.54 36.54 84.62 Kinase Table S8. The prediction performance of phosphoserine for different kinase families High (P-Value≤0.002, %) Middle (P-Value≤0.005, %) Low (P-Value≤0.015, %) Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity CDK 94.38 81.60 94.52 79.78 94.94 77.25 MAPK 95.05 78.71 95.71 77.72 95.71 75.74 PKC 74.46 75.62 75.29 73.13 78.44 70.81 PKA 96.38 73.90 96.64 72.87 97.67 72.09 CK2 83.14 75.74 84.91 73.37 85.21 71.89 PIKK 93.45 67.24 94.14 64.48 94.48 61.03 CAMKL 84.62 83.97 84.62 81.41 87.18 80.13 GSK 70.92 85.82 71.63 84.40 82.27 82.98 CK1 73.58 86.16 78.62 83.65 83.02 82.39 AKT 97.76 82.09 98.51 79.85 98.51 78.36 PLK 53.33 89.52 56.19 88.57 60.00 86.67 STE20 68.24 91.76 69.41 87.06 71.76 84.71 AUR 82.29 75.00 86.46 72.92 88.54 65.62 CAMK2 48.81 90.48 53.57 89.29 58.33 88.10 Kinase Family 6 IKK 30.21 82.29 34.38 81.25 39.58 76.04 RSK 77.66 81.91 80.85 80.85 82.98 76.60 MAPKAPK 86.89 91.80 90.16 86.89 93.44 81.97 PKD 87.27 80.00 87.27 80.00 87.27 70.91 Table S9. The prediction performance of phosphothreonine for different kinase families High (P-Value≤0.002, %) Middle (P-Value≤0.005, %) Low (P-Value≤0.015, %) Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity CDK 91.45 81.61 92.23 79.27 92.75 78.24 MAPK 95.32 83.09 95.68 80.94 96.76 76.98 PKC 55.63 90.73 56.95 88.74 62.25 82.12 PKA 90.00 86.67 90.00 85.00 90.00 83.33 CK2 81.16 88.41 84.06 85.51 86.96 82.61 PIKK 87.04 92.59 87.04 92.59 87.04 88.89 CAMKL 50.85 91.53 52.54 84.75 61.02 74.58 STE20 54.69 85.94 59.38 81.25 71.88 78.12 Kinase Family Table S10. The prediction performance of phosphotyrosine for different kinase families High (P-Value≤0.002, %) Middle (P-Value≤0.005, %) Low (P-Value≤0.015, %) Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Src 66.61 64.95 69.93 61.13 72.59 57.81 Abl 28.69 91.80 38.52 88.52 45.08 81.97 Syk 86.89 78.69 88.52 73.77 93.44 67.21 JakA 3.846 98.08 15.38 96.15 25.00 88.46 Kinase Family Table S11. The prediction performance of phosphoserine for different kinase groups High (P-Value≤0.002, %) Middle (P-Value≤0.005, %) Low (P-Value≤0.015, %) Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity CMGC 94.23 76.46 94.23 75.48 94.75 74.10 AGC 86.72 67.55 87.62 65.89 88.53 64.00 Other 69.61 74.22 71.37 72.18 75.03 68.93 CAMK 86.67 72.26 87.10 69.46 88.17 67.53 Atypical 85.31 71.47 86.44 68.93 87.01 64.97 TKL 28.57 96.64 34.45 94.96 47.06 89.08 CK1 72.35 74.12 74.12 71.18 80.59 66.47 STE 59.05 87.62 63.81 85.71 66.67 82.86 Kinase Group Table S12. The prediction performance of phosphothreonine for different kinase groups Kinase Group High (P-Value≤0.002, %) Middle (P-Value≤0.005, %) Low (P-Value≤0.015, %) 7 Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity CMGC 94.91 78.42 95.04 76.41 95.31 73.59 AGC 75.99 78.12 76.29 74.16 78.42 72.34 Other 52.58 79.90 56.70 78.35 62.89 75.26 CAMK 40.80 87.20 49.60 85.60 60.80 80.00 Atypical 67.53 94.81 68.83 92.21 71.43 83.12 TKL 16.46 92.41 22.78 89.87 32.91 84.81 STE 59.15 76.06 67.61 67.61 70.42 64.79 Table S13. The prediction performance of phosphotyrosine for different kinase groups High (P-Value≤0.002, %) Middle (P-Value≤0.005, %) Low (P-Value≤0.015, %) Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity 71.70 61.89 73.74 59.08 75.70 55.50 Kinase Group TK Figure S1. The ROC curve and the corresponding AUCs for phosphothreonine prediction of different single kinases. 8 Figure S2. The ROC curve and the corresponding AUCs for phosphotyrosine prediction of different single kinases. Figure S3. The data statistics of predicted phosphothreonine kinase family types for disease-related and normal phosphorylation substrates. 9 Figure S4. The data statistics of predicted phosphotyrosine kinase family types for disease-related and normal phosphorylation substrates. Figure S5. The sequence logos of phosphoserine for different single kinases with the window size as 15 (-7~+7) 10 Figure S6. The sequence logos of phosphothreonine and phosphotyrosine for different single kinases with the window size as 15 (-7~+7). A: log plot for phosphothreonine; B: log plot for phosphotyrosine References 1. Dinkel, H. et al. Phospho.ELM: a database of phosphorylation sites-update 2011. Nucleic Acids Res. 39, D261-D267 (2011). 2. Hornbeck, P.V. et al. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 40, D261-D270 (2012). 3. Trost, B. & Kusalik, A. Computational prediction of eukaryotic phosphorylation sites. Bioinformatics 27, 2927-2935 (2011). 4. Biswas, A.K., Noman, N. & Sikder, A.R. Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC Bioinf. 11, 273 (2010). 5. Li, T.T., Du, P.F. & Xu, N.F. Identifying Human Kinase-Specific Protein Phosphorylation 11 Sites by Integrating Heterogeneous Information from Various Sources. PLoS One 5, e15411 (2010). 6. Gao, J.J., Thelen, J.J., Dunker, A.K. & Xu, D. Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites. Mol. Cell. Proteomics 9, 2586-2600 (2010). 7. Wong, Y.H. et al. KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res. 35, W588-W594 (2007). 8. Manning, G., Whyte, D.B., Martinez, R., Hunter, T. & Sudarsanam, S. The protein kinase complement of the human genome. Science 298, 1912-1934 (2002). 9. Ubersax, J.A. & Ferrell, J.E. Mechanisms of specificity in protein phosphorylation. Nat. Rev. Mol. Cell Biol. 8, 530-541 (2007). 10. Trost, B. & Kusalik, A. Computational phosphorylation site prediction in plants using random forests and organism-specific instance weights. Bioinformatics 29, 686-694 (2013). 11. Neuberger, G., Schneider, G. & Eisenhaber, F. pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model. Biol. Direct 2, 1 (2007). 12. Blom, N., Sicheritz-Ponten, T., Gupta, R., Gammeltoft, S. & Brunak, S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 4, 1633-1649 (2004). 13. Petersen, B., Petersen, T.N., Andersen, P., Nielsen, M. & Lundegaard, C. A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct. Biol. 9, 51 (2009). 12 14. Blom, N., Sicheritz-Ponten, T., Gupta, R., Gammeltoft, S. & Brunak, S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 4, 1633-1649 (2004). 15. Jensen, L.J., Ussery, D.W. & Brunak, S. Functionality of system components: Conservation of protein function in protein feature space. Genome Res. 13, 2444-2449 (2003). 16. Budovskaya, Y.V., Stephan, J.S., Deminoff, S.J. & Herman, P.K. An evolutionary proteomics approach identifies substrates of the cAMP-dependent protein kinase. P. Natl. Acad. Sci. Usa. 102, 13933-13938 (2005). 17. Fu, L.M., Niu, B.F., Zhu, Z.W., Wu, S.T. & Li, W.Z. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152 (2012). 13
© Copyright 2026 Paperzz