Fragment Virtual Screening Based on Bayesian Categorization for Discovering Novel VEGFR-2 Scaffolds a b a a a a Yanmin Zhang · Yu Jiao · Xi ao Xi ong · Haichun Liu · Ting Ran · Jinxing Xu · Shuai Lu · Anyang Xu · Jing Pan · Xin Qiao · Zhihao Shi b · Tao Lu * and Yadong Chen * a a a a a,b a a Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China. School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China. b *Corresponding Author: For Yadong Chen: Tel.: +86-25-86185163. Fax: +86-25-86185182. E-mail: [email protected]. For Tao Lu: Tel.: +86-25-86185180. Fax: +86-25-86185179. E-mail: [email protected]. Supplementary Material Figures Fig. S1 The VEGFR-2 inhibitor sorafenib depicted using the three different scaffold representations. Fig. S2 Workflow for Bayesian classification model build ing, validation, and VS as applied to internal and external databases. m% and n% represent different ratios of actives and decoys that are d istributed to training set and internal test set. Fig. S3 Statistical values comparison of 2D p roperties for (a) VEGFR-2 act ives and decoys and compounds in Specs (b) the 10 compound databases. Fig. S4 Tree Maps of the ChemBridge database after molecular weight filtration. Fig. S5 Tree Maps of the ChemDiv database after molecular weight filtration. Fig. S6 Tree Maps of the DrugBank database after molecular weight filtration. Fig. S7 Tree Maps of the HitFinder database after molecular weight filtration. Fig. S8 Tree Maps of the NCI database after molecular weight filtration. Fig. S9 Tree Maps of the ZINC database after molecular weight filtration. Fig. S10 Tree Maps of the ChEMBL database after molecular weight filtration. Fig. S11 Tree Maps of the Specs database after molecular weight filtration. Fig. S12 Tree Maps of the KLIFS database after molecular weight filtration. Fig. S13 Finally Selected 20 Hit Scaffolds Based on Bayesian Score and Molecular Docking Analysis. Tables Table S1 Definition of classification model performance measures Table S2 Statistical values comparison of 2D Properties of 10 compound Databases Table S3 Statistics results of Bayesian categorization models Table S4 Comparison of the Bayesian model with those in other published papers Table S5 Percentile results for Bayesian model-based VS 1 Fig. S1 The VEGFR-2 inhibitor sorafenib depicted using the three different scaffold representations. Sorafenib O Cl O N N H F N H F O HN F Molecular Weight: 464.82 Murcko (All: 1 fragment, MW [150-350]: 1 fragment) O N N H N H Molecular Weight: 291.35 Recap (All: 3 fragments, MW [150-350]: 1 fragments) H H N N F H N F O * F Cl N * * * O O Molecular Weight: 313.68 Molecular Weight: 135.14 Molecular Weight: 16.00 UserDefined (All: 33 fragments, MW [150-350]: 16 fragments) H * N * N * H Molecular Weight: 58.04 H N N * H N O O F F F H Cl O H N * N * H N H N * Molecular Weight: 135.14 * F N * Molecular Weight: 15.01 H N Molecular Weight: 30.05 Molecular Weight: 434.78 O * H N * Molecular Weight: 43.02 O O F Cl F Molecular Weight: 329.68 H N N Molecular Weight: 313.68 O F Cl F O 2 F O O N H H N O * Molecular Weight: 151.14 * H H N F N O F O N * F F Cl Molecular Weight: 222.57 O N N F O Molecular Weight: 227.24 O O * N H N * Cl Molecular Weight: 237.59 H H N O F * N H H F F O F N H N N H * Cl Molecular Weight: 270.26 Molecular Weight: 194.56 Molecular Weight: 242.25 O O F F * F H Cl N * Molecular Weight: 179.55 N * O N H * Molecular Weight: 150.13 * H N N * * N * H O Molecular Weight: 255.23 Molecular Weight: 28.01 H N N N * O * * Molecular Weight: 119.12 * O H O * N H Molecular Weight: 134.14 O O O N N O Molecular Weight: 285.28 H * N H N N H H O H O O * * O O O Molecular Weight: 135.12 Molecular Weight: 240.21 H * N H Molecular Weight: 91.11 N * * O N H * O * O N Molecular Weight: 107.11 * Molecular Weight: 212.20 O O * * Molecular Weight: 76.10 O N * Molecular Weight: 92.10 Molecular Weight: 197.19 N * O * N * O * * O Molecular Weight: 16.00 * * * * O Molecular Weight: 121.09 Molecular Weight: 105.09 *bule labeled molecules are those with molecular weight in the range of 150-350 3 Fig. S2 Workflow for Bayesian classification model build ing, validation, and VS as applied to internal and external databases. m% and n% represent different ratios of actives and decoys that are d istributed to training set and internal test set. 4 Fig. S3 Statistical values comparison of 2D p roperties for (a) VEGFR-2 act ives and decoys and compounds in Specs (b) the 10 compound databases. 5 Fig. S4 Tree Maps of the ChemBridge database after mo lecular weight filtration. (a) Murcko scaffolds; (b) Recap scaffolds; (c) UserDefined scaffolds. Scaffo lds are represented by colored circles, the color and area of the circles are proportional to the scaffold frequency and gray circles represent clusters of scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and the large proportion of singleton scaffolds in the databases (many small white circles). (a) (b) (c) 6 Fig. S5 Tree Maps of the ChemDiv database after mo lecular weight filtrat ion. (a) Murcko scaffolds; (b) Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and the large proportion of singleton scaffolds in the databases (many small white circles). (a) (b) (c) 7 Fig. S6 Tree Maps of the DrugBan k database after molecular weight filtration. (a) Murcko scaffolds; (b) Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and the large proportion of singleton scaffolds in the databases (many small white circles). (a) (b) (c) 8 Fig. S7 Tree Maps of the HitFinder database after mo lecular weight filtrat ion. (a) Murcko scaffolds; (b) Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and the large proportion of singleton scaffolds in the databases (many small white circles). (a) (b) (c) 9 Fig. S8 Tree Maps of the NCI database after molecular weight filtration. (a) Murcko scaffolds; (b) Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and the large proportion of singleton scaffolds in the databases (many small white circles). (a) (b) (c) 10 Fig. S9 Tree Maps of the ZINC database after mo lecular weight filtration. (a) Murcko scaffo lds; (b) Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and the large proportion of singleton scaffolds in the databases (many small white circles). (a) (b) (c) 11 Fig. S10 Tree Maps of the ChEM BL database after mo lecular weight filtration. (a) Murcko scaffolds; (b) Recap scaffolds; (c) UserDefined scaffolds. Scaffo lds are represented by colored circles, the color and area of the circles are proportional to the scaffold frequency and gray circles represent clusters of scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and the large proportion of singleton scaffolds in the databases (many small white circles). (a) (b) (c) 12 Fig. S11 Tree Maps of the Specs database after molecular weight filtration. (a) Murcko scaffolds; (b) Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and the large proportion of singleton scaffolds in the databases (many small white circles). (b) (b) (c) 13 Fig. S12 Tree Maps of the KLIFS database after mo lecular weight filtrat ion. (a) Murcko scaffolds; (b) Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and the large proportion of singleton scaffolds in the databases (many small white circles). (a) (b) (c) 14 Fig. S13 Finally selected 20 hit scaffolds based on Bayesian score and molecular docking analysis. H H N H O H N H N N N N N N O Hit_Scaffold_1 H N N H Hit_Scaffold_2 Molecular Weight: 339.393 Molecular Weight: 345.355 Bayesian Score: 37.3637 Bayesian Score: 29.4181 Docking Score: -11.6542 Docking Score: -11.1865 H N N H N H H N F O H N N N N N N H Hit_Scaffold_3 Hit_Scaffold_4 Molecular Weight: 343.357 Molecular Weight: 343.382 Bayesian Score: 35.8301 Bayesian Score: 35.6318 Docking Score: -11.1164 Docking Score: -11.0626 H O H N O O O N H H N N H N H Hit_Scaffold_5 Hit_Scaffold_6 Molecular Weight: 321.33 Molecular Weight: 343.379 Bayesian Score: 27.0226 Bayesian Score: 33.4276 Docking Score: -11.0587 Docking Score:-10.9413 H N N N N N N N H H N N H H N H N H N H N N Hit_Scaffold_7 Hit_Scaffold_8 Molecular Weight: 342.357 Molecular Weight:304.349 Bayesian Score: 29.4108 Bayesian Score: 38.055 Docking Score: -10.8715 Docking Score: -10.8326 N O H O N N H N N H N N H N H Hit_Scaffold_9 Hit_Scaffold_10 Molecular Weight: 292.335 Molecular Weight: 325.406 Bayesian Score: 33.0219 Bayesian Score: 25.451 Docking Score: -10.7598 Docking Score: -10.7064 15 N H O O N H H O N+ N N O H N Hit_Scaffold_11 Hit_Scaffold_12 Molecular Weight: 344.363 Molecular Weight: 305.354 Bayesian Score: 46.7105 Bayesian Score: 28.4191 Docking Score: -10.6781 Docking Score: -10.6695 H N N H N O H N H N N N H N N H Hit_Scaffold_13 Hit_Scaffold_14 Molecular Weight: 349.388 Molecular Weight: 344.41 Bayesian Score: 47.7709 Bayesian Score: 29.1089 Docking Score: -10.6677 Docking Score: -10.6415 O H N H N H H N N N H N N N N N Hit_Scaffold_15 Hit_Scaffold_16 Molecular Weight: 346.386 Molecular Weight: 315.372 Bayesian Score: 30.7082 Bayesian Score: 41.9494 Docking Score: -10.6327 Docking Score: -10.6346 H H N H N H H O H N N O N N N N H H N H N N N Hit_Scaffold_17 Hit_Scaffold_18 Molecular Weight: 306.322 Molecular Weight: 343.382 Bayesian Score: 27.4236 Bayesian Score: 25.0921 Docking Score:-10.6208 Docking Score: -10.6074 O N N H N N H N H H N N N N H Hit_Scaffold_20 Hit_Scaffold_19 Molecular Weight: 316.36 Molecular Weight: 275.305 Bayesian Score: 29.0655 Bayesian Score: 30.3625 Docking Score: -10.584 Docking Score: -10.577 16 Table S1 Definition of classification model performance measures Statistics Full name Description Formula Area under the ROC curve which represents the probability for ranking AUC Area under curve an arbitrarily chosen active higher than - a decoy processed in the same way. It ranges from 0 to 1. TP FP True Nu mber positives actives. False Nu mber positives False FN negatives True TN x% EF negatives of accurately identified - of correct ly identified - inactives. Nu mber of inactives incorrectly - predicted as actives. Nu mber of act ives incorrectly - predicted to be inactives. Enrichment Enrich ment factor (EF) calculated on a factor given percentage of the screened hits. EF x% = x% x% Hits selected N selected Hits total N total Ratio of the retrieved true positive Se Selectivity compounds TP to all active compounds in the database (TP+FN) Se = TP TP + FN and ranges from 0 to 1. Specificity denotes the percentage of Sp Specificity truly inactive co mpounds and ranges from 0 to 1. Accuracy Accuracy Proportion of true prediction in the Sp = TN TN + FP Accuracy = total database that ranges from 0 to 1. Precision Precision Ability to correctly predict positive results and ranges from 0 to 1 Precision = TP+TN N total TP TP + FP Considered as true accuracy for Kappa Kappa classification models and outperforms Kappa = ((TP+TN)-(((TP+FN) accuracy to evaluate the model’s (TP+FP)+(FP+TN)(FN+TN)) prediction power. A model with a /N))/(N-(((TP+FN)(TP+FP) kappa value over 0.4 is considered +(FP+TN)(FN+TN))/N)) useful. 17 Table S2 Statistical values comparison of 2D properties of 10 compound databases DUD-E_ VEGFR-2 ChemBridge ChemDiv DrugBank HitFinder NCI ZINC Number_of_molecules 100032 700000 6234 Minimized_Energy/10 2.97 3.77 ALogP 2.54 ChEMBL 14400 252684 1190088 1286475 4.31 3.41 3.25 2.19 3.14 1.34 3.21 2.82 Specs KLIFS KDRActive Average StdEv 219059 1100 2542 - - 5.03 2.42 5.09 4.62 3.61 0.49 1.86 3.10 3.43 2.80 3.76 2.56 0.75 Molecular_Weight/100 3.36 3.94 3.46 3.25 3.24 3.00 4.24 3.66 3.92 4.30 3.50 0.26 Num_RotatableBonds 5 5 6 4 5 4 7 5 5 6 4.91 0.54 Num_Rings 3 4 2 3 2 3 3 3 4 4 2.93 0.48 Num_AromaticRings 2 3 1 2 2 2 2 2 3 4 2.02 0.40 Num_H_Acceptors 4 4 6 4 4 3 5 4 5 5 4.44 0.74 Num_H_Donors 1 1 3 1 1 1 2 1 3 2 1.41 0.78 2.10 2.33 3.42 2.75 2.57 2.47 2.44 2.37 2.81 2.34 2.65 0.50 Molecular_FPSA*10 18 Table S3 Statistics results of Bayesian categorization models Model Name Leave-one-out Cross-Validation Results XV ROC AUC Best Split TP/FN,FP/TN #in Category KDRDrugLike1vs1 1.000 -0.379 199/3,0/203 202 KDRDrugLike1vs5 1.000 6.390 198/4,0/1010 202 KDRDrugLike1vs20 1.000 3.823 200/2,18/3942 202 KDRDrugLike1vs60 1.000 4.728 200/2,71/12112 202 Model Name Enrichment Results Category % 1% 5% 10% 25% 50% 75% 90% 95% 99% 2.50% 10.40% 20.30% 50.50% 99% 100% 100% 100% 100% KDRDrugLike1vs1 49.877% KDRDrugLike1vs5 16.667% 6.40% 30.20% 60.40% 100% 100% 100% 100% 100% 100% KDRDrugLike1vs20 4.853% 20.80% 98% 100% 100% 100% 100% 100% 100% 100% KDRDrugLike1vs60 1.631% 61.40% 100% 100% 100% 100% 100% 100% 100% 100% Model Name Percentile Results (1% /99% ) 99% 95% 90% 70% 50% 30% 10% 5% 1% KDRDrugLike1vs1 -1.219 3.437 5.965 8.758 8.758 20.066 22.860 25.388 30.044 KDRDrugLike1vs5 2.192 11.623 16.742 22.401 22.401 45.303 50.962 56.081 65.511 KDRDrugLike1vs20 4.941 17.543 24.384 31.945 31.945 62.550 70.111 76.952 89.554 KDRDrugLike1vs60 5.788 19.766 27.353 35.740 35.740 69.686 78.072 85.660 99.638 Model Name Category Statistics Results Category N Category Noncategory N Noncategory Mean (±StdDev) Mean (±StdDev) KDRDrugLike1vs1 202 14.41(±6.65) 203 -20.29(±8.14) KDRDrugLike1vs5 202 33.85(±13.47) 1010 -27.72(±12.23) KDRDrugLike1vs20 202 47.25(±18.00) 3960 -29.46(±13.28) KDRDrugLike1vs60 202 52.71(±19.97) 12183 -29.90(±13.91) 19 Model Construction Information Property ALogP Molecular Nu m_H_Do Nu m_H_Acc Num_ Nu m_Aromat ic Nu m_Rotatable Molecular_PolarSurfac Molecular_ _Weight nors eptors Rings Rings Bonds eArea Solubility ECFP_4 KDRDrugLike1vs1 Original 11 11 5 6 5 6 8 11 11 5184 TooFew 0 0 0 0 0 0 0 0 0 0 Noninfo* 4 2 0 1 0 0 4 3 3 99 Final 7 9 5 5 5 6 4 8 8 5085 11895 KDRDrugLike1vs5 Original 11 11 4 7 4 5 8 11 11 TooFew 0 0 0 0 0 0 0 0 0 0 Noninfo 1 0 0 0 0 0 2 3 1 37 10 11 4 7 4 5 6 8 10 11858 Final KDRDrugLike1vs20 Original 11 11 4 6 5 5 8 11 11 23808 TooFew 0 0 0 0 0 0 0 0 0 0 Noninfo 2 0 0 1 1 0 2 2 1 33 Final 9 11 4 5 4 5 6 9 10 23775 KDRDrugLike1vs60 Original 11 11 4 6 5 4 8 11 11 40822 TooFew 0 0 0 0 0 0 0 0 0 0 Noninfo 1 1 0 1 1 0 1 2 2 24281 10 10 4 5 4 4 7 9 9 16541 Final *Noninfo stands for Noninformative 20 Table S4 Comparison of the Bayesian model with those in other published papers Papers Manuscript Ref. 16 Ref. 19 Ref. 22 Parameters EF Se Sp Accuracy Precision Training set 61.39 0.61 1.00 0.99 1.00 Internal test set 61.35 0.61 1.00 0.99 1.00 External Test set 77.30 0.77 1.00 1.00 0.72 Training set - 0.89 0.93 0.91 0.93 Test set - 0.81 0.88 0.84 0.86 Training Set 2.49 0.97 0.97 0.97 0.95 Test Set 1.51 0.80 0.77 0.78 0.79 Cross-Validation 1.63 0.69 0.74 0.72 0.62 100 nM cutoff 3.27 - - 0.88 - 500 nM cutoff 2.31 - - 0.88 - 21 Table S5 Percentile results for Bayesian model-based VS Vendor DBs Bayesian model(KDRDrug Like1vs60)-based VS for Databases Score > 35 Score > 30 Score > 25 Score > 20 Score > 15 Score > 10 ChemBridge 0.32% 0.77% 1.33% 1.85% 2.32% 2.66% ChemDiv 2.41% 5.52% 9.05% 12.25% 14.81% 16.98% DrugBank 0.28% 0.27% 0.29% 0.24% 0.22% 0.19% HitFinder 0.20% 0.25% 0.29% 0.32% 0.39% 0.41% NCI 0.61% 1.04% 1.58% 2.16% 2.74% 3.36% 3.03% 5.93% 9.91% 13.90% 17.78% 20.94% 78.16% 75.13% 69.19% 62.80% 56.41% 50.80% Specs 0.38% 0.92% 1.79% 2.48% 2.95% 3.26% KLIFS 0.91% 0.73% 0.61% 0.44% 0.35% 0.25% 11.08% 7.69% 4.89% 2.94% 1.70% 0.96% ZINC ChEMBL KDRActive Vendor DBs Bayesian model(KDRDrug Like1vs60)-based VS for Fragments Score > 35 Score > 30 Score > 25 Score > 20 Score > 15 Score > 10 ChemBridge 2.43% 4.98% 4.62% 4.51% 4.74% 5.52% ChemDiv 3.40% 7.47% 8.56% 8.84% 9.98% 11.00% DrugBank 0.49% 0.26% 0.42% 0.41% 0.41% 0.36% HitFinder 0.00% 0.13% 0.30% 0.48% 0.60% 0.78% NCI 0.00% 0.66% 1.44% 1.67% 2.19% 2.89% ZINC 18.93% 28.96% 37.61% 42.62% 45.28% 46.43% ChEMBL 30.10% 32.11% 32.06% 31.99% 29.45% 26.72% Specs 0.49% 1.57% 1.69% 1.96% 2.62% 3.44% KLIFS 1.94% 1.83% 1.23% 1.04% 0.86% 0.69% 42.23% 22.02% 12.07% 6.51% 3.88% 2.16% KDRActive 22
© Copyright 2025 Paperzz