blah

Fragment Virtual Screening Based on Bayesian
Categorization for Discovering Novel VEGFR-2 Scaffolds
a
b
a
a
a
a
Yanmin Zhang · Yu Jiao · Xi ao Xi ong · Haichun Liu · Ting Ran · Jinxing Xu · Shuai
Lu · Anyang Xu · Jing Pan · Xin Qiao · Zhihao Shi b · Tao Lu * and Yadong Chen *
a
a
a
a
a,b
a
a
Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical
University, 639 Longmian Avenue, Nanjing 211198, China.
School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China.
b
*Corresponding Author:
For Yadong Chen: Tel.: +86-25-86185163. Fax: +86-25-86185182. E-mail: [email protected].
For Tao Lu: Tel.: +86-25-86185180. Fax: +86-25-86185179. E-mail: [email protected].
Supplementary Material
Figures
Fig. S1 The VEGFR-2 inhibitor sorafenib depicted using the three different scaffold representations.
Fig. S2 Workflow for Bayesian classification model build ing, validation, and VS as applied to internal
and external databases. m% and n% represent different ratios of actives and decoys that are d istributed
to training set and internal test set.
Fig. S3 Statistical values comparison of 2D p roperties for (a) VEGFR-2 act ives and decoys and
compounds in Specs (b) the 10 compound databases.
Fig. S4 Tree Maps of the ChemBridge database after molecular weight filtration.
Fig. S5 Tree Maps of the ChemDiv database after molecular weight filtration.
Fig. S6 Tree Maps of the DrugBank database after molecular weight filtration.
Fig. S7 Tree Maps of the HitFinder database after molecular weight filtration.
Fig. S8 Tree Maps of the NCI database after molecular weight filtration.
Fig. S9 Tree Maps of the ZINC database after molecular weight filtration.
Fig. S10 Tree Maps of the ChEMBL database after molecular weight filtration.
Fig. S11 Tree Maps of the Specs database after molecular weight filtration.
Fig. S12 Tree Maps of the KLIFS database after molecular weight filtration.
Fig. S13 Finally Selected 20 Hit Scaffolds Based on Bayesian Score and Molecular Docking Analysis.
Tables
Table S1 Definition of classification model performance measures
Table S2 Statistical values comparison of 2D Properties of 10 compound Databases
Table S3 Statistics results of Bayesian categorization models
Table S4 Comparison of the Bayesian model with those in other published papers
Table S5 Percentile results for Bayesian model-based VS
1
Fig. S1 The VEGFR-2 inhibitor sorafenib depicted using the three different scaffold representations.
Sorafenib
O
Cl
O
N
N
H
F
N
H
F
O
HN
F
Molecular Weight: 464.82
Murcko (All: 1 fragment, MW [150-350]: 1 fragment)
O
N
N
H
N
H
Molecular Weight: 291.35
Recap (All: 3 fragments, MW [150-350]: 1 fragments)
H
H
N
N
F
H
N
F
O
*
F
Cl
N
*
*
*
O
O
Molecular Weight: 313.68
Molecular Weight: 135.14
Molecular Weight: 16.00
UserDefined (All: 33 fragments, MW [150-350]: 16 fragments)
H
*
N
*
N
*
H
Molecular Weight: 58.04
H
N
N
*
H
N
O
O
F
F
F
H
Cl
O
H N
*
N
*
H
N
H N
*
Molecular Weight: 135.14
*
F
N
*
Molecular Weight: 15.01
H
N
Molecular Weight: 30.05
Molecular Weight: 434.78
O
*
H
N
*
Molecular Weight: 43.02
O
O
F
Cl
F
Molecular Weight: 329.68
H
N
N
Molecular Weight: 313.68
O
F
Cl
F
O
2
F
O
O
N
H
H
N
O
*
Molecular Weight: 151.14
*
H
H
N
F
N
O
F
O
N
*
F
F
Cl
Molecular Weight: 222.57
O
N
N
F
O
Molecular Weight: 227.24
O
O
*
N
H
N
*
Cl
Molecular Weight: 237.59
H
H
N
O
F
*
N
H
H
F
F
O
F
N
H
N
N
H
*
Cl
Molecular Weight: 270.26
Molecular Weight: 194.56
Molecular Weight: 242.25
O
O
F
F
*
F
H
Cl
N
*
Molecular Weight: 179.55
N
*
O
N
H
*
Molecular Weight: 150.13
*
H
N
N
*
*
N
*
H
O
Molecular Weight: 255.23
Molecular Weight: 28.01
H
N
N
N
*
O
*
*
Molecular Weight: 119.12
*
O
H
O
*
N
H
Molecular Weight: 134.14
O
O
O
N
N
O
Molecular Weight: 285.28
H
*
N
H
N
N
H
H
O
H
O
O
*
*
O
O
O
Molecular Weight: 135.12
Molecular Weight: 240.21
H
*
N
H
Molecular Weight: 91.11
N
*
*
O
N
H
*
O
*
O
N
Molecular Weight: 107.11
*
Molecular Weight: 212.20
O
O
*
*
Molecular Weight: 76.10
O
N
*
Molecular Weight: 92.10
Molecular Weight: 197.19
N
*
O
*
N
*
O
*
*
O
Molecular Weight: 16.00
*
*
*
*
O
Molecular Weight: 121.09
Molecular Weight: 105.09
*bule labeled molecules are those with molecular weight in the range of 150-350
3
Fig. S2 Workflow for Bayesian classification model build ing, validation, and VS as applied to internal
and external databases. m% and n% represent different ratios of actives and decoys that are d istributed
to training set and internal test set.
4
Fig. S3 Statistical values comparison of 2D p roperties for (a) VEGFR-2 act ives and decoys and
compounds in Specs (b) the 10 compound databases.
5
Fig. S4 Tree Maps of the ChemBridge database after mo lecular weight filtration. (a) Murcko scaffolds;
(b) Recap scaffolds; (c) UserDefined scaffolds. Scaffo lds are represented by colored circles, the color
and area of the circles are proportional to the scaffold frequency and gray circles represent clusters of
scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and
the large proportion of singleton scaffolds in the databases (many small white circles).
(a)
(b)
(c)
6
Fig. S5 Tree Maps of the ChemDiv database after mo lecular weight filtrat ion. (a) Murcko scaffolds; (b)
Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and
area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of
scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and
the large proportion of singleton scaffolds in the databases (many small white circles).
(a)
(b)
(c)
7
Fig. S6 Tree Maps of the DrugBan k database after molecular weight filtration. (a) Murcko scaffolds; (b)
Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and
area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of
scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and
the large proportion of singleton scaffolds in the databases (many small white circles).
(a)
(b)
(c)
8
Fig. S7 Tree Maps of the HitFinder database after mo lecular weight filtrat ion. (a) Murcko scaffolds; (b)
Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and
area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of
scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and
the large proportion of singleton scaffolds in the databases (many small white circles).
(a)
(b)
(c)
9
Fig. S8 Tree Maps of the NCI database after molecular weight filtration. (a) Murcko scaffolds; (b)
Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and
area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of
scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and
the large proportion of singleton scaffolds in the databases (many small white circles).
(a)
(b)
(c)
10
Fig. S9 Tree Maps of the ZINC database after mo lecular weight filtration. (a) Murcko scaffo lds; (b)
Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and
area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of
scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and
the large proportion of singleton scaffolds in the databases (many small white circles).
(a)
(b)
(c)
11
Fig. S10 Tree Maps of the ChEM BL database after mo lecular weight filtration. (a) Murcko scaffolds;
(b) Recap scaffolds; (c) UserDefined scaffolds. Scaffo lds are represented by colored circles, the color
and area of the circles are proportional to the scaffold frequency and gray circles represent clusters of
scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and
the large proportion of singleton scaffolds in the databases (many small white circles).
(a)
(b)
(c)
12
Fig. S11 Tree Maps of the Specs database after molecular weight filtration. (a) Murcko scaffolds; (b)
Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and
area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of
scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and
the large proportion of singleton scaffolds in the databases (many small white circles).
(b)
(b)
(c)
13
Fig. S12 Tree Maps of the KLIFS database after mo lecular weight filtrat ion. (a) Murcko scaffolds; (b)
Recap scaffolds; (c) UserDefined scaffolds. Scaffolds are represented by colored circles, the color and
area of the circles are p roportional to the scaffo ld frequency and gray circles represent clusters of
scaffolds. Tree Maps illustrate the presence of highly populated scaffolds (few large green circles) and
the large proportion of singleton scaffolds in the databases (many small white circles).
(a)
(b)
(c)
14
Fig. S13 Finally selected 20 hit scaffolds based on Bayesian score and molecular docking analysis.
H
H
N
H
O
H
N
H
N
N
N
N
N N
O
Hit_Scaffold_1
H
N
N H
Hit_Scaffold_2
Molecular Weight: 339.393
Molecular Weight: 345.355
Bayesian Score: 37.3637
Bayesian Score: 29.4181
Docking Score: -11.6542
Docking Score: -11.1865
H
N
N
H
N
H
H
N
F
O
H
N
N
N
N
N N
H
Hit_Scaffold_3
Hit_Scaffold_4
Molecular Weight: 343.357
Molecular Weight: 343.382
Bayesian Score: 35.8301
Bayesian Score: 35.6318
Docking Score: -11.1164
Docking Score: -11.0626
H
O
H
N
O
O
O
N
H
H
N N
H
N
H
Hit_Scaffold_5
Hit_Scaffold_6
Molecular Weight: 321.33
Molecular Weight: 343.379
Bayesian Score: 27.0226
Bayesian Score: 33.4276
Docking Score: -11.0587
Docking Score:-10.9413
H
N
N
N
N
N
N
N
H
H
N
N
H
H
N H
N
H
N
H
N
N
Hit_Scaffold_7
Hit_Scaffold_8
Molecular Weight: 342.357
Molecular Weight:304.349
Bayesian Score: 29.4108
Bayesian Score: 38.055
Docking Score: -10.8715
Docking Score: -10.8326
N
O
H
O
N
N
H
N
N
H
N
N
H
N
H
Hit_Scaffold_9
Hit_Scaffold_10
Molecular Weight: 292.335
Molecular Weight: 325.406
Bayesian Score: 33.0219
Bayesian Score: 25.451
Docking Score: -10.7598
Docking Score: -10.7064
15
N
H
O
O
N
H
H
O
N+
N
N
O
H
N
Hit_Scaffold_11
Hit_Scaffold_12
Molecular Weight: 344.363
Molecular Weight: 305.354
Bayesian Score: 46.7105
Bayesian Score: 28.4191
Docking Score: -10.6781
Docking Score: -10.6695
H
N
N
H
N
O
H
N
H
N
N
N
H
N
N
H
Hit_Scaffold_13
Hit_Scaffold_14
Molecular Weight: 349.388
Molecular Weight: 344.41
Bayesian Score: 47.7709
Bayesian Score: 29.1089
Docking Score: -10.6677
Docking Score: -10.6415
O
H
N
H
N
H
H
N
N
N
H
N
N
N
N
N
Hit_Scaffold_15
Hit_Scaffold_16
Molecular Weight: 346.386
Molecular Weight: 315.372
Bayesian Score: 30.7082
Bayesian Score: 41.9494
Docking Score: -10.6327
Docking Score: -10.6346
H
H
N
H
N H
H
O
H
N
N
O
N
N
N
N
H
H
N
H
N
N
N
Hit_Scaffold_17
Hit_Scaffold_18
Molecular Weight: 306.322
Molecular Weight: 343.382
Bayesian Score: 27.4236
Bayesian Score: 25.0921
Docking Score:-10.6208
Docking Score: -10.6074
O
N
N
H
N N
H
N
H
H
N
N
N
N
H
Hit_Scaffold_20
Hit_Scaffold_19
Molecular Weight: 316.36
Molecular Weight: 275.305
Bayesian Score: 29.0655
Bayesian Score: 30.3625
Docking Score: -10.584
Docking Score: -10.577
16
Table S1 Definition of classification model performance measures
Statistics
Full name
Description
Formula
Area under the ROC curve which
represents the probability for ranking
AUC
Area under
curve
an arbitrarily chosen active higher than
-
a decoy processed in the same way. It
ranges from 0 to 1.
TP
FP
True
Nu mber
positives
actives.
False
Nu mber
positives
False
FN
negatives
True
TN
x%
EF
negatives
of
accurately
identified
-
of
correct ly
identified
-
inactives.
Nu mber
of
inactives
incorrectly
-
predicted as actives.
Nu mber
of
act ives
incorrectly
-
predicted to be inactives.
Enrichment
Enrich ment factor (EF) calculated on a
factor
given percentage of the screened hits.
EF
x%
=
x%
x%
Hits selected
N selected
Hits total N total
Ratio of the retrieved true positive
Se
Selectivity
compounds
TP
to
all
active
compounds in the database (TP+FN)
Se =
TP
TP + FN
and ranges from 0 to 1.
Specificity denotes the percentage of
Sp
Specificity
truly inactive co mpounds and ranges
from 0 to 1.
Accuracy
Accuracy
Proportion of true prediction in the
Sp =
TN
TN + FP
Accuracy =
total database that ranges from 0 to 1.
Precision
Precision
Ability to correctly predict positive
results and ranges from 0 to 1
Precision =
TP+TN
N total
TP
TP + FP
Considered as true accuracy for
Kappa
Kappa
classification models and outperforms
Kappa = ((TP+TN)-(((TP+FN)
accuracy to evaluate the model’s
(TP+FP)+(FP+TN)(FN+TN))
prediction power. A model with a
/N))/(N-(((TP+FN)(TP+FP)
kappa value over 0.4 is considered
+(FP+TN)(FN+TN))/N))
useful.
17
Table S2 Statistical values comparison of 2D properties of 10 compound databases
DUD-E_ VEGFR-2
ChemBridge
ChemDiv
DrugBank
HitFinder
NCI
ZINC
Number_of_molecules
100032
700000
6234
Minimized_Energy/10
2.97
3.77
ALogP
2.54
ChEMBL
14400
252684
1190088
1286475
4.31
3.41
3.25
2.19
3.14
1.34
3.21
2.82
Specs
KLIFS
KDRActive
Average
StdEv
219059
1100
2542
-
-
5.03
2.42
5.09
4.62
3.61
0.49
1.86
3.10
3.43
2.80
3.76
2.56
0.75
Molecular_Weight/100
3.36
3.94
3.46
3.25
3.24
3.00
4.24
3.66
3.92
4.30
3.50
0.26
Num_RotatableBonds
5
5
6
4
5
4
7
5
5
6
4.91
0.54
Num_Rings
3
4
2
3
2
3
3
3
4
4
2.93
0.48
Num_AromaticRings
2
3
1
2
2
2
2
2
3
4
2.02
0.40
Num_H_Acceptors
4
4
6
4
4
3
5
4
5
5
4.44
0.74
Num_H_Donors
1
1
3
1
1
1
2
1
3
2
1.41
0.78
2.10
2.33
3.42
2.75
2.57
2.47
2.44
2.37
2.81
2.34
2.65
0.50
Molecular_FPSA*10
18
Table S3 Statistics results of Bayesian categorization models
Model Name
Leave-one-out Cross-Validation Results
XV ROC AUC
Best Split
TP/FN,FP/TN
#in Category
KDRDrugLike1vs1
1.000
-0.379
199/3,0/203
202
KDRDrugLike1vs5
1.000
6.390
198/4,0/1010
202
KDRDrugLike1vs20
1.000
3.823
200/2,18/3942
202
KDRDrugLike1vs60
1.000
4.728
200/2,71/12112
202
Model Name
Enrichment Results
Category %
1%
5%
10%
25%
50%
75%
90%
95%
99%
2.50%
10.40%
20.30%
50.50%
99%
100%
100%
100%
100%
KDRDrugLike1vs1
49.877%
KDRDrugLike1vs5
16.667%
6.40%
30.20%
60.40%
100%
100%
100%
100%
100%
100%
KDRDrugLike1vs20
4.853%
20.80%
98%
100%
100%
100%
100%
100%
100%
100%
KDRDrugLike1vs60
1.631%
61.40%
100%
100%
100%
100%
100%
100%
100%
100%
Model Name
Percentile Results (1% /99% )
99%
95%
90%
70%
50%
30%
10%
5%
1%
KDRDrugLike1vs1
-1.219
3.437
5.965
8.758
8.758
20.066
22.860
25.388
30.044
KDRDrugLike1vs5
2.192
11.623
16.742
22.401
22.401
45.303
50.962
56.081
65.511
KDRDrugLike1vs20
4.941
17.543
24.384
31.945
31.945
62.550
70.111
76.952
89.554
KDRDrugLike1vs60
5.788
19.766
27.353
35.740
35.740
69.686
78.072
85.660
99.638
Model Name
Category Statistics Results
Category N
Category
Noncategory N
Noncategory
Mean (±StdDev)
Mean (±StdDev)
KDRDrugLike1vs1
202
14.41(±6.65)
203
-20.29(±8.14)
KDRDrugLike1vs5
202
33.85(±13.47)
1010
-27.72(±12.23)
KDRDrugLike1vs20
202
47.25(±18.00)
3960
-29.46(±13.28)
KDRDrugLike1vs60
202
52.71(±19.97)
12183
-29.90(±13.91)
19
Model Construction Information
Property
ALogP
Molecular
Nu m_H_Do
Nu m_H_Acc
Num_
Nu m_Aromat ic
Nu m_Rotatable
Molecular_PolarSurfac
Molecular_
_Weight
nors
eptors
Rings
Rings
Bonds
eArea
Solubility
ECFP_4
KDRDrugLike1vs1
Original
11
11
5
6
5
6
8
11
11
5184
TooFew
0
0
0
0
0
0
0
0
0
0
Noninfo*
4
2
0
1
0
0
4
3
3
99
Final
7
9
5
5
5
6
4
8
8
5085
11895
KDRDrugLike1vs5
Original
11
11
4
7
4
5
8
11
11
TooFew
0
0
0
0
0
0
0
0
0
0
Noninfo
1
0
0
0
0
0
2
3
1
37
10
11
4
7
4
5
6
8
10
11858
Final
KDRDrugLike1vs20
Original
11
11
4
6
5
5
8
11
11
23808
TooFew
0
0
0
0
0
0
0
0
0
0
Noninfo
2
0
0
1
1
0
2
2
1
33
Final
9
11
4
5
4
5
6
9
10
23775
KDRDrugLike1vs60
Original
11
11
4
6
5
4
8
11
11
40822
TooFew
0
0
0
0
0
0
0
0
0
0
Noninfo
1
1
0
1
1
0
1
2
2
24281
10
10
4
5
4
4
7
9
9
16541
Final
*Noninfo stands for Noninformative
20
Table S4 Comparison of the Bayesian model with those in other published papers
Papers
Manuscript
Ref. 16
Ref. 19
Ref. 22
Parameters
EF
Se
Sp
Accuracy
Precision
Training set
61.39
0.61
1.00
0.99
1.00
Internal test set
61.35
0.61
1.00
0.99
1.00
External Test set
77.30
0.77
1.00
1.00
0.72
Training set
-
0.89
0.93
0.91
0.93
Test set
-
0.81
0.88
0.84
0.86
Training Set
2.49
0.97
0.97
0.97
0.95
Test Set
1.51
0.80
0.77
0.78
0.79
Cross-Validation
1.63
0.69
0.74
0.72
0.62
100 nM cutoff
3.27
-
-
0.88
-
500 nM cutoff
2.31
-
-
0.88
-
21
Table S5 Percentile results for Bayesian model-based VS
Vendor DBs
Bayesian model(KDRDrug Like1vs60)-based VS for Databases
Score > 35
Score > 30
Score > 25
Score > 20
Score > 15
Score > 10
ChemBridge
0.32%
0.77%
1.33%
1.85%
2.32%
2.66%
ChemDiv
2.41%
5.52%
9.05%
12.25%
14.81%
16.98%
DrugBank
0.28%
0.27%
0.29%
0.24%
0.22%
0.19%
HitFinder
0.20%
0.25%
0.29%
0.32%
0.39%
0.41%
NCI
0.61%
1.04%
1.58%
2.16%
2.74%
3.36%
3.03%
5.93%
9.91%
13.90%
17.78%
20.94%
78.16%
75.13%
69.19%
62.80%
56.41%
50.80%
Specs
0.38%
0.92%
1.79%
2.48%
2.95%
3.26%
KLIFS
0.91%
0.73%
0.61%
0.44%
0.35%
0.25%
11.08%
7.69%
4.89%
2.94%
1.70%
0.96%
ZINC
ChEMBL
KDRActive
Vendor DBs
Bayesian model(KDRDrug Like1vs60)-based VS for Fragments
Score > 35
Score > 30
Score > 25
Score > 20
Score > 15
Score > 10
ChemBridge
2.43%
4.98%
4.62%
4.51%
4.74%
5.52%
ChemDiv
3.40%
7.47%
8.56%
8.84%
9.98%
11.00%
DrugBank
0.49%
0.26%
0.42%
0.41%
0.41%
0.36%
HitFinder
0.00%
0.13%
0.30%
0.48%
0.60%
0.78%
NCI
0.00%
0.66%
1.44%
1.67%
2.19%
2.89%
ZINC
18.93%
28.96%
37.61%
42.62%
45.28%
46.43%
ChEMBL
30.10%
32.11%
32.06%
31.99%
29.45%
26.72%
Specs
0.49%
1.57%
1.69%
1.96%
2.62%
3.44%
KLIFS
1.94%
1.83%
1.23%
1.04%
0.86%
0.69%
42.23%
22.02%
12.07%
6.51%
3.88%
2.16%
KDRActive
22