PubChem Structure-Activity Relationship Clusters and the

Structure-Activity Relationship Clusters
(and the Difference between 2-D and 3-D Similarity)
Volker D. Hähnke, Lianyi Han, Sunghwan Kim, Evan E. Bolton, Stephen H. Bryant
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Bethesda (MD), USA
[email protected]
1
PubChem Structure-Activity Relationship Clusters
Description of
Chemical Sample
- 127 million substances
Unique Chemical Structures
Aggregated Information
- 48 million compounds
Structure
Activity
Properties
Metabolic pathways
Vendors
Publishers
Patents
2
Bioactivity Information
Assays & Results
-
720 thousand assays
215 million bioactivity results
PubChem Structure-Activity Relationship Clusters
PubChem 3D
Generation of 3-D Conformers*
•
≤ 50 non-hydrogen atoms
•
≤ 15 rotatable bonds
•
H, C, N, O, F, Si, P, S, Cl, Br, and I
•
1 covalent unit
•
≤ 6 undefined stereo centers
•
All atom types supported by MMFF94
Yields
Thematic Series in J Cheminf
3
PubChem Structure-Activity Relationship Clusters
•
Conformers
•
3-D descriptors
•
3-D similarities
*Bolton, Kim & Bryant; J Cheminf 2011; 3:4.
PubChem 3D
Coverage*
92.3% have at least 1 conformer
0.9%
89.6%
0.8%
2.7%
0.4%
0.3%
0.3%
has conformer
salt (parent has conformer)
4.9%
Method
too flexible
atom type not supported
too big
undefined stereo
complex/mixture
failed
•
OpenEye OMEGA
•
MMFF94, no Coulomb term
•
Custom RMSD sampling threshold
•
Up to 500 conformers per Compound
4
PubChem Structure-Activity Relationship Clusters
CID 23666729
CID 2244
*Bolton et al.; J Cheminf 2011; 3:32.
PubChem – Neighboring
Instant access to structurally similar compounds (pre-computed)
2-D
5
PubChem Structure-Activity Relationship Clusters
3-D
PubChem – Neighboring
…
PubChem Fingerprint - 881 Bits
•
•
•
•
•
Atoms
Rings
Atom pairs
Atom environments
More specific substructures
•
Tanimoto Coefficient
, Overlay of Volumes (ROCS)
•
Shape Tanimoto
, ≥0.9
•
≥0.8
,
Color Tanimoto
Combo Tanimoto
combines ST and CT
A
B
C
6
#Bits set in A
#Bits set in B
#Bits set in A&B
PubChem Structure-Activity Relationship Clusters
≥0.5
differentiates 6 different features f
•
•
&
ComboTST-optimized
ComboTCT-optimized
5 6 10 conformers
maximum similarity between pairs
1302
naproxen
2581
carprofen
3332
felbinac
3394
flurbiprofen
3826
ketorolac
5468
surgam
5733
zomepirac
39941
benaxoprofen
CID
1302
2581
3332
3394
3826
5468
5733
39941
1302
-
0.92 / 0.55
0.89 / 0.41
0.84 / 0.53
0.84 / 0.28
0.83 / 0.39
0.80 / 0.34
0.84 / 0.56
2581
0.43
-
0.92 / 0.50
0.92 / 0.52
0.87 / 0.27
0.90 / 0.35
0.81 / 0.38
0.84 / 0.29
3332
0.70
0.49
-
0.95 / 0.73
0.86 / 0.39
0.86 / 0.50
0.87 / 0.59
0.83 / 0.21
3394
0.70
0.49
0.94
-
0.86 / 0.40
0.87 / 0.59
0.88 / 0.45
0.75 / 0.43
3826
0.42
0.65
0.49
0.49
-
0.92 / 0.52
0.81 / 0.70
0.79 / 0.17
5468
0.57
0.43
0.71
0.71
0.52
-
0.78 / 0.60
0.79 / 0.32
5733
0.42
0.72
0.48
0.48
0.86
0.51
-
0.64 / 0.27
39941
0.49
0.69
0.39
0.40
0.62
0.41
0.70
-
3-D Similarity (ST / CT)
PubChem – Neighboring
2-D Similarity
7
PubChem Structure-Activity Relationship Clusters
Bolton, Kim & Bryant; J Cheminf 2011; 3:13.
Structure-Activity-Clustering
PubChem Neighboring
Instant access to structurally similar
compounds (pre-computed)
not necessarily biologically similar
Structure-Activity-Relationship Clusters
Instant access to structurally and biologically similar
compounds
8
PubChem Structure-Activity Relationship Clusters
Structure-Activity-Clustering
Bioactivity
inactive
•
Active
Non-Inactive
undecided / inconclusive
active
o
In at least one of 548,071 Assays (PubChem AID)
843,845 Compounds
o
Against one of 4,280 Proteins (NCBI GI number)
400,599 Compounds
o
Modulate at least one of 4,540 Pathways (BioSystems BSID)
265,470 Compounds
9
PubChem Structure-Activity Relationship Clusters
Structure-Activity-Clustering – Method
Leader Algorithm (Taylor-Butina Grouping)
Score Distribution*
(randomly drawn pairs)
7
Create a Nearest Neighbors List
Eliminate (real) Singletons
2
̅ 0.4229
6
% Scores
1
5
4
0.1326
3
2
distance
1
Find Compound with largest list
3
0
0
0.2
0.4
0.6
0.8
1
Similarity
1 ̅ 2 !
Group Compounds in largest list
& eliminate from further consideration
3
4
10
PubChem Structure-Activity Relationship Clusters
Similarity
$%&'()&
2-D
0.3119
ST
0.1502
ComboTST
0.4822
CT
0.6102
ComboTCT
0.4748
*Kim, Bolton & Bryant; J Cheminf 2012; 4:28.
Structure-Activity-Clustering – Results
Assays
# Clusters
4,000,000
843,845
Compounds
3,000,000
2,000,000
1,000,000
0
Proteins
400,599
Compounds
Taylor-Butina
Grouping
# Clusters
1,000,000
800,000
600,000
400,000
200,000
0
Pathways
265,470
Compounds
# Clusters
2,000,000
1,500,000
1,000,000
500,000
11
PubChem Structure-Activity Relationship Clusters
2-D
ComboTCT
CT
ComboTST
ST
0
Structure-Activity-Clustering – Results
Compounds
Assays
Proteins
Assays
10,000,000
750,000
Absolute Frequency
Compounds
Cluster(-ing) Statistics
500,000
250,000
0
400,000
1,000,000
100,000
10,000
1,000
100
200,000
10
0
1
12
100,000
2-D
CT
0
ComboTCT
In Cluster
100
Cluster Size
̅
200,000
ComboTST
Singletons
10
300,000
ST
Compounds
1
Pathways
ST
x ComboTST
CT
ComboTCT
2-D
PubChem Structure-Activity Relationship Clusters
ST
4.0
± 5.2
ComboTST
5.3
± 7.8
CT
5.9
± 9.5
ComboTCT
5.4
± 8.3
2-D
8.2
± 13.8
1,000
10,000
Structure-Activity-Clustering – Results
4 Compounds
14 Conformers
13
PubChem Structure-Activity Relationship Clusters
4 Compounds
16 Conformers
Structure-Activity-Clustering – Results
High Value Compounds
•
Assay: 43%
More reliable information
23%
5.1%
14.9%
IC50 / EC50 < 10 µM
Has MeSH annotation
Protein: 49.5%
29.4%
•
10.7%
9.3%
Mainly in bigger clusters
Pathway: 50.9%
25.2%
14
PubChem Structure-Activity Relationship Clusters
8.7%
17%
2-D & 3-D Similarity
Method i
Clustering Differences
Method j
O(i,j): % Overlapping compounds in clusters for
•
ST
ComboTST
CT
ComboTCT
2D
ST
ComboTST
CT
ComboTCT
2D
ST
ComboTST
CT
ComboTCT
2D
a given UID between similarity measures i & j
ST
-
79
77
79
76
ST
-
90
87
89
85
ST
-
94
93
94
90
ComboTST
73
-
85
88
83
ComboTST
77
-
89
92
87
ComboTST
83
-
95
96
92
CT
71
85
-
86
83
CT
75
89
-
90
87
CT
81
93
-
95
92
ComboTCT
72
87
86
-
84
ComboTCT
77
91
90
-
86
ComboTCT
82
95
95
-
92
2-D
69
82
82
83
-
2-D
71
83
83
83
-
2-D
76
87
89
88
-
Assay
O(i,j)
15
Protein
O(i,j)
PubChem Structure-Activity Relationship Clusters
Pathway
O(i,j)
PubChem Cluster Explorer
Public Resource:
https://pubchem.ncbi.nlm.nih.gov/sar/
…
CID 2244
555 AIDs
4,401 Clusters
107 GIs
1,562 Clusters
Similarity Method
No. Compounds
No. Conformers
…
467 BSIDs
Export
5,902 Clusters
Absolute Frequency
1000
100
10
1
0
16
PubChem Structure-Activity Relationship Clusters
200
400
Cluster Size
600
800
https://pubchem.ncbi.nlm.nih.gov/sar/
PubChem Cluster Explorer
Cluster 697257876062209
•
Similarity Method: 2D
•
AID 162343: Inhibitory concentration in
DMSO with purified human Prostaglandin G/H
synthase 2 (COX-2)
•
47 Compounds / Conformers
17
PubChem Structure-Activity Relationship Clusters
https://pubchem.ncbi.nlm.nih.gov/sar/
PubChem Cluster Explorer
Export AIDs
Export GIs
Export BSIDs
18
PubChem Structure-Activity Relationship Clusters
https://pubchem.ncbi.nlm.nih.gov/sar/
Structure-Activity-Clustering
Structure-Activity-Relationship Clusters
Instant access to structurally and biologically similar
compounds
Limitations
•
No Inactives
•
No quality measure for clusters
19
PubChem Structure-Activity Relationship Clusters
Current Work
Adding Inactives
•
Neighbors to compounds in cluster
•
Tested in the same assay
20
PubChem Structure-Activity Relationship Clusters
Current Work
Adding Inactives
•
Measure quality / modelability*
Establish “good” clusters
•
How good is good enough?
•
Suitable for model generation
21
PubChem Structure-Activity Relationship Clusters
Golbraikh et al.; J Chem Inf Model 2014; 54:1-4.
Mesa Analytics
& Computing, Inc.
•
Steve Bryant
•
Jiyao Wang
•
Evan Bolton
•
Siqian He
•
Sunghwan Kim
•
Jane He
•
Lianyi Han
•
Bo Yu
•
Paul Thiessen
•
Renata Geer
•
Asta Gindulyte
•
Ben Shoemaker
•
Lewis Geer
•
Gang Fu
•
Yanli Wang
•
Tiejun Cheng
•
Jian Zhang
•
John MacCuish
•
Nora MacCuish
•
Mitch Chapman
PubChem Cluster Explorer https://pubchem.ncbi.nlm.nih.gov/sar/
This research was supported [in part] by the Intramural Research Program of the NIH, National Library of Medicine.
22
PubChem Structure-Activity Relationship Clusters
Key Points
Structure Activity Clusters
•
Instant access to structurally & biologically similar compounds
•
50% of clusters have very reliable activity information
•
Publicly accessible
•
Inactives are incoming
https://pubchem.ncbi.nlm.nih.gov/sar/
2-D & 3-D Similarity:
•
2-D similarity less restrictive
•
Pure shape similarity is the most restrictive
•
Feature similarity is similar to 2-D similarity
23
PubChem Structure-Activity Relationship Clusters
PubChem Structure-Activity Relationship Clusters