Y-scrambling

Quantitative Structure-Activity Relationships
Quantitative Structure-Property-Relationships
QSAR/QSPR modeling
Alexandre Varnek
Faculté de Chimie, ULP, Strasbourg, FRANCE
QSAR/QSPR models
• Development
• Validation
• Application
Development QSAR models
•
•
•
Selection and curation of experimental data
Preparation of training and test sets (optionaly)
Selection of an initial set of descriptors and their
normalisation
Variables selection
Selection of a machine-learning method
•
•
Validation of models
•
•
Training/test set
Cross-validation
-
internal,
external
Application of the Models
•
Models Applicability Domain
Development the QSAR models
•
•
•
•
Experimental Data
Descriptors
Mathematical techniques
Statistical criteria
Preparation of training and test sets
Building of structure property models
Training set
Initial data set
Test
10 – 15 %
Splitting of an initial
data set into training
and test sets
Selection of the best
models according to
statistical criteria
“Prediction” calculations
using the best structure property models
Recommendations to prepare a test set
• (i) experimental methods for determination of activities in
the training and test sets should be similar;
• (ii) the activity values should span several orders of
magnitude, but should not exceed activity values in the
training set by more than 10%;
• (iii) the balance between active and inactive compounds
should be respected for uniform sampling of the data.
References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994,
37, 2206-2215
Selection of descriptors for QSAR model
QSAR models should be reduced to a set of descriptors which is
as information rich but as small as possible.
Rules of thumb:
good “spread” ,
5-6 structure points per descriptor.
Objective selection (independent variable only)
Statistical criteria of correlations
Pairwise selection (Forward or Backward Stepwise selection)
Principal Component Analysis
Partial Least Square analysis
Genetic Algorithm
……………….
Subjective selection
Descriptors selection based on mechanistic studies
Preprocessing strategy for the derivation of models
for use in structure-activity relationships (QSARs)
1. identify a subset of columns (variables) with significant
correlation to the response;
2. remove columns (variables) with small variance;
3. remove columns (variables) with no unique information;
4. identify a subset of variables on which to construct a model;
5. address the problem of chance correlation.
D. C. Whitley, M. G. Ford, D. J. Livingstone
J. Chem. Inf. Comput. Sci. 2000, 40, 1160-1168
Machine-Learning Methods
Fitting models’ parameters
Y = F(ai , Xi )
Xi - descriptors (independent variables)
ai
- fitted parameters
The goal is to minimize Residual Sum of Squared (RSS)
N
RSS   ( yexp, i  ycalc,i )
i 1
2
Multiple Linear Regression
Activity
Descriptor
Y1
X1
Y2
Y2
…
…
Yn
Xn
Yi = a0 + a1 Xi1
Y
X
Multiple Linear Regression
y=ax+b
Residual Sum of
Squared (RSS)
N
RSS   ( yi  ycalc,i )
i 1
2
b
a
Multiple Linear Regression
Activity
Descr 1
Descr 2
…
Descr m
Y1
X11
X12
…
X1m
Y2
X21
X22
…
X2m
…
…
…
…
…
Yn
Xn1
Xn2
…
Xnm
Yi = a0 + a1 Xi1 + a2 Xi2 +…+ am Xim
kNN (k Nearest Neighbors)
Activity Y assessment calculating a weighted mean of the
activities Yi of its k nearest neighbors in the chemical space
TRAINING SET
Descriptor 1
Descriptor 2
A.Tropsha, A.Golbraikh, 2003
Biological and Artificial Neuron
Multilayer Neural Network
Neurons in the input layer correspond to descriptors, neurons in the output
layer – to properties being predicted, neurons in the hidden layer – to nonlinear
latent variables
QSAR/QSPR models
• Development
• Validation
• Application
Validating the QSAR Equation
How well does the model predicts the activity of known
compounds?
For a perfect model:
• All data points would reside on the diagonal.
• All variance existing in the original data is explained by the
model.
actual
r2
is the fraction of the total
variation in the dependent
variables that is explained by
the regression equation.
predicted
Calculating r2
Explained Variance
r 
Original Variance
2
Original variance = Explained variance (i.e., variance explained by the
equation) + Unexplained variance (i.e., residual variance around
regression line)
Original variance
Variance around regression line
Calculating r2
N
TSS   ( yi   y ) 2
Original variance:
i 1
N
ESS   ( yi ,calc   y ) 2
Explained variance:
i 1
N
Improvement in predicting y from just using the mean of y
Variance around regression line:
RSS   ( yi  ycalc,i ) 2
i 1
ESS TSS  RSS
RSS
r 

 1
TSS
TSS
TSS
2
3.49  0.40 3.09
r 

 0.89
3.49
3.49
2
Compound
Number
Log EC
.
.
.
.
- .
.
- .
??
Calculated
Log EC
.
.
.
.
- .
.
- .
.
Residual
.
- .
- .
.
.
- .
- .
??
F-test
Tests the assumption that a significant portion of the original
variance has been explained by the model.
In statistical terms tests that the ratio between the explained
variance (ESS/k; k = number of parameters) and the original
variance (RSS/N-k-1; N = number of data points) significantly
differs from 0. This implies that ESS = 0, i.e., the model didn’t
explain any of the variance.
F-distribution
As N and k decrease, the probability of getting large r2 values
purely by chance increases.
Thus, as N and k decrease, a larger F-value is required for the test
to be significant.
kN
Calculating F Values
ESS N  k  1 r 2 ( N  k  1)
F

k
RSS
k (1  r 2 )
Calculate F according to the above equation.
Select a significance level (e.g., 0.05).
Look up the F-value from an F-distribution derived for the correct
number of N and k at the selected significance level.
If the calculated F-value is larger than the listed F-value, then the
regression equation is significant at this significance level.
Example:
r2 = 0.89 N = 7 k = 1 F = 40.46
For an F-distribution with N=7, k=1, a value of 40.46 corresponds
to a significance level of 0.9997 . Thus, the equation is significant at
this level. The probability that the correlation is fortuitous is <
0.03%
Validation of Models
5-fold external cross-validation procedure
Cross Validation
A measure of the predictive ability of the model (as opposed to
the measure of fit produced by r2).
Q2  1

N
i 1
r  1
2
;
2
PRESS    y pred ,i  yi 
;
2
RSS    ycalc ,i  yi 
N
PRESS
( yi   y )
RSS

N
i 1
( yi   y  )
2
i 1
N
2
i 1
r2 always increases as more descriptors are added.
Q2 initially increases as more parameters are added but then
starts to decrease indicating data over fitting. Thus Q2 is a
better indicator of the model quality.
Other Model Validation Parameters
1.
s is the standard deviation about the regression line. This is a
measure of how well the function derived by the QSAR analysis
predicts the observed biological activity. The smaller the value of s
the better is the QSAR.
s
2

y

y

 obs calc
N  k 1
N is the number of observations and k is the number of variables.
2.
Scrambling of y.
Statistical tests for « chance correlations »
Scrambling:
to mix randomly:
• Y values (Y-scrambling), or
• X values (X-scrambling), or
• simulteneously Y and X values (X,Y-scrambling)
Randomization:
to generat random number s:
• from Ymin to Ymax (Y – randomization),
• from Xmin to Xmax (X – randomization),
• or do this simulteneously for Y and X (X, Y – randomization)
Calculate statistical parameters of correlations and compare
them with those obtained for the model
Scrambling
Pro.1
Struc.2
Pro.2
Struc.3
.
.
Pro.3
.
.
Pro.n
Struc.n
0.7
0.6
The lowest q2 = 0.51 in the top 10 models
0.5
0.4
q2
Struc.1
0.3
0.2
The highest q2 =0.14 for randomized datasets
0.1
Struc.1
Pro.1
0
-0.1
Struc.2
Pro.2
Struc.3
.
.
Pro.3
.
.
Struc.n
Pro.n
0
10
20
30
40
Number of Variables
50
60
70
QSAR/QSPR models
• Development
• Validation
• Application
Test compound
QSPR Models
Prediction Performance
Robustness of QSPR models
- Descriptors type;
- Descriptors selection;
- Machine-learning methods;
- Validation of models.
Applicability domain of models
Is a test compound similar
to
the
training
set
compounds?
Applicability domain of QSAR models
Descriptor 2
The new compound will be predicted by
the model, only if :
Di ≤ <Dk> + Z × sk
with Z, an empirical parameter (0.5 by default)
TRAINING SET
Descriptor 1
= TEST
INSIDE THE DOMAIN
OUTSIDE THE DOMAIN
Will be predicted
Will not be predicted
COMPOUND
Applicability domain of QSAR models
Range –based methods
 Bounding Box (BB)
Should one use only one individual model or
many models ?
ensemble modeling
Hunting season …
Single hunter
Hunting season …
Many hunters
Ensemble modelling
Property (Y) predictions using best fit models
model 1
model 2
…
mean ± s
Compound 1
Y11
Y12
…
<Y1> ± DY1
Compound 2
Y21
Y22
…
<Y2> ± DY2
Compound
…
…
Compound m
Ym1
Ym2
…
Grubbs statistics is used to exclude les outliers
<Ym> ± DYm
Calculation of Descriptors
DataSet
O
N
0
10
1
5
0
0
8
1
4
0
0
4
1
2
4
O
N
O
N
Etc.
ISIDA
FRAGMENTOR
the Pattern matrix
-0.222
0.973
+
-0.066
PATTERN MATRIX
PROPERTY VALUES
LEARNING STAGE
Building of models
VALIDATION STAGE
QSAR models filtering ->
selection of the most predictive ones
QSAR models
Example : linear QSPR model
Propriété 
Property
a
0

k
a .D
i1
i
i
PROPERTYcalc = -0.36 * NC-C-C-N-C-C + 0.27 * NC=O + 0.12 * NC-N-C*C + …
Virtual screening
with QSAR/QSPR models
Screening and hits selection
Database
O
COOH
Cl
Br
OH
N
OH
Virtual
Sreening
N
OH
QSPR model
N
COOH
Useless
compounds
O
Br
Hits
Experimental
Tests
Combinatorial Library Design
Generation of Virtual Combinatorial Libraries
O
Markush structure
R1
P R3
R2
if
R1, R2, R3 =
and
then
O
O
O
O
P
P
P
P
O
O
O
O
P
P
P
P
The types of variation in Markush structures:
1.
2.
3.
4.
OH
R1 = Me, Et, Pr
R1
R2
R3 = alkyl or
heterocycle
R3
R2 =NH2
Cl
(CH2)n
n=1– 3
Substituent variation (R1)
Position variation (R2)
Frequency variation
Homology variation (R3)
(only for patent search)
IN SILICO design of new
compounds
- Acquisition of Data;
- Acquisition of Knowledge;
- Exploitation of Knowledge
« In silico » design of new compounds
ISIDA combinatorial module
Database
1
2
Filtering
1000 molecules/second
7
Synthesis and
experimental tests
ISIDA
6
3
Similarity Search
4
QSAR models
Hits selection
Applicability
domains
QSAR models 5
Assessment of
properties
O
R1
N
R2
R3
Markush structure
The
combinatorial
module
generates virtual libraries based on
the Markush structures.
COMPUTER-AIDED DESIGN OF NEW METAL BINDERS:
Binding of UO22+ by monoamides
O
R1
N
R2
R = H, alkyl
R3
D=
[ U ] organic phase
[ U ] aqueous phase
A. Varnek, D. Fourches, V. Solov’ev, O. Klimchuk, A. Ouadi, I. Billard
J. Solv. Extr. Ion Exch., 2007, 25, N°4
SOLVENT EXTRACTION OF METALS
M2 +
An-
M1 +
L
COMPUTER-AIDED DESIGN OF NEW METAL BINDERS:
Extraction of UO22+ by monoamides
Reprocessing of the spent nuclear fuel
PUREX process
Usine de La HAGUE, France
TBP : tributyl phosphate
Goal: theoretical design of new uranyl binders more efficient than
previously studied molecules
1. T. H. Siddall III, J. Phys. Chem., 64, 1863 (1960)
2. C. Rabbe, C. Sella, C. Madic, A. Godard, Solv. Extr. Ion Exch, 17, 87 (1999)
Selected Hits: 21 cmpds
DATABASE
DATA TREATMENT
Virtual library:
11.000 cmpds
ISIDA
EXPERT
SYSTEM
VIRTUAL
SCREENING
PREDICTOR
Hits selection
“In silico” design of uranyl binders with ISIDA
logD
Experimental vs Predicted logD
New amides (ID)
Number of compounds
Newly synthesized amides
Previously studied amides
logD
Enrichment of the initial data set by new efficient
extractants:
4 compounds (previously studied)
logD > 0.9 :
9 compounds (newly synthesized)
Classification Models
Confusion Matrix
• For N instances, K classes
and a classifier
• Nij, the number of instances
of class i classified as j
Class1
Class2
… ClassK
Class1
N11
N12
… N1K
Class2
N21
N22
… N2K
…
…
…
… …
ClassK
NK1
NK2
… NKK
Classification Evaluation
Global measures of success
Measures are estimated on all classes
Local measures of success
Measures are estimated for each class
The most fundamental and lasting
objective of synthesis is not
production of new compounds but
production of properties
George S. Hammond
Norris Award Lecture, 1968