Probabilistic Assessment of QSAR Applicability domain

PROBABILISTIC ASSESSMENT OF THE QSAR
APPLICATION DOMAIN
Nina Jeliazkova1 , Joanna Jaworska2
(1) IPP, Bulgarian Academy of Sciences, Sofia, Bulgaria
(2) Central Product Safety, Procter & Gamble, Brussels, Belgium
Abstract
A key element of the quality prediction is to define whether a QSAR model is suitable to predict the queried chemical and prevent its potential. The training data set
from which a QSAR model is derived provides a basis for the estimation of its application domain. We demonstrate that the classic approaches based on interpolation
regions of the training data set, may not be sufficient. We propose a refined concept of interpolation regions by probability density distribution approach. This approach
is more robust because it can identify the actual distribution of the data. However, the presence of a point in an empty or a dense region can be used only as a warning
for model applicability, but not as a final decision on the correctness of the predicted values. Uncertainty estimation of the model prediction is necessary to complete
the process of rejecting and accepting the model result.
What is an application domain ?
How to assess it?
Application domain is
• Projection of training set in the
model’s parameter space
•NOT the space where the prediction
uncertainty is known
•We need to find the interpolation
regions in model’s parameters space
•It is interpolation because in general
QSAR models are statistical models
Classic approaches to estimate interpolation regions
Descriptor ranges
1D : parameter intervals determine interpolation region
Distances
2D, 3D, nD : are parameter intervals sufficient?
•If an experimental design was used during the model development, then
the parameter space is covered in a homogenous way.
Geometric
Probabilistic for standard
distributions (mostly 1D tests)
•BUT this is rare in QSAR model development and empty space within the
parameter range is possible.
We need to refine the approach and be able to identify the empty
regions within the interpolated space
Generic probabilistic approach in a multivariate space
The algorithm which we use to estimate the probability density for a ndimensional data set is:
1.Standardize the training data set (scale and center);
2.Extract the principal components of the data set;
3.Perform skewness correction transformation along each
principal component;
4.Estimate the one-dimensional density on each transformed
principal component;
5.Estimate the n-dimensional density p(x) ;
6.Estimate smallest regions containing specified percent of the
data points (e.g. 99%, 90%, 75%, 50%, 25%, 10% of the data
– see colored regions at the figure at the right)
7.For a query point (e.g. a new compound), transform the point
according to the 1), 2) and 3) transformations and obtain the
one-dimensional density values for each parameter, then
multiply over the principal components to obtain the ndimensional density.
8.Calculate whether the query point is inside or outside the
parameter space, containing specified percent of the training set
data points.
(a) a product of 1D densities
in the original descriptor
space
(b) a joint 2D kernel density
estimation in the descriptor
space
(c) a product of 1D densities
in the principal components
space
(d) a joint 2D kernel density
estimation in the principal
components space
(e)
same
as
(c)
skewness correction
(f)
same
as
(d)
skewness correction
(g)
Weighted
distance
with
Euclidean
with
(e) Mahalanobis distance
Data transformation is
necessary to obtain the
true shape
(a) Density estimated without
PCA and skewness correction
(b) Density estimated with
PCA and skewness correction
(d) Density approximated with
weighted Euclidean distance
Results and conclusions
(a) Density estimated without
PCA and skewness correction
(b) Density estimated with
PCA and skewness correction
(d) Density approximated with
Euclidean distance
• We refined the concepts of interpolation by using data distribution
approach.
• The method is unique in its ability to detect empty regions within
the interpolated space
• On average the predictions within the interpolation space are
more accurate than outside of the interpolation space.
(d) Density approximated with
Euclidean distance
(b) Density estimated with
PCA and skewness correction
RMSE error vs. data set density
0.7
0.5
RMSE
Example Root Mean Square Error of the test set inside and outside
of the application domain:
• SRC Kowwin (see poster)
• Phenols
0.6
0.4
0.3
0.2
0.1
Density threshold
Training set (probability density)
Test set (probability density)
Training set (Euclidean distance)
Test set (Euclidean distance)
10%
20%
30%
40%
50%
60%
70%
80%
85%
90%
95%
99%
All
Application domain filter is not sufficient for rejecting or accepting
results of the modeling.
One needs to assess uncertainty of the prediction and eventually
make decision based on both criteria.
0