PROBABILISTIC ASSESSMENT OF THE QSAR APPLICATION DOMAIN Nina Jeliazkova1 , Joanna Jaworska2 (1) IPP, Bulgarian Academy of Sciences, Sofia, Bulgaria (2) Central Product Safety, Procter & Gamble, Brussels, Belgium Abstract A key element of the quality prediction is to define whether a QSAR model is suitable to predict the queried chemical and prevent its potential. The training data set from which a QSAR model is derived provides a basis for the estimation of its application domain. We demonstrate that the classic approaches based on interpolation regions of the training data set, may not be sufficient. We propose a refined concept of interpolation regions by probability density distribution approach. This approach is more robust because it can identify the actual distribution of the data. However, the presence of a point in an empty or a dense region can be used only as a warning for model applicability, but not as a final decision on the correctness of the predicted values. Uncertainty estimation of the model prediction is necessary to complete the process of rejecting and accepting the model result. What is an application domain ? How to assess it? Application domain is • Projection of training set in the model’s parameter space •NOT the space where the prediction uncertainty is known •We need to find the interpolation regions in model’s parameters space •It is interpolation because in general QSAR models are statistical models Classic approaches to estimate interpolation regions Descriptor ranges 1D : parameter intervals determine interpolation region Distances 2D, 3D, nD : are parameter intervals sufficient? •If an experimental design was used during the model development, then the parameter space is covered in a homogenous way. Geometric Probabilistic for standard distributions (mostly 1D tests) •BUT this is rare in QSAR model development and empty space within the parameter range is possible. We need to refine the approach and be able to identify the empty regions within the interpolated space Generic probabilistic approach in a multivariate space The algorithm which we use to estimate the probability density for a ndimensional data set is: 1.Standardize the training data set (scale and center); 2.Extract the principal components of the data set; 3.Perform skewness correction transformation along each principal component; 4.Estimate the one-dimensional density on each transformed principal component; 5.Estimate the n-dimensional density p(x) ; 6.Estimate smallest regions containing specified percent of the data points (e.g. 99%, 90%, 75%, 50%, 25%, 10% of the data – see colored regions at the figure at the right) 7.For a query point (e.g. a new compound), transform the point according to the 1), 2) and 3) transformations and obtain the one-dimensional density values for each parameter, then multiply over the principal components to obtain the ndimensional density. 8.Calculate whether the query point is inside or outside the parameter space, containing specified percent of the training set data points. (a) a product of 1D densities in the original descriptor space (b) a joint 2D kernel density estimation in the descriptor space (c) a product of 1D densities in the principal components space (d) a joint 2D kernel density estimation in the principal components space (e) same as (c) skewness correction (f) same as (d) skewness correction (g) Weighted distance with Euclidean with (e) Mahalanobis distance Data transformation is necessary to obtain the true shape (a) Density estimated without PCA and skewness correction (b) Density estimated with PCA and skewness correction (d) Density approximated with weighted Euclidean distance Results and conclusions (a) Density estimated without PCA and skewness correction (b) Density estimated with PCA and skewness correction (d) Density approximated with Euclidean distance • We refined the concepts of interpolation by using data distribution approach. • The method is unique in its ability to detect empty regions within the interpolated space • On average the predictions within the interpolation space are more accurate than outside of the interpolation space. (d) Density approximated with Euclidean distance (b) Density estimated with PCA and skewness correction RMSE error vs. data set density 0.7 0.5 RMSE Example Root Mean Square Error of the test set inside and outside of the application domain: • SRC Kowwin (see poster) • Phenols 0.6 0.4 0.3 0.2 0.1 Density threshold Training set (probability density) Test set (probability density) Training set (Euclidean distance) Test set (Euclidean distance) 10% 20% 30% 40% 50% 60% 70% 80% 85% 90% 95% 99% All Application domain filter is not sufficient for rejecting or accepting results of the modeling. One needs to assess uncertainty of the prediction and eventually make decision based on both criteria. 0
© Copyright 2026 Paperzz