Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships QSAR/QSPR modeling Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE QSAR/QSPR models • Development • Validation • Application Development QSAR models • • • Selection and curation of experimental data Preparation of training and test sets (optionaly) Selection of an initial set of descriptors and their normalisation Variables selection Selection of a machine-learning method • • Validation of models • • Training/test set Cross-validation - internal, external Application of the Models • Models Applicability Domain Development the QSAR models • • • • Experimental Data Descriptors Mathematical techniques Statistical criteria Preparation of training and test sets Building of structure property models Training set Initial data set Test 10 – 15 % Splitting of an initial data set into training and test sets Selection of the best models according to statistical criteria “Prediction” calculations using the best structure property models Recommendations to prepare a test set • (i) experimental methods for determination of activities in the training and test sets should be similar; • (ii) the activity values should span several orders of magnitude, but should not exceed activity values in the training set by more than 10%; • (iii) the balance between active and inactive compounds should be respected for uniform sampling of the data. References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37, 2206-2215 Selection of descriptors for QSAR model QSAR models should be reduced to a set of descriptors which is as information rich but as small as possible. Rules of thumb: good “spread” , 5-6 structure points per descriptor. Objective selection (independent variable only) Statistical criteria of correlations Pairwise selection (Forward or Backward Stepwise selection) Principal Component Analysis Partial Least Square analysis Genetic Algorithm ………………. Subjective selection Descriptors selection based on mechanistic studies Preprocessing strategy for the derivation of models for use in structure-activity relationships (QSARs) 1. identify a subset of columns (variables) with significant correlation to the response; 2. remove columns (variables) with small variance; 3. remove columns (variables) with no unique information; 4. identify a subset of variables on which to construct a model; 5. address the problem of chance correlation. D. C. Whitley, M. G. Ford, D. J. Livingstone J. Chem. Inf. Comput. Sci. 2000, 40, 1160-1168 Machine-Learning Methods Fitting models’ parameters Y = F(ai , Xi ) Xi - descriptors (independent variables) ai - fitted parameters The goal is to minimize Residual Sum of Squared (RSS) N RSS ( yexp, i ycalc,i ) i 1 2 Multiple Linear Regression Activity Descriptor Y1 X1 Y2 Y2 … … Yn Xn Yi = a0 + a1 Xi1 Y X Multiple Linear Regression y=ax+b Residual Sum of Squared (RSS) N RSS ( yi ycalc,i ) i 1 2 b a Multiple Linear Regression Activity Descr 1 Descr 2 … Descr m Y1 X11 X12 … X1m Y2 X21 X22 … X2m … … … … … Yn Xn1 Xn2 … Xnm Yi = a0 + a1 Xi1 + a2 Xi2 +…+ am Xim kNN (k Nearest Neighbors) Activity Y assessment calculating a weighted mean of the activities Yi of its k nearest neighbors in the chemical space TRAINING SET Descriptor 1 Descriptor 2 A.Tropsha, A.Golbraikh, 2003 Biological and Artificial Neuron Multilayer Neural Network Neurons in the input layer correspond to descriptors, neurons in the output layer – to properties being predicted, neurons in the hidden layer – to nonlinear latent variables QSAR/QSPR models • Development • Validation • Application Validating the QSAR Equation How well does the model predicts the activity of known compounds? For a perfect model: • All data points would reside on the diagonal. • All variance existing in the original data is explained by the model. actual r2 is the fraction of the total variation in the dependent variables that is explained by the regression equation. predicted Calculating r2 Explained Variance r Original Variance 2 Original variance = Explained variance (i.e., variance explained by the equation) + Unexplained variance (i.e., residual variance around regression line) Original variance Variance around regression line Calculating r2 N TSS ( yi y ) 2 Original variance: i 1 N ESS ( yi ,calc y ) 2 Explained variance: i 1 N Improvement in predicting y from just using the mean of y Variance around regression line: RSS ( yi ycalc,i ) 2 i 1 ESS TSS RSS RSS r 1 TSS TSS TSS 2 3.49 0.40 3.09 r 0.89 3.49 3.49 2 Compound Number Log EC . . . . - . . - . ?? Calculated Log EC . . . . - . . - . . Residual . - . - . . . - . - . ?? F-test Tests the assumption that a significant portion of the original variance has been explained by the model. In statistical terms tests that the ratio between the explained variance (ESS/k; k = number of parameters) and the original variance (RSS/N-k-1; N = number of data points) significantly differs from 0. This implies that ESS = 0, i.e., the model didn’t explain any of the variance. F-distribution As N and k decrease, the probability of getting large r2 values purely by chance increases. Thus, as N and k decrease, a larger F-value is required for the test to be significant. kN Calculating F Values ESS N k 1 r 2 ( N k 1) F k RSS k (1 r 2 ) Calculate F according to the above equation. Select a significance level (e.g., 0.05). Look up the F-value from an F-distribution derived for the correct number of N and k at the selected significance level. If the calculated F-value is larger than the listed F-value, then the regression equation is significant at this significance level. Example: r2 = 0.89 N = 7 k = 1 F = 40.46 For an F-distribution with N=7, k=1, a value of 40.46 corresponds to a significance level of 0.9997 . Thus, the equation is significant at this level. The probability that the correlation is fortuitous is < 0.03% Validation of Models 5-fold external cross-validation procedure Cross Validation A measure of the predictive ability of the model (as opposed to the measure of fit produced by r2). Q2 1 N i 1 r 1 2 ; 2 PRESS y pred ,i yi ; 2 RSS ycalc ,i yi N PRESS ( yi y ) RSS N i 1 ( yi y ) 2 i 1 N 2 i 1 r2 always increases as more descriptors are added. Q2 initially increases as more parameters are added but then starts to decrease indicating data over fitting. Thus Q2 is a better indicator of the model quality. Other Model Validation Parameters 1. s is the standard deviation about the regression line. This is a measure of how well the function derived by the QSAR analysis predicts the observed biological activity. The smaller the value of s the better is the QSAR. s 2 y y obs calc N k 1 N is the number of observations and k is the number of variables. 2. Scrambling of y. Statistical tests for « chance correlations » Scrambling: to mix randomly: • Y values (Y-scrambling), or • X values (X-scrambling), or • simulteneously Y and X values (X,Y-scrambling) Randomization: to generat random number s: • from Ymin to Ymax (Y – randomization), • from Xmin to Xmax (X – randomization), • or do this simulteneously for Y and X (X, Y – randomization) Calculate statistical parameters of correlations and compare them with those obtained for the model Scrambling Pro.1 Struc.2 Pro.2 Struc.3 . . Pro.3 . . Pro.n Struc.n 0.7 0.6 The lowest q2 = 0.51 in the top 10 models 0.5 0.4 q2 Struc.1 0.3 0.2 The highest q2 =0.14 for randomized datasets 0.1 Struc.1 Pro.1 0 -0.1 Struc.2 Pro.2 Struc.3 . . Pro.3 . . Struc.n Pro.n 0 10 20 30 40 Number of Variables 50 60 70 QSAR/QSPR models • Development • Validation • Application Test compound QSPR Models Prediction Performance Robustness of QSPR models - Descriptors type; - Descriptors selection; - Machine-learning methods; - Validation of models. Applicability domain of models Is a test compound similar to the training set compounds? Applicability domain of QSAR models Descriptor 2 The new compound will be predicted by the model, only if : Di ≤ <Dk> + Z × sk with Z, an empirical parameter (0.5 by default) TRAINING SET Descriptor 1 = TEST INSIDE THE DOMAIN OUTSIDE THE DOMAIN Will be predicted Will not be predicted COMPOUND Applicability domain of QSAR models Range –based methods Bounding Box (BB) Should one use only one individual model or many models ? ensemble modeling Hunting season … Single hunter Hunting season … Many hunters Ensemble modelling Property (Y) predictions using best fit models model 1 model 2 … mean ± s Compound 1 Y11 Y12 … <Y1> ± DY1 Compound 2 Y21 Y22 … <Y2> ± DY2 Compound … … Compound m Ym1 Ym2 … Grubbs statistics is used to exclude les outliers <Ym> ± DYm Calculation of Descriptors DataSet O N 0 10 1 5 0 0 8 1 4 0 0 4 1 2 4 O N O N Etc. ISIDA FRAGMENTOR the Pattern matrix -0.222 0.973 + -0.066 PATTERN MATRIX PROPERTY VALUES LEARNING STAGE Building of models VALIDATION STAGE QSAR models filtering -> selection of the most predictive ones QSAR models Example : linear QSPR model Propriété Property a 0 k a .D i1 i i PROPERTYcalc = -0.36 * NC-C-C-N-C-C + 0.27 * NC=O + 0.12 * NC-N-C*C + … Virtual screening with QSAR/QSPR models Screening and hits selection Database O COOH Cl Br OH N OH Virtual Sreening N OH QSPR model N COOH Useless compounds O Br Hits Experimental Tests Combinatorial Library Design Generation of Virtual Combinatorial Libraries O Markush structure R1 P R3 R2 if R1, R2, R3 = and then O O O O P P P P O O O O P P P P The types of variation in Markush structures: 1. 2. 3. 4. OH R1 = Me, Et, Pr R1 R2 R3 = alkyl or heterocycle R3 R2 =NH2 Cl (CH2)n n=1– 3 Substituent variation (R1) Position variation (R2) Frequency variation Homology variation (R3) (only for patent search) IN SILICO design of new compounds - Acquisition of Data; - Acquisition of Knowledge; - Exploitation of Knowledge « In silico » design of new compounds ISIDA combinatorial module Database 1 2 Filtering 1000 molecules/second 7 Synthesis and experimental tests ISIDA 6 3 Similarity Search 4 QSAR models Hits selection Applicability domains QSAR models 5 Assessment of properties O R1 N R2 R3 Markush structure The combinatorial module generates virtual libraries based on the Markush structures. COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Binding of UO22+ by monoamides O R1 N R2 R = H, alkyl R3 D= [ U ] organic phase [ U ] aqueous phase A. Varnek, D. Fourches, V. Solov’ev, O. Klimchuk, A. Ouadi, I. Billard J. Solv. Extr. Ion Exch., 2007, 25, N°4 SOLVENT EXTRACTION OF METALS M2 + An- M1 + L COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Extraction of UO22+ by monoamides Reprocessing of the spent nuclear fuel PUREX process Usine de La HAGUE, France TBP : tributyl phosphate Goal: theoretical design of new uranyl binders more efficient than previously studied molecules 1. T. H. Siddall III, J. Phys. Chem., 64, 1863 (1960) 2. C. Rabbe, C. Sella, C. Madic, A. Godard, Solv. Extr. Ion Exch, 17, 87 (1999) Selected Hits: 21 cmpds DATABASE DATA TREATMENT Virtual library: 11.000 cmpds ISIDA EXPERT SYSTEM VIRTUAL SCREENING PREDICTOR Hits selection “In silico” design of uranyl binders with ISIDA logD Experimental vs Predicted logD New amides (ID) Number of compounds Newly synthesized amides Previously studied amides logD Enrichment of the initial data set by new efficient extractants: 4 compounds (previously studied) logD > 0.9 : 9 compounds (newly synthesized) Classification Models Confusion Matrix • For N instances, K classes and a classifier • Nij, the number of instances of class i classified as j Class1 Class2 … ClassK Class1 N11 N12 … N1K Class2 N21 N22 … N2K … … … … … ClassK NK1 NK2 … NKK Classification Evaluation Global measures of success Measures are estimated on all classes Local measures of success Measures are estimated for each class The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968
© Copyright 2026 Paperzz