1 Application and Efficacy of Random Forest Method for QSAR Analysis presented by Pavel Polishchuk Random Forest – consensus modelling Random Forest model is an ensemble of single decision trees. Rules for model construction 1. Each tree growing on separate bootstrap sample of initial training set compounds. 2. In each node only small randomly chosen fixed number of descriptors are considered. 3. Each tree grows for its maximum depth (no pruning). 2 Random Forest algorithm 3 Initial dataset Bootstrap sample Bootstrap sample Bootstrap sample … Tree1 Tree2 Tree3 Combined prediction Random Forest advantages: 1. RF models are robust to over-fitting. 2. There is no need in pre-selection of variables. 3. RF has its own reliable procedure for estimation of predictive ability of model. 4. RF models are robust to “noise” in training dataset. 5. RF allows to estimate variable importance for target property (interpretability of RF model). 6. RF allows to analyze compounds with different mechanisms of action. 7. RF method is very fast and effective in working with huge datasets. 4 5 Several examples of real QSAR tasks solutions Toxicity of chemical compounds for T. pyriformis# was expressed as inverse logarithm of 50% inhibition of Tetrahymena pyriformis growth concentration (pIGC50) Diverse datasets: training set = 644 compounds test set 1 (ts1) = 339 compounds test set 2 (ts2) = 110 compounds Total number of 2D simplex descriptors = 6021 # Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784. 6 Comparison of RF model with other consensus ones RF model (trees=500, vars=2000)# RF# Consensus PLS Consensus ## (2D simplex) (2D simplex) literature R2(ws) 0.99 0.85 0.92 R2(oob) 0.81 --- --- R2(ts1) 0.83 0.80 0.85 R2(ts2) 0.74 0.69 0.67 MAE(ts1) 0.30 0.33 0.29 MAE(ts2) 0.38 0.41 0.39 mean absolute error of prediction # ^ 1 n MAE Y Y n i 1 Polischuk, P.G., et al J. Chem. Inf. Model., 2009. 49: p.2481-2488 ## Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784. 7 Estimation of mutagenic potential of chemical compounds (Ames test) training set = 4361 compounds test set = 2181 compounds Accuracy Accuracy (oob) (5-fold CV) Accuracy (test set) Model Descriptors 2D RF Simplex + Dragon 0.827 0.823 0.813 2D RF Simplex 0.823 0.810 0.814 2D RF Dragon 0.815 0.803 0.805 Consensus# (32 models) --- --- 0.828 0.823 # Results of collaboration of 13 scientific groups (not published yet) 8 Solubility in water QSPR task solution# 9 training set = 2537 compounds test set = 301 compounds training set R2 = 0.99 # out-of-bag set R2 = 0.88 test set R2 = 0.82 Kovdienko, N.A., et al. Molecular Informatics, 2010. 29: p.394-406 Leo Breiman – author of Random Forest «Random Forest is an example of a tool that is useful in doing analyses of scientific data. But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem. Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.» (27.01.1928 – 07.07.2005) 10
© Copyright 2025 Paperzz