2 Random Forest

1
Application and Efficacy of
Random Forest Method
for QSAR Analysis
presented by
Pavel Polishchuk
Random Forest – consensus modelling
Random Forest model is an ensemble of single decision trees.
Rules for model construction
1. Each tree growing on separate bootstrap sample of initial
training set compounds.
2. In each node only small randomly chosen fixed number of
descriptors are considered.
3. Each tree grows for its maximum depth (no pruning).
2
Random Forest algorithm
3
Initial dataset
Bootstrap
sample
Bootstrap
sample
Bootstrap
sample
…
Tree1
Tree2
Tree3
Combined prediction
Random Forest advantages:
1. RF models are robust to over-fitting.
2. There is no need in pre-selection of variables.
3. RF has its own reliable procedure for estimation of
predictive ability of model.
4. RF models are robust to “noise” in training dataset.
5. RF allows to estimate variable importance for
target property (interpretability of RF model).
6. RF allows to analyze compounds with different
mechanisms of action.
7. RF method is very fast and effective in working with
huge datasets.
4
5
Several examples of
real QSAR tasks solutions
Toxicity of chemical compounds for T. pyriformis#
was expressed as inverse logarithm of 50% inhibition of
Tetrahymena pyriformis growth concentration (pIGC50)
Diverse datasets:
training set
= 644 compounds
test set 1 (ts1) = 339 compounds
test set 2 (ts2) = 110 compounds
Total number of 2D simplex descriptors = 6021
#
Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784.
6
Comparison of RF model with other consensus ones
RF model (trees=500, vars=2000)#
RF#
Consensus PLS Consensus
##
(2D simplex) (2D simplex) literature
R2(ws)
0.99
0.85
0.92
R2(oob)
0.81
---
---
R2(ts1)
0.83
0.80
0.85
R2(ts2)
0.74
0.69
0.67
MAE(ts1)
0.30
0.33
0.29
MAE(ts2)
0.38
0.41
0.39
mean absolute error of prediction
#
^
1 n 

MAE    Y  Y 
n i 1 

Polischuk, P.G., et al J. Chem. Inf. Model., 2009. 49: p.2481-2488
##
Zhu, H., et al., J. Chem. Inf. Model., 2008. 48: p. 766-784.
7
Estimation of mutagenic potential of chemical compounds
(Ames test)
training set = 4361 compounds
test set
= 2181 compounds
Accuracy Accuracy
(oob)
(5-fold CV)
Accuracy
(test set)
Model
Descriptors
2D RF
Simplex + Dragon
0.827
0.823
0.813
2D RF
Simplex
0.823
0.810
0.814
2D RF
Dragon
0.815
0.803
0.805
Consensus#
(32 models)
---
---
0.828
0.823
#
Results of collaboration of 13 scientific groups (not published yet)
8
Solubility in water QSPR task solution#
9
training set = 2537 compounds
test set
= 301 compounds
training set
R2 = 0.99
#
out-of-bag set
R2 = 0.88
test set
R2 = 0.82
Kovdienko, N.A., et al. Molecular Informatics, 2010. 29: p.394-406
Leo Breiman – author of Random Forest
«Random Forest is an example of a
tool that is useful in doing analyses
of scientific data. But the cleverest
algorithms are no substitute for
human intelligence and knowledge of
the data in the problem. Take the
output of random forests not as
absolute truth, but as smart
computer generated guesses that
may be helpful in leading to a
deeper understanding of the
problem.»
(27.01.1928 – 07.07.2005)
10