Variable selection from random forests: application to gene

Variable selection from random forests:
application to gene expression data
Ramón Dı́az-Uriarte
[email protected]
http://ligarto.org/rdiaz
Unidad de Bioinformática
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
December 2004
c 2004 Ramón Dı́az-Uriarte
Gene selection with random forest – p. 1
Outline
Gene selection: two objectives.
Using random forest for genes selection.
What are random forest.
Scree plots: selection of important genes.
Backwards elimination using OOB error: selection of
minimal subsets with good predictive abilities.
Gene selection with random forest: performance.
Gene selection with random forest: stability.
Gene selection with random forest – p. 2
Gene selection: two objectives
Researchers often want to:
1. Obtain a (probably large) set of genes that are related to the
outcome of interest; this set should include genes even if
they perform similar functions and are highly correlated.
2. Obtain the smallest possible set of genes that can still
achieve decent predictive performance (thus, “redundant”
genes should not appear in the list).
With microarray data, interpretability is relevant: it is important
which are the selected genes, and how stable the selected
sets of genes are.
Can we use some general purpose classification algorithm to
achieve the above goals, with microarray data?
Gene selection with random forest – p. 3
Random forest
Excellent performer in classification tasks (even when most
putative predictive variables are noise).
No need to fine-tune parameters to achieve excellent
performance.
Automatically incorporates interactions among predictor
variables (since the base learners are classification trees).
Can be used when p n.
It does not overfit.
Can handle a mixture of categorical and continuous predictor
variables.
Output is invariant to monotone transformations of the
predictors.
High quality and free implementations: original Fortran code
from L. Breiman and A. Cutler and an R package (A. Liaw).
Gene selection with random forest – p. 4
Random forest: variable importance
As part of the algorithm, random forest returns measures of
variable importance.
Variable importance measures can be used to perform variable
selection.
Measure of importance based on the decrease of classification
accuracy when values of a variable in a node are permuted
randomly.
Gene selection with random forest – p. 5
Random forest: details (I)
An algorithm for classification that uses an ensemble of
classification trees (i.e., CART, similar also to Quinlan’s C4.5).
Each of the classification trees is built using a bootstrap
sample of the data.
At each split the candidate set of variables is a random subset
of the variables instead of all the variables.
(Thus, random forest uses both bagging [bootstrap
aggregation] and random variable selection for tree building.)
Each tree is unpruned (grown fully) to obtain low-bias trees.
Gene selection with random forest – p. 6
Random forest: details (II)
Bagging and random variable selection result in low correlation
of the individual trees.
The algorithm yields an ensemble that can achieve both low
bias and low variance: from averaging over a large ensemble
of low-bias, high-variance but low correlation trees.
(Recall: M SE = variance + bias2 ).
Gene selection with random forest – p. 7
Gene selection: two objectives
Researchers often want to:
1. Obtain a (probably large) set of genes that are related to the
outcome of interest; this set should include genes even if
they perform similar functions and are highly correlated.
Scree plots.
2. Obtain the smallest possible set of genes that can still
achieve decent predictive performance (thus, “redundant”
genes should not appear in the list). Backwards elimination
using OOB error.
With microarray data, interpretability is relevant: it is important
which are the selected genes, and how stable the selected
sets of genes are.
Gene selection with random forest – p. 8
Gene selection: two objectives
Researchers often want to:
1. Obtain a (probably large) set of genes that are related to the
outcome of interest; this set should include genes even if
they perform similar functions and are highly correlated.
Scree plots.
2. Obtain the smallest possible set of genes that can still
achieve decent predictive performance (thus, “redundant”
genes should not appear in the list). Backwards elimination
using OOB error.
Gene selection with random forest – p. 9
Variable selection using “scree plots”
Plot ordered variable importances from random forest (like
“scree plots” or “scree graphs” in PCA, or “importance
spectrums” of Friedman and Meulman).
Compare the observed plot with similar plots generated by
random forest with data that conform to an appropriate “null
hypothesis”.
In our supervised case, compare with scree plots that are
generated by random forest from data sets with permuted class
labels (leaving intact the correlation structure of the predictors).
This approach to gene selection is targeted towards the first
objective above.
In particular, we expect to be able to recover sets of highly
correlated genes.
Gene selection with random forest – p. 10
Scree plots for simulated data (I)
Used simulated data: we know exactly which genes are
relevant.
Classes of patients: 2 to 4.
Number of independent dimensions: 1 to 3.
Number of genes per dimension: 5, 20, 100.
Subjects per class: 25.
Each independent dimension has the same relevance for
discrimination of the classes.
Data: multivariate normal distribution with variance of 1, and a
correlation among genes within dimension of 0.9 (and a
correlation of 0 between genes from different dimensions).
To each data set: added 2000 random normal variates (mean
0, variance 1) and 2000 random uniform [−1, 1] variates.
Gene selection with random forest – p. 11
Scree plots for simulated data (II)
6
8
10
12
0.03
0.00
4
0.06
4 classes; 1 comp.; 20 genes/comp.
Importance
0.12
0.06
2
14
0
5
10
15
20
25
30
4 classes; 2 comp.; 100 genes/comp.
4 classes; 3 comp.; 100 genes/comp.
50
100
150
Variable
200
250
0.00
Importance
0
0.02
Variable
0.020
Variable
0.010
0.000
Importance
0.00
Importance
4 classes; 1 comp.; 5 genes/comp.
0
50 100
200
Variable
300
Gene selection with random forest – p. 12
Scree plots for “real data” (I)
Leukemia: 3051 genes, 38 patients, 2 classes.
Breast: 4869 genes, 78 patients and 2 classes or 96 patients
and 3 classes.
Adenocarcinoma: 9868 genes, 76 patients, 2 classes.
NCI 60: 5244 genes, 61 patients, 8 classes.
Brain: 5597 genes, 42 patients, 5 classes.
Colon: 2000 genes, 62 patients, 2 classes.
Prostate: 6033 genes, 102 patients, 2 classes.
Lymphoma: 4026 genes, 62 patients, 3 classes.
Srbct: 2308, 63 patients, 4 classes.
Gene selection with random forest – p. 13
Scree plots for “real data” (II)
0.006
Importance
0.004
0.000
0.002
Importance
0.03
0.02
0.01
0.00
Importance
NCI 60
0 10
30
Variable
50
0
20
40
60
Variable
80
0.000 0.001 0.002 0.003 0.004 0.005
Breast, 2 cl.
0.04
Leukemia
0 200
600
Variable
1000
Gene selection with random forest – p. 14
Scree plots for “real data” (III)
Brain
0
10
20
30
Variable
40
0.010
0.015
0.020
0.000
0.000
0.005
0.005
0.010
Importance
Importance
0.015
0.004
0.003
0.002
0.001
0.000
Importance
Colon
0.020
Adenocar.
0
100 200 300 400
Variable
0
20
60
Variable
100
Gene selection with random forest – p. 15
Scree plots for “real data” (IV)
Srbct
Importance
0.04
0.00
0.02
Importance
0.015
Importance
0.010
0.005
0.000
0
100 200 300 400
Variable
0
20
40
60
Variable
80
0.00 0.01 0.02 0.03 0.04 0.05 0.06
Prostate
0.06
0.020
Lymphoma
0
50
100
Variable
150
Gene selection with random forest – p. 16
Gene selection: two objectives
Researchers often want to:
1. Obtain a (probably large) set of genes that are related to the
outcome of interest; this set should include genes even if
they perform similar functions and are highly correlated.
Scree plots.
2. Obtain the smallest possible set of genes that can still
achieve decent predictive performance (thus, “redundant”
genes should not appear in the list).Backwards elimination
using OOB error.
Gene selection with random forest – p. 17
Backwards elimination using OOB error
Iteratively refit random forests:
Discard those variables with the smallest variable
importances (at each step, eliminate the lower 50% of
variables).
At the end, select the set of variables that yields the smallest
error rate (out-of-bag —OOB— error rate).
We choose the solution with the smallest number of variables
whose error rate is within 1 standard deviation of the minimum
error rate of all forests.
Error rate of procedure estimated using the bootstrap (.632+
rule).
(Because of the iterative approach, using the OOB-error leads
to severely biased down OOB error rates that cannot be used
to asses the overall error rate of the approach.)
(No recalculation of variable importances at each step.)
Gene selection with random forest – p. 18
Backwards elimination: results
Data set
Error rate
# Vars (genes)
Leukemia
0.079
2
Breast 2 cl.
0.353
9
Breast 3 cl.
0.378
19
NCI 60
0.398
21
Adenocar.
0.209
3
Brain
0.210
11
Colon
0.192
7
Lymphoma
0.043
63
Prostate
0.071
2
Srbct
0.079
73
Gene selection with random forest – p. 19
Stability of results
Results are good, very good. . .
...
Instability or non-uniqueness of results (“Rashomon effect”
sensu Breiman) a widespread problem with microarray data
(reviewed by Somorjai, Dolenko & Baumgarter, 2003): many
equally good (or equally excellent) solutions.
Makes biological interpretation difficult.
We use the bootstrap to evaluate stability.
Gene selection with random forest – p. 20
Stability of variable importances
0
20
40
60
Rank of gene
80
100
0.8
Top 100
Top 20
0.4
0.0
0.8
Selection probability
Brain
0.4
0.0
Selection probability
Leukemia
0
20
20
40
60
Rank of gene
80
100
Rank of gene
80
100
80
100
0.8
0.4
0.0
Selection probability
0.8
0
60
Srbct
0.4
0.0
Selection probability
Prostate
40
0
20
40
60
Rank of gene
Gene selection with random forest – p. 21
Stability of backwards elimination
Data set
Error rate
Leukemia
0.079
Breast 2 cl.
# Vars
# Vars bootstrap
Freq. vars
2
2 (2, 2)
0.44 (0.37, 0.51)
0.353
9
3 (2, 9)
0.13 (0.10, 0.25)
Breast 3 cl.
0.378
19
9 (5, 19)
0.15 (0.12, 0.25)
NCI 60
0.398
21
42 (21, 81)
0.35 (0.25, 0.50)
Adenocar.
0.209
3
3 (2, 5)
0.13 (0.07, 0.16)
Brain
0.210
11
11 (11, 26)
0.32 (0.27, 0.61)
Colon
0.192
7
2 (2, 7)
0.28 (0.23, 0.34)
Lymphoma
0.043
63
31 (7, 125)
0.45 (0.36, 0.53)
Prostate
0.071
2
3 (2, 5)
0.93 (0.89, 0.96)
Srbct
0.079
73
19 (19, 37)
0.30 (0.15, 0.51)
Gene selection with random forest – p. 22
Discussion
The two approaches work as advertised.
We can recover relatively large sets of variables, even in the
presence of high correlations among variables.
We can perform aggressive variable selection, and obtain
very small sets of predictor variables with excellent
cross-validated prediction error.
But a few things might be improved?
“Sharpness” in scree plots: make it easier to differentiate
important from not important.
Uniqueness (stability) of outcome. Non-uniqueness
(multiple equally good solutions) might be a “feature” of the
data we need to live with: too many dimensions, too few
samples, extremely low # samples to # number of variables
(genes) ratio, low signal to noise ratio. It is a problem that
plagues most (all ?) other methods.
Gene selection with random forest – p. 23
Acknowledgements and others
Andy Liaw for discussions about random forest, and his R
package.
Computations carried out in parallel (using MPI, Rmpi and
Snow) on the 60 CPU cluster of the Bioinformatics Unit at
CNIO.
Partially supported by the Ramón y Cajal program of the
Spanish MCyT (Ministry of Science and Technology). Funding
provided by project TIC2003-09331-C02-02 of the Spanish
MCyT.
R package (and tech. report.) available very soon (available
now; see
http://ligarto.org/rdiaz/Papers/rfVS/rfVarSel/rfVarSel.html).
Gene selection with random forest – p. 24