Variable selection from random forests: application to gene expression data Ramón Dı́az-Uriarte [email protected] http://ligarto.org/rdiaz Unidad de Bioinformática Centro Nacional de Investigaciones Oncológicas (CNIO) (Spanish National Cancer Center) December 2004 c 2004 Ramón Dı́az-Uriarte Gene selection with random forest – p. 1 Outline Gene selection: two objectives. Using random forest for genes selection. What are random forest. Scree plots: selection of important genes. Backwards elimination using OOB error: selection of minimal subsets with good predictive abilities. Gene selection with random forest: performance. Gene selection with random forest: stability. Gene selection with random forest – p. 2 Gene selection: two objectives Researchers often want to: 1. Obtain a (probably large) set of genes that are related to the outcome of interest; this set should include genes even if they perform similar functions and are highly correlated. 2. Obtain the smallest possible set of genes that can still achieve decent predictive performance (thus, “redundant” genes should not appear in the list). With microarray data, interpretability is relevant: it is important which are the selected genes, and how stable the selected sets of genes are. Can we use some general purpose classification algorithm to achieve the above goals, with microarray data? Gene selection with random forest – p. 3 Random forest Excellent performer in classification tasks (even when most putative predictive variables are noise). No need to fine-tune parameters to achieve excellent performance. Automatically incorporates interactions among predictor variables (since the base learners are classification trees). Can be used when p n. It does not overfit. Can handle a mixture of categorical and continuous predictor variables. Output is invariant to monotone transformations of the predictors. High quality and free implementations: original Fortran code from L. Breiman and A. Cutler and an R package (A. Liaw). Gene selection with random forest – p. 4 Random forest: variable importance As part of the algorithm, random forest returns measures of variable importance. Variable importance measures can be used to perform variable selection. Measure of importance based on the decrease of classification accuracy when values of a variable in a node are permuted randomly. Gene selection with random forest – p. 5 Random forest: details (I) An algorithm for classification that uses an ensemble of classification trees (i.e., CART, similar also to Quinlan’s C4.5). Each of the classification trees is built using a bootstrap sample of the data. At each split the candidate set of variables is a random subset of the variables instead of all the variables. (Thus, random forest uses both bagging [bootstrap aggregation] and random variable selection for tree building.) Each tree is unpruned (grown fully) to obtain low-bias trees. Gene selection with random forest – p. 6 Random forest: details (II) Bagging and random variable selection result in low correlation of the individual trees. The algorithm yields an ensemble that can achieve both low bias and low variance: from averaging over a large ensemble of low-bias, high-variance but low correlation trees. (Recall: M SE = variance + bias2 ). Gene selection with random forest – p. 7 Gene selection: two objectives Researchers often want to: 1. Obtain a (probably large) set of genes that are related to the outcome of interest; this set should include genes even if they perform similar functions and are highly correlated. Scree plots. 2. Obtain the smallest possible set of genes that can still achieve decent predictive performance (thus, “redundant” genes should not appear in the list). Backwards elimination using OOB error. With microarray data, interpretability is relevant: it is important which are the selected genes, and how stable the selected sets of genes are. Gene selection with random forest – p. 8 Gene selection: two objectives Researchers often want to: 1. Obtain a (probably large) set of genes that are related to the outcome of interest; this set should include genes even if they perform similar functions and are highly correlated. Scree plots. 2. Obtain the smallest possible set of genes that can still achieve decent predictive performance (thus, “redundant” genes should not appear in the list). Backwards elimination using OOB error. Gene selection with random forest – p. 9 Variable selection using “scree plots” Plot ordered variable importances from random forest (like “scree plots” or “scree graphs” in PCA, or “importance spectrums” of Friedman and Meulman). Compare the observed plot with similar plots generated by random forest with data that conform to an appropriate “null hypothesis”. In our supervised case, compare with scree plots that are generated by random forest from data sets with permuted class labels (leaving intact the correlation structure of the predictors). This approach to gene selection is targeted towards the first objective above. In particular, we expect to be able to recover sets of highly correlated genes. Gene selection with random forest – p. 10 Scree plots for simulated data (I) Used simulated data: we know exactly which genes are relevant. Classes of patients: 2 to 4. Number of independent dimensions: 1 to 3. Number of genes per dimension: 5, 20, 100. Subjects per class: 25. Each independent dimension has the same relevance for discrimination of the classes. Data: multivariate normal distribution with variance of 1, and a correlation among genes within dimension of 0.9 (and a correlation of 0 between genes from different dimensions). To each data set: added 2000 random normal variates (mean 0, variance 1) and 2000 random uniform [−1, 1] variates. Gene selection with random forest – p. 11 Scree plots for simulated data (II) 6 8 10 12 0.03 0.00 4 0.06 4 classes; 1 comp.; 20 genes/comp. Importance 0.12 0.06 2 14 0 5 10 15 20 25 30 4 classes; 2 comp.; 100 genes/comp. 4 classes; 3 comp.; 100 genes/comp. 50 100 150 Variable 200 250 0.00 Importance 0 0.02 Variable 0.020 Variable 0.010 0.000 Importance 0.00 Importance 4 classes; 1 comp.; 5 genes/comp. 0 50 100 200 Variable 300 Gene selection with random forest – p. 12 Scree plots for “real data” (I) Leukemia: 3051 genes, 38 patients, 2 classes. Breast: 4869 genes, 78 patients and 2 classes or 96 patients and 3 classes. Adenocarcinoma: 9868 genes, 76 patients, 2 classes. NCI 60: 5244 genes, 61 patients, 8 classes. Brain: 5597 genes, 42 patients, 5 classes. Colon: 2000 genes, 62 patients, 2 classes. Prostate: 6033 genes, 102 patients, 2 classes. Lymphoma: 4026 genes, 62 patients, 3 classes. Srbct: 2308, 63 patients, 4 classes. Gene selection with random forest – p. 13 Scree plots for “real data” (II) 0.006 Importance 0.004 0.000 0.002 Importance 0.03 0.02 0.01 0.00 Importance NCI 60 0 10 30 Variable 50 0 20 40 60 Variable 80 0.000 0.001 0.002 0.003 0.004 0.005 Breast, 2 cl. 0.04 Leukemia 0 200 600 Variable 1000 Gene selection with random forest – p. 14 Scree plots for “real data” (III) Brain 0 10 20 30 Variable 40 0.010 0.015 0.020 0.000 0.000 0.005 0.005 0.010 Importance Importance 0.015 0.004 0.003 0.002 0.001 0.000 Importance Colon 0.020 Adenocar. 0 100 200 300 400 Variable 0 20 60 Variable 100 Gene selection with random forest – p. 15 Scree plots for “real data” (IV) Srbct Importance 0.04 0.00 0.02 Importance 0.015 Importance 0.010 0.005 0.000 0 100 200 300 400 Variable 0 20 40 60 Variable 80 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Prostate 0.06 0.020 Lymphoma 0 50 100 Variable 150 Gene selection with random forest – p. 16 Gene selection: two objectives Researchers often want to: 1. Obtain a (probably large) set of genes that are related to the outcome of interest; this set should include genes even if they perform similar functions and are highly correlated. Scree plots. 2. Obtain the smallest possible set of genes that can still achieve decent predictive performance (thus, “redundant” genes should not appear in the list).Backwards elimination using OOB error. Gene selection with random forest – p. 17 Backwards elimination using OOB error Iteratively refit random forests: Discard those variables with the smallest variable importances (at each step, eliminate the lower 50% of variables). At the end, select the set of variables that yields the smallest error rate (out-of-bag —OOB— error rate). We choose the solution with the smallest number of variables whose error rate is within 1 standard deviation of the minimum error rate of all forests. Error rate of procedure estimated using the bootstrap (.632+ rule). (Because of the iterative approach, using the OOB-error leads to severely biased down OOB error rates that cannot be used to asses the overall error rate of the approach.) (No recalculation of variable importances at each step.) Gene selection with random forest – p. 18 Backwards elimination: results Data set Error rate # Vars (genes) Leukemia 0.079 2 Breast 2 cl. 0.353 9 Breast 3 cl. 0.378 19 NCI 60 0.398 21 Adenocar. 0.209 3 Brain 0.210 11 Colon 0.192 7 Lymphoma 0.043 63 Prostate 0.071 2 Srbct 0.079 73 Gene selection with random forest – p. 19 Stability of results Results are good, very good. . . ... Instability or non-uniqueness of results (“Rashomon effect” sensu Breiman) a widespread problem with microarray data (reviewed by Somorjai, Dolenko & Baumgarter, 2003): many equally good (or equally excellent) solutions. Makes biological interpretation difficult. We use the bootstrap to evaluate stability. Gene selection with random forest – p. 20 Stability of variable importances 0 20 40 60 Rank of gene 80 100 0.8 Top 100 Top 20 0.4 0.0 0.8 Selection probability Brain 0.4 0.0 Selection probability Leukemia 0 20 20 40 60 Rank of gene 80 100 Rank of gene 80 100 80 100 0.8 0.4 0.0 Selection probability 0.8 0 60 Srbct 0.4 0.0 Selection probability Prostate 40 0 20 40 60 Rank of gene Gene selection with random forest – p. 21 Stability of backwards elimination Data set Error rate Leukemia 0.079 Breast 2 cl. # Vars # Vars bootstrap Freq. vars 2 2 (2, 2) 0.44 (0.37, 0.51) 0.353 9 3 (2, 9) 0.13 (0.10, 0.25) Breast 3 cl. 0.378 19 9 (5, 19) 0.15 (0.12, 0.25) NCI 60 0.398 21 42 (21, 81) 0.35 (0.25, 0.50) Adenocar. 0.209 3 3 (2, 5) 0.13 (0.07, 0.16) Brain 0.210 11 11 (11, 26) 0.32 (0.27, 0.61) Colon 0.192 7 2 (2, 7) 0.28 (0.23, 0.34) Lymphoma 0.043 63 31 (7, 125) 0.45 (0.36, 0.53) Prostate 0.071 2 3 (2, 5) 0.93 (0.89, 0.96) Srbct 0.079 73 19 (19, 37) 0.30 (0.15, 0.51) Gene selection with random forest – p. 22 Discussion The two approaches work as advertised. We can recover relatively large sets of variables, even in the presence of high correlations among variables. We can perform aggressive variable selection, and obtain very small sets of predictor variables with excellent cross-validated prediction error. But a few things might be improved? “Sharpness” in scree plots: make it easier to differentiate important from not important. Uniqueness (stability) of outcome. Non-uniqueness (multiple equally good solutions) might be a “feature” of the data we need to live with: too many dimensions, too few samples, extremely low # samples to # number of variables (genes) ratio, low signal to noise ratio. It is a problem that plagues most (all ?) other methods. Gene selection with random forest – p. 23 Acknowledgements and others Andy Liaw for discussions about random forest, and his R package. Computations carried out in parallel (using MPI, Rmpi and Snow) on the 60 CPU cluster of the Bioinformatics Unit at CNIO. Partially supported by the Ramón y Cajal program of the Spanish MCyT (Ministry of Science and Technology). Funding provided by project TIC2003-09331-C02-02 of the Spanish MCyT. R package (and tech. report.) available very soon (available now; see http://ligarto.org/rdiaz/Papers/rfVS/rfVarSel/rfVarSel.html). Gene selection with random forest – p. 24
© Copyright 2026 Paperzz