A new variable importance measure for random forests with missing

A new variable importance measure for random forests
with missing data
Alexander Hapfelmeier
Institut für Medizinische Statistik und Epidemiologie, TU München
July 23, 2011
1 / 35
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
2 / 35
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
2 / 35
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
2 / 35
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
2 / 35
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
2 / 35
Recursive partitioning
Classication and Regression Trees
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
3 / 35
Recursive partitioning
Classication and Regression Trees
Construction rules
Example: Regression Tree analysis of Airquality Data
Split criteria:
Gini-Gain (measures
impurity of binary
responses)
∆-RSS (continuous
responses)
permutation test
framework (c-trees)
4 / 35
Recursive partitioning
Classication and Regression Trees
Tree size
Stop criteria:
# observations
required for further splitting
or to be found in end nodes
no signicant splits left (conditional inference trees)
Pruning (CART):
Measure performance at dierent growth stages (cross-validation).
Apply 1 s.e. rule.
5 / 35
Recursive partitioning
Random Forests
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
6 / 35
Recursive partitioning
Random Forests
Construction rules
Trees are t to data samples (e.g. bootstrap) ≡ bagging.
√
Splits are chosen in random samples of variables (e.g. p ).
Trees are grown to full size.
Example:
1
Temp
p < 0.001
≤ 82
1
Temp
p < 0.001
> 82
2
Temp
p = 0.005
≤ 77
> 77
8
n=0
y = 41.875
3
Solar.R
p = 0.028
≤ 78
4
n=0
y = 11
9
Wind
p = 0.003
≤ 85
≤ 77
≤ 72
6
n=0
y = 27.333
≤ 191
> 72
7
n=0
y = 16.786
4
n=0
y = 15.444
9
n=0
y = 77.579
> 77
3
Solar.R
p = 0.095
> 78
5
Temp
p = 0.178
6
Day
p = 0.053
> 191
≤ 15
5
n=0
y = 20.25
7
n=0
y = 31.429
2
Solar.R
p = 0.127
≤ 49
3
n=0
y = 12.714
1
Temp
p < 0.001
2
Temp
p = 0.002
> 78
9
Wind
p < 0.001
> 49
≤ 6.3
10
n=0
y = 89.75
4
Wind
p = 0.075
≤ 8.6
5
n=0
y = 27.857
> 8.6
6
Month
p = 0.393
≤7 >7
7
8
n=0
n=0
y = 17.688 y = 22.125
> 15
8
n=0
y = 46.455
1
Temp
p < 0.001
≤ 78
> 85
2
Temp
p < 0.001
≤ 8.6 > 8.6
10
11
n=0
n=0
y = 86.929
y = 54.667
> 6.3
11
Temp
p = 0.005
≤ 83 > 83
12
13
n=0
n=0
y = 39.692 y = 71.727
≤ 77
3
Solar.R
p = 0.021
≤ 78
4
n=0
y = 11.583
≤ 82
> 82
11
Wind
p = 0.034
> 77
≤ 6.3 > 6.3
10
12
13
n=0
n=0
n=0
y = 40.846 y = 88.75 y = 70.308
> 78
5
Wind
p = 0.271
≤8
6
n=0
y = 28.857
>8
7
Wind
p = 0.385
≤ 12 > 12
8
9
n=0
n=0
y = 20.083 y = 17.556
7 / 35
Recursive partitioning
Random Forests
Rationale
Forest predictions are given by majority votes (classication) or averaged values
(regression) of tree predictions.
Diversity of trees ⇒ improved
prediction accuracy:
1.0
hard cut decision boundaries are
smoothed out ⇒
Dominated variables can reveal
their prediction strength
(importance) and new interaction
eects.
0.8
0.6
Y
reduced variance of
prediction
piecewise constant
ap .
prediction function −→
smoother functions.
Tree 1
Tree 2
Forest
0.4
0.2
0.0
−3
−1 0
3
5
X
8 / 35
Recursive partitioning
Random Forests
Missing Data and Surrogate splits
Surrogate splits:
mimic the primary split.
X2
x2
try to achieve the same
partitioning of observed values.
are ranked according to this
similarity.
X1
x1
Observation are processed along
this order until a decision is found.
Missing values generating processes:
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
9 / 35
Recursive partitioning
Variable Importance Measures
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
10 / 35
Recursive partitioning
Variable Importance Measures
Permutation accuracy importance
measures relevance of a variable.
compares the accuracy using
the original data
data where the variable of concern is randomly permuted.
Special characteristics:
Sensitive to relations of a variable to the outcome and the remaining variables
space (correlations).
Permutation scheme implicitly checks for:
H 0 : Y , Z ⊥Xj .
11 / 35
Recursive partitioning
Variable Importance Measures
Example
0
200
400
600
Airquality Data VI complete case analysis:
Solar.R
Wind
Temp
Month
Day
12 / 35
Recursive partitioning
Variable Importance Measures
Shortcomings
R produces an error message for missing values:
cannot compute variable importance measure with missing values,
⇒ the original VI was not constructed to handle missing values.
missing values ⇒ undesired and unmanageable eects.
Surrogates are used to compute the accuracy but are not part of the permutation
scheme.
The algorithm fails to check for:
H 0 : Y , Z ⊥Xj .
13 / 35
New Approach
New Approach
Solution: Still simulate the null hypothesis but
do not permute raw data values.
Randomly allocate observations to child nodes.
Thus:
child node allocation by Pk (D |Xj ) = Pk (D ).
In practice: p̂k (D = 0) = nk ,left /nk
(empirical estimator; relative frequency).
⇒ detaches any decision from the raw values.
⇒ circumvents any problems with missing values and surrogate splits.
14 / 35
New Approach
Comparison of algorithms
New measure closely sticks to existing methodology with one substantial dierence:
1
2
Compute the accuracy of a tree.
Original approach: Permute the predictor variable of interest.
New approach: Randomly assign observations to child nodes.
3
Recompute the accuracy of the tree.
4
Compute the dierence.
5
Repeat step 1 to 4 for each tree and use the average.
15 / 35
New Approach
Requirements
List of requirements to be met by a sensible measure:
(R1) RankingNew = RankingOld with 0% missing values.
(R2) #missing values % ⇒ VI &
(R3) correlation % ⇒ VI %
(R4) The VI ranking of variables not containing missing values is supposed to stay
unaected only required within groups of correlated variables.
(R5) VIinuential > VInon−inuential within correlated variables with the same amount
of missing values.
16 / 35
Simulation studies and analysis settings
Simulation
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
17 / 35
Simulation studies and analysis settings
Simulation
Rationale
Investigated eects:
1
Fraction of missing values
2
Correlation strength
3
Block size
4
Parameter strength
5
Regression and Classication problem
6
Missing data generating process
18 / 35
Simulation studies and analysis settings
Simulation
Settings
1
m ∈ 0%, 10%, 20%, 30%.
2
r ∈ 0, .3, .6, .9.
3
See point 4.
4
β = (4, 4, 3 | 4, 3 | 4, 3 | 4, 3, 0, 0 | |{z}
2 | |{z}
2 | |{z}
0 | |{z}
0 | . . . | |{z}
0 )> .
| {z } |{z} |{z} | {z }
I
II
III
IV
V
VI
VII
VIII
XIII
>
ex β
5
y = x > β + with ∼ N (0, .5) and P (Y = 1|X = x ) =
6
MCAR, MAR (rank, median, upper, margins), MNAR (upper).
contains missing values
(MCAR, MAR & MNAR)
X2
X4
X8
X10
X12
X14
1+e x > β
.
determines missing values
(MAR)
(MNAR)
X3
X2
X5
X4
X9
X8
X11
X10
X13
X12
X15
X14
→ 192 simulation settings!
19 / 35
Simulation studies and analysis settings
Results
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
20 / 35
Simulation studies and analysis settings
Results
Requirement R1
R1: RankingNew = RankingOld when m = 0%
70
●
New Approach
60
●
●
50
● ●
40
●
30
●●
20
10
0
●
●
●●
●●
●
●
0
●● ●
●
●
● ●
●●●
●
●●
●●
●
●● ●
●●
10
20
30
40
50
Original Permutation Importance
R1 is met.
Slope 6= 1.
21 / 35
Simulation studies and analysis settings
Results
Requirement R2
R2: #missing values % ⇒ VI &
Var. 2
r = .6
150
Var. 4
Var. 8
Var. 10
Var. 12
Var. 14
Coef. = 4
Block = I
Coef. = 4
Block = II
Coef. = 4
Block = IV
Coef. = 0
Block = IV
Coef. = 2
Block = V
Coef. = 0
Block = VII
Coef. = 4
Block = I
Coef. = 4
Block = II
Coef. = 4
Block = IV
Coef. = 0
Block = IV
Coef. = 2
Block = V
Coef. = 0
Block = VII
100
50
0
r = .9
150
100
50
0
m=
0%
10% 20% 30%
0%
10% 20% 30%
0%
10% 20% 30%
0%
10% 20% 30%
0%
10% 20% 30%
0%
10% 20% 30%
R2 is met.
Block size aects VI.
22 / 35
Simulation studies and analysis settings
Results
Requirement R3
R3: r % ⇒ VI %
r = .0
.3
.6
.9
.0
.3
.6
.9
.0
.3
.6
.9
.0
.3
.6
.9
.0
.3
.6
.9
.0
.3
.6
.9
.0
.3
.6
.9
.0
.3
.6
.9
.0
.3
.6
.9
m = 0%
200
150
100
50
0
m = 30%
200
150
100
50
0
Coef. =
Var. =
Block =
4
1
I
4
2
I
3
3
I
4
4
II
3
5
II
4
8
IV
3
9
IV
0
10
IV
0
11
IV
R3 is met.
Block size: VI only benets from correlations to inuential variables
23 / 35
Simulation studies and analysis settings
Results
Interaction of R2 and R3
The eects of correlation and missing values are interacting:
r = .9
r = .0
m = 0%
m = 30%
200
70
150
52
100
35
50
18
0
200
0
70
150
52
100
35
50
18
0
Coef.
Var.
Block
0
4
1
I
4
2
I
3
3
I
4
4
II
3
5
II
4
6
III
3
7
III
4
1
I
4
2
I
3
3
I
4
4
II
3
5
II
4
6
III
3
7
III
the VI of variable 2 drops in relation to variables of the same block
it is replaced by other variables
→ Selection frequencies: the more similar the information of variables becomes
(increased correlation) and the more information a variable is lacking (missing values)
the more often it will be replaced by others.
24 / 35
Simulation studies and analysis settings
Results
Requirement R4
R4: The VI ranking of variables not containing missing values is supposed to stay
unaected only required within blocks.
r = .9
r = .0
m = 0%
m = 30%
200
70
150
52
100
35
50
18
0
200
0
70
150
52
100
35
50
18
0
Coef.
Var.
Block
0
4
1
I
4
2
I
3
3
I
4
4
II
3
5
II
4
6
III
3
7
III
4
1
I
4
2
I
3
3
I
4
4
II
3
5
II
4
6
III
3
7
III
R4 is met.
variables 5 and 7: between blocks the variable rankings can change
25 / 35
Simulation studies and analysis settings
Results
Requirement R5
R5: VIinuential > VInon−inuential within the same block and amount of missing values.
r = .9
r = .6
m = 20%
m = 30%
60
50
40
30
20
10
0
60
50
40
30
20
10
0
Coef.
Var.
Block
4
8
IV
3
9
IV
0
10
IV
0
11
IV
2
12
V
R5 is met.
VI8 > VI10 .
VI9 > VI11 .
VI11 > VI8 and VI11 > VI13 .
2
13
VI
0
14
VII
0
15
VIII
4
8
IV
3
9
IV
0
10
IV
0
11
IV
2
12
V
2
13
VI
0
14
VII
0
15
VIII
→ VI is aected by:
amount of missing values
number and strength of correlated
variables
strength of correlation.
⇒ These are properties of a marginal VI which are expected and often desired.
26 / 35
Simulation studies and analysis settings
Results
Comparison of missing data generating processes
m = 20%
m = 20%
m = 20%
63
69
r = .6
r = .6
32
65
52
100
34
50
50
17
0
0
Coef.
Var.
Block
4
1
I
4
2
I
3
3
I
4
4
II
3
5
II
4
6
III
3
7
III
4
8
IV
3
9
IV
0
10
IV
0
11
IV
2
12
V
2
13
VI
0
14
VII
0
15
VIII
16
0
0
Coef.
Var.
Block
(a) MCAR
4
1
I
4
2
I
3
3
I
4
4
II
3
5
II
4
6
III
3
7
III
4
8
IV
3
9
IV
0
10
IV
0
11
IV
2
12
V
2
13
VI
0
14
VII
0
15
VIII
m = 20%
Var.
Block
0
0
Block
3
7
III
4
8
IV
3
7
III
3
9
IV
0
10
IV
0
11
IV
2
12
V
2
13
VI
0
14
VII
0
15
VIII
(d) MAR(upper)
4
8
IV
3
9
IV
0
10
IV
0
11
IV
2
12
V
2
13
VI
0
14
VII
0
15
VIII
74
150
56
37
50
16
4
6
III
4
6
III
r = .6
35
50
3
5
II
3
5
II
100
r = .6
r = .6
32
50
4
4
II
4
4
II
52
100
3
3
I
3
3
I
m = 20%
70
150
100
4
2
I
4
2
I
(c) MAR(median)
49
4
1
I
4
1
I
m = 20%
65
Var.
0
Coef.
(b) MAR(rank)
150
Coef.
32
50
16
0
49
100
r = .6
47
100
18
0
0
Coef.
Var.
Block
4
1
I
4
2
I
3
3
I
4
4
II
3
5
II
4
6
III
3
7
III
4
8
IV
3
9
IV
0
10
IV
0
11
IV
2
12
V
2
13
VI
0
14
VII
0
15
VIII
(e) MAR(margins)
18
0
0
Coef.
Var.
Block
4
1
I
4
2
I
3
3
I
4
4
II
3
5
II
4
6
III
3
7
III
4
8
IV
3
9
IV
0
10
IV
0
11
IV
2
12
V
2
13
VI
0
14
VII
0
15
VIII
(f) MNAR(upper)
27 / 35
Applications
Data
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
28 / 35
Applications
Data
Data Sets
Pima Indians Diabetes Data Set:
n = 786
p=8
nmis = 5, 35, 227, 374, 11 (0.7%, 4.6%, 29.6%, 48.7%, 1.4%)
nmis .obs = 376 (49.0%)
Mammal Sleep Data:
n = 62
p=9
nmis = 4, 4, 4, 12, 14 (6.5%, 6.5%, 6.5%, 19.4%, 22.6%)
nmis .obs = 20 (32.3%)
29 / 35
Applications
Results
Outline
1
2
3
4
5
Recursive partitioning
Classication and Regression Trees
Random Forests
Variable Importance Measures
New Approach
Simulation studies and analysis settings
Simulation
Results
Applications
Data
Results
Conclusion
30 / 35
Applications
Results
Original Perm. Imp.
New Approach
Pima Indian Diabetes
8
8
0.09
0.24
0.06
0.16
7
0.03
6
6
4
1
2
1
7
2
3
skin
insulin
5
0.08
5
4
3
0
0
num.preg
gluc
bloodpres
bmi
pedigree
age
Mammal Sleep
9
9
7500
376500
8
8
5000
251000
7
6
5
2500
0
BrainWgt
NonD
1
4
Dream
7
6
4
1
2
2
Sleep
125500
3
5
3
0
Span
Gest
Pred
Exp
Danger
Black: Original VI complete case analysis.
Grey: New VI entire data.
⇒ complete case analysis can induce bias if MCAR.
31 / 35
Applications
Results
Check for possible eects of CC analysis
Analysis setting:


N (2, 1)
Y ∼ N (0, 1)


N (−2, 1)
X ∼ B (5000, .2)
Z ∼ B (5000, .2)
X ⊥Z
if (x , z ) = (1, 0)
if (x , z ) = (0, 0) or (x , z ) = (1, 1)
if (x , z ) = (0, 1)
Original Perm. Imp.
New Approach
1.2
0.93
2.74
0.62
1.83
0.31
0.91
1.0
0.8
0.6
0.4
0.2
0.0
0
X
(g)
Z
0
X
Z
(h)
(i) entire data without missings.
(j) 30% missings in Z (MAR(upper) induced by Y).
32 / 35
Conclusion
Conclusion
Key benets:
The simulation showed that:
all requirements were fullled
in each simulation setting.
Real data applications and a small simulation showed that
it is superior to CC analysis as it
deals with missing data which is not MCAR,
uses the entire data without the need to reject information and
it is able to deal with missing values.
33 / 35
Conclusion
Bibliography I
Breiman, L., J. Friedman, C. J. Stone, and R. A. Olshen (1984, January). Classication
and Regression Trees. Chapman & Hall/CRC.
Hastie, T., R. Tibshirani, and J. H. Friedman (2009, February). The Elements of
Statistical learning (Corrected ed.). Springer.
Hothorn, T., K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning. Journal
of Computational and Graphical Statistics 15 (3), 651674.
Strobl, C., J. Malley, and G. Tutz (2009, December). An introduction to recursive
partitioning: Rationale, application, and characteristics of classication and regression
trees, bagging, and random forests. Psychological Methods 14 (4), 323348.
34 / 35
Conclusion
Vielen Dank für die Aufmerksamkeit!
35 / 35

Download Report

A new variable importance measure for random forests with missing

Paperzz.com

Your Paperzz