Mathematics of Random Forests 1 Probability

Mathematics of Random Forests
1Þ Probability: Chebyshev inequality
Theorem 1 (Chebyshev inequality): If \ is a random
variable with standard deviation 5 and mean ., then
for any %  !,
T Ðl\  .l  %Ñ Ÿ
5#
Þ
%#
Probability background
Theorem 2 (Bounded convergence theorem): Given a
sequence 2" ÐxÑß 2# ÐxÑß á of fns. with h5 ÐxÑ Ÿ Q for
fixed Q  !) defined on a space W of finite measure,
then
lim ( . x 25 ÐxÑ Ä ( . x lim 25 ÐxÑß
5Ä_
5Ä_ W
W
5Ä_
i.e., the limit and integration can be interchanged
(assuming the limits exist).
Probability background
Recall:
Def. 1 (Indicator function): For any event E § H of the
sample space, define the indicator function (also
known as characteristic function) of E to be
ME ÐBÑ œ œ
"
!
"
if B − E
œœ
otherwise
0
if E occurs
.
otherwise
2Þ Classification trees
Assume we have 8 patients (samples), and (e.g., mass
spectroscopy) feature vectors Öx3 ×83œ" with outcomes
C3 .
Data:
H œ ÖÐx" ß C" Ñß á ß Ðx8 ß C8 Ñ×Þ
Each feature vector
x5 œ ÐB5" ß á ß B5. ÑÞ
Classification trees
Formal definitions:
Definition 2: A classification tree is a decision tree in
which each node has a binary decision based on
whether B3  + or not for a fixed + (can depend on
node).
Classification trees
Classification trees
The top node contains all of the examples Ðx5 ß C5 Ñ, and
the set of examples is subdivided among the children
of each node according to the classification at that
node.
The subdivision of examples continues until every node
at the bottom has examples which are in one class
only.
At each node, feature B3 and threshold + are chosen to
minimize resulting 'diversity' in the children nodes.
This diversity is often measured by Gini criterion, see
below.
Gini criterion
The subdivision continues until every node at the bottom
has only one class (disease or normal) in it, assigned
as a prediction to input x.
Gini Criterion: Define class G" œ diseaseà G# œ normal.
How do we measure variation of samples in a node
with respect to these two classes?
Suppose there are 2 classes G" ß G# and we have
examples in set W at our current node.
Now to create child nodes, partition W œ W"  W# .
Gini criterion
(Note each sample W" ß W# is partitioned into the two
classes G" ß G# )
Recall lWl œ # objects in set W
Define
s ÐW4 Ñ œ lW4 l œ proportion of W4 in W
T
lWl
s ÐG3 lW4 Ñ œ lW4  G3 l œ proportion of W4 which is in G3 Þ
T
lW4 l
Gini criterion
Define the variation 1ÐW4 Ñ in set W4 to be:
s ÐG3 lW4 ÑÐ"  T
s ÐG3 lW4 ÑÑß
1ÐW4 Ñ œ "T
#
3œ"
Note: variation 1ÐW4 Ñ is largest if set W4 is equally divided
among G3 . It's smallest when all of W4 is just one of
the G3 Þ
We define the variation of this full subdivision of the W4 to
be the Gini index œ K if:
Gini criterion
s ÐW" Ñ1ÐW" Ñ  T
s ÐW# Ñ1ÐW# Ñ
KœT
œ weighted sum of variations 1ÐW" Ñß 1ÐW# Ñ
3. Random vectors
A random vector
X œ Ð\" ß á ß \. Ñ
is an array of random variables defined on the same
probability spaceÞ
Given X as above define its distribution (or joint
distribution of \" ß á ß \. Ñ to be measure . on ‘.
defined by
.(EÑ ´ T ÐX − EÑß
for any E − ‘. which is measurable.
Random vectors
Ex. 2: Consider rolling 2 dice. Let \" be the number on
the first dieß \# the number on the second die. Then
the probability space is
W œ Öall ordered pairs of die rolls× œ
Random vectors
\" œ first rollà \# œ second roll,
Random vectors
i.e., if = − W is given by = œ Ð$ß %Ñ, then
\" Ð=Ñ œ $;
\# Ð=Ñ œ %Þ
The random vector Ð\" ß \# Ñ satisfies
Ð\" ß \# ÑÐ=Ñ œ Ð$ß %ÑÞ
Random vectors
Ex. 3: Let x œ ÐB" ß á ß B. Ñ be a microarray of a glioma
cancer sample in its initial stages. Then each feature
B3 is a random variable with some distribution.
For fixed 3, let \3 be a model random variable whose
probability distribution is the same as the B3 numbers
in the microarray.
Then the model random vector X œ Ð\" ß á ß \. Ñ has a
joint distribution which is the same as our microarray
samples x œ ÐB" ß B# ß á ß B. ÑÞ
Random vectors
For a microarray x, let C be the classification of the
cancer (C œ  " means malignant; C œ " means
benign). Then we have another random vector
ÐB" ß á ß B. ß CÑ œ Ðxß CÑß
with the same distribution as a model vector ÐXß ] Ñ.
If the distribution of the random vector Ðxß CÑ is given by
the model random vector ÐXß ] Ñ, we write
Ðxß CÑ µ ÐXß ] ÑÞ
Random vectors
4. Random forest: formal definition
Assume training set of microarrays
H œ ÖÐx" ß C" Ñß á ß Ðx8 ß C8 Ñ×
drawn randomly from a (possibly unknown) probability
distribution Ðx3 ß C3 Ñ µ ÐXß ] Ñ.
Goal: to build a classifier which predicts C from x based
on the data set of examples H.
Given: ensemble of (possibly weak) classifiers
2 œ Ö2" ÐxÑß á ß 2O ÐxÑ×.
Random forest: formal definition
If each 25 ÐxÑ is a decision tree, then the ensemble is a
random forest. We define the parameters of the
decision tree for classifier 25 ÐxÑ to be
@5 œ Ð)5" ß )5# ß á ß )5: Ñ
(these parameters include the structure of tree, which
variables are split in which node, etc.)
Random forest: formal definition
We sometimes write
25 ÐxÑ œ 2Ðxl @5 ×Þ
Thus decision tree 5 leads to a classifier
25 ÐxÑ œ 2Ðxl@5 ÑÞ
How do we choose which features appear in which
nodes of the 5 >2 tree? At random, according to
parameters @5 , which are randomly chosen from a
model variable @Þ
Random forest: formal definition
Definition 1. A random forest is a classifier based on a
family of classifiers 2Ðxl@" Ñß á ß 2Ðxl@O Ñ based on a
classification tree with parameters @5 randomly
chosen from a model random vector @.
For the final classification 0 ÐxÑ (which combines the
classifiers Ö25 ÐxÑ×), each tree casts a vote for the
most popular class at input x, and the class with the
most votes wins.
Specifically given data H œ ÖÐx3 ß C3 Ñ×83œ" : we train a
family of classifiers 25 ÐxÑ.
Random forest: formal definition
Each classifier 25 ÐxÑ ´ 2Ðxl@5 Ñ is in our case a predictor
of 8
C œ „ " œ outcome associated with input xÞ
Examples
Example 4: @ œ parameter of a tree determines a
random subset H@ of the full data vector H, i.e., we
only choose a sub-collection of feature vectors
x œ ÐB" ß á ß B. ÑÞ
So H@ is a subset of
H œ eax" ß C" bß á ß ax8 ß C8 bf œ full data set.
Thus parameter @5 (for classification tree 5 ) determines
which subset of full data set H we choose for the tree
25 (xÑ œ 2Ðxl@5 ÑÞ
Examples
Then the ensemble of classifiers (now an RF) consists of
trees, each of which sees a different subset of the
data.
Example 3: @ determines subset x@ of the full set of
features x œ ÐB" ß á ß B. ÑÞ Then
2Ðxl@5 Ñ œ classification tree using subset x@ of
entries of full feature vector x
[dimension reduction]
Examples
In data mining (where dimension . is very high) this
situation is common.
5. General ensemble methods - properties
Given a fixed ensemble
2 œ Ð2" ÐxÑß á ß 2O ÐxÑÑ
of classifiers with random data vector Ðxß CÑ À
If E is any outcome for a classifier 25 ÐxÑ in the ensemble,
we define
s ÐEÑ œ proportion of classifiers 25 Ð" Ÿ 5 Ÿ OÑ
T
for which event E occurs
œ empirical probability of E.
Ensemble methods: definitions
Define empirical margin function by:
s 5 Ð25 ÐxÑ œ CÑ  max T
s 5 Ð25 ÐxÑ œ 4Ñß
7Ð
s xß CÑ ´ T
4ÁC
Ð"Ñ
œ average margin of the ensemble of
classifiers
œ extent to which average number of votes for
correct class exceeds the average number
of votes for the next-best class
Ensemble methods: definitions
¸ confidence in the classifier.
Definition 2: The generalization error of the classifier
ensemble 2 is
/ œ TxßC Ð7Ð
s xß CÑ  !ÑÞ
[subscript xß C indicates prob. measured in xß C space,
i.e., (xß CÑ is viewed as the random variable].
Ensemble methods: definitions
Theorem 1: As O Ä _ (i.e., as the number of trees
increases),
/ OÒ
T
T a2Ðxß @Ñ œ C b  max T@ Ð2Ðxß @Ñ œ 4Ñ  !•
Ä _ x,C ” @
4ÁC
Ð#Ñ
note TxßC denotes prob. as xß C varies - similarly for T@ Ñ
Ensemble methods: definitions
Proof: Note
" O
s
T 5 Ð25 ÐxÑ œ 4Ñ œ I5 cMÐ25 ÐxÑ œ 4Ñd ´ "M c25 ÐxÑ œ 4d
O 5œ"
where I5 œ average over 5à 25 ÐxÑ ´ 2Ðxl@5 ÑÞ
Recall MÐEÑ œ " if E occurs and 0 otherwise.
Claim for our random sequence @" ß @# ß á , and a data
vectors x, it suffices to show
Ensemble methods: definitions
" O
"M c2Ðxl@5 Ñ œ 4d Ä T@ Ð2Ðxl@Ñ œ 4ÑÞ
O 5œ"
(3)
Why? Let 1O Ðxß CÑ ´ RHS of (1)ß 1Ðxß CÑ œ quantity in
bracket of (2)
Then note for each xß C if (3) holds we would have
1O Ðxß CÑ Ä 1Ðxß CÑ.
Note also
Ð4Ñ
Ensemble methods: definitions
TxßC Ð1O Ðxß CÑ  !Ñ œ IxßC cMÐ1O Ðxß CÑ  !Ñdß
so by bounded convergence theorem and (4)
TxßC Ð1O Ðxß CÑ  !Ñ OÒ
T Ð1Ðxß CÑ  !).
Ä_
thus proving theorem. Thus we must only prove (3).
To prove (3): for fixed training set x and tree with
parameter @, set of x with 2Ðxl@Ñ œ 4 is a union of
boxes
Ensemble methods: definitions
F ´ M" ‚ á ‚ M. œ ÖxlB3 − M3 ×
for fixed collection of intervals ÖM3 ×.3œ" Þ
Assuming a finite number of models @ for 2Ðxl@Ñ (e.g.,
finite number of sample subsets, finite number of
feature space subsets).
Then b a finite number O of such unions of boxesß call
them ÖW5 ×O
5œ" Þ
Define
Ensemble methods: definitions
9Ð@Ñ œ 5 if Öx À 2Ðxl@Ñ œ 4× œ W5
Let
R5 œ # times 9Ð@7 Ñ œ W5 Þ
Then
" Q
"
" MÐ2Ðxl@7 Ñ œ 4Ñ œ
"R5 MÐx − W5 ÑÞ
Q 7œ"
Q 5
By the law of large numbers,
Ensemble methods: definitions
" Q
" MÐ9Ð@7 Ñ œ 5Ñ
R5 œ
Q 7œ"
converges with probability " to
I@ cMÐ9Ð@Ñ œ 5Ñd œ T@ Ð9Ð@Ñ œ 5Ñ.
Thus
" Q
" M c2Ðxl@7 Ñ œ 4d Ä "T@ Ð9Ð@Ñ œ 5ÑMÐx − W5 Ñ
Q 7œ"
5
Ensemble methods: definitions
œ T@ Ð2Ð@ß xÑ œ 4Ñß
proving (3) and completing proof.
Ensemble methods: definitions
6. Random forests as ensembles
Instead of fixed ensemble Ö25 ÐxÑ×O
5œ" of classifiers,
consider RF model:
We have 2Ðxl@Ñ; @ specifies classification tree
classifier 2Ðxl@Ñ
We have a fixed (known) probability distribution for @
determining variety of trees
Random forests as ensembles
Definition 3: The margin function of an RF is:
7Ðxß CÑ œ T@ Ð2Ðxl@Ñ œ CÑ  maxT@ Ð2Ðxl@Ñ œ 4ÑÞ
4Á]
The strength of the forest (or any family of classifiers)
is
= œ IxßC 7Ðxß CÑÞ
The generalization error is ÐChebyshev inequality)
/ œ TxßC Ð7Ðxß CÑ  !Ñ Ÿ TxßC Ðl7Ðxß CÑ  =l =Ñ Ÿ
giving a (weak) bound.
Z Ð7Ñ
,
=#
Random forests as ensembles
Better bounds can be obtained in
Breiman (2001), Random Forests, in Machine Learning.
Random forests as ensembles
7. Some sample applications (Breiman):
Some performance statistics on standard databases:
Forest random input (random feature selection) - single
feature at a time (percentage error rates)
Random forests as ensembles
Breiman, 2001
Random forests as ensembles
Variable importance - determined by accuracy decrease
with noise in variable:
Breiman, 2001