Mathematics of Random Forests 1Þ Probability: Chebyshev inequality Theorem 1 (Chebyshev inequality): If \ is a random variable with standard deviation 5 and mean ., then for any % !, T Ðl\ .l %Ñ Ÿ 5# Þ %# Probability background Theorem 2 (Bounded convergence theorem): Given a sequence 2" ÐxÑß 2# ÐxÑß á of fns. with h5 ÐxÑ Ÿ Q for fixed Q !) defined on a space W of finite measure, then lim ( . x 25 ÐxÑ Ä ( . x lim 25 ÐxÑß 5Ä_ 5Ä_ W W 5Ä_ i.e., the limit and integration can be interchanged (assuming the limits exist). Probability background Recall: Def. 1 (Indicator function): For any event E § H of the sample space, define the indicator function (also known as characteristic function) of E to be ME ÐBÑ œ œ " ! " if B − E œœ otherwise 0 if E occurs . otherwise 2Þ Classification trees Assume we have 8 patients (samples), and (e.g., mass spectroscopy) feature vectors Öx3 ×83œ" with outcomes C3 . Data: H œ ÖÐx" ß C" Ñß á ß Ðx8 ß C8 Ñ×Þ Each feature vector x5 œ ÐB5" ß á ß B5. ÑÞ Classification trees Formal definitions: Definition 2: A classification tree is a decision tree in which each node has a binary decision based on whether B3 + or not for a fixed + (can depend on node). Classification trees Classification trees The top node contains all of the examples Ðx5 ß C5 Ñ, and the set of examples is subdivided among the children of each node according to the classification at that node. The subdivision of examples continues until every node at the bottom has examples which are in one class only. At each node, feature B3 and threshold + are chosen to minimize resulting 'diversity' in the children nodes. This diversity is often measured by Gini criterion, see below. Gini criterion The subdivision continues until every node at the bottom has only one class (disease or normal) in it, assigned as a prediction to input x. Gini Criterion: Define class G" œ diseaseà G# œ normal. How do we measure variation of samples in a node with respect to these two classes? Suppose there are 2 classes G" ß G# and we have examples in set W at our current node. Now to create child nodes, partition W œ W" W# . Gini criterion (Note each sample W" ß W# is partitioned into the two classes G" ß G# ) Recall lWl œ # objects in set W Define s ÐW4 Ñ œ lW4 l œ proportion of W4 in W T lWl s ÐG3 lW4 Ñ œ lW4 G3 l œ proportion of W4 which is in G3 Þ T lW4 l Gini criterion Define the variation 1ÐW4 Ñ in set W4 to be: s ÐG3 lW4 ÑÐ" T s ÐG3 lW4 ÑÑß 1ÐW4 Ñ œ "T # 3œ" Note: variation 1ÐW4 Ñ is largest if set W4 is equally divided among G3 . It's smallest when all of W4 is just one of the G3 Þ We define the variation of this full subdivision of the W4 to be the Gini index œ K if: Gini criterion s ÐW" Ñ1ÐW" Ñ T s ÐW# Ñ1ÐW# Ñ KœT œ weighted sum of variations 1ÐW" Ñß 1ÐW# Ñ 3. Random vectors A random vector X œ Ð\" ß á ß \. Ñ is an array of random variables defined on the same probability spaceÞ Given X as above define its distribution (or joint distribution of \" ß á ß \. Ñ to be measure . on ‘. defined by .(EÑ ´ T ÐX − EÑß for any E − ‘. which is measurable. Random vectors Ex. 2: Consider rolling 2 dice. Let \" be the number on the first dieß \# the number on the second die. Then the probability space is W œ Öall ordered pairs of die rolls× œ Random vectors \" œ first rollà \# œ second roll, Random vectors i.e., if = − W is given by = œ Ð$ß %Ñ, then \" Ð=Ñ œ $; \# Ð=Ñ œ %Þ The random vector Ð\" ß \# Ñ satisfies Ð\" ß \# ÑÐ=Ñ œ Ð$ß %ÑÞ Random vectors Ex. 3: Let x œ ÐB" ß á ß B. Ñ be a microarray of a glioma cancer sample in its initial stages. Then each feature B3 is a random variable with some distribution. For fixed 3, let \3 be a model random variable whose probability distribution is the same as the B3 numbers in the microarray. Then the model random vector X œ Ð\" ß á ß \. Ñ has a joint distribution which is the same as our microarray samples x œ ÐB" ß B# ß á ß B. ÑÞ Random vectors For a microarray x, let C be the classification of the cancer (C œ " means malignant; C œ " means benign). Then we have another random vector ÐB" ß á ß B. ß CÑ œ Ðxß CÑß with the same distribution as a model vector ÐXß ] Ñ. If the distribution of the random vector Ðxß CÑ is given by the model random vector ÐXß ] Ñ, we write Ðxß CÑ µ ÐXß ] ÑÞ Random vectors 4. Random forest: formal definition Assume training set of microarrays H œ ÖÐx" ß C" Ñß á ß Ðx8 ß C8 Ñ× drawn randomly from a (possibly unknown) probability distribution Ðx3 ß C3 Ñ µ ÐXß ] Ñ. Goal: to build a classifier which predicts C from x based on the data set of examples H. Given: ensemble of (possibly weak) classifiers 2 œ Ö2" ÐxÑß á ß 2O ÐxÑ×. Random forest: formal definition If each 25 ÐxÑ is a decision tree, then the ensemble is a random forest. We define the parameters of the decision tree for classifier 25 ÐxÑ to be @5 œ Ð)5" ß )5# ß á ß )5: Ñ (these parameters include the structure of tree, which variables are split in which node, etc.) Random forest: formal definition We sometimes write 25 ÐxÑ œ 2Ðxl @5 ×Þ Thus decision tree 5 leads to a classifier 25 ÐxÑ œ 2Ðxl@5 ÑÞ How do we choose which features appear in which nodes of the 5 >2 tree? At random, according to parameters @5 , which are randomly chosen from a model variable @Þ Random forest: formal definition Definition 1. A random forest is a classifier based on a family of classifiers 2Ðxl@" Ñß á ß 2Ðxl@O Ñ based on a classification tree with parameters @5 randomly chosen from a model random vector @. For the final classification 0 ÐxÑ (which combines the classifiers Ö25 ÐxÑ×), each tree casts a vote for the most popular class at input x, and the class with the most votes wins. Specifically given data H œ ÖÐx3 ß C3 Ñ×83œ" : we train a family of classifiers 25 ÐxÑ. Random forest: formal definition Each classifier 25 ÐxÑ ´ 2Ðxl@5 Ñ is in our case a predictor of 8 C œ „ " œ outcome associated with input xÞ Examples Example 4: @ œ parameter of a tree determines a random subset H@ of the full data vector H, i.e., we only choose a sub-collection of feature vectors x œ ÐB" ß á ß B. ÑÞ So H@ is a subset of H œ eax" ß C" bß á ß ax8 ß C8 bf œ full data set. Thus parameter @5 (for classification tree 5 ) determines which subset of full data set H we choose for the tree 25 (xÑ œ 2Ðxl@5 ÑÞ Examples Then the ensemble of classifiers (now an RF) consists of trees, each of which sees a different subset of the data. Example 3: @ determines subset x@ of the full set of features x œ ÐB" ß á ß B. ÑÞ Then 2Ðxl@5 Ñ œ classification tree using subset x@ of entries of full feature vector x [dimension reduction] Examples In data mining (where dimension . is very high) this situation is common. 5. General ensemble methods - properties Given a fixed ensemble 2 œ Ð2" ÐxÑß á ß 2O ÐxÑÑ of classifiers with random data vector Ðxß CÑ À If E is any outcome for a classifier 25 ÐxÑ in the ensemble, we define s ÐEÑ œ proportion of classifiers 25 Ð" Ÿ 5 Ÿ OÑ T for which event E occurs œ empirical probability of E. Ensemble methods: definitions Define empirical margin function by: s 5 Ð25 ÐxÑ œ CÑ max T s 5 Ð25 ÐxÑ œ 4Ñß 7Ð s xß CÑ ´ T 4ÁC Ð"Ñ œ average margin of the ensemble of classifiers œ extent to which average number of votes for correct class exceeds the average number of votes for the next-best class Ensemble methods: definitions ¸ confidence in the classifier. Definition 2: The generalization error of the classifier ensemble 2 is / œ TxßC Ð7Ð s xß CÑ !ÑÞ [subscript xß C indicates prob. measured in xß C space, i.e., (xß CÑ is viewed as the random variable]. Ensemble methods: definitions Theorem 1: As O Ä _ (i.e., as the number of trees increases), / OÒ T T a2Ðxß @Ñ œ C b max T@ Ð2Ðxß @Ñ œ 4Ñ !• Ä _ x,C ” @ 4ÁC Ð#Ñ note TxßC denotes prob. as xß C varies - similarly for T@ Ñ Ensemble methods: definitions Proof: Note " O s T 5 Ð25 ÐxÑ œ 4Ñ œ I5 cMÐ25 ÐxÑ œ 4Ñd ´ "M c25 ÐxÑ œ 4d O 5œ" where I5 œ average over 5à 25 ÐxÑ ´ 2Ðxl@5 ÑÞ Recall MÐEÑ œ " if E occurs and 0 otherwise. Claim for our random sequence @" ß @# ß á , and a data vectors x, it suffices to show Ensemble methods: definitions " O "M c2Ðxl@5 Ñ œ 4d Ä T@ Ð2Ðxl@Ñ œ 4ÑÞ O 5œ" (3) Why? Let 1O Ðxß CÑ ´ RHS of (1)ß 1Ðxß CÑ œ quantity in bracket of (2) Then note for each xß C if (3) holds we would have 1O Ðxß CÑ Ä 1Ðxß CÑ. Note also Ð4Ñ Ensemble methods: definitions TxßC Ð1O Ðxß CÑ !Ñ œ IxßC cMÐ1O Ðxß CÑ !Ñdß so by bounded convergence theorem and (4) TxßC Ð1O Ðxß CÑ !Ñ OÒ T Ð1Ðxß CÑ !). Ä_ thus proving theorem. Thus we must only prove (3). To prove (3): for fixed training set x and tree with parameter @, set of x with 2Ðxl@Ñ œ 4 is a union of boxes Ensemble methods: definitions F ´ M" ‚ á ‚ M. œ ÖxlB3 − M3 × for fixed collection of intervals ÖM3 ×.3œ" Þ Assuming a finite number of models @ for 2Ðxl@Ñ (e.g., finite number of sample subsets, finite number of feature space subsets). Then b a finite number O of such unions of boxesß call them ÖW5 ×O 5œ" Þ Define Ensemble methods: definitions 9Ð@Ñ œ 5 if Öx À 2Ðxl@Ñ œ 4× œ W5 Let R5 œ # times 9Ð@7 Ñ œ W5 Þ Then " Q " " MÐ2Ðxl@7 Ñ œ 4Ñ œ "R5 MÐx − W5 ÑÞ Q 7œ" Q 5 By the law of large numbers, Ensemble methods: definitions " Q " MÐ9Ð@7 Ñ œ 5Ñ R5 œ Q 7œ" converges with probability " to I@ cMÐ9Ð@Ñ œ 5Ñd œ T@ Ð9Ð@Ñ œ 5Ñ. Thus " Q " M c2Ðxl@7 Ñ œ 4d Ä "T@ Ð9Ð@Ñ œ 5ÑMÐx − W5 Ñ Q 7œ" 5 Ensemble methods: definitions œ T@ Ð2Ð@ß xÑ œ 4Ñß proving (3) and completing proof. Ensemble methods: definitions 6. Random forests as ensembles Instead of fixed ensemble Ö25 ÐxÑ×O 5œ" of classifiers, consider RF model: We have 2Ðxl@Ñ; @ specifies classification tree classifier 2Ðxl@Ñ We have a fixed (known) probability distribution for @ determining variety of trees Random forests as ensembles Definition 3: The margin function of an RF is: 7Ðxß CÑ œ T@ Ð2Ðxl@Ñ œ CÑ maxT@ Ð2Ðxl@Ñ œ 4ÑÞ 4Á] The strength of the forest (or any family of classifiers) is = œ IxßC 7Ðxß CÑÞ The generalization error is ÐChebyshev inequality) / œ TxßC Ð7Ðxß CÑ !Ñ Ÿ TxßC Ðl7Ðxß CÑ =l =Ñ Ÿ giving a (weak) bound. Z Ð7Ñ , =# Random forests as ensembles Better bounds can be obtained in Breiman (2001), Random Forests, in Machine Learning. Random forests as ensembles 7. Some sample applications (Breiman): Some performance statistics on standard databases: Forest random input (random feature selection) - single feature at a time (percentage error rates) Random forests as ensembles Breiman, 2001 Random forests as ensembles Variable importance - determined by accuracy decrease with noise in variable: Breiman, 2001
© Copyright 2026 Paperzz