Classification and Regression Tree Analysis CART Master Thesis Anton V. Andriyashin Roman V. Timofeev February 11, 2005 Classification and Regression Trees 2 PART I: CART as a classification method – What is CART? – CART advantages – Construction of decision trees – Optimal tree size and pruning methods – CART applications Classification and Regression Trees 3 What is CART? Different classification methods • Parametric methods • Cluster analysis • Non-parametric • Discriminant analysis CART • Does not require specification of a functional form • Number of classes is known a priori Classification and Regression Trees 4 CART advantages • CART is nonparametric • CART does not require variables to be selected in advance • Results are invariant with respect to monotone transformations of the independent variables • CART can handle data sets with a complex structure • CART is extremely robust to the effects of outliers • CART can use any combination of categorical and continuous variables Classification and Regression Trees 5 Runtime example I: linear structure Classification and Regression Trees 6 Runtime example II: non-linear structure Classification and Regression Trees How does it work? Learning Sample is used to build Decision Trees • parent nodes are always splitted into exactly two child nodes • process can be repeated by treating each child node as a parent • decision tree is used to classify real data (using set of questions) Set of Questions for this node ⎧ Stochastic < 0.5378 ⎨ ⎩ Book Value < −0.0153 Classification and Regression Trees 7 8 Classification trees: Splitting Rules t parent ,tleft , tright − parent node, left node, right node i ( t ) − impurity function t Parent PLeft x j ≤ x Rj t Left PRight PLeft − probability of left node PRight − probability of right node N Parent − number of points in parent node t Right N Left , N Right − number of point in left node, in right node k − class index, K − number of classes, j - variable index ∆i ( t ) = iParent − E ( iChild ) = iParent − PLeft i ( t Left ) − PRight i ( t Right ) → max Different impurity functions (splitting rules) ( ) ( ) • Ginni rule ∆i (t ) = −∑ p 2 ( j t Parent ) + PLeft ∑ p 2 j t Left + PRight ∑ p 2 j t Right → max • Twoing rule PLeft PRight ⎡ K ⎤ ∆i (t ) = ⎢ ∑ p j t Left − p j t Right ⎥ → max 4 ⎣ j =1 ⎦ • Enthropy rule • Modifications of basic rules Classification and Regression Trees K K j =1 j =1 ( ) ( K j =1 ) 2 9 Regression trees: Squared Residuals Minimization Y Parent Var (Y left ) + Var (Y right ) ⎯⎯ → min xR x ≤ xR Y left ⎛ Var (Y left ) = ∑ ⎜ Yi left ⎜ i =1 ⎝ ⎛ 1 −⎜ ⎜N ⎝ left ⎛ Var (Y right ) = ∑ ⎜ Yi right ⎜ i =1 ⎝ ⎛ 1 −⎜ ⎜N ⎝ right Nleft Y right Yi − response variables N right ⎞⎞ left Y ⎟⎟ ⎟ ∑ i ⎟ i =1 ⎠⎠ Nleft 2 ⎞⎞ right Y ⎟⎟ ⎟ ∑ i ⎟ i =1 ⎠⎠ N right 2 N t − node size xR − splitting value • For each split value, sum of variances of left and right nodes is calculated • Optimal split value is the one that minimizes the sum of variances Classification and Regression Trees 10 Optimum Tree Selection Same Data – Different Decision Trees Which tree to take? High Complexity Tree Classification and Regression Trees Low Complexity Tree 11 Optimum Tree Selection Where to stop splitting? Impurity Increment Constraint Building Maximum Tree and Upward Pruning • Each node is splitted while ∆i ( t ) is high enough (e.g. higher than 3%) • Build maximum tree and then cut insignificant nodes from down to up • Can not handle well highcomplexity data – impurity is not monotone. • Time-consuming Classification and Regression Trees 12 Pruning Methods • Cross-Validation • Min node size constraint Uses cost-complexity function: Rα (T ) = R (T ) + α T% , where ( ) R (T ) = ∑ R ( t ) = ∑ ⎡ 1 − max p ( j t ) p ( t ) ⎤ ⎢ ⎥⎦ j t∈T% t∈T% ⎣ Uses individual node property: N min - minimum number of points in the node + The only available pruning method + Required much less time than CV - Very time-consuming - Requires the calibration of 1 new input parameter - Inefficient resulting trees Classification and Regression Trees 13 CART applications • Originally was used in medical disease classification • Is being used by NASA for weather forecasting • Now successfully used in financial sector: - Credit Scoring System - Insurance - Portfolio Management Classification and Regression Trees 14 CART portfolio management • Fundamental and technical weekly data are available for DAX stocks • For historical data, the classification tree was constructed • Real data were classified in accordance with classification tree • Out-of-sample testing of CART prediction was performed Classification and Regression Trees 15 CART portfolio management: results Equally weighted porfolio of DAX stocks: CART performance Classification and Regression Trees 16 CART portfolio management: results Equally weighted porfolio of DAX stocks: CART performance Classification and Regression Trees Portfolio Statistics VALUE Ending Absolute Profit 49 625 EUR Ending Relative Yield 25.55% Mean Weekly Absolute 384.33 EUR Mean Relative 26.93 % Median 393.15 EUR Std. Dev. 1879.79 EUR Skewness 1.00 Kurtosis 5.68 17 PART II: CART in XploRe – New CART Section: what is new? – Special features and properties – XploRe CART commands and quantlets – Boston Housing example – Bankruptcy data: CART vs. SVM Classification and Regression Trees 18 New CART Section: What is new? • All calculations are performed inside C++ DLL • Tree structure was expanded and completely modified • Classification trees are now available • For Classification trees: Ginni and Twoing splitting rules can be chosen • Depiction of both classification and regression trees in XploRe • Generation of CART trees to a TeX file from XploRe Classification and Regression Trees 19 Special features and properties Program special features • Efficient memory allocation -- special automatic memory allocation class was constructed -- multidimensional arrays are represented through one-dimensional array -- pointers are used instead of values • DLL can be run on Windows and UNIX machine -- C++ code does not contain ANY compiler-specific functions -- all data is transferred to *DOUBLE in order to work in XploRe • Outstanding computational performance -- all computations are done inside DLL -- XploRe only passes the data to DLL and gets back the results Classification and Regression Trees 20 Special features and properties Computational Performance Observations /Variables 2 5 10 20 40 50 1000 0 sec. 0 sec. 0 sec. 0 sec. 0 sec. 0 sec. 5000 1 sec. 1 sec. 2 sec. 2 sec. 4 sec. 6 sec. 10000 1 sec. 3 sec. 4 sec. 6 sec. 13 sec. 14 sec. 30000 13 sec. 17 sec. 25 sec. 38 sec. 62 sec. 71 sec. 50000 33 sec. 45 sec. 60 sec. 85 sec. 142 sec. 170 sec. • 2 500 000 records are processed in less than 3 minutes • maximum amount of RAM consumed is 25 Mb • testing was performed at Pentium IV 2.66 GHz, 512 Mb RAM Classification and Regression Trees 21 CART: Technical Section DLL Structure XploRe 1. Loads the data 2. Initializes tree structure CART DLL 1. Transferred from *double to internal class 2. Build maximum tree 3. Performs the tree pruning 4. Performs writing to TeX file 5. Transfers the results back to *double XploRe Performs output in XploRe window • Transformation of data types is required • Building maximum tree, pruning and writing to TEX file is performed inside DLL Classification and Regression Trees XploRe CART commands and quantlets 22 Corresponding table of new and old commands Old CART Commands Corresponding new commands Description of commands cartsplitclass cartsplitregr Generates classification and regression trees cartpredict Using the generated tree, performs the classification of new data cartdisptree Different methods of displaying CART trees (both classification and regression) in XploRe cartdrawpdfclass cartdrawpdfregr Generates TEX file for CART trees. TEX file can be then compiled to PS ->PDF. Can be used for large trees, which can not be properly depicted in XploRe. prune prunetot maketr prunecv pruneseq NONE Different types of pruning. In new CAR section pruning (cross validation) is done inside cartsplitclass and cartsplitregr commands. Corresponding options are assigned in the commands string. leafnum ssr cartleafnum carttrimp Supplementary commands which return different characteristics of CART trees (such as number of terminal nodes or weighted sum of misclassification errors for terminal nodes) cartsplit cartregr pred predd dispcart2 dispcarttree plotcart2 plotcarttree grcart2 grcarttree NONE Classification and Regression Trees 23 XploRe CART commands and quantlets Tree construction: cartsplitclass, cartsplitregr cartsplitclass (VarMatrix, ClassVector, SplitRule, MinSize) cartsplitregr (VarMatrix, ClassVector, MinSize) Input: VarMatrix - Learning Sample, matrix [n x m] ClassVector - Column of classes or response values, vector [n x 1] SplitRule - 0 – Ginni, 1 – Twoing splitting rule (only for classification trees) MinSize - minimum number of observations in terminal node Output: Tree - a complex tree structure, list of vectors with split variables, values, observations etc. Classification and Regression Trees XploRe CART commands and quantlets Tree construction: cartsplitclass, cartsplitregr Number of observations in each node Classification and Regression Trees 24 25 XploRe CART commands and quantlets Display command: cartdisptree cartdisptree (tree, string) Input: tree string - tree structure, a list of vectors with split values, variables, classes etc. - optional parameter. Used to display additional information about the tree nodes Output: Plots the tree in XploRe display Classification and Regression Trees 26 XploRe CART commands and quantlets Display command: cartdisptree cartdisptree(tr) cartdisptree(tr, «Class») shows the split questions shows the class of each node cartdisptree(tr, «Impurity») cartdisptree(tr, «NumberOfPoints») shows the impurity of each node shows number of observations in each node Classification and Regression Trees 27 XploRe CART commands and quantlets Display commands: cartdrawpdfclass, cartdrawpdfregr • Can be used to draw large trees in PDF • PDF vector graphics provides outstanding zooming without loosing quality • The path to the file is passed to DLL through ASCII converted symbols • The user needs to have TEX installed to get PDF tree. TEX Æ PS Æ PDF. Classification and Regression Trees 28 XploRe CART commands and quantlets Display commands: cartdrawpdfclass, cartdrawpdfregr cartdrawpdfclass (VarMatrix, ClassVector, SplitRule, MinSize, FilePath) cartdrawpdfregr (VarMatrix, ClassVector, MinSize, FilePath) Input parameters: VarMatrix - Learning Sample, matrix [n x m] ClassVector - Column of classes or response values, vector [n x 1] SplitRule - 0 – Ginni, 1 – Twoing splitting rule MinSize - minimum number of observations in terminal node FilePath - full path and name of TEX file. Output: Generated TEX file with CART tree under specified path and name Classification and Regression Trees 29 XploRe CART commands and quantlets Display commands: cartdrawpdfclass, cartdrawpdfregr Split variable Intermediary node Terminal node Number of observations of each class Classification and Regression Trees Class of terminal node 30 XploRe CART commands and quantlets Data classification command: cartpredict cartpredict (tree, newdata) • uses any constructed tree to classify newdata • newdata have to match dimensions of Learning Sample (columns number) • consumes virtually no time (important to build max. tree) Classification and Regression Trees XploRe CART commands and quantlets Data classification command: cartpredict Predicted class for new data Classification and Regression Trees 31 32 XploRe CART commands and quantlets Additional commands:cartleafnum, carttrimp • used to calculate additional information about the constructed trees • cartleafnnum(tree) – returns number of terminal nodes in the tree • carttrimp(tree) – returns the impurity of the tree a) Classification: sum of misclassification errors for terminal nodes b) Regression: sum of variance ratios for terminal nodes Classification and Regression Trees 33 Boston Housing example Root node: most significant variable – number of rooms 5 Outliers: far away and expensive X8 – distance to center Ratio of variances Response value Classification and Regression Trees Number of observations 34 CART vs. SVM Bankruptcy data (14 variables) • 84 observations, 2 classes (1 – for bankrupt, -1 – for non-bankrupt) • 14 variables: net income/total assets, total liabilities/total assets, etc. • 83 obs. as a learning sample, 1 obs. – testing sample • Out-of-sample classification ration is computed • CART(10) – 61.90%, SVM – 61.90%, Discriminant Analysis ≈ 60.00% Classification and Regression Trees 35 CART vs. SVM Variables CART(37) Classification Tree – 66.67% x2 – NI/TA x7 – log(TA) x14 – Inv/COGS x6 – TL/TA unreliable nodes “Good”, consistent nodes Classification and Regression Trees CART vs. SVM CART(15) Classification Tree – 61.90% Classification and Regression Trees 36 37 CART vs. SVM Distribution ration for different tree sizes (2 first variables) • Max value – 74.69% correctly classified for maximum tree (1) 74.69% 73.49% 72.29% • Min value – 68.68% correctly classified for CART(14) 68.68% • Optimum size was defined as 10% of Leaning Sample size – 10 observations. • Classification ratio for CART(10) – 72.29% Classification and Regression Trees 38 CART vs. SVM Distribution ration for different tree sizes (14 variable) • Max value – 66.67% correctly classified for CART(37) • Min value – 58.33% correctly classified for CART(27) 66.67% SVM 61,90% 61.90% 59.52% • Optimum size was defined as 10% of Leaning Sample size – 10 observations. 58.33% • Classification ratio for CART(10) – 61.90% Classification and Regression Trees
© Copyright 2026 Paperzz