Classification and Regression Tree Analysis

Classification and Regression Tree Analysis
CART
Master Thesis
Anton V. Andriyashin
Roman V. Timofeev
February 11, 2005
Classification and Regression Trees
2
PART I: CART as a classification method
– What is CART?
– CART advantages
– Construction of decision trees
– Optimal tree size and pruning methods
– CART applications
Classification and Regression Trees
3
What is CART?
Different classification methods
• Parametric methods
• Cluster analysis
• Non-parametric
• Discriminant analysis
CART
• Does not require specification of a functional form
• Number of classes is known a priori
Classification and Regression Trees
4
CART advantages
• CART is nonparametric
• CART does not require variables to be selected in advance
• Results are invariant with respect to monotone transformations of the
independent variables
• CART can handle data sets with a complex structure
• CART is extremely robust to the effects of outliers
• CART can use any combination of categorical and continuous variables
Classification and Regression Trees
5
Runtime example I: linear structure
Classification and Regression Trees
6
Runtime example II: non-linear structure
Classification and Regression Trees
How does it work?
Learning Sample is used to build Decision Trees
• parent nodes are always splitted into exactly two child nodes
• process can be repeated by treating each child node as a parent
• decision tree is used to classify real data (using set of questions)
Set of Questions for this node
⎧ Stochastic < 0.5378
⎨
⎩ Book Value < −0.0153
Classification and Regression Trees
7
8
Classification trees: Splitting Rules
t parent ,tleft , tright − parent node, left node, right node
i ( t ) − impurity function
t Parent
PLeft
x j ≤ x Rj
t Left
PRight
PLeft − probability of left node
PRight − probability of right node
N Parent − number of points in parent node
t Right
N Left , N Right − number of point in left node, in right node
k − class index, K − number of classes, j - variable index
∆i ( t ) = iParent − E ( iChild ) = iParent − PLeft i ( t Left ) − PRight i ( t Right ) → max
Different impurity functions (splitting rules)
(
)
(
)
• Ginni rule
∆i (t ) = −∑ p 2 ( j t Parent ) + PLeft ∑ p 2 j t Left + PRight ∑ p 2 j t Right → max
• Twoing rule
PLeft PRight ⎡ K
⎤
∆i (t ) =
⎢ ∑ p j t Left − p j t Right ⎥ → max
4
⎣ j =1
⎦
• Enthropy rule
• Modifications of basic rules
Classification and Regression Trees
K
K
j =1
j =1
(
) (
K
j =1
)
2
9
Regression trees: Squared Residuals Minimization
Y Parent
Var (Y left ) + Var (Y right ) ⎯⎯
→ min
xR
x ≤ xR
Y left
⎛
Var (Y left ) = ∑ ⎜ Yi left
⎜
i =1
⎝
⎛ 1
−⎜
⎜N
⎝ left
⎛
Var (Y right ) = ∑ ⎜ Yi right
⎜
i =1
⎝
⎛ 1
−⎜
⎜N
⎝ right
Nleft
Y right
Yi − response variables
N right
⎞⎞
left
Y
⎟⎟ ⎟
∑
i
⎟
i =1
⎠⎠
Nleft
2
⎞⎞
right
Y
⎟⎟ ⎟
∑
i
⎟
i =1
⎠⎠
N right
2
N t − node size
xR − splitting value
• For each split value, sum of variances of left and right nodes is calculated
• Optimal split value is the one that minimizes the sum of variances
Classification and Regression Trees
10
Optimum Tree Selection
Same Data – Different Decision Trees
Which tree to take?
High Complexity Tree
Classification and Regression Trees
Low Complexity Tree
11
Optimum Tree Selection
Where to stop splitting?
Impurity Increment
Constraint
Building Maximum Tree
and Upward Pruning
• Each node is splitted while
∆i ( t ) is high enough (e.g.
higher than 3%)
• Build maximum tree and
then cut insignificant nodes
from down to up
• Can not handle well highcomplexity data – impurity is
not monotone.
• Time-consuming
Classification and Regression Trees
12
Pruning Methods
• Cross-Validation
• Min node size constraint
Uses cost-complexity function:
Rα (T ) = R (T ) + α T% , where
(
)
R (T ) = ∑ R ( t ) = ∑ ⎡ 1 − max p ( j t ) p ( t ) ⎤
⎢
⎥⎦
j
t∈T%
t∈T% ⎣
Uses individual node property:
N min
- minimum
number of
points in the node
+ The only available pruning
method
+ Required much less time than CV
- Very time-consuming
- Requires the calibration of 1 new
input parameter
- Inefficient resulting trees
Classification and Regression Trees
13
CART applications
• Originally was used in medical disease classification
•
Is being used by NASA for weather forecasting
•
Now successfully used in financial sector:
- Credit Scoring System
- Insurance
- Portfolio Management
Classification and Regression Trees
14
CART portfolio management
• Fundamental and technical weekly data are available for DAX stocks
• For historical data, the classification tree was constructed
• Real data were classified in accordance with classification tree
• Out-of-sample testing of CART prediction was performed
Classification and Regression Trees
15
CART portfolio management: results
Equally weighted porfolio of DAX stocks: CART performance
Classification and Regression Trees
16
CART portfolio management: results
Equally weighted porfolio of DAX stocks: CART performance
Classification and Regression Trees
Portfolio Statistics
VALUE
Ending Absolute Profit
49 625 EUR
Ending Relative Yield
25.55%
Mean Weekly Absolute
384.33 EUR
Mean Relative
26.93 %
Median
393.15 EUR
Std. Dev.
1879.79 EUR
Skewness
1.00
Kurtosis
5.68
17
PART II: CART in XploRe
– New CART Section: what is new?
– Special features and properties
– XploRe CART commands and quantlets
– Boston Housing example
– Bankruptcy data: CART vs. SVM
Classification and Regression Trees
18
New CART Section: What is new?
• All calculations are performed inside C++ DLL
• Tree structure was expanded and completely modified
• Classification trees are now available
• For Classification trees: Ginni and Twoing splitting rules can be chosen
• Depiction of both classification and regression trees in XploRe
• Generation of CART trees to a TeX file from XploRe
Classification and Regression Trees
19
Special features and properties
Program special features
• Efficient memory allocation
-- special automatic memory allocation class was constructed
-- multidimensional arrays are represented through one-dimensional array
-- pointers are used instead of values
• DLL can be run on Windows and UNIX machine
-- C++ code does not contain ANY compiler-specific functions
-- all data is transferred to *DOUBLE in order to work in XploRe
• Outstanding computational performance
-- all computations are done inside DLL
-- XploRe only passes the data to DLL and gets back the results
Classification and Regression Trees
20
Special features and properties
Computational Performance
Observations
/Variables
2
5
10
20
40
50
1000
0 sec.
0 sec.
0 sec.
0 sec.
0 sec.
0 sec.
5000
1 sec.
1 sec.
2 sec.
2 sec.
4 sec.
6 sec.
10000
1 sec.
3 sec.
4 sec.
6 sec.
13 sec.
14 sec.
30000
13 sec.
17 sec.
25 sec.
38 sec.
62 sec.
71 sec.
50000
33 sec.
45 sec.
60 sec.
85 sec.
142 sec.
170 sec.
• 2 500 000 records are processed in less than 3 minutes
• maximum amount of RAM consumed is 25 Mb
• testing was performed at Pentium IV 2.66 GHz, 512 Mb RAM
Classification and Regression Trees
21
CART: Technical Section
DLL Structure
XploRe
1. Loads the data
2. Initializes tree
structure
CART DLL
1. Transferred from *double to internal class
2. Build maximum tree
3. Performs the tree pruning
4. Performs writing to TeX file
5. Transfers the results back to *double
XploRe
Performs output in
XploRe window
• Transformation of data types is required
• Building maximum tree, pruning and writing to TEX file is performed inside DLL
Classification and Regression Trees
XploRe CART commands and quantlets
22
Corresponding table of new and old commands
Old CART
Commands
Corresponding new
commands
Description of commands
cartsplitclass
cartsplitregr
Generates classification and regression trees
cartpredict
Using the generated tree, performs the classification of new data
cartdisptree
Different methods of displaying CART trees (both classification and
regression) in XploRe
cartdrawpdfclass
cartdrawpdfregr
Generates TEX file for CART trees. TEX file can be then compiled
to PS ->PDF. Can be used for large trees, which can not be
properly depicted in XploRe.
prune
prunetot
maketr
prunecv
pruneseq
NONE
Different types of pruning. In new CAR section pruning (cross
validation) is done inside cartsplitclass and cartsplitregr commands.
Corresponding options are assigned in the commands string.
leafnum
ssr
cartleafnum
carttrimp
Supplementary commands which return different characteristics of
CART trees (such as number of terminal nodes or weighted sum of
misclassification errors for terminal nodes)
cartsplit
cartregr
pred
predd
dispcart2
dispcarttree
plotcart2
plotcarttree
grcart2
grcarttree
NONE
Classification and Regression Trees
23
XploRe CART commands and quantlets
Tree construction: cartsplitclass, cartsplitregr
cartsplitclass (VarMatrix, ClassVector, SplitRule, MinSize)
cartsplitregr (VarMatrix, ClassVector, MinSize)
Input:
VarMatrix - Learning Sample, matrix [n x m]
ClassVector - Column of classes or response values, vector [n x 1]
SplitRule - 0 – Ginni, 1 – Twoing splitting rule (only for classification trees)
MinSize - minimum number of observations in terminal node
Output:
Tree
- a complex tree structure, list of vectors with split variables, values, observations etc.
Classification and Regression Trees
XploRe CART commands and quantlets
Tree construction: cartsplitclass, cartsplitregr
Number of observations
in each node
Classification and Regression Trees
24
25
XploRe CART commands and quantlets
Display command: cartdisptree
cartdisptree (tree, string)
Input:
tree
string
- tree structure, a list of vectors with split values, variables, classes etc.
- optional parameter. Used to display additional information about the tree nodes
Output:
Plots the tree in XploRe display
Classification and Regression Trees
26
XploRe CART commands and quantlets
Display command: cartdisptree
cartdisptree(tr)
cartdisptree(tr, «Class»)
shows the split
questions
shows the class of
each node
cartdisptree(tr, «Impurity»)
cartdisptree(tr,
«NumberOfPoints»)
shows the impurity
of each node
shows number of
observations in each node
Classification and Regression Trees
27
XploRe CART commands and quantlets
Display commands: cartdrawpdfclass, cartdrawpdfregr
• Can be used to draw large trees in PDF
• PDF vector graphics provides outstanding zooming without loosing quality
• The path to the file is passed to DLL through ASCII converted symbols
• The user needs to have TEX installed to get PDF tree. TEX Æ PS Æ PDF.
Classification and Regression Trees
28
XploRe CART commands and quantlets
Display commands: cartdrawpdfclass, cartdrawpdfregr
cartdrawpdfclass (VarMatrix, ClassVector, SplitRule, MinSize, FilePath)
cartdrawpdfregr (VarMatrix, ClassVector, MinSize, FilePath)
Input parameters:
VarMatrix - Learning Sample, matrix [n x m]
ClassVector - Column of classes or response values, vector [n x 1]
SplitRule
- 0 – Ginni, 1 – Twoing splitting rule
MinSize
- minimum number of observations in terminal node
FilePath
- full path and name of TEX file.
Output:
Generated TEX file with CART tree under specified path and name
Classification and Regression Trees
29
XploRe CART commands and quantlets
Display commands: cartdrawpdfclass, cartdrawpdfregr
Split variable
Intermediary
node
Terminal
node
Number of
observations
of each class
Classification and Regression Trees
Class of
terminal node
30
XploRe CART commands and quantlets
Data classification command: cartpredict
cartpredict (tree, newdata)
• uses any constructed tree to classify newdata
• newdata have to match dimensions of Learning Sample (columns number)
• consumes virtually no time (important to build max. tree)
Classification and Regression Trees
XploRe CART commands and quantlets
Data classification command: cartpredict
Predicted class
for new data
Classification and Regression Trees
31
32
XploRe CART commands and quantlets
Additional commands:cartleafnum, carttrimp
• used to calculate additional information about the constructed trees
• cartleafnnum(tree) – returns number of terminal nodes in the tree
• carttrimp(tree) – returns the impurity of the tree
a) Classification: sum of misclassification errors for terminal nodes
b) Regression: sum of variance ratios for terminal nodes
Classification and Regression Trees
33
Boston Housing example
Root node: most significant
variable – number of rooms
5 Outliers:
far away and expensive
X8 – distance to center
Ratio of
variances
Response
value
Classification and Regression Trees
Number of
observations
34
CART vs. SVM
Bankruptcy data (14 variables)
• 84 observations, 2 classes (1 – for bankrupt, -1 – for non-bankrupt)
• 14 variables: net income/total assets, total liabilities/total assets, etc.
• 83 obs. as a learning sample, 1 obs. – testing sample
• Out-of-sample classification ration is computed
• CART(10) – 61.90%, SVM – 61.90%, Discriminant Analysis ≈ 60.00%
Classification and Regression Trees
35
CART vs. SVM
Variables
CART(37) Classification Tree – 66.67%
x2 – NI/TA
x7 – log(TA)
x14 – Inv/COGS
x6 – TL/TA
unreliable nodes
“Good”, consistent
nodes
Classification and Regression Trees
CART vs. SVM
CART(15) Classification Tree – 61.90%
Classification and Regression Trees
36
37
CART vs. SVM
Distribution ration for different tree sizes (2 first variables)
• Max value – 74.69% correctly classified for
maximum tree (1)
74.69%
73.49%
72.29%
• Min value – 68.68% correctly classified for
CART(14)
68.68%
• Optimum size was defined as 10% of
Leaning Sample size – 10 observations.
• Classification ratio for CART(10) – 72.29%
Classification and Regression Trees
38
CART vs. SVM
Distribution ration for different tree sizes (14 variable)
• Max value – 66.67% correctly classified for
CART(37)
• Min value – 58.33% correctly classified for
CART(27)
66.67%
SVM
61,90% 61.90%
59.52%
• Optimum size was defined as 10% of
Leaning Sample size – 10 observations.
58.33%
• Classification ratio for CART(10) – 61.90%
Classification and Regression Trees