CS 540 – Introduction to AI Fall 2015

Today’s Topics
• Ensembles
• Decision Forests (actually, Random Forests)
• Bagging and Boosting
• Decision Stumps
• Feature Selection
• ID3 as Searching a Space of Possible Soln’s
• ID3 Wrapup
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
1
Ensembles
(Bagging, Boosting, and all that)
Old View
– Learn one good model
New View
Naïve Bayes, k-NN, neural net,
d-tree, SVM, etc
– Learn a good set of models
Probably best example of interplay between
‘theory & practice’ in machine learning
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
2
Ensembles of Neural Networks
(or any supervised learner)
OUTPUT
Combiner
Network
Network
Network
INPUT
• Ensembles often produce accuracy gains of
5-10 percentage points!
• Can combine “classifiers” of various types
– Eg, decision trees, rule sets, neural networks, etc.
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
3
Combining
Multiple Models
Three ideas for combining predictions
1. Simple (unweighted) votes
•
Standard choice
2. Weighted votes
•
eg, weight by tuning-set accuracy
3. Learn a combining function
•
•
9/29/16
Prone to overfitting?
‘Stacked generalization’ (Wolpert)
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
4
Random Forests
(Breiman, Machine Learning 2001; related to Ho, 1995)
A variant of something called BAGGING (‘multi-sets’)
Algorithm
Let N = # of examples
F = # of features
i = some number << F
Repeat k times
(1) Draw with replacement N examples, put in train set
(2) Build d-tree, but in each recursive call
– Choose (w/o replacement) i features
– Choose best of these i as the root
of this (sub)tree
(3) Do NOT prune
In HW2, we give you 101 ‘bootstrapped’ samples of the
WINE Dataset
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
5
Drawing with Replacement
vs Drawing w/o Replacement
<show on board>
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
6
Using Random Forests
After training we have K decision trees
How to use on TEST examples?
Some variant of
If at least L of these K trees say ‘true’ then output ‘true’
How to choose L ?
Use a tune set to decide!
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
7
More on Random Forests
• Increasing i
– Increases correlation among individual trees (BAD)
– Also increases accuracy of individual trees (GOOD)
• Can also use tuning set to choose good value for i
• Overall, random forests
– Are very fast (eg, 50K examples, 10 features,
10 trees/min on 1 GHz CPU back in 2004)
– Deal well with large # of features
– Reduce overfitting substantially; NO NEED TO PRUNE!
– Work very well in practice
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
8
HW2 – Programming Portion
• You will simply run your ID3 on 101
‘drawn-with-replacement’ copies of the
WINE train set (feel free to implement the
full random-forest idea)
• Use WINE tune set to choose best L in
If at least L of these 101 trees say ‘true’
then output ‘true’
• Evaluate on WINE test set
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
9
Three Explanations of Why
Ensembles Help
1. Statistical
Key
 true concept
 learned models
search path
(sample effects)
2. Computational
(limited cycles for search)
3. Representational
(wrong hypothesis space)
Concept Space
Considered
From: Dietterich, T. G. (2002). Ensemble Learning. In The Handbook of Brain Theory and Neural
Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. 405-408
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
10
A Relevant Early
Paper on ENSEMBLES
Hansen & Salamen, PAMI:20, 1990
– If (a) the combined predictors have errors that are
independent from one another
– And (b) prob any given model correct predicts any
given testset example is > 50%, then
lim ( test set error rate of N predictors )  0
N 
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
11
Some More
Relevant Early Papers
• Schapire, Machine Learning:5, 1990 (‘Boosting’)
– If you have an algorithm that gets > 50% on any
distribution of examples, you can create an algorithm
that gets > (100% - ), for any  > 0
– Need an infinite (or very large, at least) source
of examples
- Later extensions (eg, AdaBoost)
address this weakness
• Also see Wolpert, ‘Stacked Generalization,’
Neural Networks, 1992
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
12
Some Methods for Producing
‘Uncorrelated’ Members
of an Ensemble
• K times randomly choose (with replacement)
N examples from a training set of size N
• Give each training set to a std ML algo
– ‘Bagging’ by Brieman (Machine Learning, 1996)
– Want unstable algorithms (so learned models vary)
• Reweight examples each cycle (if wrong,
increase weight; else decrease weight)
– ‘AdaBoosting’ by Freund & Schapire (1995, 1996)
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
13
Stable Algorithms
A algorithm is stable if
small changes to the training data
mean small changes to the learned model
9/29/16
Are d-trees stable?
NO
What about k-NN? (Recall Voronoi diagrams)
YES
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
14
Stable Algorithms (2)
• An idea from the stats community
• D-trees unstable since one different example
can change the root
• k-NN stable since impact of examples local
• Ensembles work best with unstable algos
since we want the N learned models to differ
9/29/15
CS 540 - Fall 2015 (Shavlik©), Lecture 7, Week 4
15
Empirical Studies
(from Freund & Schapire; reprinted in Dietterich’s AI Magazine paper)
Error
Rate
of
C4.5
(Each point one data set)
Error
Rate of
Bagging
ID3
successor
Boosting and
Bagging
helped almost
always!
Error Rate of Bagged (Boosted) C4.5
9/29/16
On average,
Boosting
slightly better?
Error Rate of AdaBoost
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
16
Some More Methods for
Producing “Uncorrelated”
Members of an Ensemble
• Directly optimize accuracy + diversity
– Opitz & Shavlik (1995; used genetic algo’s)
– Melville & Mooney (2004-5)
• Different number of hidden units in a neural
network, different k in k -NN, tie-breaking
scheme, example ordering, diff ML algos, etc
– Various people
– See 2005-2008 papers of Rich Caruana’s group
for large-scale empirical studies of ensembles
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
17
Boosting/Bagging/etc
Wrapup
• An easy to use and usually highly
effective technique
- always consider it (Bagging, at least) when
applying ML to practical problems
• Does reduce ‘comprehensibility’ of models
- see work by Craven & Shavlik though (‘rule extraction’)
• Increases runtime, but cycles usually
much cheaper than examples
(and easily parallelized)
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
18
Decision “Stumps”
(formerly part of HW; try on your own!)
• Holte (ML journal) compared:
– Decision trees with only one decision (decision stumps)
vs
– Trees produced by C4.5 (with pruning algorithm used)
• Decision ‘stumps’ do remarkably well on
UC Irvine data sets
– Archive too easy? Some datasets seem to be
• Decision stumps are a ‘quick and dirty control for
comparing to new algorithms
– But ID3/C4.5 easy to use and probably a better control
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
19
C4.5 Compared
to 1R (‘Decision
Stumps’)
See Holte paper in Machine
Learning for key
(eg, HD=heart disease)
9/29/16
Testset Accuracy
Dataset
C4.5
1R
BC
72.0%
68.7%
CH
99.2%
68.7%
GL
63.2%
67.6%
G2
74.3%
53.8%
HD
73.6%
72.9%
HE
81.2%
76.3%
HO
83.6%
81.0%
HY
99.1%
97.2%
IR
93.8%
93.5%
LA
77.2%
71.5%
LY
77.5%
70.7%
MU
100.0%
98.4%
SE
97.7%
95.0%
SO
97.5%
81.0%
VO
95.6%
95.2%
V1
89.4%
86.8%
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
20
Feature Selection
• Sometimes we want to preprocess our
dataset before running an ML algo to
select a good set of features
• Simple idea:
– Collect the i features with the most infoGain
(over all the training examples)
– Weakness: redundancy (consider duplicating
best scoring feature; it will also score well!)
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
21
D-Trees as
Feature Selectors
• In feature selection, we want
features that distinguish ex’s of various cats
• But we don’t want redundant features
• And we want features that cover
all the training examples
• D-trees do just that!
– Pick informative features
CONDITIONED on features chosen so far,
until all examples covered
9/26/16
AmFam - Fall 2016 (© Jude Shavlik), Lecture 5
22
ID3 Recap:
Questions Addressed
• How closely should we fit the training data?
– Completely, then prune
– Use tuning sets to score candidates
– Learn forests and no need to prune! Why?
• How do we judge features?
– Use info theory (Shannon)
• What if a features has many values?
– Convert to Boolean-valued features
• D-trees can also handle missing feature values
(but we won’t cover this for d-trees)
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
23
ID3 Recap (cont.)
Looks like a d-tree!
• What if some features cost more to evaluate
(eg, CAT scan vs Temperature)?
– Use an ad-hoc correction factor
• Best way to use in an ensemble?
– Random forests often perform quite well
• Batch vs. incremental (aka, online) learning?
– Basically a ‘batch’ approach
– Incremental variants exist but since ID3 is so fast,
why not simply rerun ‘from scratch’ whenever a
mistake is made?
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
24
ID3 Recap (cont.)
• What about real-valued outputs?
– Could learn a linear approximation for various regions of
the feature space, eg
3 f1 - f 2
f1 + 2 f 2
f4
Venn
• How rich is our language for
describing examples?
– Limited to fixed-length feature vectors
(but they are surprisingly effective)
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
25
Summary of ID3
Strengths
– Good technique for learning models from ‘real world’
(eg, noisy) data
– Fast, simple, and robust
– Potentially considers complete hypothesis space
– Successfully applied to many real-world tasks
– Results (trees or rules) are human-comprehensive
– One of the most widely used techniques in data mining
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
26
Summary of ID3 (cont.)
Weaknesses
–
–
–
–
–
9/29/16
Requires fixed-length feature vectors
Only makes axis-parallel (univariate) splits
Not designed to make probabilistic predictions
Non-incremental
Hill-climbing algorithm
(poor early decisions
However,
can be disastrous)
extensions
exist
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
27
A Sample Search Tree
- so we can use another search method besides hill climbing (‘greedy’ algo)
• Nodes are PARTIALLY COMPLETE D-TREES
• Expand ‘left most’ (in yellow) question mark (?) of current node
• All possible trees can be generated (given thresholds ‘implied’ by
real values in train set)
Create leaf node
Create leaf node
?
Add F1
-
Add F2
F1
?
F2
?
?
...
F1 ?
?
9/29/16
?
FN
?
?
?
Add F1
F2
+
Add FN
...
Assume F2
scores best
F2
+
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7,
Week 4
?
28
Viewing ID3 as a
Search Algorithm
Search Space
Operators
Search Strategy
Heuristic
Function
Start Node
Goal Node
9/29/16
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7,
Week 4
29
Viewing ID3 as a
Search Algorithm
9/29/16
Search Space
Space of all decision trees constructible using
current feature set
Operators
Add a node (ie, grow tree)
Search Strategy
Hill Climbing
Heuristic
Function
Information Gain
Start Node
An isolated leaf node marked ‘?’
Goal Node
Tree that separates all the training data (‘post
pruning’ may be done later to reduce overfitting)
(Other d-tree algo’s use similar ‘purity measures’)
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7,
Week 4
30
Issues Methodology
Algo’s
What We’ve Covered So Far
9/29/16
•
Supervised ML Algorithms
– Instance-based (kNN)
– Logic-based (ID3, Decision Stumps)
– Ensembles (Random Forests, Bagging, Boosting)
•
Train/Tune/Test Sets, N-Fold Cross Validation
•
•
•
•
•
•
•
Feature Space, (Greedily) Searching Hypothesis Spaces
Parameter Tuning (‘Model Selection’), Feature Selection (info gain)
Dealing w/ Real-Valued and Hierarchical Features
Overfitting Reduction, Occam’s Razor
Fixed-Length Feature Vectors, Graph/Logic-Based Reps of Examples
Understandability of Learned Models, “Generalizing not Memorizing”
Briefly: Missing Feature Values, Stability (to small changes in training sets)
CS 540 - Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
31