Lecture 3 PPT - Kiri L. Wagstaff

CS 461: Machine Learning
Lecture 3
Dr. Kiri Wagstaff
[email protected]
1/24/09
CS 461, Winter 2009
1
Questions?




Homework 2
Project Proposal
Weka
Other questions from Lecture 2
1/24/09
CS 461, Winter 2009
2
Review from Lecture 2
 Representation, feature types
 numeric, discrete, ordinal
 Decision trees: nodes, leaves, greedy, hierarchical,
recursive, non-parametric
 Impurity: misclassification error, entropy
 Evaluation: confusion matrix, cross-validation
1/24/09
CS 461, Winter 2009
3
Plan for Today
 Decision trees
 Regression trees, pruning, rules
 Benefits of decision trees
 Evaluation
 Comparing two classifiers
 Support Vector Machines
 Classification
 Linear discriminants, maximum margin
 Learning (optimization)
 Non-separable classes
 Regression
1/24/09
CS 461, Winter 2009
4
Remember Decision Trees?
1/24/09
CS 461, Winter 2009
5
Algorithm: Build a Decision Tree
1/24/09
CS 461, Winter 2009
[Alpaydin 2004  The MIT Press]
6
Building a Regression Tree
 Same algorithm… different criterion
 Instead of impurity, use Mean Squared Error
(in local region)
 Predict mean output for node
 Compute training error
 (Same as computing the variance for the node)
 Keep splitting until node error is acceptable;
then it becomes a leaf
 Acceptable: error < threshold
1/24/09
CS 461, Winter 2009
7
Turning Trees into Rules
1/24/09
CS 461, Winter 2009
[Alpaydin 2004  The MIT Press]
8
Comparing Two Algorithms
Chapter 14
1/24/09
CS 461, Winter 2009
9
Machine Learning Showdown!
 McNemar’s Test
 Under H0, we expect e01= e10=(e01+ e10)/2
e
 e10  1
2
01
e01  e10
1/24/09
~ X12 Accept if < X2α,1
CS 461, Winter 2009
[Alpaydin 2004  The MIT Press]
10
Support Vector Machines
Chapter 10
1/24/09
CS 461, Winter 2009
11
Linear Discrimination
 Model class boundaries (not data distribution)
 Learning: maximize accuracy on labeled data
 Inductive bias: form of discriminant used
d
gx | w,b  wT x  b   w i x i  b
i1
C1 if g x   0
choose 
C2 otherwise
1/24/09
CS 461, Winter 2009
[Alpaydin 2004  The MIT Press]
12
How to find best w, b?
 E(w|X) is error with parameters w on sample X
w*=arg minw E(w | X)
 Gradient
 E E
E 
w E  
,
,...,


w

w

w
2
d 
 1
T
 Gradient-descent:
Starts from random w and updates w iteratively
in the negative direction of gradient
1/24/09
CS 461, Winter 2009
[Alpaydin 2004  The MIT Press]
13
Gradient Descent
w i  
E
, i
w i
w i  w i  w i
E (wt)
E (wt+1)
wt
wt+1
η
1/24/09
CS 461, Winter 2009
[Alpaydin 2004  The MIT Press]
14
Support Vector Machines
 Maximum-margin linear classifiers
 Imagine: Army Ceasefire
 How to find best w, b?
 Quadratic programming:
min
1 2
w subject to y t wT x t  b 1,t
2

1/24/09
CS 461, Winter 2009
15
Optimization (primal formulation)
1
2
min
w subject to y t wT x t  b 1,t Must get training
data right!
2
N
1
2
L p  w   t y t wT x t  b1
2
t1

N

N
1
2
 w   t y t wT x t  b  t
2
t1
t1
L p
w
L p
b
1/24/09
N
 0+d w

 t ytxt
N
+1
parameters
t1
N
 0   t y t  0
t1
CS 461, Winter 2009
[Alpaydin 2004  The MIT Press]
16
2
t1
N
N
1
2
 w   t y t wT x t  b  t
2
t1
t1
Optimization (dual formulation)
L p
We know:
N
1 0 2 w   t y t x t
t
T t
w w subject
t1 to y w x  b  1,t
min
2
N
L p
1 0 2 N t ytt  t0 T t
Lb
w t1 y w x  b1
p 
2
t1


N
So re-write:

L
L
d p
w
L p
b
1 T
t t t
w w wNT 
y x b t y t   t

t t t
2 0  w  t y x
t
t

t1
1 T
w
w

t
N



2
t tt

1 t1 t s t s t T s
  y y x  x   t


2 t s
t
 0   y  0
subject to

N
1
2
L p  w   t y t wT x t  b  t
2
t1
t1
 y
t
t
 0 and   0,t
t
t
αt >0 are the
SVs, capped
by C
N parameters. Where did w and b go?
1/24/09

CS 461, Winter 2009
[Alpaydin 2004  The MIT Press]
17
What if Data isn’t Linearly Separable?
1. Embed data in higher-dimensional space

Explicit: Basis functions (new features)


Implicit: Kernel functions (new dot product/similarity)





Visualization of 2D -> 3D
Polynomial
RBF/Gaussian
Sigmoid
SVM applet
Still need to find a linear hyperplane
2. Add “slack” variables to permit some errors
1/24/09
CS 461, Winter 2009
18
Example: Orbital Classification
EO-1
 Linear SVM flying
on EO-1 Earth
Orbiter since Dec.
2004
Ice
Water
 Classify every pixel
 Four classes
 12 features
(of 256 collected)
Land
Snow
Hyperion
1/24/09
CS 461, Winter 2009
Classified
[Castano et al., 2005]
19
SVM in Weka
 SMO: Sequential Minimal Optimization
 Faster than QP-based versions
 Try linear, RBF kernels
1/24/09
CS 461, Winter 2009
20
Summary: What You Should Know
 Decision trees
 Regression trees, pruning, rules
 Benefits of decision trees
 Evaluation
 Comparing two classifiers (McNemar’s test)
 Support Vector Machines
 Classification
 Linear discriminants, maximum margin
 Learning (optimization)
 Non-separable classes
1/24/09
CS 461, Winter 2009
21
Next Time
 Reading
 Evaluation
(read Ch. 14.7)
 Support Vector Machines
(read Ch. 10.1-10.4, 10.6, 10.9)
 Questions to answer from the reading
 Posted on the website
 Three volunteers: Sen, Jimmy, and Irvin
1/24/09
CS 461, Winter 2009
22