2PP

CS 188: Artificial Intelligence
Fall 2006
Lecture 24: Perceptrons
11/28/2006
Dan Klein – UC Berkeley
Announcements
ƒ Midterms, 2.1, 1.3 regrades all graded, will be in glookup
soon
ƒ Our record of your late days is now in glookup
ƒ Make sure it’s right for you and your partner
ƒ If you accidentally submitted an assignment weeks later, you’ll
have a very wrong record – let us know
ƒ Projects
ƒ 4.1 up now
ƒ Due 12/06
ƒ Contest details on web
1
Example: Spam Filtering
ƒ Model:
ƒ Parameters:
ham : 0.66
spam: 0.33
the
to
and
...
free
click
...
morally
nicely
...
: 0.016
: 0.015
: 0.012
the
to
and
...
free
click
...
screens
minute
...
: 0.001
: 0.001
: 0.001
: 0.001
: 0.021
: 0.013
: 0.011
: 0.005
: 0.004
: 0.000
: 0.000
Example: OCR
1
0.1
1
0.01
1
0.05
2
0.1
2
0.05
2
0.01
3
0.1
3
0.05
3
0.90
4
0.1
4
0.30
4
0.80
5
0.1
5
0.80
5
0.90
6
0.1
6
0.90
6
0.90
7
0.1
7
0.05
7
0.25
8
0.1
8
0.60
8
0.85
9
0.1
9
0.50
9
0.60
0
0.1
0
0.80
0
0.80
2
Generative vs. Discriminative
ƒ Generative classifiers:
ƒ E.g. naïve Bayes
ƒ We build a causal model of the variables
ƒ We then query that model for causes, given evidence
ƒ Discriminative classifiers:
ƒ
ƒ
ƒ
ƒ
E.g. perceptron (next)
No causal model, no Bayes rule, often no probabilities
Try to predict output directly
Loosely: mistake driven rather than model driven
Errors, and What to Do
ƒ Examples of errors
Dear GlobalSCAPE Customer,
GlobalSCAPE has partnered with ScanSoft to offer you the
latest version of OmniPage Pro, for just $99.99* - the
regular list price is $499! The most common question we've
received about this offer is - Is this genuine? We would like
to assure you that this offer is authorized by ScanSoft, is
genuine and valid. You can get the . . .
. . . To receive your $30 Amazon.com promotional certificate,
click through to
http://www.amazon.com/apparel
and see the prominent link for the $30 offer. All details are
there. We hope you enjoyed receiving this message. However,
if you'd rather not receive future e-mails announcing new
store launches, please click . . .
3
What to Do About Errors?
ƒ Need more features– words aren’t enough!
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
Have you emailed the sender before?
Have 1K other people just gotten the same email?
Is the sending information consistent?
Is the email in ALL CAPS?
Do inline URLs point where they say they point?
Does the email address you by (your) name?
ƒ Naïve Bayes models can incorporate a variety of
features, but tend to do best in homogeneous
cases (e.g. all features are word occurrences)
Features
ƒ A feature is a function which signals a property of the input
ƒ Examples:
ƒ
ƒ
ƒ
ƒ
ƒ
ALL_CAPS: value is 1 iff email in all caps
HAS_URL: value is 1 iff email has a URL
NUM_URLS: number of URLs in email
VERY_LONG: 1 iff email is longer than 1K
SUSPICIOUS_SENDER: 1 iff reply-to domain doesn’t match originating
server
ƒ Features are anything you can think of code to evaluate on an input
ƒ Some cheap, some very very expensive to calculate
ƒ Can even be the output of another classifier
ƒ Domain knowledge goes here!
ƒ In naïve Bayes, how did we encode features?
4
Feature Extractors
ƒ A feature extractor maps inputs to feature vectors
Dear Sir.
First, I must
solicit your
confidence in
this
transaction,
this is by
virture of its
nature as being
utterly
confidencial and
top secret. …
W=dear
W=sir
W=this
...
W=wish
...
MISSPELLED
NAMELESS
ALL_CAPS
NUM_URLS
...
:
:
:
1
1
2
:
0
:
:
:
:
2
1
0
0
ƒ Many classifiers take feature vectors as inputs
ƒ Feature vectors usually very sparse, use sparse
encodings (i.e. only represent non-zero keys)
Some (Vague) Biology
ƒ Very loose inspiration: human neurons
5
The Binary Perceptron
ƒ Inputs are features
ƒ Each feature has a weight
ƒ Sum is the activation
ƒ If the activation is:
ƒ Positive, output 1
ƒ Negative, output 0
f1
f2
f3
w1
w2
w3
Σ
>0?
Example: Spam
ƒ Imagine 4 features:
ƒ Free (number of occurrences of “free”)
ƒ Money (occurrences of “money”)
ƒ BIAS (always has value 1)
“free money”
BIAS
free
money
the
...
:
:
:
:
1
1
1
0
BIAS
free
money
the
...
: -3
: 4
: 2
: 0
6
Binary Decision Rule
ƒ In the space of feature vectors
money
ƒ Any weight vector is a hyperplane
ƒ One side will be class 1
ƒ Other will be class 0
BIAS
free
money
the
...
: -3
: 4
: 2
: 0
2
1 = SPAM
1
0 = HAM
0
0
1
free
The Multiclass Perceptron
ƒ If we have more than
two classes:
ƒ Have a weight vector for
each class
ƒ Calculate an activation for
each class
ƒ Highest activation wins
7
Example
BIAS
win
game
vote
the
...
“win the vote”
BIAS
win
game
vote
the
...
: -2
: 4
: 4
: 0
: 0
BIAS
win
game
vote
the
...
:
:
:
:
:
1
2
0
4
0
:
:
:
:
:
1
1
0
1
1
BIAS
win
game
vote
the
...
:
:
:
:
:
2
0
2
0
0
The Perceptron Update Rule
ƒ Start with zero weights
ƒ Pick up training instances one by one
ƒ Try to classify
ƒ If correct, no change!
ƒ If wrong: lower score of wrong
answer, raise score of right answer
8
Example
“win the vote”
“win the election”
“win the game”
BIAS
win
game
vote
the
...
:
:
:
:
:
BIAS
win
game
vote
the
...
:
:
:
:
:
BIAS
win
game
vote
the
...
:
:
:
:
:
Mistake-Driven Classification
ƒ In naïve Bayes, parameters:
ƒ From data statistics
ƒ Have a causal interpretation
ƒ One pass through the data
Training
Data
ƒ For the perceptron parameters:
ƒ From reactions to mistakes
ƒ Have a discriminative interpretation
ƒ Go through the data until held-out
accuracy maxes out
Held-Out
Data
Test
Data
9
Properties of Perceptrons
ƒ Separability: some parameters get
the training set perfectly correct
Separable
ƒ Convergence: if the training is
separable, perceptron will
eventually converge (binary case)
ƒ Mistake Bound: the maximum
number of mistakes (binary case)
related to the margin or degree of
separability
Non-Separable
Issues with Perceptrons
ƒ Overtraining: test / held-out
accuracy usually rises, then
falls
ƒ Overtraining isn’t quite as bad as
overfitting, but is similar
ƒ Regularization: if the data isn’t
separable, weights might
thrash around
ƒ Averaging weight vectors over
time can help (averaged
perceptron)
ƒ Mediocre generalization: finds
a “barely” separating solution
10
Summary
ƒ Naïve Bayes
ƒ Build classifiers using model of training data
ƒ Smoothing estimates is important in real systems
ƒ Classifier confidences are useful, when you can get
them
ƒ Perceptrons:
ƒ Make less assumptions about data
ƒ Mistake-driven learning
ƒ Multiple passes through data
Similarity Functions
ƒ Similarity functions are very important in
machine learning
ƒ Topic for next class: kernels
ƒ Similarity functions with special properties
ƒ The basis for a lot of advance machine
learning (e.g. SVMs)
11
Case-Based Reasoning
ƒ Similarity for classification
ƒ Case-based reasoning
ƒ Predict an instance’s label using
similar instances
ƒ Nearest-neighbor classification
ƒ 1-NN: copy the label of the most
similar data point
ƒ K-NN: let the k nearest neighbors vote
(have to devise a weighting scheme)
ƒ Key issue: how to define similarity
ƒ Trade-off:
ƒ Small k gives relevant neighbors
ƒ Large k gives smoother functions
ƒ Sound familiar?
ƒ [DEMO]
http://www.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html
Parametric / Non-parametric
ƒ Parametric models:
ƒ Fixed set of parameters
ƒ More data means better settings
ƒ Non-parametric models:
ƒ Complexity of the classifier increases with data
ƒ Better in the limit, often worse in the non-limit
ƒ (K)NN is non-parametric
2 Examples
10 Examples
100 Examples
Truth
10000 Examples
12
Collaborative Filtering
ƒ Ever wonder how online merchants
decide what products to recommend to
you?
ƒ Simplest idea: recommend the most
popular items to everyone
ƒ Not entirely crazy! (Why)
ƒ Can do better if you know something
about the customer (e.g. what they’ve
bought)
ƒ Better idea: recommend items that
similar customers bought
ƒ A popular technique: collaborative filtering
ƒ Define a similarity function over
customers (how?)
ƒ Look at purchases made by people with
high similarity
ƒ Trade-off: relevance of comparison set vs
confidence in predictions
ƒ How can this go wrong?
You are
here
Nearest-Neighbor Classification
ƒ Nearest neighbor for digits:
ƒ Take new image
ƒ Compare to all training images
ƒ Assign based on closest example
ƒ Encoding: image is vector of intensities:
ƒ What’s the similarity function?
ƒ Dot product of two images vectors?
ƒ Usually normalize vectors so ||x|| = 1
ƒ min = 0 (when?), max = 1 (when?)
13
Basic Similarity
ƒ Many similarities based on feature dot products:
ƒ If features are just the pixels:
ƒ Note: not all similarities are of this form
Invariant Metrics
ƒ Better distances use knowledge about vision
ƒ Invariant metrics:
ƒ Similarities are invariant under certain transformations
ƒ Rotation, scaling, translation, stroke-thickness…
ƒ E.g:
ƒ 16 x 16 = 256 pixels; a point in 256-dim space
ƒ Small similarity in R256 (why?)
ƒ How to incorporate invariance into similarities?
This and next few slides adapted from Xiao Hu, UIUC
14
Rotation Invariant Metrics
ƒ
Each example is now a curve
in R256
ƒ
Rotation invariant similarity:
s’=max s( r(
ƒ
), r(
))
E.g. highest similarity between
images’ rotation lines
Tangent Families
ƒ Problems with s’:
ƒ Hard to compute
ƒ Allows large
transformations (6 → 9)
ƒ Tangent distance:
ƒ 1st order approximation
at original points.
ƒ Easy to compute
ƒ Models small rotations
15
Template Deformation
ƒ Deformable templates:
ƒ
ƒ
ƒ
ƒ
An “ideal” version of each category
Best-fit to image using min variance
Cost for high distortion of template
Cost for image points being far from distorted template
ƒ Used in many commercial digit recognizers
Examples from [Hastie 94]
16