Document

Artificial Intelligence
9. Perceptron
Japan Advanced Institute of Science and Technology (JAIST)
Yoshimasa Tsuruoka
Outline
• Feature space
• Perceptrons
• The averaged perceptron
• Lecture slides
• http://www.jaist.ac.jp/~tsuruoka/lectures/
Feature space
• Instances are represented by vectors in a
feature space
Feature space
• Instances are represented by vectors in a
feature space
正例
<Outlook = sunny, Temperature = cool, Humidity = normal>
負例
<Outlook = rain, Temperature = high, Humidity = high>
Separating instances with a
hyperplane
• Find a hyperplane that separates the positive
and negative examples
Perceptron learning
• Can always find such a hyperplane if the given
examples are linearly separable
Linear classification
• Binary classification with a linear model


yx  f wT  x
  1, a  0
f a   
 1, a  0
x
: instance
 x  :
w
feature vector
: weight vector
bias
0 x  1
If the inner product of the feature vector with the linear
weights is greater than or equal to zero, then it is classified
as a positive example, otherwise it is classified as a negative
example
The Perceptron learning algorithm
1. Initialize the weight vector
2. Choose an example (randomly) from the
training data
3. If it is not classified correctly,
w  w   x
 If it is a positive example
w  w   x
 If it is a negative example
4. Step 2 and 3 are repeated until all examples
are correctly classified.
Learning the concept OR
• Training data
1
 
 x1    0 
0
 
t1  1
Negative
1
 
 x 2    0 
1
 
1
 
 x 3    1 
0
 
t2  1
t3  1
t4  1
Positive
Positive
Positive
 1
 
 x 4   1
 1
 
Iteration 1
• x1
0
 
w  0
0
 
1
 
 x1    0 
0
 
t1  1
yx1   f 0 1  0  0  0  0
 f 0
1
Wrong!
w  w   x1 
  1
 
w  0 
0
 
Iteration 2
• x4
  1
 
w  0 
0
 
 1
 
 x 4   1
 1
 
t4  1
yx3   f  11  0 1  0 1
 f  1
 1
Wrong!
w  w   x4 
0
 
w  1
1
 
Iteration 3
• x2
0
 
w  1
1
 
1
 
 x 2    0 
1
 
t2  1
yx 2   f 0 1  1 0  11
 f 1
1
OK!
w w
0
 
w  1
1
 
Iteration 4
• x3
0
 
w  1
1
 
1
 
 x 3    1 
0
 
t3  1
yx3   f 0 1  11  1 0
 f 1
1
OK!
w w
0
 
w  1
1
 
Iteration 5
• x1
0
 
w  1
1
 
1
 
 x1    0 
0
 
t1  1
yx1   f 0 1  1 0  1 0
 f 0
1
Wrong!
w  w   x1 
  1
 
w  1 
1
 
Separating hyperplane
• Final weight vector
  1
 
w  1 
1
 
Separating hyperplane
t
1
w  x   0
T
s
1
1  s  t  0
s and t are the input (the second and
the third elements of the feature vector)
Why the update rule works
• When a positive example has not been
correctly classified
yx  f wT  x
This values was too small
w  w   x
w   x   x   w  x    x 
T
T
Original value
2
This is always positive
The update rule makes it less likely for the perceptron
to make the same mistake
Convergence
• The Perceptron training algorithm converges
after a finite number of iterations to a
hyperplane that perfectly classifies the
training data, provided the training examples
are linearly separable.
• The number of iterations can be very large
• The algorithm does not converge if the
training data are not linearly separable
Learning the PlayTennis concept
• Feature space
– 11 binary features
• Perceptron learning
• Converged in 239 steps
Final weight vector
Bias
0
Outlook = Sunny
-3
Outlook = Overcast
5
Outlook = Rain
-2
Temperature = Hot
0
Temperature = Mild
3
Temperature = Cool
-3
Humidity = High
-4
Humidity = Normal
4
Wind = Strong
-3
Wind = Weak
3
Averaged Perceptron
• A variant of the Perceptron learning algorithm
– Output the weight vector which is averaged over
iterations rather than the final weight vector
– Do not wait until convergence
• Determine when to stop by observing the performance
on the validation set
• Practical and widely used
Naive Bayes vs Perceptrons
• The naive Bayes model assumes conditional
independence between features
– Adding informative features does not necessarily
improve the performance
• Percetrons allow one to incorporate diverse
types of features
• The training takes longer