Linear models and the perceptron algorithm P = 1(x i,yi)ln P = 1(x i,yi

8/27/15
Preliminaries
Linear models and the perceptron
algorithm
Preliminaries
Definition: The Euclidean dot product between two vectors is
d
the expression
X
Definition: The Euclidean dot product between two vectors is
d
the expression
X
(hence the name dot product).
Geometric interpretation. The dot product between two unit
vectors1 is the cosine of the angle between them.
The dot product between a vector and a unit vector is the
length of its projection in that direction.
And in general:
|
T
w x=
wT x =
w i xi
i=1
The dot product is also referred to as inner product or scalar
product.
It is sometimes denoted as w · x
Chapters 1, 3
The dot product is also referred to as inner product or scalar
product.
w
wTx + b > 0
w x = ||w|| · ||x||cos(✓)
wTx + b < 0
The norm of a vector:
1
2
Labeled data
Labeled data
A labeled dataset:
D=
Where
D=
are d-dimensional vectors
Where
The labels:
3
Linear models for classification
(linear decision boundaries)
{(xi , yi )}ni=1
x i 2 Rd
||x||2 = x| x
Linear models
A labeled dataset:
{(xi , yi )}ni=1
x i 2 Rd
w i xi
i=1
w
are d-dimensional vectors
wTx + b > 0
wTx + b < 0
The labels:
Linear models for regression
are discrete for classification
are continuous values for a
problems (e.g. +1, -1) for binary
regression problem
(estimating a linear function)
90
85
80
75
classification
70
65
60
55
50
45
4
5
40
140
150
160
170
180
190
200
6
1
8/27/15
Linear models for classification
Linear models for classification
w
Linear models for classification
w
w
wTx + b > 0
wTx + b > 0
wTx + b < 0
wTx + b > 0
wTx + b < 0
Decision boundary:
f (x) = w| x + b
Discriminant/scoring function:
weight vector
wTx + b < 0
bias
Using the discriminant to
make a prediction:
f (x) = w| x + b = 0
all x such that
For linear models the the decision boundary is a line in 2-d, a plane
in 3-d and a hyperplane in higher dimensions
7
ŷ = sign(w| x + b)
the sign function equals 1 when its argument is positive and -1 otherwise
8
Linear models for classification
9
Linear models for regression
Why linear?
When using a linear model for regression the scoring function is
the prediction:
q 
q 
|
Least
Squares
ŷ = w
x + bLinear Regression
w
q 
wTx + b > 0
q 
It’s a good baseline: always start simple
Linear models are stable
Linear models are less likely to overfit the training
data because they have relatively less parameters.
Can sometimes underfit. Often all you need when
the data is high dimensional.
Lots of scalable algorithms
Decision boundary:
all x such that
y
y
wTx + b < 0
f (x) = w| x + b = 0
x1
x
x2
What can you say about the decision boundary when b = 0?
←− noisy target P (y|x)
y = f (x) + !
10
in-sample error
out-of-sample error
c AM
#
L Creator: Malik Magdon-Ismail
Ein(h) =
1
N
N
!
11
(h(xn) − yn)2
n=1
2
Eout(h) = Ex[(h(x) − y) ]
Linear Classification and Regression: 16 /21





12
h(x) = wtx




Matrix representation −→
2
8/27/15
From linear to non-linear
From linear to non-linear
Linearly separable data
2.5
5
Linearly separable data: there exists a linear
decision boundary separating the classes.
2
4.5
1.5
4
1
3.5
Example:
0.5
3
0
2.5
−0.5
2
−1
1.5
−1.5
1
−2
There is a neat mathematical trick that will enable us to use linear
classifiers to create non-linear decision boundaries!
0.5
0
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Original data: not linearly separable
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
wTx + b > 0
5
Transformed data: (x’,y’) = (x2, y2)
13
wTx + b < 0
14
The bias and homogeneous coordinates
Finding a good hyperplane
In some cases we will use algorithms that learn a discriminant
function without a bias term. This does not reduce the
expressivity of the model because we can obtain a bias using the
following trick:
We would like a classifier that fits the data, i.e. we would like to
find a vector w that minimizes
How to Learn
a Final Hypothesis g from H
The perceptron algorithm
(Rosenblatt, 1957)
We want to select g ∈ H so that g ≈ f .
Idea: iterate over the
update
the
Wetraining
certainlyexamples,
want g ≈ f and
on the
data set
D. weight
Ideally,
vector w in a way that would make xi is more likely to be
g(xn) = yn.
correctly classified.
n
Ein (h) =
Add another dimension x0 to each input and set it to 1.
1X
[h(xi ) 6= f (xi )]
n i=1
How do we find such a g in the infinite hypothesis set H, if it exists?
Let’s assume that xi is misclassified, and is a positive example
i.e.
Idea! Start with some Note:
weightwe’re
vectorlearning
and trya classifier
to improve it.
w · xi < 0
n
1X
=
[sign(w| xi ) 6= yi ]
n i=1
Learn a weight vector of dimension d+1 in this extended space,
and interpret w0 as the bias term. With the notation
w = (w1 , . . . , wd ) w̃ = (w0 , w1 , . . . , wd )
x̃ = (1, x1 , . . . , xd )
w 0 · xi > w · xi
This can be achieved by choosing
w0 = w + ⌘xi
0 < ⌘  1 is the learning rate
w̃ · x̃ = w0 + w · x
See page 7 in the book
without a bias term
We would like to update w to w’ such that
This is a difficult problem because of the discrete nature of
the indicator and sign function (known to be NP-hard).
We have that:
15
Income
−2.5
−2.5
w
Rosenblatt, Frank (1957), The Perceptron--a perceiving and recognizing automaton.
Report 85-460-1, Cornell Aeronautical Laboratory.
16
17
Age
Section 1.1 in the book
c AM
#
L Creator: Malik Magdon-Ismail
The Perceptron: 10 /25
18
PLA −→
3
8/27/15
The perceptron algorithm
The perceptron algorithm
If xi is a negative example, the update needs to be opposite.
Overall, we can summarize the two cases as:
The perceptron algorithm
Since the algorithm is not guaranteed to converge you need to
set a limit on the number of iterations:
w0 = w + ⌘yi xi
The algorithm makes sense, but let’s try to derive it in a more
principled way.
The algorithm is trying to find a vector w that separates
positive from negative examples.
We can express that as:
Input: labeled data D in homogeneous coordinates
Output: weight vector w
yi w| xi > 0, i = 1, . . . , n
w = 0
converged = false
while (not converged or number of iterations < T) :
converged = true
for i in 1,…,|D| :
if xi is misclassified:
update w and set converged=false
return w
Input: labeled data D in homogeneous coordinates
Output: weight vector w
w = 0
converged = false
while not converged :
converged = true
for i in 1,…,|D| :
if xi is misclassified update w and set
converged=false
return w
For a given weight vector w the degree to which this does not
hold can be expressed as:
X
E(w) =
yi w | x i
i: xi is misclassified
We want to find w that minimizes or maximizes this criterion?
19
20
Digression: gradient descent
Gradient descent
Given a function E(w), the gradient is the direction of
steepest ascent
Therefore to minimize E(w), take a step in the direction of the
negative of the gradient
The perceptron algorithm
We can now express gradient descent as:
w(t + 1) = w(t)
w(t)
where
@E(w)
=
@w
✓
Let’s apply gradient descent to the perceptron criterion:
⌘rE(w)
⌘
And w(t) is the weight vector at iteration t
yi w | x i
i: xi is misclassified
@E(w)
=
@w
◆|
X
yi x i
i: xi is misclassified
w(t + 1) = w(t)
⌘
= w(t) + ⌘
The constant η is called the step size (learning rate when used
in the context of machine learning).
Notice that the gradient is perpendicular
to contours of equal E(w)
X
E(w) =
@E(w)
@w
@E(w)
@E(w)
,...,
@w1
@wd
21
@E(w)
@w
X
yi x i
i: xi is misclassified
Which is exactly the perceptron algorithm!
Images from http://en.wikipedia.org/wiki/Gradient_descent
22
23
24
4
8/27/15
Intensity and Symmetry Features
The perceptron algorithm
feature: an important property of the input that you think is useful for classification.
Input: labeled data D in homogeneous coordinates
Output: a weight vector w
Issues with the algorithm:
w = 0, wpocket = 0
converged = false
while (not converged or number of iterations < T) :
converged = true
for i in 1,…,|D| :
if xi is misclassified:
update w and set converged=false
if w leads to better Ein than wpocket :
wpocket = w
return wpocket
q 
Image classification
feature: an important property of the input that you think is useful for classification.
Features:
important
ofor the
input you think are
(dictionary.com:
a prominent properties
or conspicuous part
characteristic)
relevant for classification
(dictionary.com: a prominent or conspicuous part or characteristic)
The algorithm is guaranteed to converge if the data is linearly
separable, and does not converge otherwise.
q 
Intensity and Symmetry Features
The pocket algorithm
The algorithm chooses an arbitrary hyperplane that
separates the two classes. It may not be the best one from
the learning perspective.
Does not converge if the data is not separable (can halt after
a fixed number of iterations).
There are variants of the algorithm that address these issues
(to some extent).
In this case we consider the level of symmetry (image – its
Gallant, S. I. (1990). Perceptron-based learning algorithms. IEEE Transactions on
Neural Networks, vol. 1, no. 2, pp. 179–191.
25
26
!
flipped version) and overall!intensity (fraction
of pixels that are
x1 , x2 )
← input
x = (1, x1,dark)
x2) x =←(1,input
dvc = 3
dvc =model
3
= (w
w2) ← linear
0, w1,model
w = (w0, w1, w2w
) ←
linear
c AM
"
L Creator: Malik Magdon-Ismail
c AM
"
L Creator: Malik Magdon-Ismail
27
Linear Classification and Regression: 12 /21
Linear Classification and Regression: 12 /21
PLA on digits data −→
PLA on digits data −→
Pocket
Pocket vs
on perceptron
Digits Data
Comparison on image data: distinguishing between the digits “1”
PLA
Pocket
and “5” (see page
83 in the book):
50%
Eout
Error (log scale)
Error (log scale)
50%
10%
1%
10%
Eout
1%
Ein
Ein
0
250
500
750
1000
0
Iteration Number, t
250
500
750
1000
Iteration Number, t
28
c AM
!
L Creator: Malik Magdon-Ismail
Linear Classification and Regression: 14 /21
Regression −→
5