• We have been considering linear discriminant functions.

Recap
•
We have been considering linear discriminant
functions.
PR NPTEL course – p.1/122
Recap
•
•
We have been considering linear discriminant
functions.
Such a linear classifier is given by
d′
h(X) = 1 if
X
wi φi (X) + w0 > 0
i=1
= 0 Otherwise
where φi are fixed functions.
PR NPTEL course – p.2/122
Recap
•
•
We have been considering linear discriminant
functions.
Such a linear classifier is given by
d′
h(X) = 1 if
X
wi φi (X) + w0 > 0
i=1
= 0 Otherwise
where φi are fixed functions.
•
We have been considering the case φi (X) = xi for
simplicity.
PR NPTEL course – p.3/122
Perceptron
•
Perceptron is the earliest such classifier.
PR NPTEL course – p.4/122
Perceptron
•
Perceptron is the earliest such classifier.
•
Assuming augumented feature vector,
h(X) = sgn(W T X).
PR NPTEL course – p.5/122
Perceptron
•
Perceptron is the earliest such classifier.
•
Assuming augumented feature vector,
h(X) = sgn(W T X).
•
‘find weighted sum and threshold’
PR NPTEL course – p.6/122
Perceptron
•
Perceptron is the earliest such classifier.
•
Assuming augumented feature vector,
h(X) = sgn(W T X).
•
‘find weighted sum and threshold’
PR NPTEL course – p.7/122
Perceptron Learning Algorithm
•
A simple iterative algorithm.
PR NPTEL course – p.8/122
Perceptron Learning Algorithm
•
A simple iterative algorithm.
•
Each iteration, we locally try to correct errors.
PR NPTEL course – p.9/122
Perceptron Learning Algorithm
•
A simple iterative algorithm.
•
Each iteration, we locally try to correct errors.
Let ∆W (k) = W (k + 1) − W (k). Then
∆W (k)
=0
if W (k)T X(k) > 0 & y(k) = 1,
W (k)T X(k) < 0 & y(k) = 0
= X(k) if W (k)T X(k) ≤ 0 & y(k) = 1
= − X(k) if W (k)T X(k) ≥ 0 & y(k) = 0
PR NPTEL course – p.10/122
or
Perceptron: Geometric view
The algorithm has a simple geometric view. Consider the
following data set.
PR NPTEL course – p.11/122
•
Suppose W (k) misclassifies a pattern.
PR NPTEL course – p.12/122
•
Now the correction made to W (k) can be seen as
PR NPTEL course – p.13/122
•
We showed that: if the training set is linearly
separable, then the algorithm would find a separating
hyperplane in finitely many iterations.
PR NPTEL course – p.14/122
•
We showed that: if the training set is linearly
separable, then the algorithm would find a separating
hyperplane in finitely many iterations.
•
We also saw the ‘batch’ version of the algorithm. It is
shown to be a gradient descent on a reasonable cost
function.
PR NPTEL course – p.15/122
Perceptron
•
A simple ‘device’: Weighted sum and threshold.
PR NPTEL course – p.16/122
Perceptron
•
A simple ‘device’: Weighted sum and threshold.
•
A simple learning machine. (A neuron model).
PR NPTEL course – p.17/122
Perceptron
•
A simple ‘device’: Weighted sum and threshold.
•
A simple learning machine. (A neuron model).
PR NPTEL course – p.18/122
•
Perceptron is an interesting algorithm to learn linear
classifiers.
PR NPTEL course – p.19/122
•
•
Perceptron is an interesting algorithm to learn linear
classifiers.
Works only when data is linearly separable.
PR NPTEL course – p.20/122
•
•
•
Perceptron is an interesting algorithm to learn linear
classifiers.
Works only when data is linearly separable.
In general, not possible to know beforehand whether
data is linearly separable.
PR NPTEL course – p.21/122
•
•
Perceptron is an interesting algorithm to learn linear
classifiers.
Works only when data is linearly separable.
•
In general, not possible to know beforehand whether
data is linearly separable.
•
We next look at other linear methods in classification
and regression.
PR NPTEL course – p.22/122
Regression Problems
•
Recall that the regression or function learning
problem is closely related to learning classifiers.
PR NPTEL course – p.23/122
Regression Problems
•
Recall that the regression or function learning
problem is closely related to learning classifiers.
•
The training set would be
{(Xi , yi ), i = 1, · · · , n} with Xi ∈ ℜd , yi ∈ ℜ, ∀i.
PR NPTEL course – p.24/122
Regression Problems
•
Recall that the regression or function learning
problem is closely related to learning classifiers.
•
The training set would be
{(Xi , yi ), i = 1, · · · , n} with Xi ∈ ℜd , yi ∈ ℜ, ∀i.
•
The main difference is that the ‘targets’ or the ‘output’,
yi , is continuous valued in regression problem while it
can take only finitely many distinct values in a
classifier.
PR NPTEL course – p.25/122
•
In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
PR NPTEL course – p.26/122
•
In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
•
Note that any such function can also be viewed as a
classifier.
We can take h(X) = sgn(f (X)) as the classifier.
PR NPTEL course – p.27/122
•
In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
•
Note that any such function can also be viewed as a
classifier.
We can take h(X) = sgn(f (X)) as the classifier.
•
We search over a suitably parameterized class of
functions to find the best one.
PR NPTEL course – p.28/122
•
In a regression problem, the goal is to learn a
function, f : ℜd → ℜ, that captures the relationship
between X and y . We write ŷ = f (X).
•
Note that any such function can also be viewed as a
classifier.
We can take h(X) = sgn(f (X)) as the classifier.
•
We search over a suitably parameterized class of
functions to find the best one.
Once again, the problem is that of learning the best
parameters.
•
PR NPTEL course – p.29/122
Linear Regression
•
We will now consider learning a linear function
f (X) =
d
X
wi xi + w0
j=1
where W = (w1 , · · · , wd )T ∈ ℜd and w0 ∈ ℜ are the
parameters.
PR NPTEL course – p.30/122
Linear Regression
•
We will now consider learning a linear function
f (X) =
d
X
wi xi + w0
j=1
where W = (w1 , · · · , wd )T ∈ ℜd and w0 ∈ ℜ are the
parameters.
•
Thus a linear model can be expressed as
f (X) = W T X + w0 .
PR NPTEL course – p.31/122
Linear Regression
•
We will now consider learning a linear function
f (X) =
d
X
wi xi + w0
j=1
where W = (w1 , · · · , wd )T ∈ ℜd and w0 ∈ ℜ are the
parameters.
•
Thus a linear model can be expressed as
f (X) = W T X + w0 .
•
As earlier, by using an augumented vector X , we can
write this as f (X) = W T X .
PR NPTEL course – p.32/122
•
Now, to learn ‘optimal’ W , we need a criterion
function.
PR NPTEL course – p.33/122
•
•
Now, to learn ‘optimal’ W , we need a criterion
function.
The criterion function assigns a figure of merit or cost
to each W ∈ ℜd+1 .
PR NPTEL course – p.34/122
•
•
•
Now, to learn ‘optimal’ W , we need a criterion
function.
The criterion function assigns a figure of merit or cost
to each W ∈ ℜd+1 .
Then the optimal W would be one that optimizes the
criterion function.
PR NPTEL course – p.35/122
•
•
•
•
Now, to learn ‘optimal’ W , we need a criterion
function.
The criterion function assigns a figure of merit or cost
to each W ∈ ℜd+1 .
Then the optimal W would be one that optimizes the
criterion function.
A criterion function that is most often used is the sum
of squares of errors.
PR NPTEL course – p.36/122
Linear Least Squares Regression
•
We want to find a W such that ŷ(X) = f (X) = W T X
is a good fit for the training data.
PR NPTEL course – p.37/122
Linear Least Squares Regression
•
We want to find a W such that ŷ(X) = f (X) = W T X
is a good fit for the training data.
•
Consider a function J : ℜd+1 → ℜ defined by
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
PR NPTEL course – p.38/122
Linear Least Squares Regression
•
We want to find a W such that ŷ(X) = f (X) = W T X
is a good fit for the training data.
•
Consider a function J : ℜd+1 → ℜ defined by
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
•
We take the ‘optimal’ W to be the minimizer of J(·).
PR NPTEL course – p.39/122
Linear Least Squares Regression
•
We want to find a W such that ŷ(X) = f (X) = W T X
is a good fit for the training data.
•
Consider a function J : ℜd+1 → ℜ defined by
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
•
We take the ‘optimal’ W to be the minimizer of J(·).
•
Known as linear least squares method.
PR NPTEL course – p.40/122
•
We want to find W to minimize
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
PR NPTEL course – p.41/122
•
We want to find W to minimize
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
•
If we are learning a classifier we can have
yi ∈ {−1, +1}.
PR NPTEL course – p.42/122
•
We want to find W to minimize
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
•
If we are learning a classifier we can have
yi ∈ {−1, +1}.
•
Note that finally we would use sign of W T X as the
classifier output.
PR NPTEL course – p.43/122
•
We want to find W to minimize
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
•
If we are learning a classifier we can have
yi ∈ {−1, +1}.
•
Note that finally we would use sign of W T X as the
classifier output.
•
Thus minimizing J is a good way to learn linear
discriminant functions also.
PR NPTEL course – p.44/122
•
We want to find minimizer of
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
PR NPTEL course – p.45/122
•
We want to find minimizer of
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
•
This is a quadratic function and we can analytically
find the minimizer.
PR NPTEL course – p.46/122
•
We want to find minimizer of
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
•
•
This is a quadratic function and we can analytically
find the minimizer.
For this we rewrite J(W ) into a more convenient form.
PR NPTEL course – p.47/122
•
Recall that we take all vectors to be column vectors.
PR NPTEL course – p.48/122
•
•
Recall that we take all vectors to be column vectors.
Hence each training sample Xi is a (d + 1) × 1 matrix.
PR NPTEL course – p.49/122
•
Recall that we take all vectors to be column vectors.
Hence each training sample Xi is a (d + 1) × 1 matrix.
•
Let A be a matrix given by
•
T
A = [X1 · · · Xn ]
PR NPTEL course – p.50/122
•
Recall that we take all vectors to be column vectors.
Hence each training sample Xi is a (d + 1) × 1 matrix.
•
Let A be a matrix given by
•
T
A = [X1 · · · Xn ]
•
A is a n × (d + 1) matrix whose ith row is given by XiT .
PR NPTEL course – p.51/122
•
Recall that we take all vectors to be column vectors.
Hence each training sample Xi is a (d + 1) × 1 matrix.
•
Let A be a matrix given by
•
T
A = [X1 · · · Xn ]
•
A is a n × (d + 1) matrix whose ith row is given by XiT .
•
Hence, AW would be a n × 1 vector whose ith
element is XiT W .
PR NPTEL course – p.52/122
•
Let Y be a n × 1 vector whose ith element is yi .
PR NPTEL course – p.53/122
•
Let Y be a n × 1 vector whose ith element is yi .
•
Hence AW − Y would be a n × 1 vector whose ith
element is (XiT W − yi ).
PR NPTEL course – p.54/122
•
Let Y be a n × 1 vector whose ith element is yi .
•
Hence AW − Y would be a n × 1 vector whose ith
element is (XiT W − yi ).
•
Hence we have
J(W ) =
n
X
¡
1
2
i=1
T
i
X W − yi
¢2
1
= (AW −Y )T (AW −Y )
2
PR NPTEL course – p.55/122
•
Let Y be a n × 1 vector whose ith element is yi .
•
Hence AW − Y would be a n × 1 vector whose ith
element is (XiT W − yi ).
•
Hence we have
J(W ) =
•
n
X
¡
1
2
i=1
T
i
X W − yi
¢2
1
= (AW −Y )T (AW −Y )
2
To find minimizer of J(·) we need to equate its
gradient to zero
PR NPTEL course – p.56/122
•
We have
∇ J(W ) = AT (AW − Y )
PR NPTEL course – p.57/122
•
We have
∇ J(W ) = AT (AW − Y )
•
Equating the gradient to zero, we get
(AT A)W = AT Y
PR NPTEL course – p.58/122
•
We have
∇ J(W ) = AT (AW − Y )
•
Equating the gradient to zero, we get
(AT A)W = AT Y
•
The optimal W satisfies this system of linear
equations. (Called normal equations).
PR NPTEL course – p.59/122
•
AT A is a (d + 1) × (d + 1) matrix.
PR NPTEL course – p.60/122
•
AT A is a (d + 1) × (d + 1) matrix.
•
AT A is invertible if A has linearly independent
columns. (This is because null space of A is same as
null space of AT A).
PR NPTEL course – p.61/122
•
AT A is a (d + 1) × (d + 1) matrix.
•
AT A is invertible if A has linearly independent
columns. (This is because null space of A is same as
null space of AT A).
Rows of A are the training samples Xi .
•
PR NPTEL course – p.62/122
•
AT A is a (d + 1) × (d + 1) matrix.
•
AT A is invertible if A has linearly independent
columns. (This is because null space of A is same as
null space of AT A).
Rows of A are the training samples Xi .
•
•
Hence j th column of A would give the values of j th
feature in all the examples.
PR NPTEL course – p.63/122
•
Hence columns of A are linearly independent if no
feature can be obtained as a linear combination of
other features.
PR NPTEL course – p.64/122
•
•
Hence columns of A are linearly independent if no
feature can be obtained as a linear combination of
other features.
If we assume features are linearly independent then
A would have linearly independent columns and
hence AT A would be invertible.
PR NPTEL course – p.65/122
•
•
•
Hence columns of A are linearly independent if no
feature can be obtained as a linear combination of
other features.
If we assume features are linearly independent then
A would have linearly independent columns and
hence AT A would be invertible.
This is a reasonable assumption.
PR NPTEL course – p.66/122
•
The optimal W is a solution of (AT A)W = AT Y .
PR NPTEL course – p.67/122
•
The optimal W is a solution of (AT A)W = AT Y .
•
When AT A is invertible, we get the optimal W as
W ∗ = (AT A)−1 AT Y = A† Y
where A† = (AT A)−1 AT , is called the generalized
inverse of A.
PR NPTEL course – p.68/122
•
The optimal W is a solution of (AT A)W = AT Y .
•
When AT A is invertible, we get the optimal W as
W ∗ = (AT A)−1 AT Y = A† Y
•
where A† = (AT A)−1 AT , is called the generalized
inverse of A.
The above W ∗ is the linear least squares solution for
our regression (or classification) problem.
PR NPTEL course – p.69/122
Geometry of Least Squares
•
Our least squares method seeks to find a W to
minimize ||AW − Y ||2 .
PR NPTEL course – p.70/122
Geometry of Least Squares
•
Our least squares method seeks to find a W to
minimize ||AW − Y ||2 .
•
A is a n × (d + 1) matrix and normally n >> d.
PR NPTEL course – p.71/122
Geometry of Least Squares
•
Our least squares method seeks to find a W to
minimize ||AW − Y ||2 .
•
A is a n × (d + 1) matrix and normally n >> d.
•
Consider the (over-determined) system of linear
equations AW = Y .
PR NPTEL course – p.72/122
Geometry of Least Squares
•
Our least squares method seeks to find a W to
minimize ||AW − Y ||2 .
•
A is a n × (d + 1) matrix and normally n >> d.
•
Consider the (over-determined) system of linear
equations AW = Y .
•
The system may or may not be consistent. But, we
seek to find W ∗ to minimize squared error.
PR NPTEL course – p.73/122
Geometry of Least Squares
•
Our least squares method seeks to find a W to
minimize ||AW − Y ||2 .
•
A is a n × (d + 1) matrix and normally n >> d.
•
Consider the (over-determined) system of linear
equations AW = Y .
•
The system may or may not be consistent. But, we
seek to find W ∗ to minimize squared error.
•
As we saw, the solution is W ∗ = A† Y and hence the
name generalized inverse for A† .
PR NPTEL course – p.74/122
•
The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
PR NPTEL course – p.75/122
•
The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
•
Let C0 , C1 , · · · , Cd be the columns of A.
PR NPTEL course – p.76/122
•
The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
•
Let C0 , C1 , · · · , Cd be the columns of A.
•
Then AW = w0 C0 + w1 C1 + · · · + wd Cd .
PR NPTEL course – p.77/122
•
The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
•
Let C0 , C1 , · · · , Cd be the columns of A.
•
Then AW = w0 C0 + w1 C1 + · · · + wd Cd .
Thus, for any W , AW is a linear combination of
columns of A.
•
PR NPTEL course – p.78/122
•
The least squares method is trying to find a ‘best-fit’
W for the systems AW = Y .
•
Let C0 , C1 , · · · , Cd be the columns of A.
•
Then AW = w0 C0 + w1 C1 + · · · + wd Cd .
Thus, for any W , AW is a linear combination of
columns of A.
Hence, if Y is in the space spanned by columns of A,
there is an exact solution.
•
•
PR NPTEL course – p.79/122
•
Otherwise, we want the projection of Y onto the
column space of A.
PR NPTEL course – p.80/122
•
Otherwise, we want the projection of Y onto the
column space of A.
•
That is, we want to find a vector Z in the column
space of A that is closest to Y .
PR NPTEL course – p.81/122
•
Otherwise, we want the projection of Y onto the
column space of A.
•
That is, we want to find a vector Z in the column
space of A that is closest to Y .
•
Any vector in the column space of A can be written as
Z = AW for some W .
PR NPTEL course – p.82/122
•
Otherwise, we want the projection of Y onto the
column space of A.
•
That is, we want to find a vector Z in the column
space of A that is closest to Y .
•
Any vector in the column space of A can be written as
Z = AW for some W .
Hence we want to find Z to minimize ||Z − Y ||2
subject to the constraint that Z = AW for some W .
•
PR NPTEL course – p.83/122
•
Otherwise, we want the projection of Y onto the
column space of A.
•
That is, we want to find a vector Z in the column
space of A that is closest to Y .
•
Any vector in the column space of A can be written as
Z = AW for some W .
Hence we want to find Z to minimize ||Z − Y ||2
subject to the constraint that Z = AW for some W .
•
•
That is the least squares solution.
PR NPTEL course – p.84/122
•
Let us take the original (and not augumented) data
vectors and write our model as
ŷ(X) = f (X) = W T X + w0 where now W ∈ ℜd .
PR NPTEL course – p.85/122
•
Let us take the original (and not augumented) data
vectors and write our model as
ŷ(X) = f (X) = W T X + w0 where now W ∈ ℜd .
•
Now we have
n
X
¡ T
¢2
1
J(W ) =
W Xi + w0 − yi
2 i=1
PR NPTEL course – p.86/122
•
Let us take the original (and not augumented) data
vectors and write our model as
ŷ(X) = f (X) = W T X + w0 where now W ∈ ℜd .
•
Now we have
n
X
¡ T
¢2
1
J(W ) =
W Xi + w0 − yi
2 i=1
•
For any given W we can find best w0 by equating the
partial derivative to zero.
PR NPTEL course – p.87/122
We have
∂J
=
∂w0
n
X
(W T Xi + w0 − yi )
i=1
PR NPTEL course – p.88/122
We have
∂J
=
∂w0
n
X
(W T Xi + w0 − yi )
i=1
Equating the partial derivative to zero, we get
PR NPTEL course – p.89/122
We have
∂J
=
∂w0
n
X
(W T Xi + w0 − yi )
i=1
Equating the partial derivative to zero, we get
Pn
T
(W
Xi + w0 − yi ) = 0
i=1
PR NPTEL course – p.90/122
We have
∂J
=
∂w0
n
X
(W T Xi + w0 − yi )
i=1
Equating the partial derivative to zero, we get
⇒
Pn
T
(W
Xi + w0 − yi ) = 0
i=1
n
X
Pn
T
yi
nw0 + W
=
i=1 Xi
i=1
PR NPTEL course – p.91/122
This gives us
w0 =
n
X
1
n
i=1
yi − W T
Ã
n
X
1
n
i=1
Xi
!
PR NPTEL course – p.92/122
This gives us
w0 =
•
n
X
1
n
i=1
yi − W T
Ã
n
X
1
n
i=1
Xi
!
Thus, w0 accounts for difference in the average of
W T X and average of y .
PR NPTEL course – p.93/122
This gives us
w0 =
n
X
1
n
i=1
yi − W T
Ã
n
X
1
n
i=1
Xi
!
•
Thus, w0 accounts for difference in the average of
W T X and average of y .
•
So, w0 is often called the bias term.
PR NPTEL course – p.94/122
•
We have taken our linear model to be
ŷ(X) = f (X) =
d
X
wj xj
j=0
PR NPTEL course – p.95/122
•
We have taken our linear model to be
ŷ(X) = f (X) =
d
X
wj xj
j=0
•
As mentioned earlier, we could instead choose any
fixed set of basis functions φi .
PR NPTEL course – p.96/122
•
We have taken our linear model to be
ŷ(X) = f (X) =
d
X
wj xj
j=0
•
As mentioned earlier, we could instead choose any
fixed set of basis functions φi .
•
Then the model would be
d′
ŷ(X) = f (X) =
X
wj φj (X)
j=0
PR NPTEL course – p.97/122
•
We can use the same criterion of minimizing sum of
squares of errors.
n
X
¡ T
¢2
1
W Φ(Xi ) − yi
J(W ) =
2 i=1
where Φ(Xi ) = (φ0 (Xi ), · · · , φd′ (Xi ))T .
PR NPTEL course – p.98/122
•
We can use the same criterion of minimizing sum of
squares of errors.
n
X
¡ T
¢2
1
W Φ(Xi ) − yi
J(W ) =
2 i=1
where Φ(Xi ) = (φ0 (Xi ), · · · , φd′ (Xi ))T .
•
We want the minimizer of J(·).
PR NPTEL course – p.99/122
•
We can learn W using the same method as earlier.
PR NPTEL course – p.100/122
•
We can learn W using the same method as earlier.
•
Thus, we will again have
W ∗ = (AT A)−1 AT Y
PR NPTEL course – p.101/122
•
We can learn W using the same method as earlier.
•
Thus, we will again have
W ∗ = (AT A)−1 AT Y
•
The only difference is that now the ith row of matrix A
would be
[φ0 (Xi ) φ1 (Xi ) · · · φd′ (Xi )]
PR NPTEL course – p.102/122
•
As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
PR NPTEL course – p.103/122
•
As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
•
Take φj (X) = X j , j = 0, 1, · · · m.
PR NPTEL course – p.104/122
•
As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
•
Take φj (X) = X j , j = 0, 1, · · · m.
•
Now the model is
ŷ(X) = f (X) = w0 + w1 X + w2 X 2 + · · · + wm X m
PR NPTEL course – p.105/122
•
As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
•
Take φj (X) = X j , j = 0, 1, · · · m.
•
Now the model is
ŷ(X) = f (X) = w0 + w1 X + w2 X 2 + · · · + wm X m
•
The model is: y is a mth degree polynomial in X .
PR NPTEL course – p.106/122
•
As an example: Let d = 1. (Then Xi , yi ∈ ℜ).
•
Take φj (X) = X j , j = 0, 1, · · · m.
•
Now the model is
ŷ(X) = f (X) = w0 + w1 X + w2 X 2 + · · · + wm X m
•
The model is: y is a mth degree polynomial in X .
•
All such problems are tackled in a uniform fashion
using the least squares method we presented.
PR NPTEL course – p.107/122
LMS algorithm
•
We are finding W ∗ that minimizes
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
PR NPTEL course – p.108/122
LMS algorithm
•
We are finding W ∗ that minimizes
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
•
We could have found the minimum through an
iterative scheme using gradient descent.
PR NPTEL course – p.109/122
LMS algorithm
•
We are finding W ∗ that minimizes
n
X
¡ T
¢2
1
J(W ) =
Xi W − yi
2 i=1
•
We could have found the minimum through an
iterative scheme using gradient descent.
•
The gradient of J is given by
∇J(W ) =
n
X
i=1
¡
T
i
Xi X W − yi
¢
PR NPTEL course – p.110/122
•
The iterative gradient descent scheme would be
W (k + 1) = W (k) − η
n
X
i=1
¡
T
i
Xi X W (k) − yi
PR NPTEL course – p.111/122
¢
•
The iterative gradient descent scheme would be
W (k + 1) = W (k) − η
n
X
i=1
•
¡
T
i
Xi X W (k) − yi
¢
In analogy with what we saw in Perceptron algorithm,
this can be viewed as a ‘batch’ version.
PR NPTEL course – p.112/122
•
The iterative gradient descent scheme would be
W (k + 1) = W (k) − η
n
X
i=1
•
•
¡
T
i
Xi X W (k) − yi
¢
In analogy with what we saw in Perceptron algorithm,
this can be viewed as a ‘batch’ version.
We use the current W to find the errors on all training
data and then do all the ‘corrections’ together.
PR NPTEL course – p.113/122
•
The iterative gradient descent scheme would be
W (k + 1) = W (k) − η
n
X
i=1
•
•
•
¡
T
i
Xi X W (k) − yi
¢
In analogy with what we saw in Perceptron algorithm,
this can be viewed as a ‘batch’ version.
We use the current W to find the errors on all training
data and then do all the ‘corrections’ together.
We can instead have an incremental version of this
algorithm.
PR NPTEL course – p.114/122
•
For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
PR NPTEL course – p.115/122
•
For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
•
The error on this sample would be
1
T
2
(X(k)
W
(k)
−
y(k))
.
2
PR NPTEL course – p.116/122
•
For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
•
The error on this sample would be
1
T
2
(X(k)
W
(k)
−
y(k))
.
2
•
Using the gradient of only this term, we get the
incremental version as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
PR NPTEL course – p.117/122
•
For the incremental version, at each iteration we pick
one of the training samples. Call this X(k).
•
The error on this sample would be
1
T
2
(X(k)
W
(k)
−
y(k))
.
2
•
Using the gradient of only this term, we get the
incremental version as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
•
This is called the LMS algorithm.
PR NPTEL course – p.118/122
•
In the LMS algorithm, we iteratively update W as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
PR NPTEL course – p.119/122
•
In the LMS algorithm, we iteratively update W as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
•
Here (X(k), y(k)) is the training example picked at
iteration k and W (k) is the weight vector at iteration k .
PR NPTEL course – p.120/122
•
In the LMS algorithm, we iteratively update W as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
•
Here (X(k), y(k)) is the training example picked at
iteration k and W (k) is the weight vector at iteration k .
•
We do not need to have all training examples together
with us. We can learn W from a stream of examples
without needing to store them.
PR NPTEL course – p.121/122
•
In the LMS algorithm, we iteratively update W as
W (k + 1) = W (k) − η X(k) (X(k)T W (k) − y(k))
•
Here (X(k), y(k)) is the training example picked at
iteration k and W (k) is the weight vector at iteration k .
•
We do not need to have all training examples together
with us. We can learn W from a stream of examples
without needing to store them.
•
If η is sufficiently small this algorithm also converges
to the minimizer of J(W ).
PR NPTEL course – p.122/122

Download Report

• We have been considering linear discriminant functions.

Paperzz.com

Your Paperzz