Newton`s Method

Theory and Application to Logistic
Regression (Classification)
Computational Details
1
• Recall the cost function of the Logistic Classifier:
• There are different ways of minimizing
, for example:
– Gradient descent
– Newton’s method
2
Newton’s Method of Minimizing
• Recall the cost function of the Logistic Classifier:
•
Basic Calculus tells us that to find the point(s) at which a function
is minimum, find the value(s) of the parameter such that
, and ensuring that the point is a minimum (i.e., not a
maximum or a saddle point).
Find
s.t.
•
To simplify the explanation, will consider
•
Initially: find
•
Later, find
s.t.
s.t.
, rather than
.
.
3
Illustration of Newton's Method
)
•
Goal: find
•
Want to find the value of
•
Will use an iterative method to find
•
Consider
s.t.
.
How to find ?
for which
.
.
as the initial value; then iterate as illustrated above, until
is achieved.
4
)
• Update rule for iteration
:
5
• Update rule for any iteration:
6
•
Cost function:
•
Update rule for
•
Update rule for
:
:
7
•
Update rule for
•
In the gradient descent and Newton’s algorithm, we are interested in updating each
component of individually, and as such, the update rule for the component is:
•
Note: the 2nd order partial derivative means we need to take the derivative of
:
with
respect to again, but in order to avoid confusing index , we use index
8
•
Recall
:
9
•
Recall the non-vectorized expression for
•
What is the expression for
:
?
See slide 35 of
Logistic
Regression
Slides
10
For each , where
⋯
11
Vector Format of
•
Recall:
for each
•
In vector format:
for each
For training example , scalar multiply
each component of the vector .
As determined in
previous slides.
12
Verifying
for each
…
13
Matrix Format of
•
The 2nd order partial derivative means we need to take the derivative of
respect to again, but in order to avoid confusing index , we use index
with
:
for each
•
This is called the Hessian matrix, named after its inventor, German mathematician Ludwig
Otto Hesse.
14
Matrix Format of
•
∑
Substituting
1
into the matrix:
…
…
…
…
…
…
…
…
for each
0…
1
1
1
1
1
1
…
1
1
1
1
…
1
…
…
…
1
…
1
1
1
1
…
1
1
15
Matrix Format of
•
Recall the outer product or matrix multiplication of two vectors:
•
The outer product
1 column vector and
•
Therefore,
is equivalent to a matrix multiplication
, provided that
as a n × 1 column vector (which makes a row vector):
is represented as a m ×
16
•
For the matrix:
•
The
are scalars, i.e.,
•
The
forms an
•
For each training index,
x(
matrix, i.e.,
multiplies the
th
element of
.
17
Scalar?
• Consider:
• Therefore, the following is a scaler:
18
Finally,
in Vector Format
19
Gradient Descent vs Newton's Method
Gradient Descent
Newton’s Method
Simpler ?
Slightly more complex ?
Needs choice of learning rate.
No parameters.
Needs more iterations to converge.
Needs fewer iterations to converge.
Order of complexity O(mxn)
Order of complexity O(n3)
Scales well with number of features
Unable to handle very large n.
20
References
21