Logistic regression - cs534 1 Maximum Likelihood Estimation

Logistic regression - cs534
This lecture will introduce logistic regression - a discriminative probabilistic classifier. We will
begin with the basic modeling assumption used by LR. Recall in LDA, we showed that by assuming
Gaussian distributions with shared covariance matrix for each class, we have
log
P (x, y = 1)
= w T x + w0
P (x, y = 0)
which defines a linear decision boundary. In Logistic regression, we make essentially the same
assumption that
P (y = 1|x)
log
= w T x + w0
P (y = 0|x)
and learn the weights directly without making the Gaussian assumptions for the class distributions.
Note that this is equivalent to assuming that
P (y = 1|x) =
1
1+
e−(wT x+w0 )
Using the canonical representation of the data (adding a dummy feature of value 1 to each input
vector), we have
1
P (y = 1|x) = g(x, w) =
1 + e−wT x
It follows that
T
e−w x
P (y = 0|x) = 1 − g(x, w) =
1 + e−wT x
. Given a training set {(xi , yi ) : i = 1, · · · , N }, yi ∈ {0, 1}, and xi ∈ Rd+1 , where N is the total
number of training examples and d is the original feature dimension, the learning goal is to find
the optimal weight vector w.
1
Maximum Likelihood Estimation
To learn the parameters, we will again use the maximum likelihood estimation. The log likelihood
function is as follows.
log p(D|M ) =
=
N
X
i=1
N
X
log p(xi , yi )
log p(yi |xi )p(xi )
i=1
Note that in Logistic regression, we don’t care about p(xi ) and only need to learn the correct p(y|x).
So we can drop p(xi ). Thus we have:
L(w) =
N
X
i=1
log p(yi |xi ) =
N
X
log g(xi , w)yi (1 − g(xi , w))1−yi
i=1
To maximize L with respect to w, lets look at each example:
Li (w) = log(xi , w)yi (1 − g(xi , w))1−yi = yi log g(xi , w) + (1 − yi ) log(1 − g(xi , w))
1
1
where g(x, w) =
.
T
1+e−(w x)
Taking gradient of Li with respect to w, we have:
1 − yi
yi
∇w g −
∇w g
g(xi , w))
1 − g(xi , w))
yi
1 − yi
= g(1 − g)xi −
g(1 − g)xi
g
1−g
= (yi (1 − g) − (1 − yi )g)xi
∇w Li =
= (yi − yi g − g + yi g)xi
= (yi − g(xi , w))xi
Consider all training examples, we have:
∇w L =
N
X
(yi − g(xi , w))xi
i=1
Note that we cannot directly solve for the optimal w by setting the gradient to zero. Instead,
we will need to use the gradient
P ascent method to find the optimal w. The update rule for the
batch method is w ← w + η N
i=1 (yi − g(xi , w))xi , where η is the learning rate. Similarly the
update rule for online learning of LR is w ← w + (yi − g(xi , w))xi .
Note that the likelihood function is concave, thus Gradient ascent will find the global optimal
solution.
2
Multi-class Logistic regression
For multi-class classification problems, we have a training set {(xi , yi ) : i = 1, · · · , N }, yi ∈
{1, 2, · · · , K}. In this case, we assume that
T
ewk x
p(y = k|x) = P
wiT x
ie
To write down the likelihood function, we will use the 1-of-K coding scheme in which we create
a target vector y for each example, if y = k, then the vector y will have 1 for the k-th element and
zero for all other elements. The log likelihood function for the ith example is then:
Li (w1 , w2 , · · · , wK ) = log
K
Y
k
p(k|xi )yi
k=1
=
K
X
yik log p(k|xi )
k=1
As shown in class, taking the gradient with respect to wk we have:
∇wk Li = (yik − p(k|xi ))xi
This gives us the online update rule for weight vector wk is
wk ← wk + (yik − p(k|xi ))xi
2