CSE 5584 Evolutionary Computation Winter Quarter, 2003 Instructor

Logistic Regression
Reading:
X. Z. Fern’s Notes on Logistic Regression
(linked from class website)
Discriminative vs. Generative Models
• Discriminative: NN, SVM, Logistic Regression
• Generative: Naïve Bayes, Markov Chains, etc.
• E.g.: http://projects.haykranen.nl/markov/demo/
• E.g., how could you use a Naïve Bayes model to learn to
generate spam-like messages?
• How could you use one to fool spam detectors?
Learning Probabilistic Models
Goal is to learn a function mapping an instance
x = (x1, …,xn) to a probability distribution over classes y
I.e., learn P(y | x)
Bayesian learning
Softmax
Computes P(y | x) by turning output
values into probability distribution
P(y | x) = y sm (oi ) =
Computes P(y | x) by learning P(y) and
P(x| y) from data, and computing
eoi
P(x | y)P(y)
P(y | x) »
P(x)
K
åe
ok
k=1
ysm =
.501
ysm =
.499
0.461
0.455
Logistic Regression
.1
−.2
−.1
0.547
.1 .2
1
−.5
Learns to directly map inputs to
probabilities.
0.470
.1 −.2
.3
−.4
x1
x2
0.4
0.1
x
P(class| x)
Logistic Regression:
Binary Classification Case
For ease of notation, let x = (x0, x1, …, xn), where x0 =1.
Let w = (w0, w1, …,wn), where w0 is the bias weight.
Logistic Regression: Binary Classification Case
Logistic Regression: Learning Weights
Goal is to learn weights w.
Let (xj , yj) be the jth training example and its label.
We want:
w ¬ argmax Õ P(y j | x j , w), for training examples j =1,..., M
w
j
This is equivalent to:
w ¬ argmax å ln P(y j | x j , w)
w
j
This is called “log of conditional likelihood”
We can write the log conditional likelihood this way:
l(w) = å ln P(y j | x j , w)
j
= å y j ln P(y j = 1| x j , w) + (1- y j )ln P(y j = 0 | x j , w)
j
since yl is either 0 or 1
This is what we want to maximize.
Use gradient ascent to maximize l(w).
This is called “Maximum likelihood estimation” (or MLE).
Recall:
We have:
Let’s find the gradient with respect to wi :
Using chain rule and algebra
Stochastic Gradient Ascent for Logistic Regression
• Start with small random initial weights, both positive and negative:
w = (w0, w1, …wn)
Repeat until convergence, or for some max number of epochs
(
)
j
j
For each training example x , y :
Note again that w includes the bias weight w0, and x includes the bias term x0 = 1.
Homework 4, Part 2