Logistic Regression Reading: X. Z. Fern’s Notes on Logistic Regression (linked from class website) Discriminative vs. Generative Models • Discriminative: NN, SVM, Logistic Regression • Generative: Naïve Bayes, Markov Chains, etc. • E.g.: http://projects.haykranen.nl/markov/demo/ • E.g., how could you use a Naïve Bayes model to learn to generate spam-like messages? • How could you use one to fool spam detectors? Learning Probabilistic Models Goal is to learn a function mapping an instance x = (x1, …,xn) to a probability distribution over classes y I.e., learn P(y | x) Bayesian learning Softmax Computes P(y | x) by turning output values into probability distribution P(y | x) = y sm (oi ) = Computes P(y | x) by learning P(y) and P(x| y) from data, and computing eoi P(x | y)P(y) P(y | x) » P(x) K åe ok k=1 ysm = .501 ysm = .499 0.461 0.455 Logistic Regression .1 −.2 −.1 0.547 .1 .2 1 −.5 Learns to directly map inputs to probabilities. 0.470 .1 −.2 .3 −.4 x1 x2 0.4 0.1 x P(class| x) Logistic Regression: Binary Classification Case For ease of notation, let x = (x0, x1, …, xn), where x0 =1. Let w = (w0, w1, …,wn), where w0 is the bias weight. Logistic Regression: Binary Classification Case Logistic Regression: Learning Weights Goal is to learn weights w. Let (xj , yj) be the jth training example and its label. We want: w ¬ argmax Õ P(y j | x j , w), for training examples j =1,..., M w j This is equivalent to: w ¬ argmax å ln P(y j | x j , w) w j This is called “log of conditional likelihood” We can write the log conditional likelihood this way: l(w) = å ln P(y j | x j , w) j = å y j ln P(y j = 1| x j , w) + (1- y j )ln P(y j = 0 | x j , w) j since yl is either 0 or 1 This is what we want to maximize. Use gradient ascent to maximize l(w). This is called “Maximum likelihood estimation” (or MLE). Recall: We have: Let’s find the gradient with respect to wi : Using chain rule and algebra Stochastic Gradient Ascent for Logistic Regression • Start with small random initial weights, both positive and negative: w = (w0, w1, …wn) Repeat until convergence, or for some max number of epochs ( ) j j For each training example x , y : Note again that w includes the bias weight w0, and x includes the bias term x0 = 1. Homework 4, Part 2
© Copyright 2026 Paperzz