Machine Learning for NLP
Linear Models
Joakim Nivre
Uppsala University
Department of Linguistics and Philology
Slides adapted from Ryan McDonald, Google Research
Machine Learning for NLP
1(26)
Introduction
Outline
I
Last time:
I
I
I
I
Today:
I
I
I
Preliminaries: input/output, features, etc.
Perceptron
Assignment 2
Large-margin classifiers (SVMs, MIRA)
Logistic regression (Maximum Entropy)
Next time:
I
I
Naive Bayes classifiers
Generative and discriminative models
Machine Learning for NLP
2(26)
Introduction
Perceptron Summary
I
Learns a linear classifier that minimizes error
I
I
I
I
Guaranteed to find a w in a finite amount of time
Improvement 1: shuffle training data between iterations
Improvement 2: average weight vectors seen during training
Perceptron is an example of an online learning algorithm
I
w is updated based on a single training instance in isolation
w(i+1) = w(i) + f(xt , yt ) − f(xt , y 0 )
I
Compare decision trees that perform batch learning
I
All training instances are used to find best split
Machine Learning for NLP
3(26)
Introduction
Margin
Training
Testing
Denote the
value of the
margin by γ
Machine Learning for NLP
4(26)
Introduction
Maximizing Margin
I
For a training set T , the margin of a weight vector w is the
smallest γ such that
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ γ
for every training instance (xt , yt ) ∈ T , y 0 ∈ Ȳt
Machine Learning for NLP
5(26)
Introduction
Maximizing Margin
I
Intuitively maximizing margin makes sense
I
More importantly, generalization error to unseen test data is
proportional to the inverse of the margin
∝
I
R2
γ 2 × |T |
Perceptron: we have shown that:
I
I
If a training set is separable by some margin, the perceptron
will find a w that separates the data
But the perceptron does not pick w to maximize the margin!
Machine Learning for NLP
6(26)
Introduction
Maximizing Margin
Let γ > 0
max
||w||≤1
γ
such that:
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ γ
∀(xt , yt ) ∈ T
and y 0 ∈ Ȳt
I
I
Note: algorithm still minimizes error
||w|| is bound since scaling trivially produces larger margin
β(w · f(xt , yt ) − w · f(xt , y 0 )) ≥ βγ, for some β ≥ 1
Machine Learning for NLP
7(26)
Introduction
Max Margin = Min Norm
Let γ > 0
Max Margin:
max
||w||≤1
Min Norm:
1
min
||w||2
w 2
γ
such that:
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ γ
∀(xt , yt ) ∈ T
and y 0 ∈ Ȳt
=
such that:
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ 1
∀(xt , yt ) ∈ T
and y 0 ∈ Ȳt
I
Instead of fixing ||w|| we fix the margin γ = 1
I
Technically γ ∝ 1/||w||
Machine Learning for NLP
8(26)
Introduction
Support Vector Machines
min
1
||w||2
2
such that:
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ 1
∀(xt , yt ) ∈ T
and y 0 ∈ Ȳt
I
Quadratic programming problem
I
Can be solved with out-of-the-box algorithms
I
Batch learning algorithm – w set w.r.t. all training points
Machine Learning for NLP
9(26)
Introduction
Support Vector Machines
I
Problem: Sometimes |T | is far too large
I
Thus the number of constraints might make solving the
quadratic programming problem very difficult
I
Common technique: Sequential Minimal Optimization (SMO)
I
Sparse: solution depends only on features in support vectors
Machine Learning for NLP
10(26)
Introduction
Margin Infused Relaxed Algorithm (MIRA)
I
I
Another option – maximize margin using an online algorithm
Batch vs. Online
I
I
I
Batch – update based on entire training set (SVM)
Online – update based on one instance at a time (Perceptron)
MIRA – max-margin perceptron or online SVM
Machine Learning for NLP
11(26)
Introduction
MIRA
Online (MIRA):
Batch (SVMs):
min
1
||w||2
2
such that:
w·f(xt , yt )−w·f(xt , y 0 ) ≥ 1
∀(xt , yt ) ∈ T and y 0 ∈ Ȳt
|T |
Training data: T = {(xt , yt )}t=1
1. w(0) = 0; i = 0
2. for n : 1..N
3.
for t : 1..T
4.
w(i+1) = arg minw* w* − w(i) such that:
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ 1
∀y 0 ∈ Ȳt
5.
i =i +1
6. return wi
I
MIRA has much smaller optimizations, only |Ȳt | constraints
I
Cost: sub-optimal optimization
Machine Learning for NLP
12(26)
Introduction
Interim Summary
What we have covered
I
Linear classifiers:
I
I
I
I
Perceptron
SVMs
MIRA
All are trained to minimize error
I
I
With or without maximizing margin
Online or batch
What is next
I
Logistic Regression / Maximum Entropy
I
Train linear classifiers to maximize likelihood
Machine Learning for NLP
13(26)
Introduction
Logistic Regression / Maximum Entropy
Define a conditional probability:
P(y|x) =
e w·f(x,y)
,
Zx
where Zx =
X
e w·f(x,y )
0
y 0 ∈Y
Note: still a linear classifier
e w·f(x,y)
Zx
y
w
·f(x,y)
= arg max e
arg max P(y|x) = arg max
y
y
= arg max w · f(x, y)
y
Machine Learning for NLP
14(26)
Introduction
Log-Linear Models
I
Linear model:
f(x, y) · w
I
Make scores positive:
exp [f(x, y) · w]
I
Normalize:
exp [f(x, y) · w]
P(y |x) = Pn
i=1 exp [f(x, yi ) · w]
Machine Learning for NLP
15(26)
Introduction
Log-Linear Models
I
Crash course in exponentiation:
exp x = ax (for some base a)
I
Note:
0 < exp x
exp x
1 < exp x
I
< 1 if x < 0
= 1 if x = 0
if x > 0
The inverse of exponentiation is the logarithm:
log exp x = x
I
Hence, the log-linear model is linear in log(arithmic) space
Machine Learning for NLP
16(26)
Introduction
Log-Linear Models
I
Suppose we have (only) two classes with the following scores:
f(x, y1 ) · w = 1.0
f(x, y2 ) · w = −2.0
I
Using base 2, we have:
exp [f(x, y1 ) · w] = 2
exp [f(x, y2 ) · w] = 0.25
I
Normalizing, we get:
P(y1 |x) =
P(y2 |x) =
Machine Learning for NLP
exp[f(x,y1 )·w]
exp[f(x,y1 )·w]+exp[f(x,y2 )·w]
exp[f(x,y2 )·w]
exp[f(x,y1 )·w]+exp[f(x,y2 )·w]
=
2
2.25
= 0.89
=
0.25
2.25
= 0.11
17(26)
Introduction
Logistic Regression / Maximum Entropy
P(y|x) =
e w·f(x,y)
Zx
I
Q: How do we learn weights w
I
A: Set weights to maximize log-likelihood of training data:
Y
X
w = arg max
P(yt |xt ) = arg max
log P(yt |xt )
w
w
t
t
I
In a nut shell we set the weights w so that we assign as much
probability to the correct label y for each x in the training set
Machine Learning for NLP
18(26)
Introduction
Aside: Min error versus max log-likelihood
I
Highly related but not identical
I
Example: consider a training set T with 1001 points
1000 × (xi , y = 0) = [−1, 1, 0, 0]
for
i = 1 . . . 1000
1 × (x1001 , y = 1) = [0, 0, 3, 1]
I
I
Now consider w = [−1, 0, 1, 0]
Error in this case is 0 – so w minimizes error
[−1, 0, 1, 0] · [−1, 1, 0, 0] = 1 > [−1, 0, 1, 0] · [0, 0, −1, 1] = −1
[−1, 0, 1, 0] · [0, 0, 3, 1] = 3 > [−1, 0, 1, 0] · [3, 1, 0, 0] = −3
I
However, log-likelihood = −126.9 (omit calculation)
Machine Learning for NLP
19(26)
Introduction
Aside: Min error versus max log-likelihood
I
Highly related but not identical
I
Example: consider a training set T with 1001 points
1000 × (xi , y = 0) = [−1, 1, 0, 0]
for
i = 1 . . . 1000
1 × (x1001 , y = 1) = [0, 0, 3, 1]
I
I
Now consider w = [−1, 7, 1, 0]
Error in this case is 1 – so w does not minimizes error
[−1, 7, 1, 0] · [−1, 1, 0, 0] = 8 > [−1, 7, 1, 0] · [0, 0, −1, 1] = −1
[−1, 7, 1, 0] · [0, 0, 3, 1] = 3 < [−1, 7, 1, 0] · [3, 1, 0, 0] = 4
I
I
However, log-likelihood = -1.4
Better log-likelihood and worse error
Machine Learning for NLP
20(26)
Introduction
Aside: Min error versus max log-likelihood
I
I
Max likelihood 6= min error
Max likelihood pushes as much probability on correct labeling
of training instance
I
Even at the cost of mislabeling a few examples
I
Min error forces all training instances to be correctly classified
I
SVMs with slack variables – allows some examples to be
classified wrong if resulting margin is improved on other
examples
Machine Learning for NLP
21(26)
Introduction
Logistic Regression
P(y|x) =
e w·f(x,y)
,
Zx
where Zx =
X
e w·f(x,y )
0
y 0 ∈Y
X
w = arg max
log P(yt |xt )*
w
t
X
w = arg min −
log P(yt |xt ) (*)
w
t
I
The objective function (*) is concave/convex
I
Therefore there is a global maximum/minimum
I
No closed form solution, but lots of numerical techniques
Machine Learning for NLP
22(26)
Introduction
Gradient Descent
I
I
I
We want to minimize negative log-likelihood
Convexity guarantees a single minimum
Gradient descent:
1. Guess an initial weight vector w0 (all w0 = 0.0)
2. Repeat until convergence:
2.1 Use gradient of wi to determine descent direction
2.2 Update wi+1 ← wi + gradient step
Machine Learning for NLP
23(26)
Introduction
Logistic Regression = Maximum Entropy
I
I
Well known equivalence
Max Ent: maximize entropy subject to constraints on features
I
I
Empirical feature counts must equal expected counts
Quick intuition
I
Partial derivative in logistic regression
X
XX
∂
F (w) =
fi (xt , yt ) −
P(y 0 |xt )fi (xt , y 0 )
∂wi
0
t
t
y ∈Y
I
I
I
Difference: empirical counts − expected counts
Derivative set to zero maximizes function
Equal counts optimize the logistic regression objective!
Machine Learning for NLP
24(26)
Introduction
Linear Models
I
Basic form of a linear (multiclass) classifier:
y∗ = arg max w · f(x, y)
y
I
Different learning objectives:
I
I
I
I
Perceptron – separate data (0-1 loss)
SVM/MIRA – maximize margin (hinge loss)
Logistic regression – maximize likelihood (log loss)
Generalized learning objective:
n
X
`(yi , arg max w · f(xi , y))
arg min
y
w
i=1
Machine Learning for NLP
25(26)
Introduction
Regularization
I
Regularized learning objective:
n
X
`(yi , arg max w · f(xi , y)) + λR(w)
arg min
y
w
i=1
I
R(w) prevents weights from getting too large (overfitting)
I
Common regularization functions:
P
L1 norm = ni=1 |wi |
Promotes sparse weights
Machine Learning for NLP
L2 norm =
Pn
2
i=1 wi
Promotes dense weights
26(26)
© Copyright 2026 Paperzz