Machine Learning for NLP

Machine Learning for NLP
Linear Models
Joakim Nivre
Uppsala University
Department of Linguistics and Philology
Slides adapted from Ryan McDonald, Google Research
Machine Learning for NLP
1(26)
Introduction
Outline
I
Last time:
I
I
I
I
Today:
I
I
I
Preliminaries: input/output, features, etc.
Perceptron
Assignment 2
Large-margin classifiers (SVMs, MIRA)
Logistic regression (Maximum Entropy)
Next time:
I
I
Naive Bayes classifiers
Generative and discriminative models
Machine Learning for NLP
2(26)
Introduction
Perceptron Summary
I
Learns a linear classifier that minimizes error
I
I
I
I
Guaranteed to find a w in a finite amount of time
Improvement 1: shuffle training data between iterations
Improvement 2: average weight vectors seen during training
Perceptron is an example of an online learning algorithm
I
w is updated based on a single training instance in isolation
w(i+1) = w(i) + f(xt , yt ) − f(xt , y 0 )
I
Compare decision trees that perform batch learning
I
All training instances are used to find best split
Machine Learning for NLP
3(26)
Introduction
Margin
Training
Testing
Denote the
value of the
margin by γ
Machine Learning for NLP
4(26)
Introduction
Maximizing Margin
I
For a training set T , the margin of a weight vector w is the
smallest γ such that
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ γ
for every training instance (xt , yt ) ∈ T , y 0 ∈ Ȳt
Machine Learning for NLP
5(26)
Introduction
Maximizing Margin
I
Intuitively maximizing margin makes sense
I
More importantly, generalization error to unseen test data is
proportional to the inverse of the margin
∝
I
R2
γ 2 × |T |
Perceptron: we have shown that:
I
I
If a training set is separable by some margin, the perceptron
will find a w that separates the data
But the perceptron does not pick w to maximize the margin!
Machine Learning for NLP
6(26)
Introduction
Maximizing Margin
Let γ > 0
max
||w||≤1
γ
such that:
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ γ
∀(xt , yt ) ∈ T
and y 0 ∈ Ȳt
I
I
Note: algorithm still minimizes error
||w|| is bound since scaling trivially produces larger margin
β(w · f(xt , yt ) − w · f(xt , y 0 )) ≥ βγ, for some β ≥ 1
Machine Learning for NLP
7(26)
Introduction
Max Margin = Min Norm
Let γ > 0
Max Margin:
max
||w||≤1
Min Norm:
1
min
||w||2
w 2
γ
such that:
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ γ
∀(xt , yt ) ∈ T
and y 0 ∈ Ȳt
=
such that:
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ 1
∀(xt , yt ) ∈ T
and y 0 ∈ Ȳt
I
Instead of fixing ||w|| we fix the margin γ = 1
I
Technically γ ∝ 1/||w||
Machine Learning for NLP
8(26)
Introduction
Support Vector Machines
min
1
||w||2
2
such that:
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ 1
∀(xt , yt ) ∈ T
and y 0 ∈ Ȳt
I
Quadratic programming problem
I
Can be solved with out-of-the-box algorithms
I
Batch learning algorithm – w set w.r.t. all training points
Machine Learning for NLP
9(26)
Introduction
Support Vector Machines
I
Problem: Sometimes |T | is far too large
I
Thus the number of constraints might make solving the
quadratic programming problem very difficult
I
Common technique: Sequential Minimal Optimization (SMO)
I
Sparse: solution depends only on features in support vectors
Machine Learning for NLP
10(26)
Introduction
Margin Infused Relaxed Algorithm (MIRA)
I
I
Another option – maximize margin using an online algorithm
Batch vs. Online
I
I
I
Batch – update based on entire training set (SVM)
Online – update based on one instance at a time (Perceptron)
MIRA – max-margin perceptron or online SVM
Machine Learning for NLP
11(26)
Introduction
MIRA
Online (MIRA):
Batch (SVMs):
min
1
||w||2
2
such that:
w·f(xt , yt )−w·f(xt , y 0 ) ≥ 1
∀(xt , yt ) ∈ T and y 0 ∈ Ȳt
|T |
Training data: T = {(xt , yt )}t=1
1. w(0) = 0; i = 0
2. for n : 1..N
3.
for t : 1..T
4.
w(i+1) = arg minw* w* − w(i) such that:
w · f(xt , yt ) − w · f(xt , y 0 ) ≥ 1
∀y 0 ∈ Ȳt
5.
i =i +1
6. return wi
I
MIRA has much smaller optimizations, only |Ȳt | constraints
I
Cost: sub-optimal optimization
Machine Learning for NLP
12(26)
Introduction
Interim Summary
What we have covered
I
Linear classifiers:
I
I
I
I
Perceptron
SVMs
MIRA
All are trained to minimize error
I
I
With or without maximizing margin
Online or batch
What is next
I
Logistic Regression / Maximum Entropy
I
Train linear classifiers to maximize likelihood
Machine Learning for NLP
13(26)
Introduction
Logistic Regression / Maximum Entropy
Define a conditional probability:
P(y|x) =
e w·f(x,y)
,
Zx
where Zx =
X
e w·f(x,y )
0
y 0 ∈Y
Note: still a linear classifier
e w·f(x,y)
Zx
y
w
·f(x,y)
= arg max e
arg max P(y|x) = arg max
y
y
= arg max w · f(x, y)
y
Machine Learning for NLP
14(26)
Introduction
Log-Linear Models
I
Linear model:
f(x, y) · w
I
Make scores positive:
exp [f(x, y) · w]
I
Normalize:
exp [f(x, y) · w]
P(y |x) = Pn
i=1 exp [f(x, yi ) · w]
Machine Learning for NLP
15(26)
Introduction
Log-Linear Models
I
Crash course in exponentiation:
exp x = ax (for some base a)
I
Note:
0 < exp x
exp x
1 < exp x
I
< 1 if x < 0
= 1 if x = 0
if x > 0
The inverse of exponentiation is the logarithm:
log exp x = x
I
Hence, the log-linear model is linear in log(arithmic) space
Machine Learning for NLP
16(26)
Introduction
Log-Linear Models
I
Suppose we have (only) two classes with the following scores:
f(x, y1 ) · w = 1.0
f(x, y2 ) · w = −2.0
I
Using base 2, we have:
exp [f(x, y1 ) · w] = 2
exp [f(x, y2 ) · w] = 0.25
I
Normalizing, we get:
P(y1 |x) =
P(y2 |x) =
Machine Learning for NLP
exp[f(x,y1 )·w]
exp[f(x,y1 )·w]+exp[f(x,y2 )·w]
exp[f(x,y2 )·w]
exp[f(x,y1 )·w]+exp[f(x,y2 )·w]
=
2
2.25
= 0.89
=
0.25
2.25
= 0.11
17(26)
Introduction
Logistic Regression / Maximum Entropy
P(y|x) =
e w·f(x,y)
Zx
I
Q: How do we learn weights w
I
A: Set weights to maximize log-likelihood of training data:
Y
X
w = arg max
P(yt |xt ) = arg max
log P(yt |xt )
w
w
t
t
I
In a nut shell we set the weights w so that we assign as much
probability to the correct label y for each x in the training set
Machine Learning for NLP
18(26)
Introduction
Aside: Min error versus max log-likelihood
I
Highly related but not identical
I
Example: consider a training set T with 1001 points
1000 × (xi , y = 0) = [−1, 1, 0, 0]
for
i = 1 . . . 1000
1 × (x1001 , y = 1) = [0, 0, 3, 1]
I
I
Now consider w = [−1, 0, 1, 0]
Error in this case is 0 – so w minimizes error
[−1, 0, 1, 0] · [−1, 1, 0, 0] = 1 > [−1, 0, 1, 0] · [0, 0, −1, 1] = −1
[−1, 0, 1, 0] · [0, 0, 3, 1] = 3 > [−1, 0, 1, 0] · [3, 1, 0, 0] = −3
I
However, log-likelihood = −126.9 (omit calculation)
Machine Learning for NLP
19(26)
Introduction
Aside: Min error versus max log-likelihood
I
Highly related but not identical
I
Example: consider a training set T with 1001 points
1000 × (xi , y = 0) = [−1, 1, 0, 0]
for
i = 1 . . . 1000
1 × (x1001 , y = 1) = [0, 0, 3, 1]
I
I
Now consider w = [−1, 7, 1, 0]
Error in this case is 1 – so w does not minimizes error
[−1, 7, 1, 0] · [−1, 1, 0, 0] = 8 > [−1, 7, 1, 0] · [0, 0, −1, 1] = −1
[−1, 7, 1, 0] · [0, 0, 3, 1] = 3 < [−1, 7, 1, 0] · [3, 1, 0, 0] = 4
I
I
However, log-likelihood = -1.4
Better log-likelihood and worse error
Machine Learning for NLP
20(26)
Introduction
Aside: Min error versus max log-likelihood
I
I
Max likelihood 6= min error
Max likelihood pushes as much probability on correct labeling
of training instance
I
Even at the cost of mislabeling a few examples
I
Min error forces all training instances to be correctly classified
I
SVMs with slack variables – allows some examples to be
classified wrong if resulting margin is improved on other
examples
Machine Learning for NLP
21(26)
Introduction
Logistic Regression
P(y|x) =
e w·f(x,y)
,
Zx
where Zx =
X
e w·f(x,y )
0
y 0 ∈Y
X
w = arg max
log P(yt |xt )*
w
t
X
w = arg min −
log P(yt |xt ) (*)
w
t
I
The objective function (*) is concave/convex
I
Therefore there is a global maximum/minimum
I
No closed form solution, but lots of numerical techniques
Machine Learning for NLP
22(26)
Introduction
Gradient Descent
I
I
I
We want to minimize negative log-likelihood
Convexity guarantees a single minimum
Gradient descent:
1. Guess an initial weight vector w0 (all w0 = 0.0)
2. Repeat until convergence:
2.1 Use gradient of wi to determine descent direction
2.2 Update wi+1 ← wi + gradient step
Machine Learning for NLP
23(26)
Introduction
Logistic Regression = Maximum Entropy
I
I
Well known equivalence
Max Ent: maximize entropy subject to constraints on features
I
I
Empirical feature counts must equal expected counts
Quick intuition
I
Partial derivative in logistic regression
X
XX
∂
F (w) =
fi (xt , yt ) −
P(y 0 |xt )fi (xt , y 0 )
∂wi
0
t
t
y ∈Y
I
I
I
Difference: empirical counts − expected counts
Derivative set to zero maximizes function
Equal counts optimize the logistic regression objective!
Machine Learning for NLP
24(26)
Introduction
Linear Models
I
Basic form of a linear (multiclass) classifier:
y∗ = arg max w · f(x, y)
y
I
Different learning objectives:
I
I
I
I
Perceptron – separate data (0-1 loss)
SVM/MIRA – maximize margin (hinge loss)
Logistic regression – maximize likelihood (log loss)
Generalized learning objective:
n
X
`(yi , arg max w · f(xi , y))
arg min
y
w
i=1
Machine Learning for NLP
25(26)
Introduction
Regularization
I
Regularized learning objective:
n
X
`(yi , arg max w · f(xi , y)) + λR(w)
arg min
y
w
i=1
I
R(w) prevents weights from getting too large (overfitting)
I
Common regularization functions:
P
L1 norm = ni=1 |wi |
Promotes sparse weights
Machine Learning for NLP
L2 norm =
Pn
2
i=1 wi
Promotes dense weights
26(26)