Maximum Entropy

Maximum Entropy
CL1: Jordan Boyd-Graber
University of Maryland
October 21, 2013
Adapted from material by Robert Malouf, Philipp Koehn, and Matthew Leingang
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
1 / 35
Roadmap
Why we need more powerful probabilistic modeling formalism
Example: POS tagging
Introducing key concepts from information theory
Maximum Entropy Models
I
I
Formulation
Estimation
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
2 / 35
Outline
1
Motivation: Supervised POS Tagging
2
Expectation and Entropy
3
Constraints
4
Maximum Entropy Form
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
3 / 35
Modeling Distributions
Modeling Distributions
Estimating from data
Thus far, only counting
I
I
I
MLE
Priors
Backo↵
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
4 / 35
Modeling Distributions
Modeling Distributions
Estimating from data
Thus far, only counting
I
I
I
MLE
Priors
Backo↵
What about features?
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
4 / 35
Supervised Learning
Problem Setup
I
I
I
Given: some annotated data
Goal: Build a model
Task: Apply it to unseen data
Issues
I
I
More data help
How to represent the data
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
5 / 35
Supervised Learning
Problem Setup
I
I
I
Given: some annotated data
Goal: Build a model
Task: Apply it to unseen data
Issues
I
I
More data help
How to represent the data
Part of speech tagging
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
5 / 35
Supervised Learning
Problem Setup
I
I
I
Given: some annotated data (words with POS tags)
Goal: Build a model (using some feature representation)
Task: Apply it to unseen data (POS tags for untagged sentences)
Issues
I
I
More data help
How to represent the data
Part of speech tagging
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
5 / 35
Contrast: Hidden Markov Models
HMMs are useful and simple; three parameters
I
I
I
Initial distribution
Transition
Conditional emission
Training is easy from tagged data
Find best sequence using Vitterbi
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
6 / 35
Contrast: Hidden Markov Models
HMMs are useful and simple; three parameters
I
I
I
Initial distribution
Transition
Conditional emission
Training is easy from tagged data
Find best sequence using Vitterbi
But it ignores important clues that could help
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
6 / 35
Motivating Example: Features for POS Tagging
wn
tn
2
2
wn
tn
1
1
wn
tn
wn+1
tn+1
wn+2
tn+2
But we can do better
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
7 / 35
Motivating Example: Features for POS Tagging
wn
tn
2
2
wn
tn
1
1
wn
tn
wn+1
tn+1
wn+2
tn+2
But we can do better
If one of the previous tags is md, then vb is likelier than vbp (basic
verb form instead of singular)
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
7 / 35
Motivating Example: Features for POS Tagging
wn
tn
2
2
wn
tn
1
1
wn
tn
wn+1
tn+1
wn+2
tn+2
But we can do better
If one of the previous tags is md, then vb is likelier than vbp (basic
verb form instead of singular)
If next tag is jj, rbr is likelier than jjr (adverb instead of adjective)
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
7 / 35
Motivating Example: Features for POS Tagging
wn
tn
2
2
wn
tn
1
1
wn
tn
wn+1
tn+1
wn+2
tn+2
But we can do better
If one of the previous tags is md, then vb is likelier than vbp (basic
verb form instead of singular)
If next tag is jj, rbr is likelier than jjr (adverb instead of adjective)
If one of the previous words is “not”, the vb is likelier than vbp
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
7 / 35
Motivating Example: Features for POS Tagging
wn
tn
2
2
wn
tn
1
1
wn
tn
wn+1
tn+1
wn+2
tn+2
But we can do better
If one of the previous tags is md, then vb is likelier than vbp (basic
verb form instead of singular)
If next tag is jj, rbr is likelier than jjr (adverb instead of adjective)
If one of the previous words is “not”, the vb is likelier than vbp
If a word ends in “-tion” it is likely a nn, but “-ly” implies adverb
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
7 / 35
Encoding Features
Much more powerful and expressive than counting single observations
I
I
f~(x) a vector with the feature count for observation x
fi (x): count of feature i in observation x
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
8 / 35
Encoding Features
Much more powerful and expressive than counting single observations
I
I
f~(x) a vector with the feature count for observation x
fi (x): count of feature i in observation x
Typical example
f (wn
2 , wn 1 , wn ,wn+1 , wn+1 , tn )
CL1: Jordan Boyd-Graber (UMD)
(
1,
0,
=
if wn 1 =“angry” and tn =NNP
otherwise
Maximum Entropy
October 21, 2013
8 / 35
Where Maximum Entropy Models Fit
Suppose we have some data-driven information about these features
What distribution should we use to model these features?
“Maximum Entropy” models provide a solution
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
9 / 35
Where Maximum Entropy Models Fit
Suppose we have some data-driven information about these features
What distribution should we use to model these features?
“Maximum Entropy” models provide a solution . . . but first we need
some definitions
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
9 / 35
Outline
1
Motivation: Supervised POS Tagging
2
Expectation and Entropy
3
Constraints
4
Maximum Entropy Form
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
10 / 35
Expectation
An expectation of a random variable is a weighted average:
E [f (X )] =
=
1
X
f (x) p(x)
x=1
Z
1
f (x) p(x) dx
(discrete)
(continuous)
1
Alternate formulation for positive random variables:
E [X ] =
=
1
X
P(X > x)
x=1
Z
1
P(X > x) dx
(discrete)
(continuous)
0
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
11 / 35
Expectation
Expectations of constants or known values:
E [a] = a
E [Y | Y = y ] = y
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
12 / 35
Expectation of die / dice
What is the expectation of the roll of die?
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
13 / 35
Expectation of die / dice
What is the expectation of the roll of die?
One die
1·
1
6
+2·
1
6
+3·
1
6
+4·
CL1: Jordan Boyd-Graber (UMD)
1
6
+5·
1
6
+6·
1
6
=
Maximum Entropy
October 21, 2013
13 / 35
Expectation of die / dice
What is the expectation of the roll of die?
One die
1·
1
6
+2·
1
6
+3·
1
6
+4·
CL1: Jordan Boyd-Graber (UMD)
1
6
+5·
1
6
+6·
1
6
= 3.5
Maximum Entropy
October 21, 2013
13 / 35
Expectation of die / dice
What is the expectation of the roll of die?
One die
1·
1
6
+2·
1
6
+3·
1
6
+4·
1
6
+5·
1
6
+6·
1
6
= 3.5
What is the expectation of the sum of two dice?
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
13 / 35
Expectation of die / dice
What is the expectation of the roll of die?
One die
1·
1
6
+2·
1
6
+3·
1
6
+4·
1
6
+5·
1
6
+6·
1
6
= 3.5
What is the expectation of the sum of two dice?
Two die
1
2
3
4
5
6
5
4
3
2
1
2· 36
+3· 36
+4· 36
+5· 36
+6· 36
+7· 36
+8· 36
+9· 36
+10· 36
+11· 36
+12· 36
=
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
13 / 35
Expectation of die / dice
What is the expectation of the roll of die?
One die
1·
1
6
+2·
1
6
+3·
1
6
+4·
1
6
+5·
1
6
+6·
1
6
= 3.5
What is the expectation of the sum of two dice?
Two die
1
2
3
4
5
6
5
4
3
2
1
2· 36
+3· 36
+4· 36
+5· 36
+6· 36
+7· 36
+8· 36
+9· 36
+10· 36
+11· 36
+12· 36
=7
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
13 / 35
Entropy
Measure of disorder in a system
In the real world, entroy in a system
tends to increase
Can also be applied to probabilities:
I
I
Is one (or a few) outcomes certain
(low entropy)
Are things equiprobable (high entropy)
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
14 / 35
Entropy
Entropy is a measure of uncertainty that is associated with the distribution
of a random variable:
H(X ) =
=
E [lg(p(X ))]
X
p(x) lg(p(x))
(discrete)
x
=
Z
1
p(x) lg(p(x)) dx
(continuous)
1
Does not account for the values of the random variable, only the spread of
the distribution.
H(X )
0
uniform distribution = highest entropy, point mass = lowest
suppose P(X = 1) = p, P(X = 0) = 1 p and
P(Y = 100) = p, P(Y = 0) = 1 p: X and Y have the same
entropy
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
15 / 35
Entropy of a die / dice
What is the entropy of a roll of a die?
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
16 / 35
Entropy of a die / dice
What is the entropy of a roll of a die?
One die
1
6
lg
1
6
+ 16 lg
1
6
CL1: Jordan Boyd-Graber (UMD)
+ 16 lg
1
6
+ 16 lg
1
6
Maximum Entropy
+ 16 lg
1
6
+ 16 lg
1
6
= 2.58
October 21, 2013
16 / 35
Entropy of a die / dice
What is the entropy of a roll of a die?
One die
1
6
lg
1
6
+ 16 lg
1
6
+ 16 lg
1
6
+ 16 lg
1
6
+ 16 lg
1
6
+ 16 lg
1
6
= 2.58
What is the entropy of the sum of two die? Tricky question: will it be
higher or lower than the first one?
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
16 / 35
Entropy of a die / dice
What is the entropy of a roll of a die?
One die
1
6
lg
1
6
+ 16 lg
1
6
+ 16 lg
1
6
+ 16 lg
1
6
+ 16 lg
1
6
+ 16 lg
1
6
= 2.58
What is the entropy of the sum of two die? Tricky question: will it be
higher or lower than the first one?
Two die
✓
1
lg
36
✓
1
36
◆
✓ ◆
✓ ◆
✓ ◆
2
2
3
3
4
4
lg
+
lg
+
lg
+
36
36
36
36
36
36
✓ ◆
✓ ◆
✓ ◆
6
6
5
5
4
4
+
lg
+
lg
+
lg
+
36
36
36
36
36
36
✓ ◆
✓ ◆◆
2
2
1
1
+ lg
+
lg
= 3.27
36
36
36
36
+
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
✓ ◆
5
5
lg
36
36
✓ ◆
3
3
lg
36
36
October 21, 2013
16 / 35
Principles for Modeling Distributions
Maximum Entropy Principle (Jaynes)
All else being equal, we should prefer distributions that maximize the
Entropy
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
17 / 35
Principles for Modeling Distributions
Maximum Entropy Principle (Jaynes)
All else being equal, we should prefer distributions that maximize the
Entropy
What additional constraints do we want to place on the distribution?
How, mathematically, do we optimize the entropy?
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
17 / 35
Outline
1
Motivation: Supervised POS Tagging
2
Expectation and Entropy
3
Constraints
4
Maximum Entropy Form
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
18 / 35
The obvious one . . .
We’re attempting to model a probability distribution p
By definition, our probability distribution must sum to one
X
p(x) = 1
(1)
x
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
19 / 35
Feature constraints
We observe features across many outcomes
We’re modeling a distribution p over observations x. What is the
correct model of features under this distribution?
The whole point of this is that we don’t want to count outcomes
(we’ve discussed those methods)
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
20 / 35
Feature constraints
We observe features across many outcomes
We’re modeling a distribution p over observations x. What is the
correct model of features under this distribution?
The whole point of this is that we don’t want to count outcomes
(we’ve discussed those methods)
Ideally, the expected count of the features should be consistent with
observations
Estimated Counts
Ep [fi (x)] =
X
Empirical Counts
p(x)fi (x)
(2)
Êp̂ [fi (x)] = p̂(x)fi (x)
(3)
x
Empirical distribution is just what we’ve observed in data
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
20 / 35
Optimizing Constrained Functions
Theorem: Lagrange Multiplier Method
Given functions f (x1 , . . . xn ) and g (x1 , . . . xn ), the critical points of f
restricted to the set g = 0 are solutions to equations:
@f
@g
(x1 , . . . xn ) =
(x1 , . . . xn )
@xi
@xi
g (x1 , . . . xn ) = 0
8i
This is n + 1 equations in the n + 1 variables x1 , . . . xn , .
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
21 / 35
Lagrange Example
Maximize f (x, y ) =
p
xy subject to the constraint 20x + 10y = 200.
Compute derivatives
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
22 / 35
Lagrange Example
Maximize f (x, y ) =
p
xy subject to the constraint 20x + 10y = 200.
Compute derivatives
r
@f
1 y
=
@x
2 x
r
@f
1 x
=
@y
2 y
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
@g
= 20
@x
@g
= 10
@y
October 21, 2013
22 / 35
Lagrange Example
Maximize f (x, y ) =
p
xy subject to the constraint 20x + 10y = 200.
Compute derivatives
r
@f
1 y
=
@x
2 x
r
@f
1 x
=
@y
2 y
@g
= 20
@x
@g
= 10
@y
Create new systems of equations
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
22 / 35
Lagrange Example
Maximize f (x, y ) =
p
xy subject to the constraint 20x + 10y = 200.
Compute derivatives
r
@f
1 y
=
@x
2 x
r
@f
1 x
=
@y
2 y
@g
= 20
@x
@g
= 10
@y
Create new systems of equations
r
1 y
= 20
2 x
r
1 x
= 10
2 y
20x + 10y = 200
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
22 / 35
Lagrange Example
Dividing the first equation by the second gives us
y
=2
x
(4)
which means y = 2x, plugging this into the constraint equation gives:
20x + 20(2x) = 200
x = 5 ) y = 10
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
23 / 35
Outline
1
Motivation: Supervised POS Tagging
2
Expectation and Entropy
3
Constraints
4
Maximum Entropy Form
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
24 / 35
Objective Function
We want a distribution p that maximizes
X
H(p) ⌘
p(x) log p(x)
(5)
x
Under the constraints that
X
p(x) = 1
(6)
x
and, for every feature fi
Ep [fi ] = Êp̂ [fi ] .
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
(7)
October 21, 2013
25 / 35
Augmented Objective Function
L(p, , ) =
X
p(x) log p(x)
x
X
i
X
p(x)fi (x)
x
i
X
p(x)
Ê [fi ]
!
!
1
x
Plan for solution:
Take derivative
Set it equal to zero
Solve for the p(x) that optimizes equation
This will give the functional form of our solution
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
26 / 35
Form of Solution
Derivation in class
(Feel free to work out for yourself)
n
o
exp > f~(x)
n
o
p(x) = P
> f~(x)
exp
x0
(8)
Thus, distribution is parameterized by ~ (one for each feature)
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
27 / 35
Finding Parameters
Form is simple
However, finding parameteris is difficult
Solutions take iterative form
1
2
Start with ~ (0) = ~0
For k = 1 . . .
1
2
Determine update ~(k)
~ (k) ! ~ (k 1) + ~(k)
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
28 / 35
Method for finding updates
Our objective is a function of ~
L( ) =
X
x
exp > f (x)
P
> f (x 0 )}
x 0 exp {
(in practice, we typically use the log probability)
Strategy: Move ~ by walking up the gradient G (
(9)
(k) )
Gradient
@L( )
Gi ( ) =
=
@ i
CL1: Jordan Boyd-Graber (UMD)
"
X
x
Maximum Entropy
p (x)fi (x)
!
Ê [fi ]
#
October 21, 2013
(10)
29 / 35
Method for finding updates
Set the update of the form
(k)
= ↵(k) G (
(k)
)
(11)
Use the new parameter
~ (k) ! ~ (k
1)
+ ~(k)
(12)
What value of ↵?
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
30 / 35
Method for finding updates
Set the update of the form
(k)
= ↵(k) G (
(k)
)
(11)
Use the new parameter
~ (k) ! ~ (k
1)
+ ~(k)
(12)
What value of ↵?
I
Try lots of di↵erent values, pick the one that optimizes L( ) (grid
search)
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
30 / 35
Other parameter estimation techniques
Iterative scaling
Conjugate gradient methods
Real di↵erence is speed and scalability
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
31 / 35
Regularization / Priors
We often want to prefer small parameters over large ones, all else
being equal
L( ) =
X
x
exp > f (x)
P
> f (x 0 )}
x 0 exp {
X
2
2
(13)
i
This is equivalent to having a Gaussian prior on the weights
Also possible to use informed priors when you have an idea of what
the weights should be (e.g. for domain adaptation)
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
32 / 35
All sorts of distributions
We talked about a simple distribution p(x)
But could just as easily be joint distribution p(y , x)
p(y , x) = P
exp > f (y , x)
> f (y 0 , x 0 )}
y 0 ,x 0 exp {
(14)
Or a conditional distribution p(y |x)
exp > f (y , x)
p(y |x) = P
> f (y 0 , x)}
y 0 exp {
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
(15)
October 21, 2013
33 / 35
Uses of MaxEnt Distributions
POS Tagging (state of the art)
Supervised classification: spam vs. not spam
Parsing (head or not)
Many other NLP applications
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
34 / 35
In class . . .
HW 3 Results
Quiz
Deriving MaxEnt formula
Defining feature functions
CL1: Jordan Boyd-Graber (UMD)
Maximum Entropy
October 21, 2013
35 / 35