slides - Yisong Yue

Machine Learning & Data Mining
CS/CNS/EE 155
Lecture 6:
Conditional Random Fields
1
Previous Lecture
• Sequence Prediction
–
–
–
–
Input: x = (x1,…,xM)
Predict: y = (y1,…,yM)
Naïve full multiclass: exponential explosion
Independent multiclass: strong independence assumption
• Hidden Markov Models
– Generative model: P(yi|yi-1), P(xi|yi)
– Prediction using Bayes’s Rule + Viterbi
– Train using Maximum Likelihood
2
Outline of Today
• Long Prelude:
– Generative vs Discriminative Models
– Naïve Bayes
• Conditional Random Fields
– Discriminative version of HMMs
3
Generative vs Discriminative
• Generative Models:
Hidden Markov Models
– Joint Distribution: P(x,y)
Mismatch!
– Uses Bayes’s Rule to predict: argmaxy P(y|x)
– Can generate new samples (x,y)
• Discriminative Models:
Conditional Random Fields
Same thing!
– Conditional Distribution: P(y|x)
– Can use model directly to predict: argmaxy P(y|x)
• Both trained via Maximum Likelihood
4
Naïve Bayes
• Binary (or Multiclass) prediction
• Model joint distribution (Generative):
x Î RD
y Î {-1, +1}
P(x, y) = P(x | y)P(y)
• “Naïve” independence assumption:
D
P(x | y) = Õ P(x d | y)
• Prediction via:
d=1
D
argmax P(y | x) = argmax P(x | y)P(y) = argmax P(y)Õ P(x d | y)
y
y
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
y
d=1
5
Naïve Bayes
P(xd=1|y)
y=-1 y=+1
P(y)
x Î RD
P(x1=1|y)
0.5
0.7
P(y=-1) = 0.4
P(x2=1|y)
0.9
0.4
y Î {-1, +1}
P(y=+1) = 0.6
P(x3=1|y)
0.1
0.5
• Prediction:
D
argmax P(y | x) = argmax P(x | y)P(y) = argmax P(y)Õ P(x d | y)
y
x
y
P(y=-1|x)
y
P(y=+1|x)
d=1
Predict
(1,0,0) 0.4 * 0.5 * 0.1 * 0.9 = 0.018
0.6 * 0.7 * 0.6 * 0.5 = 0.126 y = +1
(0,1,1) 0.4 * 0.5 * 0.9 * 0.1 = 0.018
0.6 * 0.3 * 0.4 * 0.5 = 0.036 y = +1
(0,1,0) 0.4 * 0.5 * 0.9 * 0.9 = 0.162
0.6 * 0.3 * 0.4 * 0.5 = 0.036 y = -1
6
Naïve Bayes
• Matrix Formulation:
D
D
d=1
d=1
P(x, y) = P(y)Õ P(x d | y) = Ay Õ Oxdd ,y
d
Oa,b
= P(xd = a| y = b)
(Each
d
Sums to 1)
O*,b
Ab = P(y = b)
(Sums to 1)
P(xd=1|y)
y=-1 y=+1
P(y)
x Î RD
P(x1=1|y)
0.5
0.7
P(y=-1) = 0.4
P(x2=1|y)
0.9
0.4
P(y=+1) = 0.6
y Î {-1, +1}
P(x3=1|y)
0.1
0.5
7
Naïve Bayes
• Train via Max Likelihood:
N
N
S= {(xi , yi )}i=1
N
x Î RD
D
argmax Õ P(xi , yi ) = Õ P(yi )Õ P(xid | yi )
A,O
i=1
i=1
y Î {-1, +1}
d=1
• Estimate P(y) and each P(xd|y) from data
– Count frequencies
N
N
Az = P(y = z) =
å1[
yi =z]
i=1
N
O = P(x = a | y = z) =
d
a,z
d
å1 (
i=1
(
)
é y =z Ù xd =a ù
ë i ) i
û
N
å1[
yi =z]
i=1
8
Naïve Bayes vs HMMs
• Naïve Bayes:
D
P(x, y) = P(y)Õ P(x d | y)
d=1
• Hidden Markov Models:
“Naïve” Generative
Independence Assumption
M
M
j=1
i=1
P ( x, y) = P(End | yM )Õ P(y j | y j-1 )Õ P(x j | y j )
P(y)
• HMMs ≈ 1st order variant of Naïve Bayes!
(just one interpretation…)
9
Naïve Bayes vs HMMs
• Naïve Bayes:
P(x|y)
P(y)
D
P(x, y) = Ay Õ O
d
xd ,y
d=1
• Hidden Markov Models:
“Naïve” Generative
Independence Assumption
M
M
j=1
j=1
P ( x, y) = AEnd,yM Õ Ay j ,y j-1 Õ Ox j ,y j
P(y)
P(x|y)
• HMMs ≈ 1st order variant of Naïve Bayes!
(just one interpretation…)
10
Summary: Naïve Bayes
• Joint model of (x,y):
– “Naïve” independence assumption each xd
D
“Generative Model”
(can sample new data)
P(x, y) = P(y)Õ P(x | y)
d
d=1
• Use Bayes’s Rule for prediction:
D
argmax P(y | x) = argmax P(x | y)P(y) = argmax P(y)Õ P(x d | y)
y
y
y
d=1
• Maximum Likelihood Training:
– Count Frequencies
11
Learn Conditional Prob.?
S= {(xi , yi )}i=1
N
• Weird to train to maximize:
N
N
x Î RD
D
argmax Õ P(xi , yi ) = argmax Õ P(yi )Õ P(xid | yi )
A,O
A,O
i=1
i=1
y Î {-1, +1}
d=1
In
general,
you
should
maximize
the
• When goal should be to maximize:
likelihood of theP(xmodel
you
define!
,y)
P(y )
argmax Õ P(y | x ) = argmax Õ
= argmax Õ
P(x | y )
Õ
)
P(x )
So if you define P(x
joint
model P(x,y),
Breaks independence!
p(x)
=å
P(x, y) = å P(y)
Õ P(x | y)
then
maximize
P(x,y)
on
training
data.
Can no longer use count statistics
N
N
N
i
i
A,O
D
i
d
i
i
i
A,O
i=1
i=1
i
A,O
i
i=1
i
d=1
D
d
y
y
d=1
N
P(x = a | y = z) =
d
å1 (
i=1
(
)
é y =z Ù xd =a ù
ë i ) i
û
N
å1[
i=1
yi =z]
*HMMs suffer same problem
12
Summary: Generative Models
• Joint model of (x,y):
– Compact & easy to train…
– ...with ind. assumptions
• E.g., Naïve Bayes & HMMs
P(x, y)
Θ often used to denote
all parameters of model
N
• Maximize Likelihood Training:
• Mismatch w/ prediction goal:
– But hard to maximize P(y|x)
argmax Õ P(xi , yi )
Q
i=1
argmax P(y | x)
y
S= {(xi , yi )}i=1
N
13
Discriminative Models
• Conditional model:
P(y | x)
– Directly model prediction goal
• Maximum Likelihood:
• Matches prediction goal:
N
argmax Õ P(yi | xi )
Q
i=1
argmax P(y | x)
y
• What does P(y|x) look like?
14
First Try
• Model P(y|x) for every possible x
P(y=1|x)
x1
x2
x Î {0,1}D
0.5
0
0
0.7
0
1
y Î {-1, +1}
0.2
1
0
0.4
1
1
• Train by counting frequencies
• Exponential in # input variables D!
– Need to assume something… what?
15
Log Linear Models!
(Logistic Regression)
P(y | x) =
exp {wTy x - by }
x Î RD
T
exp
w
å { k x - bk }
y Î {1, 2,..., K }
k
• “Log-Linear” assumption
– Model representation to linear in D
– Most common discriminative probabilistic model
Training:
Prediction:
N
argmax P(y | x)
y
Match!
argmax Õ P(yi | xi )
Q
i=1
16
Naïve Bayes vs Logistic Regression
D
• Naïve Bayes:
– Strong ind. assumptions
– Super easy to train…
– …but mismatch with prediction
• Logistic Regression:
P(x, y) = Ay Õ Oxdd ,y
P(y | x) =
– “Log Linear” assumption
• Often more flexible than Naïve Bayes
– Harder to train (gradient desc.)…
– …but matches prediction
d=1
P(y)
P(x|y)
exp {wTy x - by }
T
exp
w
å { k x - bk }
k
x Î RD
y Î {1, 2,..., K }
17
Naïve Bayes vs Logistic Regression
•
•
•
•
•
NB has K parameters for P(y) (i.e., A)
Intuition:
LR has K parameters for bias b
Both models have same “capacity”
NB has K*D parameters for P(x|y) (i.e, O)
a for
lotwof capacity on P(x)
LR has NB
K*D spends
parameters
LR spends
all of capacity on P(y|x)
Same number
of parameters!
No
Naïve Bayes
Model
Is Perfect!
Logistic
Regression
D
(Especially
on
finite
training
set)
w x-b
e
P(x, y) = Ay Õ Oxd ,y
P(yP(y|x)
| x) = with
NB will
trade off
P(x)x Î {0,1}
w x-b
d=1
e
å
LR
fit P(y|x) as well kas possible
y Î {1, 2,..., K }
P(y) will P(x|y)
T
y
y
d
T
k
D
k
18
Generative
Discriminative
P(x,y)
• Joint model over x and y
• Cares about everything
P(y|x) (when probabilistic)
• Conditional model
• Only cares about predicting well
Naïve Bayes, HMMs
• Also Topic Models (later)
Logistic Regression, CRFs
• also SVM, Least Squares, etc.
Max Likelihood
Max (Conditional) Likelihood
• (=minimize log loss)
• Can pick any loss based on y
• Hinge Loss, Squared Loss, etc.
Always Probabilistic
Not Necessarily Probabilistic
• Certainly never joint over P(x,y)
Often strong assumptions
• Keeps training tractable
More flexible assumptions
• Focuses entire model on P(y|x)
Mismatch between train & predict
• Requires Bayes’s rule
Train to optimize predict goal
Can sample anything
Can only sample y given x
Can handle missing values in x
Cannot handle missing values in x
19
Conditional Random Fields
20
“Log-Linear” 1st Order Sequential Model
ìï M
1
P(y | x) =
exp íå uy j ,y j-1 + wy j ,x j
Z(x)
ïî j=1
(
Z(x) = å exp { F(y', x)}
)
üï
ý
ïþ
aka “Partition Function”
y'
M
(
F(y, x) º å uy j ,y j-1 + wy j ,x j
)
Scoring Function
j=1
Scoring transitions
P(y | x) =
exp { F(y, x)}
Z(x)
Scoring input features
log P(y | x) = F(y, x) - log ( Z(x))
y0 = special start state
21
ìï M
1
P(y | x) =
exp íå uy j ,y j-1 + wy j ,x j
Z(x)
ïî j=1
• x = “Fish Sleep”
• y = (N,V)
uN,V
(
uN,*
uV,*
u*,N
-2
1
w*,Fish
u*,V
2
-2
u*,Start
1
-1
P(N,V |"Fish Sleep") =
wN,*
wV,*
2
1
w*,Sleep 1
0
)
üï
ý
ïþ
wV,Fish
1
1
exp {uN,Start + wN,Fish + uV,N + wV,Sleep } =
exp {4}
Z(x)
Z(x)
(
Z(x) = Sum
y
exp(F(y,x))
(N,N)
exp(1+2-2+1) = exp(2)
(N,V)
exp(1+2+2+0) = exp(4)
(V,N)
exp(-1+1+2+1) = exp(3)
(V,V)
exp(-1+1-2+0) = exp(-2)
)
22
• x = “Fish Sleep”
• y = (N,V)
1
P(N,V |"Fish Sleep") =
exp {uN,Start + wN,Fish + uV,N + wV,Sleep }
Z(x)
P(N,V |"Fish Sleep")
*hold other parameters fixed
uN,Start + vN,Fish + uV,N + vV,Sleep
23
Basic Conditional Random Field
• Directly models P(y|x)
–
–
–
–
Discriminative
Log linear assumption
Same #parameters as HMM
1st Order Sequential LR
• How to Predict?
• How to Train?
• Extensions?
CRF spends all model capacity
on P(y|x), rather than P(x,y)
(
M
F(y, x) º å uy j ,y j-1 + wy j ,x j
)
j=1
P(y | x) =
exp { F(y, x)}
åexp {F(y', x)}
y'
æ
ö
log P(y | x) = F(y, x) - log ççå exp { F(y', x)}÷÷
è y'
ø
24
Predict via Viterbi
argmax P(y | x) = argmaxlog P(y | x) = argmax F(y, x)
y
y
y
M
(
= argmax å uy j ,y j-1 + wy j ,x j
y
j =1
Scoring transitions
)
Scoring input features
Maintain length-k
prefix solutions
Å
Å
Ŷ k (T) = Åargmax F(y1:k-1 Å T, x1:k )ÅÅ T
Å y1:k-1
Å
Recursively solve for
length-(k+1) solutions
Å
Å
1:k
Ŷ (T) = Å
F(y Å T, x)Å
ÅT
Åy1:kargmax
Å
k
Å Å {Ŷ (T )}T
Å
Å
Å
1:k
Å
= Å argmax F(y , x) + uT,yk + wT,xk+1 Å
ÅÅ T
1:k
k
y
Å
Ŷ
(T
)
Å { }T
Å
Predict via best
length-M solution
argmax F(y, x) = argmax F(y, x)
y
yÎ{Ŷ M (T )}
T
k+1
25
Å
Å
1
Ŷ (V) = Å
F(y , x) + uV,y1 + wV,x2 Å
ÅV
Åyargmax
Å
1
1
Å Å {Ŷ (T )}T
Å
Solve:
2
Store each
Ŷ1(T) & F(Ŷ1(T),x)
y1=V
Ŷ1(V)
Ŷ2(V)
y1=D
Ŷ1(D)
Ŷ2(D)
y1=N
Ŷ1(N)
Ŷ2(N)
Ŷ1(T) is just T
26
Solve:
Å
Å
1
Ŷ (V) = Å
F(y , x) + uV,y1 + wV,x2 Å
ÅV
Åyargmax
Å
1
1
Å Å {Ŷ (T )}T
Å
2
Store each
Ŷ1(T) & F(Ŷ1(T),x1)
Ŷ1(V)
Ŷ2(V)
Ŷ1(D)
Ŷ2(D)
y1=N
Ŷ1(N)
Ŷ1(T) is just T
Ŷ2(N)
Ex: Ŷ2(V) = (N, V)
27
Solve:
Store each
Ŷ1(T) & F(Ŷ1(T),x1)
Ŷ1(V)
Å
Å
1:2
ÅÅ V
Ŷ3 (V) = Å
argmax
F(y
,
x)
+
u
2 +w
3
V,y
V,x
Åy1:2 Ŷ2 (T )
Å
Å { }T
Å
Store each
Ŷ2(Z) & F(Ŷ2(Z),x)
Ŷ2(V)
y2=V
Ŷ3(V)
y2=D
Ŷ1(D)
Ŷ2(D)
Ŷ3(D)
y2=N
Ŷ1(N)
Ŷ1(Z) is just Z
Ŷ2(N)
Ŷ3(N)
Ex: Ŷ2(V) = (N, V)
28
Solve:
Store each
Ŷ1(Z) & F(Ŷ1(Z),x1)
Å
Å
M -1
Ŷ (V) = Å
F(y , x) + uV,yM -1 + wV,xM Å
ÅV
ÅyMargmax
Å
-1
M
Å Å {Ŷ (T )}T
Å
M
Store each
Ŷ2(T) & F(Ŷ2(T),x)
Store each
Ŷ3(T) & F(Ŷ3(T),x)
Ŷ1(V)
Ŷ2(V)
Ŷ3(V)
Ŷ1(D)
Ŷ2(D)
Ŷ3(D)
Ŷ1(N)
Ŷ2(N)
Ŷ3(N)
Ex: Ŷ2(V) = (N, V)
Ex: Ŷ3(V) = (D,N,V)
Ŷ1(T) is just T
ŶL(V)
…
ŶL(D)
ŶL(N)
29
Computing P(y|x)
• Viterbi doesn’t compute P(y|x)
– Just maximizes the numerator F(y,x)
P(y | x) =
exp { F(y, x)}
åexp {F(y', x)}
º
1
exp { F(y, x)}
Z(x)
y'
• Also need to compute Z(x)
– aka the “Partition Function”
Z(x) = å exp { F(y', x)}
y'
30
Computing Partition Function
• Naive approach is iterate over all y’
– Exponential time, LM possible y’!
M
Z(x) = å exp { F(y', x)}
)
j=1
y'
• Notation:
(
F(y, x) º å uy j ,y j-1 + wy j ,x j
G j (a, b) = exp {ua,b + wa,x j }
1 M j j j-1
P(y | x) =
G (y , y )
Õ
Z(x) j=1
M
Z(x) = åÕ G j (y' j , y' j-1 )
y'
j=1
http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf
31
Matrix Semiring
M
Z(x) = åÕ G j (y' j , y' j-1 )
j=1
L+1
y'
Matrix Version of Gj
G j (a, b) = exp {ua,b + wa,x j }
Include ‘Start’
G1:2 (a, b) º åG2 (a, c)G1 (c, b)
G1:2
c
Gi:j (a, b) º
Gi:j
=
Gj
Gj-1
=
…
Gj(a,b)
L+1
G2
G1
Gi+1
Gi
http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf
32
Path Counting Interpretation
• Interpretation G1(a,b)
G1
– L+1 start & end locations
– Weight of path from ‘b’ to ‘a’ in step 1
• G1:2(a,b)
G1:2
=
G2
G1
– Weight of all paths
• Start in ‘b’ beginning of Step 1
• End in ‘a’ after Step 2
http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf
33
Computing Partition Function
Z(x) = åG1 (a, Start)
• Consider Length-1 (M=1)
a
Sum column ‘Start’ of G1!
Z(x) = åG2 (b, a)G1 (a, Start) = åG1:2 (b, Start)
• M=2
a,b
b
Sum column ‘Start’ of G1:2!
• General M
Sum column ‘Start’ of G1:M!
– Do M (L+1)x(L+1) matrix computations to compute G1:M
– Z(x) = sum column ‘Start’ of G1:M
G1:M
=
GM
GM-1
…
http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf
G2
G1
34
Train via Gradient Descent
• Similar to Logistic Regression
– Gradient Descent on negative log likelihood (log loss)
N
N
argmin å-log P(yi | xi ) = argmin å-F(yi , xi ) + log ( Z(xi ))
Q
Q
i=1
i=1
Harder to
differentiate!
Θ often used to denote all parameters of model
• First term is easy:
M
¶uab - F(y, x) = -å1é( y j ,y j-1 )=(a,b)ù
– Recall:
j=1
M
(
F(y, x) º å uy j ,y j-1 + wy j ,x j
j=1
)
ë
û
M
¶waz - F(y, x) = -å1é( y j ,x j )=(a,z)ù
http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf
j=1
ë
û
35
Differentiating Log Partition
Lots of Chain Rule & Algebra!
1
1
¶uab log(Z(x)) =
¶uab Z(x) =
¶uab å exp { F(y', x)}
Z(x)
Z(x)
y'
Definition
of P(y’|x)
=
1
¶uab exp { F(y', x)}
å
Z(x) y'
=
exp { F(y', x)}
1
exp
F(y',
x)
¶
F(y',
x)
=
¶uab F(y', x)
{
} uab
å
å
Z(x) y'
Z(x)
y'
M
é
ù
= å P(y' | x)¶uab F(y', x) = åêP(y' | x)å1é( y' j ,y' j-1 )=(a,b)ù ú
ë
ûú
ë
û
y'
y' ê
j =1
M
M
= åå P(y' | x)1é( y' j ,y' j -1 )=(a,b)ù = å P(y j = a, y j-1 = b | x)
j=1 y'
ë
Forward-Backward!
û
j=1
Marginalize over all y’
http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf
36
Optimality Condition
N
N
argmin å-log P(yi | xi ) = argmin å-F(yi , xi ) + log ( Z(x))
Q
Q
i=1
i=1
• Consider one parameter:
N
N Mi
i=1
i=1 j=1
¶uab å-F(yi , xi ) = -åå1é( y j ,y j -1 )=(a,b)ù
ë
i
i
û
N
N Mi
i=1
i=1 j=1
¶uab å log(Z(x)) = åå P(yij = a, yij-1 = b | xi )
• Optimality condition:
N Mi
åå1
i=1 j=1
é( y j ,y j-1 )=(a,b)ù
ë i i
û
N Mi
= åå P(yij = a, yij-1 = b | xi )
i=1 j=1
• Frequency counts = Cond. expectation on training data!
– Holds for each component of the model
– Each component is a “log-linear” model and requires gradient desc.
37
Forward-Backward for CRFs
a1 (a) = G1 (a, Start)
b M (b) =1
a j (a) = åa j-1 (b)G j (a, b)
b j (b) = å b j+1 (a)G j+1 (a, b)
b
a
P(y j = b, y j-1 = a | x) =
Z(x) = å exp { F(y', x)}
y'
M
(
a j-1 (a)G j (b, a)b j (b)
F(y, x) º å uy j ,y j-1 + wy j ,x j
Z(x)
)
G j (a, b) = exp {ua,b + wa,x j }
j=1
http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf
38
Path Interpretation
α1
α2
Total Weight of
paths from “Start”
to “V” in 3rd step
α3
α1(V)
α2(V)
α3(V)
α1(D)
α2(D)
α3(D)
α1(N)
α2(N)
α3(N)
G1(V,“Start”)
“Start”
G1(N,“Start”)
x G2(D,N)
β just does it backwards
x G3(N,D)
39
Matrix Formulation
• Use Matrices!
α2
=
G2
α1
β6
=
(G2)T
β5
• Fast to compute!
• Easy to implement!
40
Path Interpretation:
Forward-Backward vs Viterbi
Path%Interpreta+on%
Viterbi
Path%
Interpreta+on%
Forward
α1%
“Start”'
α2%
α3%
α1%
α1(V)%
α2(V)%
α3(V)%
α1(D)%
α2(D)%
α3(D)%
α1(N)%
α2(N)%
α3(N)%
α1(V)%
α2(V)%
α3(V)%
α1(D)%
α2(D)%
α3(D)%
α1(N)%
α2(N)%
α3(N)%
β%just%does%it%backwards%%
“Start”'
α2%
α3%
β%just%does%it%backwards%%
• Forward (and Backward) sums over all paths
38%
38%
– Computes expectation of reaching each state
– E.g., total (un-normalized) probability of y3=Verb over all possible y1:2
• Viterbi only keeps the best path
– Computes best possible path to reaching each state
– E.g., single highest probability setting of y1:3 such that y3=Verb
41
Summary: Training CRFs
• Similar optimality condition as HMMs:
– Match frequency counts of model components!
N Mi
N Mi
j
j-1
1
=
P(y
=
a,
y
= b | xi )
åå é( yj ,yj-1 )=(a,b)ù åå i
i
i=1 j=1
ë
i
i
û
i=1 j=1
– Except HMMs can just set the model using counts
– CRFs need to do gradient descent to match counts
• Run Forward-Backward for expectation
– Just like HMMs as well
42
More General CRFs
P(y | x) =
exp { F(y, x)}
åexp {F(y', x)}
New:
M
F(y, x) º åq Tf j (y j , y j-1 | x)
j=1
y'
Old:
M
(
F(y, x) º å uy j ,y j-1 + wy j ,x j
Reduction:
)
j=1
ua,b
 q Tf j (y j , y j-1 | x) = uy ,y + wy ,x
j
wa,x j
j-1
j
j
θ is “flattened” weight vector
Can extend φj(a,b|x)
43
More General CRFs
P(y | x) =
exp { F(y, x)}
åexp {F(y', x)}
y'
M
F(y, x) º åq Tf j (y j , y j-1 | x)
j=1
1st order Sequence CRFs:
M
F(y, x) º åéëq 2Ty j (y j , y j-1 ) + q 1Tj j (y j | x)ùû
j=1
é q ù
q =ê 1 ú
êë q 2 úû
é y (a, b)
j
f j (a, b | x) = ê
ê j j (b | x)
ë
ù
ú
ú
û
44
Example
M
F(y, x) º åéëq 2Ty j (y j , y j-1 ) + q 1Tj j (y j | x)ùû
j=1
Basic formulation
only had first part
é
ex j
ê
ê
j j,b (x) = ê 1éx j Îanimal ù
ë
û
ê
ex j -1
êë
ù
ú
ú
ú
ú
úû
Various attributes of x
All 0’s except
1 sub-vector
Stack for each label yj=b
45
Summary: CRFs
• “Log-Linear” 1st order sequence model
– Multiclass LR + 1st order components
– Discriminative Version of HMMs
P(y | x) =
exp { F(y, x)}
åexp {F(y', x)}
y'
M
F(y, x) º åéëq 2Ty j (y j , y j-1 ) + q 1Tj j (y j | x)ùû
j=1
– Predict using Viterbi, Train using Gradient Descent
– Need forward-backward to differentiate partition function
46
Next Week
• Structural SVMs
– Hinge loss for sequence prediction
• More General Structured Prediction
• Next Recitation:
– Optimizing non-differentiable functions (Lasso)
– Accelerated gradient descent
• Homework 2 due in 12 days
– Tuesday, Feb 3rd at 2pm via Moodle
47