Learning to Predict Trees, Sequences, and other Structured Outputs

Supervised Learning
Learning to Predict Trees, Sequences, and
other Structured Outputs
Tutorial at NESCAI 2006
• Find function from input space X to output space Y
such that the prediction error is low.
x
Thorsten Joachims
Microsoft announced today that they acquired
Apple for the amount equal to the gross national
product of Switzerland. Microsoft officials
y
xstated that they first wanted to buy Switzerland,
GATACAACCTATCCCCGTATATATATTCT
but eventually
were turned off by the mountains
ATGGGTATAGTATTAAATCAATACAACC
and the
snowy winters…
TATCCCCGTATATATATTCTATGGGTATA
GTATTAAATCAATACAACCTATCCCCGT
ATATATATTCTATGGGTATAGTATTAAAT
CAGATACAACCTATCCCCGTATATATAT
TCTATGGGTATAGTATTAAATCACATTTA
Cornell University
Department of Computer Science
x
Examples of Complex Output Spaces
NP
VP
N
V
Det
– Given a sequence of words x, predict sequence of
tags y.
– Dependencies from tag-tag transitions in Markov
model.
• Multi-Label Classification
– Given a (bag-of-words) document x, predict a set of
labels y.
– Dependencies between labels from correlations between
labels (“iraq” and “oil” in newswire corpus)
y -1
-1
-1
+1
+1
-1
-1
-1
y Det
N
V
Det
N
N
Examples of Complex Output Spaces
Due to the continued violence
in Baghdad, the oil price is
expected to further increase.
OPEC officials met with …
• Part-of-Speech Tagging
x The bear chased the cat
NP
Det
x
7.3
S
y
The dog chased the cat
-1
y
Examples of Complex Output Spaces
• Natural Language Parsing
– Given a sequence of words x, predict the parse tree y.
– Dependencies from structural constraints, since y has to
be a tree.
x
1
y
antarctica
benelux
germany
iraq
oil
coal
trade
acquisitions
Examples of Complex Output Spaces
• Non-Standard Performance Measures (e.g. F1-score, Lift)
– F1-score: harmonic average of precision and recall
– New example vector
. Predict y8=1, if P(y8=1|
Î Depends on other examples!
x
p 0.2 1
py0.2-1 1
0.1
0.1-1
0.3
0.3-1
x
y -1
0.6 F1
0.4+1F1
0.4
0.4-1
0.0
0.0-1
0.9 0
0.3+1 0
0
threshold
1
0
)=0.4?
threshold
1
1
Examples of Complex Output Spaces
The policeman fed
The policeman fed
the cat. He did not know
the cat. He did not know
that he was late.
that he was late.
The cat is called Peter.
The cat is called Peter.
• Task: Learning to predict complex outputs
• Formalizing the problem
– Multi-class classification: Generative vs. Discriminative
– Compact models for problems with structured outputs
• Training models with structured outputs
– Generative training
– SVMs and large-margin training
– Conditional likelihood training
• Example 1: Learning to parse natural language
– Learning weighted context free grammar
• Example 2: Optimizing F1-score in text classification
– Predict vector of class labels
• Example 3: Learning to cluster
– Learning a clustering function that produces desired clusterings
• Summary & Reading
Why do we Need Research on
Complex Outputs?
• Important applications for which conventional methods don’t fit!
– Noun-phrase co-reference: two step approaches of pair-wise
classification and clustering as postprocessing, e.g [Ng & Cardie, 2002]
– Directly optimize complex loss functions (e.g. F1, AvgPrec)
• Improve upon existing methods!
– Natural language parsing: generative models like probabilistic contextfree grammars
– SVM outperforms naïve Bayes for text classification [Joachims, 1998]
[Dumais et al., 1998]
• MorePrecision/Recall
flexible models!
Naïve Bayes
– Avoid
generative
(independence)
assumptions Linear SVM
Break-Even
Point
– Kernels for structured input spaces and non-linear functions
Reuters
72.1
87.5
• Transfer what we learned for classification and regression!
WebKB
82.0
90.3
– Boosting
– Bagging
Ohsumed
62.4
71.6
– Support Vector Machines
Structured Output Prediction as
Multi-Class Classification
Natural Language Parsing as
Multi-Class Classifications
• Input Space X: Sequences of words
• Output Space Y: Each tree is a class
y1
y2
x
The dog chased the cat
S
VP
V
N
S
NP
Det
VP
V
N
V
N
S
NP
V
NP
Det
VP
N
NP
Det
N
…
• Noun-Phrase Co-reference
– Given a set of noun phrases x, predict a clustering y.
– Structural dependencies, since prediction has to be an
equivalence relation.
– Correlation dependencies from interactions.
y
x
Overview
yk
VP
Det
NP
Det
N
Generative Model: Model P(X,Y)
• Bayes’ Decision Rule: Optimal Decision is
• Learning Task: P(X,Y) = P(X) P(Y|X)
– Input Space: X (i.e. feature vectors, word sequence, etc.)
– Output Space: Y (i.e. class, tag sequence, parse tree, etc.)
– Training Data: S=((x1,y1), …, (xn,yn)) ∼iid P(X,Y)
• Equivalent Reformulations: For 0/1-Loss ∆(y,y’) = 1, if y ≠ y’, 0 else
• Approach: view as multi-class classification task
– Every complex output y ∈ Y is one class
• Goal: Find h: X → Y with low expected loss
– Loss function: ∆(y,y’) (penalty for predicting y’ if y correct)
– Expected loss (i.e. Risk):
• Learning: maximum likelihood (or MAP, or Bayesian)
– Assume model class P(X,Y|ω) with parameters ω ∈ Ω
– Find
2
Naïve Bayes’ Classifier (Multivariate)
• Input Space X: Feature Vector
• Output Space Y: {1,-1}
• Model:
– Prior class probabilities
– Class conditional model (one for each class)
Estimating the Parameters of Naïve Bayes
• Count frequencies in training data
– n: number of training examples
– n+ / n- : number of pos/neg examples
– #(X(j)=x(j), y): number of times feature
X(j) takes value x(j) for examples in class y
– | X(j) |: number of values attribute of X(j)
• Estimating:
– P(Y): Maximum Likelihood Estimate
– P(X|Y): Maximum Likelihood Estimate
• Classification rule:
– P(X|Y): Smoothing with Laplace estimate
Discriminative Model: Model P(Y|X)
Discriminative Model:
Model Discriminant Function h Directly
• Discriminant Function: hω: X × Y → <
• Bayes’ Decision Rule:
– Assume 0/1 Loss ∆(y,y’) = 1, if y ≠ y’, 0 else
– Optimal Decision:
• Learning: maximum likelihood (or MAP, or Bayesian)
– Assume model class P(Y|X,ω) with parameters ω ∈ Ω
– Find
• Example: Logistic regression classifier
– Assume
– Empirical Risk (i.e. training error):
– For sufficiently “small” HΩ and “large” S: Rule h’∈ H with
best ErrS(h’) has ErrP(h’) close to minh∈H{ErrP(h)}
• Learning: Empirical Risk Minimization (ERM)
– Assume class HΩ of discriminant functions hω: X → Y
– Find
Generative vs. Discriminative Models
Learning Task:
• Generator: Generate descriptions according to distribution P(X).
• Teacher: Assigns a value to each description based on P(Y|X).
Challenges in Learning with Complex Outputs
• Approach: view as multi-class classification task
– Every complex output y ∈ Y is one class
x The bear chased the cat
Training Examples
Discriminative Model
• Model P(Y|X) with P(Y|X,ω)
• Find ω e.g. via MLE
• Examples: Log. Reg., CRF
• Model discriminant functions
• Find hω∈ H with low train
loss (e.g. Emp. Risk Min.)
• Examples: SVM, Dec. Tree
• Consistency of Empirical Risk:
Generative Model
• Model P(X,Y) with distributions
P(Y,X|ω)
• Find ω that best matches P(X,Y)
on training data (e.g. MLE)
• Examples: naive Bayes, HMM
y Det
N
V
Det
N
• Problem: Exponentially many classes!
– Generative Model: P(X,Y|ω)
– Discriminative Model: P(Y|X,ω)
– Discriminant Functions: hω: X × Y → <
• Challenges
– How to compactly represent model?
– How to do efficient inference with model (i.e.
– How to effectively estimate model from data?
(e.g. compute
)
)?
Prediction:
3
Multi-Class Linear Discriminant
Natural Language Parsing
• Input Space X: Sequences of words
• Output Space Y: Trees over sequences
• Problem:
How to compute predictions
• Linear discriminant function of the form:
• Learn one weight vector wy for each class y ∈ Y
y2
efficiently?
y1
V
y2
The dog chased the cat
N
S
NP
Det
V
N
V
VP
NP
Det
VP
y1
N
NP
Det
N
x
yk
S
VP
NP
Det
N
V
V
y1
x
S
N
y12 NP
Det
N
The dog chased the cat
V
S
NP
S
V
VP
VP
Det
N
N
Det
VP
S
NP
N
V
S
N
VP
Det
VP
Det
S
V
V
N
V
Det
S
NP
N
V
Joint Feature Map for Sequences
– Only local dependencies
– Score for each adjacent label/label and word/label pair
– Find highest scoring sequence
N →V
⎛1⎞
⎜ ⎟
⎜ 0 ⎟ Det → V
⎜ 2 ⎟ Det → N
Viterbi
⎜ ⎟
⎜ 1 ⎟ V → Det
⎜M⎟
y Det N
V
Det
N
Φ (x, y ) = ⎜ ⎟
⎜ 0 ⎟ Det → bear
⎜ ⎟
x The bear chased the cat
⎜ 2 ⎟ Det → the
⎜ 1 ⎟ N → bear
⎜ ⎟
⎜ 1 ⎟ V → chased
⎜ 1 ⎟ N → cat
⎝ ⎠
N
NP
Det
N
Joint Feature Map for Trees
CKY Parser
– Score of a tree is the sum of its weights
– Find highest scoring tree
S
y
NP
VP
NP
Det
N
V
Det
N
x The dog chased the
cat
N
• Linear Chain Model
NP N
N
NP
DetNP
N
Det
VP
VP
NP
Det
N
NP
• Weighted Context Free Grammar
– Each rule (e.g. S → NP VP) has a weight
NP
NP N
34
yDet
S
V
Det
N
NPN
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y58
NP
Det
NP
Det
VP
S
V
34
yDet
NPN
y
NP
NP
Det
VP
S
NP
Det
• Feature vector
that describes match between x and y
• Linear discriminant function of the form:
VP
y4
VP
V
y12 NP
Det
N
The dog chased the cat
Joint Feature Map
y2
N
58
Det
S
VP
V
…
x
S
VP
⎛ 1 ⎞ S → NP VP
⎜ 0 ⎟ S → NP
⎜ ⎟
⎜ 2 ⎟ NP → Det N
⎜ 1 ⎟ VP → V NP
⎜M⎟
Φ ( x, y ) = ⎜ ⎟
⎜ 0 ⎟ Det → dog
⎜ 2 ⎟ Det → the
⎜ 1 ⎟ N → dog
⎜ 1 ⎟ V → chased
⎜ 1 ⎟ N → cat
⎝ ⎠
Connection to Graphical Models
Hidden Markov Model:
y Det
Det
N
– Assumptions
x The bear chased the
cat
N
V
→ Rule:
with wab= -log[P(Yc=a|Yp=b)] and wcd= -log[P(Xc=c|Yc=d)]
and Φ(x,y) histogram
4
Overview
Training Generative Model: HMM
• Task: Learning to predict complex outputs
• Formalizing the problem
– Multi-class classification: Generative vs. Discriminative
– Compact models for problems with structured outputs
• Training models with structured outputs
– Generative training
– SVMs and large-margin training
– Conditional likelihood training
• Example 1: Learning to parse natural language
– Learning weighted context free grammar
• Example 2: Optimizing F1-score in text classification
– Predict vector of class labels
• Example 3: Learning to cluster
– Learning a clustering function that produces desired clusterings
• Summary & Reading
• Assume:
– model class P(X,Y|ω) with parameters ω ∈ Ω
• Maximum Likelihood (or alternative estimator)
– Find
• Example: Hidden Markov Model
– Closed-form solutions
# ofTimesStateAFollowsStateB
# ofTimesStateBOccurs
# ofTimesOutputAIsObservedInStateB
P( Xc = xa | Yc = yb ) =
# ofTimesStateBOccurs
P(Yc = ya | Yp = yb ) =
– Need for smoothing the estimates (e.g. max a posteriori)
Support Vector Machine
Multi-Class SVM
• Training Examples:
• Training Examples:
• Hypothesis Space:
• Hypothesis Space:
with
• Training: Find hyperplane
with minimal
y2
Optimization Problem:
Hard Margin
(separable)
1
y
x
δ
N
The dog chased the cat
Soft Margin
(training error)
Training: Find
solve
Multi-ClassthatSVM
• Hypothesis Space:
S
N
1
y
x
The dog chased the cat
[Crammer & Singer 02]
• Joint features
• Learn weights
VP
NP
y12 NP
Det
N
VP
V
S
VP
VP
N
NP
S
VP
V
Det NP N
y34
S
Det NPN
V
Det
N
VP
y4
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
y
NP
Det
N
Det
VP
N
S
NP
V
NP
Det
N
describe match between x and y
so that
is max for correct y
NP
Det
S
V
NP
Det
VP
Structural Support Vector Machine
• Training Examples:
V
S
NP
y58
[Vapnik et al.]
y2
V
12
δ
δ
S
VP
V
N
NP
VP
Det NP N
34
yDet
S
V
Det
N
NPN
VP
y
S
NP
NP
VP
Det
N
V
DetNP
N
Det
N
V
Det
N
…
4
y58
[Crammer & Singer 02]
Det
VP
N
S
NP
V
NP
Det
N
5
Structural Support Vector Machine
Hard-margin optimization problem:
• Joint features
describe match between x and y
• Learn weights so that
is max for correct y
Loss Functions: Soft-Margin Struct SVM
• Loss function
prediction.
measures match between target and
…
…
Loss Functions: Soft-Margin Struct SVM
Soft-margin optimization problem:
• Loss function
measures match between target and
prediction.
Training Approach 1: Factored QP
• Assume:
– Linearly decomposable loss function:
– Linear program solution is integral:
• Algorithm:
– Min-Max Formulation:
…
= ξi
Lemma: The training loss is upper bounded by
– Plug in linear program for “max”
– Quadratic program can be rewritten so that it has
• Polynomially many variables
• Polynomially many constraints
[Taskar et al. 03/04]
Training Approach 2:
Cutting-Plane Algorithm for Structural SVM
• Input:
•
• REPEAT
Violated
by more
than ε ?
Find most
violated
constraint
– FOR
Training Approach 2:
Polynomial Sparsity Bound
• Theorem: The sparse-approximation algorithm finds a
solution to the soft-margin optimization problem after
adding at most
• compute
• IF
_
–
• ENDIF
optimize StructSVM over
Add constraint
to working set
– ENDFOR
• UNTIL
constraints to the working set , so that the Kuhn-Tucker
conditions are fulfilled up to a precision . The loss has to
be bounded
, and
.
has not changed during iteration
[AltHo03] [Jo03] [TsoJoHoAl05]
[Jo 03] [Tsochantaridis et al. 04] [Tsochantaridis et al. 05]
6
More Expressive Features
Experiment: Natural Language Parsing
• Linear composition:
• Implemention
SVMlight
– Implemented Sparse-Approximation Algorithm in
– Incorporated modified version of Mark Johnson’s CKY parser
– Learned weighted CFG with
y
NP
Det N
• General form:
S
VP
V
NP
Det N
x The dog chased the
cat
• Data
– Penn Treebank sentences of length at most 10 (start with POS)
– Train on Sections 2-22: 4098 sentences
– Test on Section 23: 163 sentences
• So far:
• Example:
[Tsochantaridis et al. 05]
see [Taskar et al. 05]
[Lafferty et al. 01]
Applying Structural SVM to New Problem
• Application specific
– Loss function
– Representation
– Algorithms to compute
• Implementation SVM-struct: http://svmlight.joachims.org
– Context-free grammars
– Sequence alignment
– Classification with multivariate loss (e.g. F1, ROC Area)
– General API for other problems
Applying CRF to New Problem
• Application specific
– Representation
– Algorithms to compute
Conditional Random Field (CRF)
• Assume:
– model class P(Y|X,ω) with parameters ω ∈ Ω
– In particular,
• Training: Maximum A Posteriori
– Objective
– Gradient
– Methods: iterative scaling, quasi Newton, conjugate gradient
[Sha & Pereira 03] [Wallach 02] [Malouf 02] [Minka 03]
Relationship CRF and Structural SVM
• Objective Functions:
– CRF:
– Structural SVM:
• Implementation of CRF: http://mallet.cs.umass.edu/
• Basic Cases for two classes
– CRF: Regularized logistic regression
– Structural SVM: binary classification SVM
7
Overview
Examples of Complex Output Spaces
• Non-Standard Performance Measures (e.g. F1-score, Lift)
– F1-score: harmonic average of precision and recall
• Task: Learning to predict complex outputs
• Formalizing the problem
– Multi-class classification: Generative vs. Discriminative
– Compact models for problems with structured outputs
• Training models with structured outputs
– Generative training
– SVMs and large-margin training
– Conditional likelihood training
• Example 1: Learning to parse natural language
– Learning weighted context free grammar
• Example 2: Optimizing F1-score in text classification
– Predict vector of class labels
• Example 3: Learning to cluster
– Learning a clustering function that produces desired clusterings
• Summary & Reading
– New example vector
. Predict y8=1, if P(y8=1|
Î Depends on other examples!
)=0.4?
y -1
-1
-1
+1
-1
-1
+1
x
Struct SVM for Optimizing F1-Score
Struct SVM for Optimizing F1-Score
• Loss Function
–
• Representation
–
–
– Joint feature map
• Prediction
–
• Find most violated constraint
–
– Only n2 different contingency tables Î search brute force
• Loss Function
x
y -1
-0.1
–
-1
-0.7
-1
-3.1
sign
• Representation
+1
+1.2
–
-1
-0.3
-1
-1.2
–
+1
+0.3
– Joint feature map
• Prediction
x
y -1
–
-1
• Find most violated constraint
-1
+1
–
-1
– Only n2 different contingency tables Î search
-1 brute force
+1
[Jo 05]
Experiment: Text Classification
• Dataset: Reuters-21578 (ModApte)
– 9603 training / 3299 test examples, 90 categories
– TFIDF unit vectors (no stemming, no stopword removal)
• Experiment Setup
– Classification SVM with optimal C in hindsight
– Linear Cost SVM [Morik et al., 1999] with C+/C- via 2-CV
– F1-loss SVM with C via 2-CV
• Results
[Jo 05]
Multivariate SVM Generalizes
Classification SVM
Theorem: The solutions of the multivariate SVM with number
of errors as the loss function and an (unbiased) classification
SVM are equal.
Multivariate SVM optimizing Error Rate:
Ù
Classification SVM (unbiased):
[Jo 05]
[Jo 05]
8
Overview
• Task: Learning to predict complex outputs
• Formalizing the problem
– Multi-class classification: Generative vs. Discriminative
– Compact models for problems with structured outputs
• Training models with structured outputs
– Generative training
– SVMs and large-margin training
– Conditional likelihood training
• Example 1: Learning to parse natural language
– Learning weighted context free grammar
• Example 2: Optimizing F1-score in text classification
– Predict vector of class labels
• Example 3: Learning to cluster
– Learning a clustering function that produces desired clusterings
• Summary & Reading
Learning to Cluster
• Noun-Phrase Co-reference
– Given a set of noun phrases x, predict a clustering y.
– Structural dependencies, since prediction has to be an
equivalence relation.
– Correlation dependencies from interactions.
y
x
The policeman fed
The policeman fed
the cat. He did not know
the cat. He did not know
that he was late.
that he was late.
The cat is called Peter.
The cat is called Peter.
[Finley & Jo 05], see also [McCallum & Wellner 04]
[Finley & Jo 05], see also [McCallum & Wellner 04]
Struct SVM for Supervised Clustering
• Representation
Struct SVM for Supervised Clustering
• Representation
y1
1 1 1 0 0 0
1 1 1 1 0 0 0
1 1 1 1 0 0 0
1 1 1 1 0 0 0
– y is reflexive (yii=1), symmetric (yij=yji), and0 transitive
0 0 0 (if
1 y1ij=1
1 and
yjk=1, then yik=1)
0 0 0 0 1 1 1
– Joint feature map
0 0 0 0 1 1 1
–
–
– y is reflexive (yii=1), symmetric (yij=yji), and transitive (if yij=1 and
yjk=1, then yik=1)
– Joint feature map
• Loss Function
–
• Prediction
–
– NP hard, use linear relaxation instead [Demaine & Immorlica, 2003]
• Find most violated constraint
–
– NP hard, use linear relaxation instead [Demaine & Immorlica, 2003]
• Loss Function
–
• Prediction
y’1 1 1 0 0 0 0
y1 1 1 1 0 0 0
–
1 use
1 1linear
1 0relaxation
0 0
1 1[Demaine
1 0 0 &0 Immorlica,
0
– NP hard,
instead
2003]
1 1 1constraint
0 0 0
1 1 1 0 0 0 0
• Find most1violated
1
1
1
1
0
0
0
0
0
0
1
0
0
0
–
0 use
0 0linear
0 1relaxation
1 1
0 0[Demaine
0 0 1 &1 Immorlica,
1
– NP hard,
instead
2003]
0 0 0 0 1 1 1
0 0 0 0 1 1 1
0 0 0 0 1 1 1
0 0 0 0 1 1 1
Summary
• Learning to predict structured and interdependent output
– Discriminant function:
• Training:
– Generative
– Structural SVM
– Conditional Random Field
• Examples
– Learning to predict trees (natural language parsing)
– Optimize to non-standard performance measures (imbalanced classes)
– Learning to cluster (noun-phrase coreference resolution)
• Software:
– SVMstruct:http://svmlight.joachims.org/
– Mallet: http://mallet.cs.umass.edu/
Reading
•
•
Generative training
– Hidden-Markov models [Manning & Schuetze, 1999]
– Probabilistic context-free grammars [Manning & Schuetze, 1999]
– Markov random fields [Geman & Geman, 1984]
– Etc.
Discriminative training
– Multivariate output regression [Izeman, 1975] [Breiman & Friedman, 1997]
– Kernel Dependency Estimation [Weston et al. 2003]
– Conditional HMM [Krogh, 1994]
– Transformer networks [LeCun et al, 1998]
– Conditional random fields [Lafferty et al., 2001] [Sutton & McCallum, 2005]
– Perceptron training of HMM [Collins, 2002]
– Structural SVMs / Maximum-margin Markov networks [Taskar et al., 2003]
[Tsochantaridis et al., 2004, 2005] [Taskar 2004]
9