Global Inference for Sentence Compression

Global Inference for Sentence Compression
An Integer Programming Approach
James Clarke and Mirella Lapata
School of Informatics
University of Edinburgh
S6.864: Advanced Natural Language Processing, 2012
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
1 / 45
Introduction
What is Sentence Compression?
The task of producing a summary of a single sentence while:
using less words than the original sentence
preserving the most important information
remaining grammatical
using reordering, deletion, substitution, and insertion operations
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
2 / 45
Introduction
What is Sentence Compression?
The task of producing a summary of a single sentence while:
using less words than the original sentence
preserving the most important information
remaining grammatical
using reordering, deletion, substitution, and insertion operations
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
2 / 45
Introduction
What is Sentence Compression?
The task of producing a summary of a single sentence while:
using less words than the original sentence
preserving the most important information
remaining grammatical
using reordering, deletion, substitution, and insertion operations
Source
Prime Minister Tony Blair today insisted the case for holding terrorism
suspects without trial was “absolutely compelling” as the government
published new legislation allowing detention for 90 days without
charge.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
2 / 45
Introduction
What is Sentence Compression?
The task of producing a summary of a single sentence while:
using less words than the original sentence
preserving the most important information
remaining grammatical
using reordering, deletion, substitution, and insertion operations
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
3 / 45
Introduction
What is Sentence Compression?
The task of producing a summary of a single sentence while:
using less words than the original sentence
preserving the most important information
remaining grammatical
using reordering, deletion, substitution, and insertion operations
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
3 / 45
Introduction
What is Sentence Compression?
The task of producing a summary of a single sentence while:
using less words than the original sentence
preserving the most important information
remaining grammatical
using reordering, deletion, substitution, and insertion operations
Simplification (Knight and Marcu, 2002)
Given an input sentence of words W = w1 , w2 , . . . , wn , form a
compression by dropping any subset of these words.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
3 / 45
Introduction
Example Compression
Source
Prime Minister Tony Blair today insisted the case for holding terrorism
suspects without trial was “absolutely compelling” as the government
published new legislation allowing detention for 90 days without
charge.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
4 / 45
Introduction
Example Compression
Source
Prime Minister Tony Blair today insisted the case for holding terrorism
suspects without trial was “absolutely compelling” as the government
published new legislation allowing detention for 90 days without
charge.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
4 / 45
Introduction
Example Compression
Source
Prime Minister Tony Blair today insisted the case for holding terrorism
suspects without trial was “absolutely compelling” as the government
published new legislation allowing detention for 90 days without
charge.
Target
Tony Blair insisted the case for holding terrorism suspects without trial
was “compelling”.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
4 / 45
Introduction
Motivation
Outline
Motivation
Models and Inference
1
Integer Linear Programming
2
ILP for Compression
Models
Constraints
3
Experiments
Evaluation Set-up
Results
4
Conclusions
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
5 / 45
Introduction
Motivation
Applications
Summarization, current systems contain manually written
sentence compression rules (Jing, 2000).
Subtitle generation from spoken transcripts
(Vandeghinste and Pan, 2004).
Display of text on small screens
(Corston-Oliver, 2001).
Audio scanning devices for the blind (Grefenstette, 1998).
Creating smaller indices for information retrieval
(Olivers and Dolan, 1999).
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
6 / 45
Introduction
Models and Inference
Outline
Motivation
Models and Inference
1
Integer Linear Programming
2
ILP for Compression
Models
Constraints
3
Experiments
Evaluation Set-up
Results
4
Conclusions
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
7 / 45
Introduction
Models and Inference
Previous Work
Methods
Supervised
Unsupervised
Generative
Discriminative
Knight & Marcu (2002)
Turner & Charniak (2005)
Galley & McKeown (2007)
Knight & Marcu (2002)
Riezler et al. (2003)
Nguyen et al. (2004)
McDonald (2006)
Cohn & Lapata (2007)
Clarke & Lapata (Edinburgh)
Knight & Marcu (2002)
Hori & Furui (2004)
Charniak & Turner (2005)
Global Inference for Sentence Compression
8 / 45
Introduction
Models and Inference
Previous Work
Methods
Supervised
Unsupervised
Generative
Discriminative
Knight & Marcu (2002)
Turner & Charniak (2005)
Galley & McKeown (2007)
Knight & Marcu (2002)
Riezler et al. (2003)
Nguyen et al. (2004)
McDonald (2006)
Cohn & Lapata (2007)
Clarke & Lapata (Edinburgh)
Knight & Marcu (2002)
Hori & Furui (2004)
Charniak & Turner (2005)
Global Inference for Sentence Compression
8 / 45
Introduction
Models and Inference
Compression corpora
Supervised approaches rely on a parallel corpus.
There is no ‘natural’ resource of original-compressed sentences.
Ask humans to produce compressions (Clarke and Lapata, 2006).
Or create corpus automatically (Knight and Marcu, 2002).
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
9 / 45
Introduction
Models and Inference
Compression corpora
Supervised approaches rely on a parallel corpus.
There is no ‘natural’ resource of original-compressed sentences.
Ask humans to produce compressions (Clarke and Lapata, 2006).
Or create corpus automatically (Knight and Marcu, 2002).
Document
Abstract
Blah blah blah. The
documentation is excellent.
Blah blah blah . . .
Clarke & Lapata (Edinburgh)
. . . blah blah blah. The
documentation is excellent – it
is clearly written with numerous
drawings, cautions and tips,
and includes an entire section
on troubleshooting. Blah . . .
Global Inference for Sentence Compression
9 / 45
Introduction
Models and Inference
Compression corpora
Supervised approaches rely on a parallel corpus.
There is no ‘natural’ resource of original-compressed sentences.
Ask humans to produce compressions (Clarke and Lapata, 2006).
Or create corpus automatically (Knight and Marcu, 2002).
Document
Abstract
Blah blah blah. The
documentation is excellent.
Blah blah blah . . .
Clarke & Lapata (Edinburgh)
. . . blah blah blah. The
documentation is excellent – it
is clearly written with numerous
drawings, cautions and tips,
and includes an entire section
on troubleshooting. Blah . . .
Global Inference for Sentence Compression
9 / 45
Introduction
Models and Inference
Unsupervised Compression
A language model (Knight & Marcu, 2002).
Easy to learn, requires no parallel corpus.
Prefers frequent word combinations.
Normally a baseline.
y∗ = argmax
y
Clarke & Lapata (Edinburgh)
n
X
P(xi |xi−1 , xi−2 )
i=1
Global Inference for Sentence Compression
10 / 45
Introduction
Models and Inference
Semi-supervised Compression
Significance Model (based on Hori & Furui, 2004).
Adds more corpus knowledge to the language model.
Prefers frequent word combinations and important words.
Significance score similar to tf/idf.
λ is tuned on small parallel corpus.
∗
y = argmax
y
Clarke & Lapata (Edinburgh)
n
X
P(xi |xi−1 , xi−2 ) + λ
i=1
Global Inference for Sentence Compression
n
X
I(xi )
i=1
11 / 45
Introduction
Models and Inference
Fully Supervised Compression
Syntax-tree pruning model (McDonald, 2006).
Large margin learning method optimises correct deletion
decisions, requires large parallel corpus.
High dimensional feature representation.
State-of-the-art performance.
y∗ = argmax s(x, y)
y
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
12 / 45
Introduction
Models and Inference
Fully Supervised Compression
Syntax-tree pruning model (McDonald, 2006).
Large margin learning method optimises correct deletion
decisions, requires large parallel corpus.
High dimensional feature representation.
State-of-the-art performance.
∗
y = argmax
y
|y|
X
s(x, L(yj−1 ), L(yj ))
j=2
L(yj ) maps word in compression to index of original.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
12 / 45
Introduction
Models and Inference
Fully Supervised Compression
Syntax-tree pruning model (McDonald, 2006).
Large margin learning method optimises correct deletion
decisions, requires large parallel corpus.
High dimensional feature representation.
State-of-the-art performance.
∗
y = argmax
y
|y|
X
w · f(x, L(yj−1 ), L(yj ))
j=2
L(yj ) maps word in compression to index of original.
Features over adjacent words in compression and words dropped.
w feature weights learnt using MIRA (Crammer and Singer 2003).
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
12 / 45
Introduction
Models and Inference
Discriminative Features
NNP VBD NNP IN NNP
IN NN
Mary1 saw2 Ralph3 on4 Tuesday5 after6 lunch7
Placing Ralph and after adjacent. f(x, 3, 6)
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
13 / 45
Introduction
Models and Inference
Discriminative Features
NNP VBD NNP IN NNP
IN NN
Mary1 saw2 Ralph3 on4 Tuesday5 after6 lunch7
Placing Ralph and after adjacent. f(x, 3, 6)
Word/POS Features
POS tags of adjacent words in compression.
POS tags of words dropped.
Bracketing features
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
13 / 45
Introduction
Models and Inference
Discriminative Features
NNP VBD NNP IN NNP
IN NN
Mary1 saw2 Ralph3 on4 Tuesday5 after6 lunch7
Placing Ralph and after adjacent. f(x, 3, 6)
Word/POS Features
POS tags of adjacent words in compression.
POS tags of words dropped.
Bracketing features
Syntactic features
Dependency tree features.
Phrase structure features.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
13 / 45
Introduction
Models and Inference
The Anatomy of a Compression System
Decoding finds best solution given the model.
Search to find y∗ from all possible y’s.
There are 2n compressions in sentence with n words.
Search, decoding, inference.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
14 / 45
Introduction
Models and Inference
Decoding
Uses dynamic programming (decode left to right).
Decisions made locally, heuristics reduce the search space.
Long range dependencies are not modelled:
verbs must have their arguments
pronouns must be retained
Cannot ensure that compressions are semantically valid:
negation must be retained
Discourse constraints are difficult to enforce:
do not drop topic important words
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
15 / 45
Introduction
Models and Inference
Decoding
Uses dynamic programming (decode left to right).
Decisions made locally, heuristics reduce the search space.
Long range dependencies are not modelled:
verbs must have their arguments
pronouns must be retained
Cannot ensure that compressions are semantically valid:
negation must be retained
Discourse constraints are difficult to enforce:
do not drop topic important words
Goal: find an optimal solution with global constraints.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
15 / 45
Integer Linear Programming
What is Integer Linear Programming?
Optimisation Technique.
Find minimum or maximum value of a linear objective function.
With respect to a set of constraints.
ILP is an extension of Linear Programming; every LP has:
decision variables
a linear objective function
constraints on the variables
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
16 / 45
Integer Linear Programming
Linear Programming: Telfa Example
Telfa Corporation manufactures tables and chairs.
A table requires 1 hour of labour and 9 square board feet of wood.
A chair requires 1 hour of labour and 5 square board feet of wood.
They have 6 hours of labour and 45 square board feet of wood.
Each table generates $8 of profit and each chair $5.
Goal: Maximise profit.
(from Winston and Venkataramanan, 2003)
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
17 / 45
Integer Linear Programming
Telfa Example: LP Model
Decision Variables
x1 = tables manufactured
x2 = chairs manufactured
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
18 / 45
Integer Linear Programming
Telfa Example: LP Model
Decision Variables
Objective function
x1 = tables manufactured
x2 = chairs manufactured
Clarke & Lapata (Edinburgh)
Profit = 8x1 + 5x2
Global Inference for Sentence Compression
18 / 45
Integer Linear Programming
Telfa Example: LP Model
Decision Variables
Objective function
x1 = tables manufactured
x2 = chairs manufactured
Profit = 8x1 + 5x2
Constraints
Labour constraint
x1 + x2
Wood constraint
9x1 + 5x2
Variable constraints
x1
x2
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
≤ 6
≤ 45
≥ 0
≥ 0
18 / 45
Integer Linear Programming
Solving LP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
6
Feasible Region
x2 5
Region that contains all the
points that satisfy the LP
constraints. A polyhedral
convex set.
4
3
2
x1 + x2 = 6
1
0
0
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
6
7
Global Inference for Sentence Compression
19 / 45
Integer Linear Programming
Solving LP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
6
Isoprofit Line
x2 5
A line on which all points have
the same objective function
value.
4
3
z = 20
2
x1 + x2 = 6
1
0
0
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
6
7
Global Inference for Sentence Compression
19 / 45
Integer Linear Programming
Solving LP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
6
Isoprofit Line
x2 5
A line on which all points have
the same objective function
value.
4
3
2
z = 10
1
0
0
1
x1 + x2 = 6
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
6
7
Global Inference for Sentence Compression
19 / 45
Integer Linear Programming
Solving LP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
6
Isoprofit Line
x2 5
A line on which all points have
the same objective function
value.
4
3
z = 20
2
x1 + x2 = 6
1
0
0
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
6
7
Global Inference for Sentence Compression
19 / 45
Integer Linear Programming
Solving LP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
6
Isoprofit Line
x2 5
A line on which all points have
the same objective function
value.
4
3
2
z = 36
x1 + x2 = 6
1
0
0
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
6
7
Global Inference for Sentence Compression
19 / 45
Integer Linear Programming
Solving LP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
6
Optimal Solution
x2 5
4
3
Optimal LP solution
2
x1 + x2 = 6
1
0
The point within feasible region
that has maximum objective
function value.
0
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
6
7
Global Inference for Sentence Compression
19 / 45
Integer Linear Programming
Solving LP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
6 B
Extreme Point
x2 5
The intersections of lines that
form boundaries of feasible
region.
4
3
C
2
Optimal LP solution
x1 + x2 = 6
1
0
0
A
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
D
6
7
Global Inference for Sentence Compression
19 / 45
Integer Linear Programming
Solving LP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
6 B
The simplex algorithm
x2 5
Moves from one extreme point
to an adjacent extreme point
(Dantzig, 1963).
4
3
C
2
Optimal LP solution
x1 + x2 = 6
1
0
0
A
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
D
6
7
Global Inference for Sentence Compression
19 / 45
Integer Linear Programming
Solving LP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
6 B
Telfa Problem Solution
x2 5
z = 41.25
4
x1 = 3.75
3
x2 = 2.25
C
2
Optimal LP solution
x1 + x2 = 6
1
0
0
A
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
D
6
7
Global Inference for Sentence Compression
19 / 45
Integer Linear Programming
Solving LP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
Telfa Problem Solution
6 B
z = 41.25
x2 5
x1 = 3.75
4
x2 = 2.25
3
C
2
x1 + x2 = 6
1
0
We cannot build a fraction
of a chair or table!
Optimal LP solution
0
A
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
D
6
7
Global Inference for Sentence Compression
19 / 45
Integer Linear Programming
Integer Linear Programming
Integer linear programs are LP problems in which some or all of the
variables must be non-negative integers.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
20 / 45
Integer Linear Programming
Integer Linear Programming
Integer linear programs are LP problems in which some or all of the
variables must be non-negative integers.
Telfa LP model
max z = 8x1 + 5x2 (Objective function)
subject to (s.t.)
x1 + x2
9x1 + 5x2
x1
x2
Clarke & Lapata (Edinburgh)
≤
≤
≥
≥
6 (Labour constraint)
45 (Wood constraint)
0;
0;
Global Inference for Sentence Compression
20 / 45
Integer Linear Programming
Integer Linear Programming
Integer linear programs are LP problems in which some or all of the
variables must be non-negative integers.
Telfa ILP model
max z = 8x1 + 5x2 (Objective function)
subject to (s.t.)
x1 +
9x1 +
Clarke & Lapata (Edinburgh)
x2
5x2
x1 ≥
x2 ≥
≤
≤
0;
0;
6
(Labour constraint)
45
(Wood constraint)
x1 integer
x2 integer
Global Inference for Sentence Compression
20 / 45
Integer Linear Programming
Solving ILP Models
10
9
= LP’s feasible region
9x1 + 5x2 = 45
8
7
6
ILP Solutions
x2 5
Not all points within feasible
region of an LP will be solutions
to ILP problem.
4
3
Optimal LP solution
2
x1 + x2 = 6
1
0
0
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
6
7
Global Inference for Sentence Compression
21 / 45
Integer Linear Programming
Solving ILP Models
10
9
9x1+ 5x2 = 45
8
= IP feasiable point
= IP relaxation’s feasible region
7
6
ILP Solutions
x2 5
Not all points within feasible
region of an LP will be solutions
to ILP problem.
4
3
Optimal LP solution
2
x1 + x2 = 6
1
0
0
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
6
7
Global Inference for Sentence Compression
21 / 45
Integer Linear Programming
Solving ILP Models
10
9
9x1+ 5x2 = 45
8
= IP feasiable point
= IP relaxation’s feasible region
7
6
Branch and Bound
x2 5
Prunes sub-optimal sections of
the feasibility region
(Land and Doig, 1960).
4
3
Optimal LP solution
2
x1 + x2 = 6
1
0
0
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
6
7
Global Inference for Sentence Compression
21 / 45
Integer Linear Programming
Solving ILP Models
10
9
9x1+ 5x2 = 45
8
= IP feasiable point
= IP relaxation’s feasible region
7
6
Telfa ILP Solution
x2 5
z = 40
4
x1 = 5
3
x2 = 0
Optimal LP solution
2
x1 + x2 = 6
1
0
0
1
2
Clarke & Lapata (Edinburgh)
3
x1
4
5
6
7
Global Inference for Sentence Compression
21 / 45
ILP for Compression
Models
Outline
Motivation
Models and Inference
1
Integer Linear Programming
2
ILP for Compression
Models
Constraints
3
Experiments
Evaluation Set-up
Results
4
Conclusions
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
22 / 45
ILP for Compression
Models
The Models: Recap
Unsupervised: Trigram Language Model.
Semi-supervised: Significance Model.
Supervised: Discriminative Model.
Why ILP?
Can use same models in unified framework.
Enrich the models with linguistic constraints.
Decode globally with respect to the model and constraints.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
23 / 45
ILP for Compression
Models
Language Model
The Problem
y∗ = argmax
y
Clarke & Lapata (Edinburgh)
n
X
P(xi |xi−1 , xi−2 )
i=1
Global Inference for Sentence Compression
24 / 45
ILP for Compression
Models
Language Model
The Problem
y∗ = argmax
y
n
X
P(xi |xi−1 , xi−2 )
i=1
Decision Variables
δi =
Clarke & Lapata (Edinburgh)
1 if xi is in the compression
(1 ≤ i ≤ n)
0 otherwise
Global Inference for Sentence Compression
24 / 45
ILP for Compression
Models
Language Model
Unigram Objective Function
max
n
X
δi · P(xi )
i=1
Decision Variables
δi =
Clarke & Lapata (Edinburgh)
1 if xi is in the compression
(1 ≤ i ≤ n)
0 otherwise
Global Inference for Sentence Compression
24 / 45
ILP for Compression
Models
Language Model
Unigram Objective Function
max
n
X
δi · P(xi )
i=1
Decision Variables
δi =
1 if xi is in the compression
(1 ≤ i ≤ n)
0 otherwise
Auxiliary Variables
γijk =
1 if word sequence xi , xj , xk is in the compression
0 otherwise
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
24 / 45
ILP for Compression
Models
Language Model
Trigram Objective Function
max
n
n−1 X
n−2 X
X
γijk · P(xk |xi , xj )
i=0 j=i+1 k =j+1
Decision Variables
δi =
1 if xi is in the compression
(1 ≤ i ≤ n)
0 otherwise
Auxiliary Variables
γijk =
1 if word sequence xi , xj , xk is in the compression
0 otherwise
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
24 / 45
ILP for Compression
Models
ILP Formulations
Significance Model Objective Function
max
n−2 X
n−1 X
n
X
γijk · P(xk |xi , xj ) +
i=0 j=i+1 k =j+1
n
X
δi · λI(xi )
i=1
Discriminative Model Objective Function
max
n−1 X
n
X
γij · s(x, L(yi , yj ))
i=0 j=i+1
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
25 / 45
ILP for Compression
Models
Language Model
He became a power player in Greek Politics in 1974, when he
founded the socialist Pasok Party.
He became a player in the Pasok.
We took these troubled youth who don’t have fathers, and
brought them into the room to Dads who don’t have their
children.
We don’t have, and don’t have children.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
26 / 45
ILP for Compression
Models
Language Model
He became a power player in Greek Politics in 1974, when he
founded the socialist Pasok Party.
He became a player in the Pasok.
We took these troubled youth who don’t have fathers, and
brought them into the room to Dads who don’t have their
children.
We don’t have, and don’t have children.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
26 / 45
ILP for Compression
Constraints
Outline
Motivation
Models and Inference
1
Integer Linear Programming
2
ILP for Compression
Models
Constraints
3
Experiments
Evaluation Set-up
Results
4
Conclusions
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
27 / 45
ILP for Compression
Constraints
Modifier Constraints
Ensure the relationships between head words and their modifiers
remain grammatical.
1
If a modifier is in the compression, its head word must be included:
δhead − δmodifer ≥ 0
2
Do not drop not if the head word is in the compression (same for
words like his, our and genitives).
δhead − δnot = 0
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
28 / 45
ILP for Compression
Constraints
Modifier Constraints
He became a power player in Greek Politics in 1974, when he
founded the socialist Pasok Party.
He became a player in the Pasok.
We took these troubled youth who don’t have fathers, and
brought them into the room to Dads who don’t have their
children.
We don’t have, and don’t have children.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
28 / 45
ILP for Compression
Constraints
Modifier Constraints
He became a power player in Greek Politics in 1974, when he
founded the socialist Pasok Party.
He became a player in the Pasok Party.
We took these troubled youth who don’t have fathers, and
brought them into the room to Dads who don’t have their
children.
We don’t have them don’t have their children.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
28 / 45
ILP for Compression
Constraints
Modifier Constraints
He became a power player in Greek Politics in 1974, when he
founded the socialist Pasok Party.
He became a player in the Pasok Party.
We took these troubled youth who don’t have fathers, and
brought them into the room to Dads who don’t have their
children.
We don’t have them don’t have their children.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
28 / 45
ILP for Compression
Constraints
Sentential Constraints
Take overall sentence structure into account.
1
If a verb is in the compression then so are its arguments, and
vice-versa:
δsubject/object − δverb = 0
2
The compression must contain at least one verb.
X
δi ≤ 1
i∈verbs
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
29 / 45
ILP for Compression
Constraints
Sentential Constraints
He became a power player in Greek Politics in 1974, when he
founded the socialist Pasok Party.
He became a player in the Pasok Party.
We took these troubled youth who don’t have fathers, and
brought them into the room to Dads who don’t have their
children.
We don’t have them don’t have their children.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
29 / 45
ILP for Compression
Constraints
Sentential Constraints
He became a power player in Greek Politics in 1974, when he
founded the socialist Pasok Party.
He became a player in politics.
We took these troubled youth who don’t have fathers, and
brought them into the room to Dads who don’t have their
children.
We took these youth and brought them into the room to Dads.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
29 / 45
ILP for Compression
Constraints
Discourse Constraints
Take overall document into account and preserve its coherence.
1
Do not drop centers and their referents.
δcenter = 1
2
Do not drop words in topical lexical chains.
δtopical = 1
3
Do not drop personal pronouns.
δpersonal
Clarke & Lapata (Edinburgh)
pronoun
=1
Global Inference for Sentence Compression
30 / 45
Experiments
Evaluation Set-up
Outline
Motivation
Models and Inference
1
Integer Linear Programming
2
ILP for Compression
Models
Constraints
3
Experiments
Evaluation Set-up
Results
4
Conclusions
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
31 / 45
Experiments
Evaluation Set-up
Evaluation Set-up
Compare global vs. local models:
Trigram Language Model
Significance Model
Discriminative Model
Data: speech (Broadcast news) and written (BNC) corpus
(http://homepages.inf.ed.ac.uk/s0460084/data).
Parameters:
LM trained on North American News corpus (25M tokens).
Parses were provided by RASP (Briscoe and Carroll, 2002).
Used McDonald’s (2006) feature set.
Solver: lp_solve (http://www.geocities.com/lpsolve).
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
32 / 45
Experiments
Evaluation Set-up
Human Evaluation
Follows set-up of Knight and Marcu (2002).
Humans judge compression output on two dimensions:
Is the most important information preserved?
How grammatical is the compression?
Gold standard, 6 models (global vs. local).
42 participants using a 5-point scale.
Latin square design.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
33 / 45
Experiments
Evaluation Set-up
Compression Output
Source
Gold
The aim is to give councils some control over
the future growth of second homes.
The aim is to give councils control over the
growth of homes.
LM
LM+Constr
The aim is to the future.
The aim is to give councils control.
Sig
The aim is to give councils control over the future growth of homes.
The aim is to give councils control over the future growth of homes.
Sig+Constr
McD
The aim is to give councils.
McD+Constr The aim is to give councils some control over
the growth of homes.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
34 / 45
Experiments
Results
Outline
Motivation
Models and Inference
1
Integer Linear Programming
2
ILP for Compression
Models
Constraints
3
Experiments
Evaluation Set-up
Results
4
Conclusions
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
35 / 45
Experiments
Results
Results on BNC corpus
Models
LM
Sig
McD
Gold
Clarke & Lapata (Edinburgh)
Grammar
2.25
2.26
3.05
Importance
1.82
2.99
2.84
4.25
3.98
Global Inference for Sentence Compression
36 / 45
Experiments
Results
Results on BNC corpus
Models
LM
Sig
McD
LM+Constr
Sig+Constr
McD+Constr
Gold
Grammar
2.25
2.26
3.05
3.47
3.76
3.50
4.25
Importance
1.82
2.99
2.84
3.98
Constrains significantly improve model performance.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
36 / 45
Experiments
Results
Results on BNC corpus
Models
LM
Sig
McD
LM+Constr
Sig+Constr
McD+Constr
Gold
Grammar
2.25
2.26
3.05
3.47
3.76
3.50
4.25
Importance
1.82
2.99
2.84
3.98
Constrains significantly improve model performance.
LM+Constr, Sig+Constr as grammatical as McD+Constr.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
36 / 45
Experiments
Results
Results on BNC corpus
Models
LM
Sig
McD
LM+Constr
Sig+Constr
McD+Constr
Gold
Grammar
2.25
2.26
3.05
3.47
3.76
3.50
4.25
Importance
1.82
2.99
2.84
2.37
3.53
3.17
3.98
Constrains significantly improve model performance.
LM+Constr, Sig+Constr as grammatical as McD+Constr.
Sig+Constr, McD+Constr sig better than LM+Constr on
importance.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
36 / 45
Experiments
Results
Results on BNC corpus
Models
LM
Sig
McD
LM+Constr
Sig+Constr
McD+Constr
Gold
Grammar
2.25
2.26
3.05
3.47
3.76
3.50
4.25
Importance
1.82
2.99
2.84
2.37
3.53
3.17
3.98
Constrains significantly improve model performance.
LM+Constr, Sig+Constr as grammatical as McD+Constr.
Sig+Constr, McD+Constr sig better than LM+Constr on
importance.
Sig+Constr as good as McD+Constr.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
36 / 45
Conclusions
Conclusions
Contributions:
Unified framework for modeling sentence compression.
Global inference integrated with learning.
Constraints bring performance improvements.
Simple models can do well without large parallel corpus.
Constraints not tuned to mistakes of individual system.
Future Work:
Create better objective functions.
Interface compression with sentence extraction.
Apply framework to other generation tasks (e.g., paraphrasing).
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
37 / 45
Conclusions
ILP Pros and Cons
Finds the best solution to the problem as modelled.
Objective function must decompose linearly.
Solve times can be slow (not in the compression task).
There may be multiple optimal solutions or none.
When variables are binary, life is easier.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
38 / 45
Conclusions
References
Dan Roth and Wen-tau Yih. 2007.
Global Inference for Entity and Relation Identification via a Linear
Programming Formulation.
In Introduction to Statistical Relational Learning, Lise Getoor and
Ben Taskar (eds), MIT Press.
James Clarke and Mirella Lapata. 2008.
Global Inference for Sentence Compression: An Integer
Programming Approach.
In Journal of Artificial Intelligence Research, 31:273–381.
James Clarke and Mirella Lapata. 2010.
Discourse Constraints for Document Compression.
In Computational Linguistics, 26:3, 411–441.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
39 / 45
Discourse Representation
Centering Theory (Grosz et al. 1995)
Entity-orientated theory of local coherence (Grosz et al. 1995)
Entities in an utterance are ranked according to salience
Each utterance has one center (≈ topic or focus)
Coherent discourses have utterances with common centers
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
40 / 45
Discourse Representation
Centering Theory (Grosz et al. 1995)
Entity-orientated theory of local coherence (Grosz et al. 1995)
Entities in an utterance are ranked according to salience
Each utterance has one center (≈ topic or focus)
Coherent discourses have utterances with common centers
Lexical Chains (Halliday and Hasan 1976)
Representation of lexical cohesion (Halliday and Hasan 1976)
Degree of semantic relatedness among words in document
Dense and long chains signal the main topic of the document
Coherent texts have more related words than incoherent ones
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
40 / 45
Example Discourse
1 Bad weather dashed hopes of attempts to halt the flow during what
was seen as a lull in the lava’s momentum.
2 Some experts say that even if the eruption stopped today, the pressure of lava piled up behind for six miles would bring debris cascading down on to the town anyway.
3 Some estimate the volcano is pouring out one million tons of debris
a day, at a rate of 15ft per second, from a fissure that opened in
mid-December.
4 The Italian Army yesterday detonated 400lb of dynamite 3,500 feet
up Mount Etna’s slopes.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
41 / 45
Centering Algorithm
1 Bad weather dashed hopes of attempts to halt the flow during what
was seen as a lull in the lava’s momentum.
2 Some experts say that even if the eruption stopped today, the pressure of lava piled up behind for six miles would bring debris cascading down on to the town anyway.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
42 / 45
Centering Algorithm
1 Bad weather dashed hopes of attempts to halt the flow during what
was seen as a lull in the lava’s momentum.
2 Some experts say that even if the eruption stopped today, the
pressure of lava piled up behind for six miles would bring debris
cascading down on to the town anyway.
1
Extract entities from U2 .
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
42 / 45
Centering Algorithm
1 Bad weather dashed hopes of attempts to halt the flow during what
was seen as a lull in the lava’s momentum.
2 Some experts say that even if the eruption stopped today, the
pressure of lava piled up behind for six miles would bring debris
cascading down on to the town anyway.
1
Extract entities from U2 .
2
Rank the entities in U2 according to their grammatical role.
(subject > objects > others)
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
42 / 45
Centering Algorithm
1 Bad weather dashed hopes of attempts to halt the flow during
what was seen as a lull in the lava’s momentum.
2 Some experts say that even if the eruption stopped today, the
pressure of lava piled up behind for six miles would bring debris
cascading down on to the town anyway.
1
Extract entities from U2 .
2
Rank the entities in U2 according to their grammatical role.
(subject > objects > others)
3
Find highest ranked entity in U1 which occurs in U2 . Set entity to
be center of U2 .
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
42 / 45
Centering Algorithm
1. Bad weather dashed hopes of attempts to halt the flow during
what was seen as a lull in the lava’s momentum.
2. Some experts say that even if the eruption stopped today, the
pressure of lava piled up behind for six miles would bring debris
cascading down on to the town anyway.
1
Extract entities from U2 .
2
Rank the entities in U2 according to their grammatical role.
(subject > objects > others)
3
Find highest ranked entity in U1 which occurs in U2 . Set entity to
be center of U2 .
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
42 / 45
Annotated Discourse
1 Bad weather dashed hopes of attempts to halt the flow during what
was seen as a lull in the lava’s momentum.
2 Some experts say that even if the eruption stopped today, the pressure of lava piled up behind for six miles would bring debris cascading down on to the town anyway.
3 Some estimate the volcano is pouring out one million tons of debris a day, at a rate of 15ft per second, from a fissure that opened
in mid-December.
4 The Italian Army yesterday detonated 400lb of dynamite 3,500 feet
up Mount Etna’s slopes.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
43 / 45
Lexical Chain Algorithm
1
2
3
4
5
6
7
8
–
–
–
–
–
–
–
–
Clarke & Lapata (Edinburgh)
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Global Inference for Sentence Compression
44 / 45
Lexical Chain Algorithm
1
2
3
4
5
6
7
8
Lava
X
X
–
X
X
–
X
–
Weight
–
–
–
X
X
–
–
–
Clarke & Lapata (Edinburgh)
Time
X
–
X
X
–
X
–
–
1
Compute chains for document
(Galley and McKeown 2003).
Global Inference for Sentence Compression
44 / 45
Lexical Chain Algorithm
1
2
3
4
5
6
7
8
Lava
X
X
–
X
X
–
X
–
Weight
–
–
–
X
X
–
–
–
Time
X
–
X
X
–
X
–
–
1
Compute chains for document
(Galley and McKeown 2003).
Lava : {lava, lava, lava, magma, lava}
Weight : {tons, lbs}
Time : {day, today, yesterday, second}
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
44 / 45
Lexical Chain Algorithm
1
2
3
4
5
6
7
8
Lava
X
X
–
X
X
–
X
–
Weight
–
–
–
X
X
–
–
–
Time
X
–
X
X
–
X
–
–
Score
5
2
4
Clarke & Lapata (Edinburgh)
1
Compute chains for document
(Galley and McKeown 2003).
2
Score(Chain) = Sent(Chain)
Global Inference for Sentence Compression
44 / 45
Lexical Chain Algorithm
1
2
3
4
5
6
7
8
Lava
X
X
–
X
X
–
X
–
Weight
–
–
–
X
X
–
–
–
Time
X
–
X
X
–
X
–
–
Score
5
2
4
Clarke & Lapata (Edinburgh)
1
Compute chains for document
(Galley and McKeown 2003).
2
Score(Chain) = Sent(Chain)
3
Score(Chain) < Avg(Score).
Global Inference for Sentence Compression
44 / 45
Lexical Chain Algorithm
1
2
3
4
5
6
7
8
Lava
X
X
–
X
X
–
X
–
Time
X
–
X
X
–
X
–
–
Score
5
4
1
Compute chains for document
(Galley and McKeown 2003).
2
Score(Chain) = Sent(Chain)
3
Score(Chain) < Avg(Score).
4
Mark terms in chains as topic.
Lava : {lava, lava, lava, magma, lava}
Time : {day, today, yesterday, second}
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
44 / 45
Annotated Discourse
1 Bad weather dashed hopes of attempts to halt the flow during
what was seen as a lull in the lava’s momentum.
2 Some experts say that even if the eruption stopped today, the
pressure of lava piled up behind for six miles would bring debris
cascading down on to the town anyway.
3 Some estimate the volcano is pouring out one million tons of debris a day, at a rate of 15ft per second, from a fissure that opened
in mid-December.
4 The Italian Army yesterday detonated 400lb of dynamite 3,500
feet up Mount Etna’s slopes.
Clarke & Lapata (Edinburgh)
Global Inference for Sentence Compression
45 / 45