Text summarization model based on the

Text Summarization Model
based on the Budgeted Median Problem
Hiroya Takamura
Manabu Okumura
Tokyo Institute of Technology
Yokohama, Japan
Tokyo Institute of Technology
Yokohama, Japan
[email protected]
[email protected]
ABSTRACT
We propose a multi-document generic summarization model
based on the budgeted median problem. Our model selects
sentences to generate a summary so that every sentence in
the document cluster can be assigned to and be represented
by a sentence in the summary as much as possible. The
advantage of this model is that it covers the entire relevant
part of the document cluster through sentence assignment
and can incorporate asymmetric relations between sentences
such as textual entailment.
Categories and Subject Descriptors
I.2 [Artificial Intelligence]: Natural Language Processing
General Terms
Figure 1: Sentences in a document cluster. The
starting node of each arrow infers the target node.
Algorithms, Experimentation
Keywords
generate a summary by selecting the sentences that provide
the best selection and assignment. Given sentences in Figure 1, we would like to select the two shaded nodes (i.e.,
“the man likes football and baseball” and “his wife bought a
book and a magazine”), since these two sentences infer all the
other sentences. This summarization model can be formalized by the budgeted median problem, which is an integer
linear programming problem.
A possible criticism to this idea of summarization by sentence selection and assignment is that a document cluster
would contain many irrelevant sentences and one cannot
expect that all the sentences should be inferred from the
summary, especially if the summary length is limited. We
manage this problem by incorporating the benefit of each
sentence being included in a summary. If the benefit of a
sentence is low, this sentence will be practically ignored and
does not affect the resultant summary.
An advantage of the model is that it covers the relevant
part of the document cluster through sentence assignment
and can incorporate inter-sentential asymmetric relations
such as textual entailment.
text summarization, median problems
1. INTRODUCTION
Generic text summarization is the task of generating a
short and concise document, or a summary, that describes
the content of a document or multiple documents [8]. One
well-studied approach to this task generates a summary by
selecting some sentences from given documents. We focus
on this extractive approach, since it has an advantage that
the grammaticality is guaranteed.
A summary will be considered good, if the summary represents the whole content of the document cluster. This demand will be met if every sentence in the document cluster
is assigned to a selected sentence and the former can be inferred from the latter. In the example in Figure 1, “the man
is a football fan” can be inferred from “the man likes football
and baseball”. In this case, the former sentence need not be
selected in the summary, if the latter is selected. We thus
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM’09, November 2–6, 2009, Hong Kong, China.
Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00.
2.
RELATED WORK
Nomoto and Matsumoto [10] performed clustering on sentences and used the obtained sentence clusters to generate
a summary. MEAD [11] used document clustering (not sentence clustering) to obtain salient words of each cluster and
then used these words to select sentences. CLASSY [2],
1589
which performed best in DUC’04 in terms of ROUGE-1
score, scored sentences with the sum of tf∗idf scores of words.
They also incorporated sentence compression based on heuristic rules. Filatova and Hatzivassiloglou [5] represented each
sentence with a set of conceptual units (e.g., words) and formalized the extractive summarization as a maximum coverage problem that aims at covering as many conceptual
units as possible by selecting some sentences. Takamura and
Okumura [15] used the same model and solved the problem
as an integer linear programming problem. They also extended the model so that it takes into account the relevancy
of sentences to the topic of the document cluster. Since
their report is the latest one on the generic text summarization, we will use it for comparison. McDonald [9] formalized
text summarization as a knapsack problem which determines
whether each sentence is packed in the knapsack or not.
is assigned is in a summary. (2) is the cardinality constraint.
(3) guarantees that every sentence is assigned to a sentence,
and (4) means that any selected sentence is assigned to itself. The integrality constraint on zij (6) is automatically
satisfied in the problem above.
This maximization problem can be regarded as a budgeted median problem [12], which is an NP-hard problem.
Although this summarization model is intractable in general,
if the problem size is not so large, we can still find the optimal solution by means of the branch-and-bound method [6],
which we rely on throughout this paper.
3.2
Inter-sentential coefficients eij
We explain how to calculate inter-sentential coefficients.
Although the inference relation required for summarization
is similar to the entailment relation, it is also supposed to include other relations such as exemplification. For example,
if one sentence gives merely an example of what is stated
in another sentence, then the former sentence might not be
needed in the summary. Our model provides a framework
in which such relations are well utilized. However, since the
techniques for identifying such relations are still under development, we use baseline methods for entailment recognition.
3. SUMMARIZATION MODEL BASED ON
THE BUDGETED MEDIAN PROBLEM
Facility location problems [3] are applicable to practical issues of determining, for example, hospital location or school
location, where the facility is desired to be accessible to the
customers in the area. The budgeted median problem is a facility location problem which has the cardinality constraints
in the knapsack form. We consider here that customer locations are also potential facility sites.
We will propose a multi-document summarization model
based on the budgeted median problem. Documents are
split beforehand into sentences D = {s1 , · · · , s|D| }. We will
select some sentences from D to generate a summary.
3.2.1
easy,ij
We would like to generate a summary such that every sentence in the document cluster is assigned to and represented
by one selected sentence as much as possible.
Let us denote by eij the extent to which sj is inferred
by si . We call the score inter-sentential coefficient. If we
denote by zij the variable which becomes 1 if sj is assigned
to si , otherwise
P 0, then the score of this whole summary is
going to be i,j eij zij , which we would like to maximize.
We next have to impose the cardinality constraint on this
maximization so that we can obtain a summary of length K
or shorter, measured, for example, by the number of words
or bytes. Let xi denote a variable which becomes 1 if sentence si is selected, 0 otherwise. Let ci denote the length of
sentence
si . The cardinality constraint is then represented
P
as i ci xi ≤ K. Since we have the cardinality constraint, we
cannot expect that every sentence in the document cluster
is perfectly inferred by the summary. However, at least we
are going to maximize the inference relations of the contents
so that we can obtain a good summary.
We formalize text summarization as follows:
P
max.
i,j eij zij
zij ≤ xi ;
P
ci xi ≤ K,
Pi
i zij = 1;
∀i, j,
∀j,
(3)
zii = xi ;
∀i,
(4)
xi ∈ {0, 1};
∀i,
(5)
zij ∈ {0, 1};
∀i, j.
(6)
=
|si ∩ sj |
,
|sj |
(7)
where si is regarded as the set of the words contained in the
sentence, and therefore si ∩ sj represents the intersection of
si and sj . We should notice that easy,ij is asymmetric with
respect to i and j, as suggested by the subscript asy (for
asymmetric). We also try the symmetric version of easy,ij :
3.1 Formalization of Text Summarization
s.t.
Textual entailment score
Inter-sentential coefficient eij indicates the extent to which
sentence si infers sentence sj . We use the score based on
word coverage used as a baseline method by Rus et al. [13]:
esym,ij
3.2.2
=
|si ∩ sj |
.
|si ∪ sj |
(8)
Benefit of each sentence
The proposed method attempts to cover the entire document cluster as much as possible by assigning every sentence to one of the selected sentences. However, we cannot
necessarily expect that all the sentences are inferred from
the summary, because there can be irrelevant sentences and
the summary length is limited. There are also some sentences that are more important than the other, and hence
need to be covered more thoroughly. We manage this problem by incorporating benefit bj which is given by sentence
sj . If the benefit of a sentence is low, this sentence will be
practically ignored and does not affect the resultant summary. We use this benefit to define a new inter-sentential
coefficient: e0asy,ij = bj easy,ij . This new coefficient e0asy,ij
indicates the benefit that can be obtained by assigning sj
to si . It becomes large when the original benefit bj is large
and sj is sufficiently entailed by si . Similarly, we also define
e0sym,ij = bj esym,ij . Notice that e0sym,ij is not symmetric any
more. Benefit bj is defined as follows:
X
1
→
→
−
+ (1 − p) cos(−
sj ,
sk ),
(9)
bj = p
pos(sj )
(1)
(2)
k
where pos(sj ) indicates the position of sentence sj , which
ranges from 1 to the number of sentences in the document.
(1) guarantees that any sentence to which another sentence
1590
We use the inverse of the sentence position because it is
known that leading sentences tend to be important in the
summarization of news articles, which we use in the later
→
experiments. −
sj denotes a feature vector of sentence sj ,
P
−
→
s
denotes
aP
feature vector of the entire document clusk
k
→
→
→
ter, and
cos(−
sj , k −
sk ) denotes the cosine of two vectors −
sj
P →
and k −
sk . We use this cosine because sentences similar to
the entire document cluster are supposed to be important in
summarization. Parameter p controls the trade-off between
the two terms.
Table 1: ROUGE-1 score and computational time
of each model. Underlined scores are significantly
different from that of peer65 in statistical test.
inter-sent.
coefficient
easy,ij
esym,ij
e0asy,ij
e0sym,ij
predicted p
e0asy,ij
optimal p
e0asy,ij
CLASSY
MaxCov
MaxCov-Rel
3.3 Discussion
The proposed method is somewhat similar to methods
based on sentence clustering [10] in the sense that both
methods generate some sets of sentences. However, there
is a big difference between these two methods. While the
methods based on sentence clustering generate sets of similar sentences, the proposed method attempts to generate the
sentence sets, each of which has one selected sentence and
contains sentences entailed by the selected sentence. Two
sentences in a set in the proposed method might not be similar to each other. One advantage of the proposed method is
that asymmetric relations between sentences such as textual
entailment can be incorporated in a natural manner.
Filatova and Hatzivassiloglou [5] represented each sentence with a set of conceptual units (e.g., words) and formalized the extractive summarization as a maximum coverage
problem that aims at covering as many conceptual units as
possible by selecting some sentences. Their model has the restriction that each sentence has to be decomposed into conceptual units. The question of what should be used as conceptual units has not been clearly answered yet. Our model
is free from such a restriction and can incorporate various
inter-sentential relations. One advantage of Filatova’s model
is in computational time; their model has O(max{|D|, N })
decision variables and our model has O(|D|2 ) decision variables, where |D| denotes the number of sentences in the
document cluster, and N denotes the number of conceptual
units. Another advantage of Filatova’s model is that it can
express the situation where two sentences infer one sentence.
Our current model cannot handle such a situation. However,
our model can be extended in a straightforward way so that
it can handle those situations (Section 5).
McDonald [9] formalized text summarization as a knapsack problem which determines whether each sentence is
packed in the knapsack or not. In order not to select similar
sentences, they added to the objective function the negative
sum of similarities between sentences in the summary. Although their model can incorporate various similarities between sentences, their model does not directly emulate the
coverage of the sentences in the document cluster.
p
−
−
(0.0)
(0.0)
ROUGE-1
without with
0.292
0.375
0.239
0.340
0.326
0.399
0.291
0.370
comp.
time (s)
47.3
31.5
82.0
39.3
(0.4)
0.330
0.398
52.7
(0.5)
−
−
−
0.330
0.309
0.306
0.325
0.396
0.382
0.378
0.385
61.1
−
−
−
and adjectives) that are not stopwords. In the calculation
of bj , each sentence is represented as a bag-of-words vector whose elements are stems of content words as above.
ROUGE version 1.5.5 [7] was used for evaluation.1 We
focus on ROUGE-1, because it has been usually used for
evaluation on DUC’04 in other papers, although ROUGE-2
and ROUGE-SU4 are more often used on newer non-generic
datasets. Wilcoxon signed rank test for paired samples with
significance level 0.03 was used for the significance test of
the difference in ROUGE-1. We used the branch-and-bound
methods implemented in ILOG CPLEX version 11.1 to solve
integer linear programming problems. We tested 4 models:
esym,ij , easy,ij , e0sym,ij and e0asy,ij .
4.2
Results
Experimental results are shown in Table 1 with three
columns: ROUGE-1 measured without stopwords (‘without’), ROUGE-1 measured with stopwords (‘with’), the average computational time in seconds for one document cluster. In addition to the result of p = 0, we report the result of
the predicted p determined on DUC’03 as development data
(tuned on ROUGE-1 without stopwords), and the optimal
p determined on DUC’04.
In the experiment, the models with asymmetric intersentential coefficients generally outperformed the models with
symmetric inter-sentential coefficients. We also successfully
predicted p. e0asy,ij and e0sym,ij , which contain the benefit
bj , outperform respectively easy,ij and esym,ij , which do not
contain the benefit. It means the inference alone will not
yield a good summary.
For comparison, we added ROUGE-1 scores of other methods in Table 1. We selected three methods: CLASSY (peer65),
which performed best in DUC’04 in terms of ROUGE-1,
MaxCov, which is based on the maximum coverage problem, MaxCov-Rel, which is a latest generic text summarization model and is a variant of MaxCov. These methods were
mentioned in Section 2. For CLASSY, we obtained the DUC
official results. For MaxCov and MaxCov-Rel, we used the
results obtained through the branch-and-bound method by
Takamura and Okumura [15], in order to remove the search
4. EXPERIMENTS
4.1 Experimental setting
We conducted experiments on task 2 of DUC’04 [4], because it is the latest DUC dataset on generic text summarization. 50 document clusters, each of which consists of 10
documents, are given. One summary is to be generated for
each cluster. Following the official experimental setting of
DUC, we set the target length to 665 bytes.
In the calculation of eij , each sentence is represented as a
set of words. We use stems of content words (nouns, verbs,
1
Options are -n 4 -m -2 4 -u -f A -p 0.5 -b 665 -t 0 -d -s.
In case of evaluation with stopwords, -s was removed.
1591
6.
error and directly compare the models. Conceptual units
in MaxCov and MaxCov-Rel are stems of content words as
those used in our method. In ROUGE-1 evaluation, our
model with e0asy,ij with the predicted or the optimal p significantly outperformed peer65, Even without position information p = 0, e0asy,ij yielded a better result than peer65,
although the difference was not statistically significant. This
means that the proposed method should work well on nonnewspaper datasets, in which the sentence position is not
going to be a strong information for summary generation.
The proposed method still requires much computational
time; for example, e0asy,ij requires nearly 90 seconds for generating one summary. For applications that require fast
summarization, we will need to incorporate efficient approximation techniques [14].
Figure 2 displays the change in ROUGE-1 score of e0asy,ij
when p changes. The curve reaches its peak at around p =
0.5. ROUGE-1 score degrades when p is large. It means
that the sentence position alone is not sufficient.
0.332
We proposed a novel text summarization model based on
the budgeted median problem. The proposed model covers
the entire document cluster through sentence assignment,
since in our model every sentence is represented by one of
the selected sentences as much as possible. An advantage of
our method is that it can incorporate asymmetric relations
between sentences in a natural manner.
7.
0.328
ROUGE-1
0.326
0.324
0.322
0.32
0.318
0.316
0.314
0.312
0.31
0
0.2
0.4
0.6
0.8
REFERENCES
[1] R. Bar-Haim, I. Dagan, B. Dolan, L. Ferro,
D. Giampiccolo, B. Magnini, and I. Szpektor. The
second pascal recognising textual entailment challenge.
In Proc. of the 2nd PASCAL Challenges Workshop on
Recognising Textual Entailment, pp. 1–9, 2006.
[2] J. M. Conroy, J. D. Schlesinger, J. Goldstein, and
D. P. O’Leary. Left-brain/right-brain multi-document
summarization. In Proc. of the DUC, 2004.
[3] Z. Drezner and H. W. Hamacher, editors. Facility
Location: Applications and Theory. Springer, 2004.
[4] Document Understanding Conference. HLT/NAACL
Workshop on Text Summarization, 2004.
[5] E. Filatova and V. Hatzivassiloglou. A formal model
for information selection in multi-sentence text
extraction. In Proc. of the 20th COLING, pp. 397–403,
2004.
[6] J. Hromkovič. Algorithmics for Hard Problems.
Springer, 2003.
[7] C. Lin. ROUGE: a package for automatic evaluation
of summaries. In Proc. of the Workshop on Text
Summarization Branches Out, pp. 74–81, 2004.
[8] I. Mani. Automatic Summarization. John Benjamins
Publisher, 2001.
[9] R. McDonald. A study of global inference algorithms
in multi-document summarization. In Proc. of the 29th
ECIR, pp. 557–564, 2007.
[10] T. Nomoto and Y. Matsumoto. A new approach to
unsupervised text summarization. In Proc. of the 24th
SIGIR, pp. 26–34, 2001.
[11] D. R. Radev, H. Jing, M. gorzata Styś, and D. Tam.
Centroid-based summarization of multiple documents.
Information Processing and Management,
40(6):919–938, 2004.
[12] P. Rojeski and C. S. ReVelle. Central facilities
location under an investment constraint. Geographical
Analysis, 2:343–360, 1970.
[13] V. Rus, A. Graesser, P. M. McCarthy, and K.-I. Lin.
A study on textual entailment. In Proc. of the 17th
ICTAI, pp. 326–333, 2005.
[14] D. B. Shmoys. Approximation algorithms for facility
location problems. In Approximation Algorithms for
Combinatorial Optimization (LNCS; Vol. 1913), pp.
369–378, 2000.
[15] H. Takamura and M. Okumura. Text summarization
model based on maximum coverage problem and its
variant. In Proc. of the 12th EACL, pp. 781–789, 2009.
[16] D. Zelenko, C. Aone, and A. Richardella. Kernel
methods for relation extraction. Journal of Machine
Learning Research, 3:1083–1106, 2003.
Asymmetric
0.33
CONCLUSION
1
p
Figure 2: p and ROUGE-1 score of our model with
the asymmetric inter-sentential coefficient.
We give an example of sentence assignment we obtained.
To the selected sentence “Yeltsin has a long history of health
problems, including a heart bypass surgery two years ago”,
the following sentence was assigned: “Yeltsin suffered from
heart disease during the 1996 presidential election and had
a heart attack, followed by multiple bypass surgery, in the
months after his victory”.
5. EXTENSIONS OF THE MODEL
– One-to-two assignment: in some situations, one sentence might be better represented by two sentences, than
by one sentence. We can extend our model to handle such
a 1-to-2 assignment. We introduce a new decision variable
0
zijk
, which becomes 1 when sentence sk is assigned to sentence pair siP
and sj , 0 otherwise. We also set the objective
0
0
0
function to
i,j,k eijk zijk , where eijk indicates the extent
to which the pair of sentences si and sj infers sk . We can
also extend our model so that it can handle 1-to-many assignment, or even many-to-many assignment.
– Other relations between sentences: we used baseline
methods for recognizing textual entailment. Our model is
expected to perform better with a support of more sophisticated entailment engines [1]. We can use any inter-sentence
relations, such as kernel functions between sentences [16].
– Efficient algorithms: the proposed method requires much
computational time. We need to incorporate efficient approximation techniques [14].
1592