Text Summarization Model based on the Budgeted Median Problem Hiroya Takamura Manabu Okumura Tokyo Institute of Technology Yokohama, Japan Tokyo Institute of Technology Yokohama, Japan [email protected] [email protected] ABSTRACT We propose a multi-document generic summarization model based on the budgeted median problem. Our model selects sentences to generate a summary so that every sentence in the document cluster can be assigned to and be represented by a sentence in the summary as much as possible. The advantage of this model is that it covers the entire relevant part of the document cluster through sentence assignment and can incorporate asymmetric relations between sentences such as textual entailment. Categories and Subject Descriptors I.2 [Artificial Intelligence]: Natural Language Processing General Terms Figure 1: Sentences in a document cluster. The starting node of each arrow infers the target node. Algorithms, Experimentation Keywords generate a summary by selecting the sentences that provide the best selection and assignment. Given sentences in Figure 1, we would like to select the two shaded nodes (i.e., “the man likes football and baseball” and “his wife bought a book and a magazine”), since these two sentences infer all the other sentences. This summarization model can be formalized by the budgeted median problem, which is an integer linear programming problem. A possible criticism to this idea of summarization by sentence selection and assignment is that a document cluster would contain many irrelevant sentences and one cannot expect that all the sentences should be inferred from the summary, especially if the summary length is limited. We manage this problem by incorporating the benefit of each sentence being included in a summary. If the benefit of a sentence is low, this sentence will be practically ignored and does not affect the resultant summary. An advantage of the model is that it covers the relevant part of the document cluster through sentence assignment and can incorporate inter-sentential asymmetric relations such as textual entailment. text summarization, median problems 1. INTRODUCTION Generic text summarization is the task of generating a short and concise document, or a summary, that describes the content of a document or multiple documents [8]. One well-studied approach to this task generates a summary by selecting some sentences from given documents. We focus on this extractive approach, since it has an advantage that the grammaticality is guaranteed. A summary will be considered good, if the summary represents the whole content of the document cluster. This demand will be met if every sentence in the document cluster is assigned to a selected sentence and the former can be inferred from the latter. In the example in Figure 1, “the man is a football fan” can be inferred from “the man likes football and baseball”. In this case, the former sentence need not be selected in the summary, if the latter is selected. We thus Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’09, November 2–6, 2009, Hong Kong, China. Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00. 2. RELATED WORK Nomoto and Matsumoto [10] performed clustering on sentences and used the obtained sentence clusters to generate a summary. MEAD [11] used document clustering (not sentence clustering) to obtain salient words of each cluster and then used these words to select sentences. CLASSY [2], 1589 which performed best in DUC’04 in terms of ROUGE-1 score, scored sentences with the sum of tf∗idf scores of words. They also incorporated sentence compression based on heuristic rules. Filatova and Hatzivassiloglou [5] represented each sentence with a set of conceptual units (e.g., words) and formalized the extractive summarization as a maximum coverage problem that aims at covering as many conceptual units as possible by selecting some sentences. Takamura and Okumura [15] used the same model and solved the problem as an integer linear programming problem. They also extended the model so that it takes into account the relevancy of sentences to the topic of the document cluster. Since their report is the latest one on the generic text summarization, we will use it for comparison. McDonald [9] formalized text summarization as a knapsack problem which determines whether each sentence is packed in the knapsack or not. is assigned is in a summary. (2) is the cardinality constraint. (3) guarantees that every sentence is assigned to a sentence, and (4) means that any selected sentence is assigned to itself. The integrality constraint on zij (6) is automatically satisfied in the problem above. This maximization problem can be regarded as a budgeted median problem [12], which is an NP-hard problem. Although this summarization model is intractable in general, if the problem size is not so large, we can still find the optimal solution by means of the branch-and-bound method [6], which we rely on throughout this paper. 3.2 Inter-sentential coefficients eij We explain how to calculate inter-sentential coefficients. Although the inference relation required for summarization is similar to the entailment relation, it is also supposed to include other relations such as exemplification. For example, if one sentence gives merely an example of what is stated in another sentence, then the former sentence might not be needed in the summary. Our model provides a framework in which such relations are well utilized. However, since the techniques for identifying such relations are still under development, we use baseline methods for entailment recognition. 3. SUMMARIZATION MODEL BASED ON THE BUDGETED MEDIAN PROBLEM Facility location problems [3] are applicable to practical issues of determining, for example, hospital location or school location, where the facility is desired to be accessible to the customers in the area. The budgeted median problem is a facility location problem which has the cardinality constraints in the knapsack form. We consider here that customer locations are also potential facility sites. We will propose a multi-document summarization model based on the budgeted median problem. Documents are split beforehand into sentences D = {s1 , · · · , s|D| }. We will select some sentences from D to generate a summary. 3.2.1 easy,ij We would like to generate a summary such that every sentence in the document cluster is assigned to and represented by one selected sentence as much as possible. Let us denote by eij the extent to which sj is inferred by si . We call the score inter-sentential coefficient. If we denote by zij the variable which becomes 1 if sj is assigned to si , otherwise P 0, then the score of this whole summary is going to be i,j eij zij , which we would like to maximize. We next have to impose the cardinality constraint on this maximization so that we can obtain a summary of length K or shorter, measured, for example, by the number of words or bytes. Let xi denote a variable which becomes 1 if sentence si is selected, 0 otherwise. Let ci denote the length of sentence si . The cardinality constraint is then represented P as i ci xi ≤ K. Since we have the cardinality constraint, we cannot expect that every sentence in the document cluster is perfectly inferred by the summary. However, at least we are going to maximize the inference relations of the contents so that we can obtain a good summary. We formalize text summarization as follows: P max. i,j eij zij zij ≤ xi ; P ci xi ≤ K, Pi i zij = 1; ∀i, j, ∀j, (3) zii = xi ; ∀i, (4) xi ∈ {0, 1}; ∀i, (5) zij ∈ {0, 1}; ∀i, j. (6) = |si ∩ sj | , |sj | (7) where si is regarded as the set of the words contained in the sentence, and therefore si ∩ sj represents the intersection of si and sj . We should notice that easy,ij is asymmetric with respect to i and j, as suggested by the subscript asy (for asymmetric). We also try the symmetric version of easy,ij : 3.1 Formalization of Text Summarization s.t. Textual entailment score Inter-sentential coefficient eij indicates the extent to which sentence si infers sentence sj . We use the score based on word coverage used as a baseline method by Rus et al. [13]: esym,ij 3.2.2 = |si ∩ sj | . |si ∪ sj | (8) Benefit of each sentence The proposed method attempts to cover the entire document cluster as much as possible by assigning every sentence to one of the selected sentences. However, we cannot necessarily expect that all the sentences are inferred from the summary, because there can be irrelevant sentences and the summary length is limited. There are also some sentences that are more important than the other, and hence need to be covered more thoroughly. We manage this problem by incorporating benefit bj which is given by sentence sj . If the benefit of a sentence is low, this sentence will be practically ignored and does not affect the resultant summary. We use this benefit to define a new inter-sentential coefficient: e0asy,ij = bj easy,ij . This new coefficient e0asy,ij indicates the benefit that can be obtained by assigning sj to si . It becomes large when the original benefit bj is large and sj is sufficiently entailed by si . Similarly, we also define e0sym,ij = bj esym,ij . Notice that e0sym,ij is not symmetric any more. Benefit bj is defined as follows: X 1 → → − + (1 − p) cos(− sj , sk ), (9) bj = p pos(sj ) (1) (2) k where pos(sj ) indicates the position of sentence sj , which ranges from 1 to the number of sentences in the document. (1) guarantees that any sentence to which another sentence 1590 We use the inverse of the sentence position because it is known that leading sentences tend to be important in the summarization of news articles, which we use in the later → experiments. − sj denotes a feature vector of sentence sj , P − → s denotes aP feature vector of the entire document clusk k → → → ter, and cos(− sj , k − sk ) denotes the cosine of two vectors − sj P → and k − sk . We use this cosine because sentences similar to the entire document cluster are supposed to be important in summarization. Parameter p controls the trade-off between the two terms. Table 1: ROUGE-1 score and computational time of each model. Underlined scores are significantly different from that of peer65 in statistical test. inter-sent. coefficient easy,ij esym,ij e0asy,ij e0sym,ij predicted p e0asy,ij optimal p e0asy,ij CLASSY MaxCov MaxCov-Rel 3.3 Discussion The proposed method is somewhat similar to methods based on sentence clustering [10] in the sense that both methods generate some sets of sentences. However, there is a big difference between these two methods. While the methods based on sentence clustering generate sets of similar sentences, the proposed method attempts to generate the sentence sets, each of which has one selected sentence and contains sentences entailed by the selected sentence. Two sentences in a set in the proposed method might not be similar to each other. One advantage of the proposed method is that asymmetric relations between sentences such as textual entailment can be incorporated in a natural manner. Filatova and Hatzivassiloglou [5] represented each sentence with a set of conceptual units (e.g., words) and formalized the extractive summarization as a maximum coverage problem that aims at covering as many conceptual units as possible by selecting some sentences. Their model has the restriction that each sentence has to be decomposed into conceptual units. The question of what should be used as conceptual units has not been clearly answered yet. Our model is free from such a restriction and can incorporate various inter-sentential relations. One advantage of Filatova’s model is in computational time; their model has O(max{|D|, N }) decision variables and our model has O(|D|2 ) decision variables, where |D| denotes the number of sentences in the document cluster, and N denotes the number of conceptual units. Another advantage of Filatova’s model is that it can express the situation where two sentences infer one sentence. Our current model cannot handle such a situation. However, our model can be extended in a straightforward way so that it can handle those situations (Section 5). McDonald [9] formalized text summarization as a knapsack problem which determines whether each sentence is packed in the knapsack or not. In order not to select similar sentences, they added to the objective function the negative sum of similarities between sentences in the summary. Although their model can incorporate various similarities between sentences, their model does not directly emulate the coverage of the sentences in the document cluster. p − − (0.0) (0.0) ROUGE-1 without with 0.292 0.375 0.239 0.340 0.326 0.399 0.291 0.370 comp. time (s) 47.3 31.5 82.0 39.3 (0.4) 0.330 0.398 52.7 (0.5) − − − 0.330 0.309 0.306 0.325 0.396 0.382 0.378 0.385 61.1 − − − and adjectives) that are not stopwords. In the calculation of bj , each sentence is represented as a bag-of-words vector whose elements are stems of content words as above. ROUGE version 1.5.5 [7] was used for evaluation.1 We focus on ROUGE-1, because it has been usually used for evaluation on DUC’04 in other papers, although ROUGE-2 and ROUGE-SU4 are more often used on newer non-generic datasets. Wilcoxon signed rank test for paired samples with significance level 0.03 was used for the significance test of the difference in ROUGE-1. We used the branch-and-bound methods implemented in ILOG CPLEX version 11.1 to solve integer linear programming problems. We tested 4 models: esym,ij , easy,ij , e0sym,ij and e0asy,ij . 4.2 Results Experimental results are shown in Table 1 with three columns: ROUGE-1 measured without stopwords (‘without’), ROUGE-1 measured with stopwords (‘with’), the average computational time in seconds for one document cluster. In addition to the result of p = 0, we report the result of the predicted p determined on DUC’03 as development data (tuned on ROUGE-1 without stopwords), and the optimal p determined on DUC’04. In the experiment, the models with asymmetric intersentential coefficients generally outperformed the models with symmetric inter-sentential coefficients. We also successfully predicted p. e0asy,ij and e0sym,ij , which contain the benefit bj , outperform respectively easy,ij and esym,ij , which do not contain the benefit. It means the inference alone will not yield a good summary. For comparison, we added ROUGE-1 scores of other methods in Table 1. We selected three methods: CLASSY (peer65), which performed best in DUC’04 in terms of ROUGE-1, MaxCov, which is based on the maximum coverage problem, MaxCov-Rel, which is a latest generic text summarization model and is a variant of MaxCov. These methods were mentioned in Section 2. For CLASSY, we obtained the DUC official results. For MaxCov and MaxCov-Rel, we used the results obtained through the branch-and-bound method by Takamura and Okumura [15], in order to remove the search 4. EXPERIMENTS 4.1 Experimental setting We conducted experiments on task 2 of DUC’04 [4], because it is the latest DUC dataset on generic text summarization. 50 document clusters, each of which consists of 10 documents, are given. One summary is to be generated for each cluster. Following the official experimental setting of DUC, we set the target length to 665 bytes. In the calculation of eij , each sentence is represented as a set of words. We use stems of content words (nouns, verbs, 1 Options are -n 4 -m -2 4 -u -f A -p 0.5 -b 665 -t 0 -d -s. In case of evaluation with stopwords, -s was removed. 1591 6. error and directly compare the models. Conceptual units in MaxCov and MaxCov-Rel are stems of content words as those used in our method. In ROUGE-1 evaluation, our model with e0asy,ij with the predicted or the optimal p significantly outperformed peer65, Even without position information p = 0, e0asy,ij yielded a better result than peer65, although the difference was not statistically significant. This means that the proposed method should work well on nonnewspaper datasets, in which the sentence position is not going to be a strong information for summary generation. The proposed method still requires much computational time; for example, e0asy,ij requires nearly 90 seconds for generating one summary. For applications that require fast summarization, we will need to incorporate efficient approximation techniques [14]. Figure 2 displays the change in ROUGE-1 score of e0asy,ij when p changes. The curve reaches its peak at around p = 0.5. ROUGE-1 score degrades when p is large. It means that the sentence position alone is not sufficient. 0.332 We proposed a novel text summarization model based on the budgeted median problem. The proposed model covers the entire document cluster through sentence assignment, since in our model every sentence is represented by one of the selected sentences as much as possible. An advantage of our method is that it can incorporate asymmetric relations between sentences in a natural manner. 7. 0.328 ROUGE-1 0.326 0.324 0.322 0.32 0.318 0.316 0.314 0.312 0.31 0 0.2 0.4 0.6 0.8 REFERENCES [1] R. Bar-Haim, I. Dagan, B. Dolan, L. Ferro, D. Giampiccolo, B. Magnini, and I. Szpektor. The second pascal recognising textual entailment challenge. In Proc. of the 2nd PASCAL Challenges Workshop on Recognising Textual Entailment, pp. 1–9, 2006. [2] J. M. Conroy, J. D. Schlesinger, J. Goldstein, and D. P. O’Leary. Left-brain/right-brain multi-document summarization. In Proc. of the DUC, 2004. [3] Z. Drezner and H. W. Hamacher, editors. Facility Location: Applications and Theory. Springer, 2004. [4] Document Understanding Conference. HLT/NAACL Workshop on Text Summarization, 2004. [5] E. Filatova and V. Hatzivassiloglou. A formal model for information selection in multi-sentence text extraction. In Proc. of the 20th COLING, pp. 397–403, 2004. [6] J. Hromkovič. Algorithmics for Hard Problems. Springer, 2003. [7] C. Lin. ROUGE: a package for automatic evaluation of summaries. In Proc. of the Workshop on Text Summarization Branches Out, pp. 74–81, 2004. [8] I. Mani. Automatic Summarization. John Benjamins Publisher, 2001. [9] R. McDonald. A study of global inference algorithms in multi-document summarization. In Proc. of the 29th ECIR, pp. 557–564, 2007. [10] T. Nomoto and Y. Matsumoto. A new approach to unsupervised text summarization. In Proc. of the 24th SIGIR, pp. 26–34, 2001. [11] D. R. Radev, H. Jing, M. gorzata Styś, and D. Tam. Centroid-based summarization of multiple documents. Information Processing and Management, 40(6):919–938, 2004. [12] P. Rojeski and C. S. ReVelle. Central facilities location under an investment constraint. Geographical Analysis, 2:343–360, 1970. [13] V. Rus, A. Graesser, P. M. McCarthy, and K.-I. Lin. A study on textual entailment. In Proc. of the 17th ICTAI, pp. 326–333, 2005. [14] D. B. Shmoys. Approximation algorithms for facility location problems. In Approximation Algorithms for Combinatorial Optimization (LNCS; Vol. 1913), pp. 369–378, 2000. [15] H. Takamura and M. Okumura. Text summarization model based on maximum coverage problem and its variant. In Proc. of the 12th EACL, pp. 781–789, 2009. [16] D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083–1106, 2003. Asymmetric 0.33 CONCLUSION 1 p Figure 2: p and ROUGE-1 score of our model with the asymmetric inter-sentential coefficient. We give an example of sentence assignment we obtained. To the selected sentence “Yeltsin has a long history of health problems, including a heart bypass surgery two years ago”, the following sentence was assigned: “Yeltsin suffered from heart disease during the 1996 presidential election and had a heart attack, followed by multiple bypass surgery, in the months after his victory”. 5. EXTENSIONS OF THE MODEL – One-to-two assignment: in some situations, one sentence might be better represented by two sentences, than by one sentence. We can extend our model to handle such a 1-to-2 assignment. We introduce a new decision variable 0 zijk , which becomes 1 when sentence sk is assigned to sentence pair siP and sj , 0 otherwise. We also set the objective 0 0 0 function to i,j,k eijk zijk , where eijk indicates the extent to which the pair of sentences si and sj infers sk . We can also extend our model so that it can handle 1-to-many assignment, or even many-to-many assignment. – Other relations between sentences: we used baseline methods for recognizing textual entailment. Our model is expected to perform better with a support of more sophisticated entailment engines [1]. We can use any inter-sentence relations, such as kernel functions between sentences [16]. – Efficient algorithms: the proposed method requires much computational time. We need to incorporate efficient approximation techniques [14]. 1592
© Copyright 2026 Paperzz