A Learning Technique to Determine Criteria for Multiple Document Summarization Fatma Kallel Jaoua Miracl laboratory [email protected] Maher Jaoua Miracl Laboratory [email protected] Lamia Hadrich Belguith Miracl laboratory [email protected] Abdelmajid Ben Hamadou Miracl Laboratory [email protected] Abstract In this paper we describe a new method of automatic summarization based on a learning step to identify criteria that maximize the correlation between human summary and peer extract. The proposed method uses a genetic algorithm to produce extracts from a collection of source documents describing the same event. Theses extracts are compared to human summaries using “Rouge measure” in order to identify the correlation between statistical and linguistic criteria and “Rouge score”. The experiment Results are presented for a document set extracted from the DUC’06 evaluation conference. 1. Introduction The use of automatic tools for information processing became essential to the user. Indeed, without those tools, it is very difficult to exploit all the relevant information available online. In this context, automatic text summarization systems are very useful. The goal of such systems is to generate a short version (the output) from a simple or multiple documents (the input) that allows the user to obtain the pertinent information. In the recent years, Natural Language Processing (NLP) and Information Retrieval (IR) communities have shown a great interest for automatic text summarization. Indeed, they are encouraging related research work by organizing several workshops on automatic text summarization (e.g. EACL’97, WAS1, etc.), special sessions in international conferences (e.g. SIGIR) and evaluation system conferences (e.g. TIPSTER, DUC 2 ) to encourage implementation of practical systems. DUC conferences, for example, have been interested since 2004 in summarizing multiple documents responding to specific user questions [1, 6]. Several methods for automatic multiple document summarization have been proposed. The common strategy used in these methods is the sentence extraction. This strategy ranks and extracts the representative textual units (sentences, words, phrases or paragraphs) in the multiple documents. Radev and al. described an extractive multi-document summarizer that extracts a summary from multiple documents based on the document cluster centroids [13]. Other works also used similarity techniques in order to determine the sentences reporting the same topic [4]. Marcu introduced two algorithms for sentence compression based on noisychannel model and decision-tree approach [10]. Mckeown and al. developed a system using a generation module in order to reformulate the obtained extract. They also described an algorithm for information fusion, which tries to combine similar sentences across documents to create new sentences based on language generation technologies [11]. In sentence extraction strategy, clustering is introduced to avoid information redundancy resulted from the multiplicity of the source documents. However, the redundancy problem can not be totally solved because the clustering process can not maintain the disjunction between clusters. Moreover, there is a critical problem for clustering methods which consists of how to find out the appropriate number of clusters and which criteria can be used to separate clusters. Some researchers adopted a predefined cluster number or a similarity threshold. Even 2 1 WAS’00, Workshop on automatic Summarization ANLP-NAACL, Seattle, Washington, USA, April, 2000. DUC: Document Understanding conference proposes each year a large scale experiments by creating reference data (documents and summaries) for training and testing. INFOS2008, March 27-29, 2008 Cairo-Egypt © 2008 Faculty of Computers & Information-Cairo University NLP-121 if the cluster number is determined, the sentences (or other textual units such as phrases) that have the highest scores in their clusters are not always the best ones (i.e. when we consider the summary as a whole). To solve these problems, we adopt a genetic algorithm (GA) [7, 8] in order to determine the appropriate set of sentences that maximizes the extract’s information. In addition, we propose a new strategy to identify criteria that can be used to select an extract according to a giving task. Note that the produced extract should be revised in order to improve its coherence. In previous works we proposed an additional revision step that includes a reordering sentence technique based on the context of the selected sentences for the extract [6]. The goal of this paper is to detail the method of determining criteria that can be used to extract sentences from multiple documents. This paper is organized as follows. Section 2 presents an overview of the proposed method and describes the criteria used in the training step. Section 3 describes our system architecture. Section 4 reports experiment results from DUC’06 data set. Finally, Section 5 presents the conclusion and the future work. Final Extract Set of documents yes Evaluation and selection No Stagnation ? preprocessing Extract 1i+1 Set of sentences Extract 2i+1 Extract 1i Extract 2i Random generation Extract ni Population i Extract ki+1 Cross-over Mutation Extract ni+1 Extract mi+1 Population i+1 the highest score (it is obtained when there is a stagnation of the generated population) (see figure 1). Figure 1. Genetic Algorithm process 2. Overview of the proposed method Clustering represents the common phase between the different summarizing methods described above. This phase is introduced in order to eliminate the information redundancy, which results from the multiplicity of the source documents. However, the redundancy problem can not be totally solved due to the fact that the clustering process can not maintain the disjunction between clusters. Thus, the quality of the produced extracts strongly depends on redundancy and coherence problems (e.g., the coherence problem is due to information lack, sentence ordering, information credibility lack, etc.). To solve these problems and to enhance the extract quality, we have proposed a new extraction process, which operates on the extract on its totality and not on its independent parts (e.g., sentences or phrases). Therefore, we consider the extraction process as an optimization problem where the optimal extract is chosen among a set of extracts formed by the sentence conjunction of the source documents [8]. The proposed method uses a genetic algorithm to generate a population of extracts, to evaluate them and to classify them in order to produce the best extract. The genetic algorithm starts from a random solution, and then it builds, in each stage, a set of solutions and evaluates them. The genetic algorithm produces, for each iteration, an extract’s population while combining, randomly, sentences from the various documents and applying crossover and mutation operators [3]. It assigns, for each generated extract, an objective value, which depends on criteria applied to the extract as a whole unit. The best extract is the one that has Two important points should be considered in this approach: which criteria can be considered to evaluate the extracts and how can we select the criteria that can produce good extracts? The first trivial answer to this question is to consider the length of the extract as a dominant criterion. Indeed, if the length of the produced extract exceeds the length required by the user (or by the system) then we reject the produced extract. To give a better answer to the above question, we propose a training step to explore several statistical criteria. This step determines for each produced extract, the correlation between statistical criteria and an informativeness value assigned to the extract. This value can be assigned by a human annotator by reading the extracts and attribute them a scoe from 1 to 10. But this method is very subjective and can not be performed for a large collection of extracts. Thus, to determine an informativeness value for the extracts, we adopt the Rouge measure used by DUC conference to evaluate peer and human summaries. The Score determined by the Rouge system is based on the " model units ” which could be an n_gram words extracted from human summary (in practice we compare the extract with several human summaries at the same time). The output of the automatic summarizers is decomposed into peer units (PU) which are compared to the model units (MU). The Rouge results calculate unigram cooccurrences between summary couples, as follows [9, 12]: NLP-122 ∑ ∑ { Cn = c∈{ModelUnits} ∑Countmatch (n-gram) n− gram ∈ C c∈ ModelUnits} ∑Count (n-gram) n−gram ∈ C Where Countmatch(n-gram) is the maximum number of n-grams co-occurring in a peer summary and a model unit. Count(n-gram) is the number of n-grams in the model unit. One note that the average n-gram3 coverage score (Cn) is a recall metric. The denominator of this equation is the total sum of the number of n-grams occurring at the model summary side and only one model summary is used for each evaluation. Several studies show that Rouge2 (2 words gram) correlates with human judgement [5]. So we adopt this measure to quantify the adequacy between the machine extracts and the human summaries. There are also many studies that show the importance of statistical criteria to determine the candidate sentences to form the extract: − Position: indicates the position of the sentence in the document; − Size: indicates the number of terms (words) in the sentence; − TF*IDF: the TF*IDF (term frequency – inverse document frequency) measure is widely used in information retrieval field; − Similarity to title: this measure is based on the intersection between the vectorial representation of sentences and titles of source documents. − Similarity to document keywords: similarly to the previous criterion, this criterion is computed according to the vectorial representation of the document and the similarity between each sentence and the vector of keywords by using the cosine measure; − Similarity to question keywords: this measue is computed by using the vectorial representation of the document and calculating the similarity between the extract and the vector of words extracted from the user question. This criterion is introduced for the question summarization task proposed in DUC’06 and DUC’07. One notes that the aim of our method consists to evaluate extract as a whole unit because we consider the extract as the minimal unit for extraction and classification. So the above criteria should be generalized to the extract. One notices that few summarization methods consider the word as an extraction and classification unit. Other methods consider the phrase or the sentence or even the paragraph as a unit. In our case, we used the following criteria. One indicates 3 The basic official ROUGE measures for DUC are the 1-gram, 2-gram, 3-gram, 4-gram, and longest common substring scores that all these criteria are normalized. − APS: Average Position: indicates the position of each sentence in its source document. − SSS: Sum of Size : indicates the length of the extract by summing the length of its sentences; − ATI: Average TF*IDF : for all words of the extract we calculate the TF*IDF score and we divide all these scores by the number of words in the extract; − CWT: Coverage of Word Title : represents the number of words of the title covered by all the extract’s sentences; − CDK: Coverage of Document Keywords: this criterion calculates the extract’s coverage for the document keywords. These keywords are determined by calculating the word frequency in the source document collection after the elimination of the stop-words; − CQW: Coverage of Question Words: for DUC’0507, the task of summarization is guided by a user question. This criterion calculates the extract’s coverage of words that occur in the question (stop-words are not considered); − C2W: Coverage of two consecutive words: we determine for the set of document sources the frequency of each two consecutive words. This criterion determines the coverage of the most frequent two-words that occurred in the set; − C3W: Coverage of three consecutive words: represents the same criterion above applied for three words; − C4W: Coverage of four consecutive Words: the same criterion applied for four consecutive words; − PSW: Position of simple words: determines the sum of the first occurence of each word (except for the stop words) of the extract in its source text (we take the minimum value if the word occurs in many texts); − P2W: Position of two words: calculates the sum of the first occurence of each extract’s word couples in their source text; − P3W: Position of three words: the same criterion applied for three consecutive words; − P4W: Position of four words: the same criterion applied for four consecutive words. 3. The ExtraNews system The proposed method is experimented in ExtraNews system which participates to DUC’04-07 evaluation conferences. In these conferences, we used some of the criteria described above without any experimentation to justify our choice. In this work, we perform a training step to select the criteria which correlate with “Rouge2” score in order to determine those to be used to produce and rank the extracts. We have extended ExtraNews system, so that it could accept (as an input) a set of NLP-123 documents and a set of human summaries and produce (as an output) a table containing a list of values indicating the “Rouge2” score and the corresponding values of all other criteria. ExtraNews is composed of four main modules (see Figure 2): the pre-processing module, the statistical module, the linguistic module and the selective module. applying crossover and mutation operators. It assigns, for each generated extract, a fitness value, which depends on Rouge2 score. This score is obtained by comparing the extract with human summaries. If there is a stagnation of the Rouge2 value, then the algorithm stops and the classification gives the best extract. 3.1. The pre-processing module Document 1 Generally, in text processing systems, a preliminary preprocessing phase is performed. This phase consists of identifying the sentences of each document by using the punctuation marks to locate the sentence boundaries and removing very common words (stop words) (e.g., “the”, “a”, etc.) This module removes also the suffixes (i.e., using the stemming technique), so that words such as “learned” and “learning” are converted to the stem “learn”. Pre-processing module Document 2 Document n Words Sentences WordNet User Profil/query 3.2. The statistical module Query key words This module computes the words frequency and their relative position in their source documents. For this purpose, it generates a list of frequent n_gram keywords (1, 2, 3 words). It also determines the positions of the keywords in the source documents. These lists are then used to determine, for each extract, the size, the coverage, and the position criteria. Linguistic module Lists of keywords position Statistical module Lists of keywords Human summary Enrichment Selective module Selected extract 3.3. The linguistic module This module consists in enriching the keywords list (produced from the documents or the user query or its profile) by synonymous words. Thus, it uses the WordNet dictionary which provides predefined functions and relations such as the synsets relation which we use in order to obtain the list of the synonyms [2]. The enrichment of the keywords list is very helpful to correct their frequencies. Thus, if two keywords occur in the source texts, their frequencies are added. This is valid for the synonyms to which the directions are very close. 3.4. The selective module This module operates on two stages: the generation and the classification of the summaries in order to choose the best one. Thus, we define an evaluation function, in which we try to maximize the Rouge2 score (i.e, as an optimization problem). The genetic algorithm implemented in ExtraNews system starts from a random solution, and then builds, in each stage, a set of solutions and evaluates them [3]. This algorithm produces, for each iteration, a population of extracts while combining, randomly, sentences of the various documents and Figure 2. ExtraNews architecture 4. Results The corpora used to experiment the proposed learning step is extracted from the DUC’06 sets of documents. We used 25 document sets. Each set contains 25 documents. The average number of sentences is 32 per document. The complexity of our program depends essentially on the runtime of the genetic algorithm and this is the first result that can be concluded from this experiment. The number of generations is fixed at 300 (maximum value) and each population contains 200 extracts. We notice that the cross-over rate used in our algorithm is 0.6 and the mutation rate is 0.5. The complexity of our algorithm is different from a document set to another. This complexity depends on the number of initial sentences handled by the genetic algorithm. The following figure shows the runtime of our algorithm. Values considered in x-Axis represent different document sets extracted from DUC'06. The y- NLP-124 Axis represents the average time to reach the best solution obtained for three runtimes4 for one document set. The results illustrated in the figure 3 show that the average time is around 25 seconds (for sentence number less then 1000); so we estimate that we can use our program in online context. Rouge2 C2W C4W 90 80 70 60 50 40 30 20 10 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 Figure 6 illustrates the comparison of Rouge2 score with the position criterion. This comparison shows a low correlation with the PSW criterion. Note that all these comparison have been performed on all DUC’06 collection sets and we obtained the same results. 84 4 10 17 11 26 90 2 86 1 81 2 77 5 70 6 Figure 5. Comparaison of Rouge2 score with the coverage criterion. 66 1 60 8 Time in second C3W Sentence number Figure 3. Runtime evaluation of the genetic algorithm Figure 4 reports the variation of the frequency criterion mentioned above with Rouge2 for the document set number D0604D. We present only a sample of the result of the Rouge2 score variation. Rouge2 PSW P2W P3W P4W Rouge2 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 ATI CWT CDK Figure 6. Comparison of Rouge2 score with the position criterion. CQW 5. Conclusion 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 Figure 4. Comparison of Rouge2 score with frequency criterion. As illustrated from figure 4, Rouge2 score correlates with ATI (Average TF*IDF criterion) score. Indeed, the ATI value increases when the Rouge2 score increases. Figure 5 shows a comparison of Rouge2 score with the coverage criterion score. This comparison does not give a significant result but we can notice a low correlation with the C2W criterion (coverage of two occurred word). 4 This Computer used to perform the experiment is a Centrino Core Solo at 1.86 GHz, 512 MB RAM. In this paper we have shown the interest of using a learning technique in order to detect the criteria that could be used to select the best extracts. The system evaluation has led to two main conclusions : the average TF*IDF criterion presents the best correlation with Rouge2 score however the C2W and PSW criteria present a low correlation with it. As a future work, we intend to experiment other criteria using the learning step. We also plan to experiment the correlation of these criteria with other corpora. In addition, we think that we can enhance the runtime of our program by applying a filtering step in order to reduce the number of sentences to be considered from the source documents. NLP-125 6. References [1] DUC Document Understanding Conference, Workshop on Text Summarization 2003-2004, Boston, USA. http://www-nlpir.nist.gov/projects/duc/ [2] Fellbaum C., WordNet: An Electronic Lexical Database, The MIT Press, 1998. [3] Goldberg D.E., Genetic Algorithm in Search Optimization, and Machine Learning. Addison-Wesley Reading, Massachusetts, 1989. [4] Goldstein J., Carbonell J., Callan J., “Creating and Evaluating Multi-Document Sentence Extract Summaries”, Proceedings of the ACM-CIKM International Conference on Information and Knowledge Management. McLean, VA, USA, 2000, pp.165-172. [5] Hirohata, M.; Shinnaka, Y.; Iwano, K.; Furui, S. “Sentence extraction-based presentation summarization techniques and evaluation metrics”, in Acoustics, Speech, and Signal Processing, 2005. [6] Jaoua K.F., Belguith H.L., Ben Hamadou A., "Révision des Extraits de Documents Multiples Basée sur la Réorganisation des Phrases : Application à la Langue Arabe", dans les actes de la conférence IBIMA 08 : Information Management in Modern Organizations, Marrakech, Marroc, du 4 au 6 janvier 2008. [7] Jaoua K.F., Jaoua M., Ben Hamadou A., “Summarization at Laris Laboratory”, Proceeding of DUC 2004 Conference, Boston, USA, 2004. [8] Jaoua K.F., Jaoua M., Ben Hamadou A., “Une méthode de condensation automatique des documents multiples: cas des articles de presse”, Proceeding of CIDE.6 International Conference on Electronique Document. Caen, France, Novembre 2003. [9] Lin C.Y. Hovy E.H., “Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics”, Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003. [10] Marcu D., Gerber L., “An Inquiry into the Nature of Multidocument Abstracts, Extracts and Their Evaluation”, Proceedings of Automatic Summarization Workshop DUC’01, 2001. [11] McKeown K. and al., “Columbia at the Document Understanding Conference 2003”, Workshop on Text Summarization (DUC 2003), Edmonton, Canada, 2003. [12] Papineni, K., Roukos S., Ward T., Zhu W. J., BLEU: a Method for Automatic Evaluation of Machine Translation, IBM Research Report RC22176 (W0109022), 2001. [13] Radev R., Otterbacher J., Qi H., Tam D., “MEAD ReDUCs: Michigan at DUC 2003”, Workshop on Text Summarization (DUC 2003), Edmonton, Canada, 2003. NLP-126
© Copyright 2026 Paperzz