A Learning Technique to Determine Criteria for

A Learning Technique to Determine Criteria for Multiple Document
Summarization
Fatma Kallel Jaoua
Miracl laboratory
[email protected]
Maher Jaoua
Miracl Laboratory
[email protected]
Lamia Hadrich Belguith
Miracl laboratory
[email protected]
Abdelmajid Ben Hamadou
Miracl Laboratory
[email protected]
Abstract
In this paper we describe a new method of automatic
summarization based on a learning step to identify
criteria that maximize the correlation between human
summary and peer extract. The proposed method uses a
genetic algorithm to produce extracts from a collection of
source documents describing the same event. Theses
extracts are compared to human summaries using
“Rouge measure” in order to identify the correlation
between statistical and linguistic criteria and “Rouge
score”. The experiment Results are presented for a
document set extracted from the DUC’06 evaluation
conference.
1. Introduction
The use of automatic tools for information processing
became essential to the user. Indeed, without those tools,
it is very difficult to exploit all the relevant information
available online. In this context, automatic text
summarization systems are very useful. The goal of such
systems is to generate a short version (the output) from a
simple or multiple documents (the input) that allows the
user to obtain the pertinent information.
In the recent years, Natural Language Processing (NLP)
and Information Retrieval (IR) communities have shown
a great interest for automatic text summarization. Indeed,
they are encouraging related research work by organizing
several workshops on automatic text summarization (e.g.
EACL’97, WAS1, etc.), special sessions in international
conferences (e.g. SIGIR) and evaluation system
conferences (e.g. TIPSTER, DUC 2 ) to encourage
implementation of practical systems. DUC conferences,
for example, have been interested since 2004 in
summarizing multiple documents responding to specific
user questions [1, 6].
Several methods for automatic multiple document
summarization have been proposed. The common
strategy used in these methods is the sentence extraction.
This strategy ranks and extracts the representative textual
units (sentences, words, phrases or paragraphs) in the
multiple documents. Radev and al. described an
extractive multi-document summarizer that extracts a
summary from multiple documents based on the
document cluster centroids [13]. Other works also used
similarity techniques in order to determine the sentences
reporting the same topic [4]. Marcu introduced two
algorithms for sentence compression based on noisychannel model and decision-tree approach [10]. Mckeown
and al. developed a system using a generation module in
order to reformulate the obtained extract. They also
described an algorithm for information fusion, which tries
to combine similar sentences across documents to create
new sentences based on language generation technologies
[11].
In sentence extraction strategy, clustering is introduced to
avoid information redundancy resulted from the
multiplicity of the source documents. However, the
redundancy problem can not be totally solved because the
clustering process can not maintain the disjunction
between clusters. Moreover, there is a critical problem for
clustering methods which consists of how to find out the
appropriate number of clusters and which criteria can be
used to separate clusters. Some researchers adopted a
predefined cluster number or a similarity threshold. Even
2
1
WAS’00, Workshop on automatic Summarization ANLP-NAACL,
Seattle, Washington, USA, April, 2000.
DUC: Document Understanding conference proposes each year a large
scale experiments by creating reference data (documents and summaries)
for training and testing.
INFOS2008, March 27-29, 2008 Cairo-Egypt
© 2008 Faculty of Computers & Information-Cairo University
NLP-121
if the cluster number is determined, the sentences (or
other textual units such as phrases) that have the highest
scores in their clusters are not always the best ones (i.e.
when we consider the summary as a whole).
To solve these problems, we adopt a genetic algorithm
(GA) [7, 8] in order to determine the appropriate set of
sentences that maximizes the extract’s information. In
addition, we propose a new strategy to identify criteria
that can be used to select an extract according to a giving
task. Note that the produced extract should be revised in
order to improve its coherence. In previous works we
proposed an additional revision step that includes a
reordering sentence technique based on the context of the
selected sentences for the extract [6].
The goal of this paper is to detail the method of
determining criteria that can be used to extract sentences
from multiple documents. This paper is organized as
follows. Section 2 presents an overview of the proposed
method and describes the criteria used in the training step.
Section 3 describes our system architecture. Section 4
reports experiment results from DUC’06 data set. Finally,
Section 5 presents the conclusion and the future work.
Final
Extract
Set of
documents
yes
Evaluation and selection
No Stagnation ?
preprocessing
Extract 1i+1
Set of
sentences
Extract 2i+1
Extract 1i
Extract 2i
Random
generation
Extract ni
Population i
Extract ki+1
Cross-over
Mutation
Extract ni+1
Extract mi+1
Population i+1
the highest score (it is obtained when there is a stagnation
of the generated population) (see figure 1).
Figure 1. Genetic Algorithm process
2. Overview of the proposed method
Clustering represents the common phase between the
different summarizing methods described above. This
phase is introduced in order to eliminate the information
redundancy, which results from the multiplicity of the
source documents. However, the redundancy problem can
not be totally solved due to the fact that the clustering
process can not maintain the disjunction between clusters.
Thus, the quality of the produced extracts strongly
depends on redundancy and coherence problems (e.g., the
coherence problem is due to information lack, sentence
ordering, information credibility lack, etc.).
To solve these problems and to enhance the extract
quality, we have proposed a new extraction process,
which operates on the extract on its totality and not on its
independent parts (e.g., sentences or phrases). Therefore,
we consider the extraction process as an optimization
problem where the optimal extract is chosen among a set
of extracts formed by the sentence conjunction of the
source documents [8]. The proposed method uses a
genetic algorithm to generate a population of extracts, to
evaluate them and to classify them in order to produce the
best extract. The genetic algorithm starts from a random
solution, and then it builds, in each stage, a set of
solutions and evaluates them. The genetic algorithm
produces, for each iteration, an extract’s population while
combining, randomly, sentences from the various
documents and applying crossover and mutation
operators [3]. It assigns, for each generated extract, an
objective value, which depends on criteria applied to the
extract as a whole unit. The best extract is the one that has
Two important points should be considered in this
approach: which criteria can be considered to evaluate the
extracts and how can we select the criteria that can
produce good extracts? The first trivial answer to this
question is to consider the length of the extract as a
dominant criterion. Indeed, if the length of the produced
extract exceeds the length required by the user (or by the
system) then we reject the produced extract.
To give a better answer to the above question, we
propose a training step to explore several statistical
criteria. This step determines for each produced extract,
the correlation between statistical criteria and an
informativeness value assigned to the extract. This value
can be assigned by a human annotator by reading the
extracts and attribute them a scoe from 1 to 10. But this
method is very subjective and can not be performed for a
large collection of extracts.
Thus, to determine an informativeness value for the
extracts, we adopt the Rouge measure used by DUC
conference to evaluate peer and human summaries. The
Score determined by the Rouge system is based on the "
model units ” which could be an n_gram words extracted
from human summary (in practice we compare the extract
with several human summaries at the same time). The
output of the automatic summarizers is decomposed into
peer units (PU) which are compared to the model units
(MU). The Rouge results calculate unigram cooccurrences between summary couples, as follows [9,
12]:
NLP-122
∑
∑
{
Cn = c∈{ModelUnits}
∑Countmatch (n-gram)
n− gram ∈ C
c∈ ModelUnits}
∑Count (n-gram)
n−gram ∈ C
Where Countmatch(n-gram) is the maximum number of
n-grams co-occurring in a peer summary and a model
unit. Count(n-gram) is the number of n-grams in the
model unit. One note that the average n-gram3 coverage
score (Cn) is a recall metric. The denominator of this
equation is the total sum of the number of n-grams
occurring at the model summary side and only one model
summary is used for each evaluation.
Several studies show that Rouge2 (2 words gram)
correlates with human judgement [5]. So we adopt this
measure to quantify the adequacy between the machine
extracts and the human summaries.
There are also many studies that show the importance of
statistical criteria to determine the candidate sentences to
form the extract:
− Position: indicates the position of the sentence in the
document;
− Size: indicates the number of terms (words) in the
sentence;
− TF*IDF: the TF*IDF (term frequency – inverse
document frequency) measure is widely used in
information retrieval field;
− Similarity to title: this measure is based on the
intersection between the vectorial representation of
sentences and titles of source documents.
− Similarity to document keywords: similarly to the
previous criterion, this criterion is computed according to
the vectorial representation of the document and the
similarity between each sentence and the vector of
keywords by using the cosine measure;
− Similarity to question keywords: this measue is
computed by using the vectorial representation of the
document and calculating the similarity between the
extract and the vector of words extracted from the user
question. This criterion is introduced for the question
summarization task proposed in DUC’06 and DUC’07.
One notes that the aim of our method consists to evaluate
extract as a whole unit because we consider the extract as
the minimal unit for extraction and classification. So the
above criteria should be generalized to the extract. One
notices that few summarization methods consider the
word as an extraction and classification unit. Other
methods consider the phrase or the sentence or even the
paragraph as a unit.
In our case, we used the following criteria. One indicates
3
The basic official ROUGE measures for DUC are the 1-gram,
2-gram, 3-gram, 4-gram, and longest common substring scores
that all these criteria are normalized.
− APS: Average Position: indicates the position of each
sentence in its source document.
− SSS: Sum of Size : indicates the length of the extract
by summing the length of its sentences;
− ATI: Average TF*IDF : for all words of the extract
we calculate the TF*IDF score and we divide all these
scores by the number of words in the extract;
− CWT: Coverage of Word Title : represents the
number of words of the title covered by all the extract’s
sentences;
− CDK: Coverage of Document Keywords: this
criterion calculates the extract’s coverage for the
document keywords. These keywords are determined by
calculating the word frequency in the source document
collection after the elimination of the stop-words;
− CQW: Coverage of Question Words: for DUC’0507, the task of summarization is guided by a user
question. This criterion calculates the extract’s coverage
of words that occur in the question (stop-words are not
considered);
− C2W: Coverage of two consecutive words: we
determine for the set of document sources the frequency
of each two consecutive words. This criterion determines
the coverage of the most frequent two-words that
occurred in the set;
− C3W: Coverage of three consecutive words:
represents the same criterion above applied for three
words;
− C4W: Coverage of four consecutive Words: the same
criterion applied for four consecutive words;
− PSW: Position of simple words: determines the sum
of the first occurence of each word (except for the stop
words) of the extract in its source text (we take the
minimum value if the word occurs in many texts);
− P2W: Position of two words: calculates the sum of
the first occurence of each extract’s word couples in their
source text;
− P3W: Position of three words: the same criterion
applied for three consecutive words;
− P4W: Position of four words: the same criterion
applied for four consecutive words.
3. The ExtraNews system
The proposed method is experimented in ExtraNews
system which participates to DUC’04-07 evaluation
conferences. In these conferences, we used some of the
criteria described above without any experimentation to
justify our choice. In this work, we perform a training
step to select the criteria which correlate with “Rouge2”
score in order to determine those to be used to produce
and rank the extracts. We have extended ExtraNews
system, so that it could accept (as an input) a set of
NLP-123
documents and a set of human summaries and produce (as
an output) a table containing a list of values indicating the
“Rouge2” score and the corresponding values of all other
criteria. ExtraNews is composed of four main modules
(see Figure 2): the pre-processing module, the statistical
module, the linguistic module and the selective module.
applying crossover and mutation operators. It assigns, for
each generated extract, a fitness value, which depends on
Rouge2 score. This score is obtained by comparing the
extract with human summaries. If there is a stagnation of
the Rouge2 value, then the algorithm stops and the
classification gives the best extract.
3.1. The pre-processing module
Document 1
Generally, in text processing systems, a preliminary preprocessing phase is performed. This phase consists of
identifying the sentences of each document by using the
punctuation marks to locate the sentence boundaries and
removing very common words (stop words) (e.g., “the”,
“a”, etc.) This module removes also the suffixes (i.e.,
using the stemming technique), so that words such as
“learned” and “learning” are converted to the stem
“learn”.
Pre-processing
module
Document 2
Document n
Words
Sentences
WordNet
User
Profil/query
3.2. The statistical module
Query key
words
This module computes the words frequency and their
relative position in their source documents. For this
purpose, it generates a list of frequent n_gram keywords
(1, 2, 3 words). It also determines the positions of the
keywords in the source documents. These lists are then
used to determine, for each extract, the size, the coverage,
and the position criteria.
Linguistic
module
Lists of
keywords
position
Statistical
module
Lists of
keywords
Human
summary
Enrichment
Selective module
Selected
extract
3.3. The linguistic module
This module consists in enriching the keywords list
(produced from the documents or the user query or its
profile) by synonymous words. Thus, it uses the
WordNet dictionary which provides predefined functions
and relations such as the synsets relation which we use in
order to obtain the list of the synonyms [2]. The
enrichment of the keywords list is very helpful to correct
their frequencies. Thus, if two keywords occur in the
source texts, their frequencies are added. This is valid for
the synonyms to which the directions are very close.
3.4. The selective module
This module operates on two stages: the generation and
the classification of the summaries in order to choose the
best one. Thus, we define an evaluation function, in
which we try to maximize the Rouge2 score (i.e, as an
optimization
problem).
The
genetic
algorithm
implemented in ExtraNews system starts from a random
solution, and then builds, in each stage, a set of solutions
and evaluates them [3]. This algorithm produces, for each
iteration, a population of extracts while combining,
randomly, sentences of the various documents and
Figure 2. ExtraNews architecture
4. Results
The corpora used to experiment the proposed learning
step is extracted from the DUC’06 sets of documents. We
used 25 document sets. Each set contains 25 documents.
The average number of sentences is 32 per document.
The complexity of our program depends essentially on the
runtime of the genetic algorithm and this is the first result
that can be concluded from this experiment. The number
of generations is fixed at 300 (maximum value) and each
population contains 200 extracts. We notice that the
cross-over rate used in our algorithm is 0.6 and the
mutation rate is 0.5.
The complexity of our algorithm is different from a
document set to another. This complexity depends on the
number of initial sentences handled by the genetic
algorithm. The following figure shows the runtime of our
algorithm. Values considered in x-Axis represent
different document sets extracted from DUC'06. The y-
NLP-124
Axis represents the average time to reach the best solution
obtained for three runtimes4 for one document set.
The results illustrated in the figure 3 show that the
average time is around 25 seconds (for sentence number
less then 1000); so we estimate that we can use our
program in online context.
Rouge2
C2W
C4W
90
80
70
60
50
40
30
20
10
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73
Figure 6 illustrates the comparison of Rouge2 score with
the position criterion. This comparison shows a low
correlation with the PSW criterion.
Note that all these comparison have been performed on
all DUC’06 collection sets and we obtained the same
results.
84
4
10
17
11
26
90
2
86
1
81
2
77
5
70
6
Figure 5. Comparaison of Rouge2 score with the
coverage criterion.
66
1
60
8
Time in second
C3W
Sentence number
Figure 3. Runtime evaluation of the genetic algorithm
Figure 4 reports the variation of the frequency criterion
mentioned above with Rouge2 for the document set
number D0604D. We present only a sample of the result
of the Rouge2 score variation.
Rouge2
PSW
P2W
P3W
P4W
Rouge2
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73
ATI
CWT
CDK
Figure 6. Comparison of Rouge2 score with the
position criterion.
CQW
5. Conclusion
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73
Figure 4. Comparison of Rouge2 score with frequency
criterion.
As illustrated from figure 4, Rouge2 score correlates
with ATI (Average TF*IDF criterion) score. Indeed, the
ATI value increases when the Rouge2 score increases.
Figure 5 shows a comparison of Rouge2 score with the
coverage criterion score. This comparison does not give a
significant result but we can notice a low correlation with
the C2W criterion (coverage of two occurred word).
4
This Computer used to perform the experiment is a Centrino Core Solo
at 1.86 GHz, 512 MB RAM.
In this paper we have shown the interest of using a
learning technique in order to detect the criteria that could
be used to select the best extracts. The system evaluation
has led to two main conclusions : the average TF*IDF
criterion presents the best correlation with Rouge2 score
however the C2W and PSW criteria present a low
correlation with it. As a future work, we intend to
experiment other criteria using the learning step. We also
plan to experiment the correlation of these criteria with
other corpora. In addition, we think that we can enhance
the runtime of our program by applying a filtering step in
order to reduce the number of sentences to be considered
from the source documents.
NLP-125
6. References
[1] DUC Document Understanding Conference,
Workshop on Text Summarization 2003-2004, Boston,
USA. http://www-nlpir.nist.gov/projects/duc/
[2] Fellbaum C., WordNet: An Electronic Lexical
Database, The MIT Press, 1998.
[3] Goldberg D.E., Genetic Algorithm in Search
Optimization, and Machine Learning. Addison-Wesley
Reading, Massachusetts, 1989.
[4] Goldstein J., Carbonell J., Callan J., “Creating and
Evaluating
Multi-Document
Sentence
Extract
Summaries”,
Proceedings
of
the
ACM-CIKM
International Conference on Information and Knowledge
Management. McLean, VA, USA, 2000, pp.165-172.
[5] Hirohata, M.; Shinnaka, Y.; Iwano, K.; Furui, S.
“Sentence extraction-based presentation summarization
techniques and evaluation metrics”, in Acoustics, Speech,
and Signal Processing, 2005.
[6] Jaoua K.F., Belguith H.L., Ben Hamadou A.,
"Révision des Extraits de Documents Multiples Basée sur
la Réorganisation des Phrases : Application à la Langue
Arabe", dans les actes de la conférence IBIMA 08 :
Information Management in Modern Organizations,
Marrakech, Marroc, du 4 au 6 janvier 2008.
[7] Jaoua K.F., Jaoua M., Ben Hamadou A.,
“Summarization at Laris Laboratory”, Proceeding of
DUC 2004 Conference, Boston, USA, 2004.
[8] Jaoua K.F., Jaoua M., Ben Hamadou A., “Une
méthode de condensation automatique des documents
multiples: cas des articles de presse”, Proceeding of
CIDE.6 International Conference on Electronique
Document. Caen, France, Novembre 2003.
[9] Lin C.Y. Hovy E.H., “Automatic Evaluation of
Summaries Using N-gram Co-occurrence Statistics”,
Proceedings of 2003 Language Technology Conference
(HLT-NAACL 2003), Edmonton, Canada, May 27 - June
1, 2003.
[10] Marcu D., Gerber L., “An Inquiry into the Nature of
Multidocument
Abstracts,
Extracts
and
Their
Evaluation”, Proceedings of Automatic Summarization
Workshop DUC’01, 2001.
[11] McKeown K. and al., “Columbia at the Document
Understanding Conference 2003”, Workshop on Text
Summarization (DUC 2003), Edmonton, Canada, 2003.
[12] Papineni, K., Roukos S., Ward T., Zhu W. J., BLEU:
a Method for Automatic Evaluation of Machine
Translation, IBM Research Report RC22176 (W0109022), 2001.
[13] Radev R., Otterbacher J., Qi H., Tam D., “MEAD
ReDUCs: Michigan at DUC 2003”, Workshop on Text
Summarization (DUC 2003), Edmonton, Canada, 2003.
NLP-126