Improving Triangle-Graph Based Text Summarization using Hybrid

Indian Journal of Science and Technology, Vol 10(8), DOI: 10.17485/ijst/2017/v10i8/108907, February 2017
ISSN (Print) : 0974-6846
ISSN (Online) : 0974-5645
Improving Triangle-Graph Based Text Summarization
using Hybrid Similarity Function
Yazan Alaya AL-Khassawneh1*, Naomie Salim1 and Mutasem Jarrah1,2
Faculty of Computing, Universiti Teknologi Malaysia, 81310, Skudai, Johor, Malaysia;
[email protected], [email protected]
2
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah,
Saudi Arabia; [email protected]
1
Abstract
Objective: Extractive Summarization, extracts the most applicable sentences from the main document, while keeping the most
vital information in the document. The Graph-based techniques have become very popular for text summarisation. This paper
introduces a hybrid graph based technique for single-document extractive summarization. Methods/Statistical Analysis: Prior
research that utilised the graph-based approach for extractive summarisation deployed one function for computing the necessary
summary. Nonetheless, in our work, we have recommended an innovative hybrid similarity function (H), for estimation purpose.
This function hybridises four distinct similarity measures: cosine similarity (sim1), Jaccard similarity (sim2), word alignmentbased similarity (sim3) and the window-based similarity measure (sim4). The method uses a trainable summarizer, which takes
into account several features. The effect of these features on the summarization task is investigated. Findings: By combining,
the traditional similarity measures (Cosine and Jaccard) with dynamic programming approaches (word alignment-based and
the window-based) for calculating the similarity between two sentences, more common information were extracted and helped
to find the best sentences to be extracted in the final summary. The proposed method was evaluated using ROUGE measures
on the dataset DUC2002. The experimental results showed that specific combinations of features could give higher efficiency.
It also showed that some features have more effect than others on the summary creation. Applications/Improvements: The
performance of this new method has been tested using the DUC 2002 data set. The effectiveness of this technique is measured
using the ROUGE score, and the results are promising when compared with some existing techniques.
Keywords: Extractive Summarization, Feature Extraction, Graph-Based Summarization, Hybrid Similarity, Sentence
Similarity, Triangle Counting
1. Introduction
With the quick expansion of Internet, a vast quantity of
acquaintance is offered and reachable online. Humans
extensively use internet to find information through proficient Information Retrieval (IR) tools such as Google,
Yahoo, AltaVista, and so on. People have no enough time
for reading all, and so far they need to have vital decisions
depends on the available information there. The need
for new procedures to assist people achieve and absorb
the vital information in these sources become increasing
imperative because the quantity and accessibility of textual
information increases. The necessitate for computerized
summaries is becoming more visible. Improvement in
*Author for correspondence
text summarization and filtration will not only permit the
improvement of superior repossession systems, but also
support access and analyze information on the basis of
the text in multiple ways to help create a separate novelist
and always access the system. There are diverse explanations for summary. In1 describes the summary as a text
that is founded depends on one or more texts; it keeps
the largely vital info of the original documents and its
content is not more than half of the original documents.
In2 defines the summarization of a text as the process of
finding the major significant contents and a procedure
for discovering the major foundation of information, and
presenting them as an ephemeral text in the predefined
prototype. In3 provide the following definition for a sum-
Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function
mary: “A summary can be loosely defined as a text that is
produced from one or more texts that conveys important
information in the original text(s), and that is no longer
than half of the original text(s) and usually significantly
less than that”.
Texts summarizing has three main steps4. These steps
are identifying the topic, interpretation, and last step is
summary generation. Through identifying the topic, the
mainly important info in the document is recognized.
Most systems allocate diverse preference to diverse fragments of the text (words, phrases, and sentence); then the
scores of every part being mixed to discover the totality
degree for a part. At the end, the system includes the N
highest degree fragments in concluding summary. Cue
Axioms, content counting, word existence are some
examples of numerous techniques for topic identification
were had been studied5.
Interpretation step related to Abstract summaries. In
This step, associated issues are joined with a view to shape
a common brief content and the extra expressions are
ignored6. Concluding the topics is complicated; consequently furthermost of the systems produce the extractive
summary.
In the step of generating the summary, the system
uses text generation procedures. This step contains a collection of diverse generation procedures from extremely
straightforward word or phrase printing to more complicated expression assimilation and sentence generation. In
other words, this step is to generate the natural language
which is easy to be understood by the users.
Based on the type of generated summary, the summarization systems are categorized. Generally speaking,
the methods can be either extractive summarization or
abstractive summarization. Extractive summarization
involves assigning salience scores to some units (e.g.
sentences, paragraphs) of the document and extracting the sentences with highest scores, while abstraction
summarization (e.g. http://www1.cs.columbia.edu/nlp/
newsblaster/) usually needs information fusion, sentence
compression and reformulation7,8.
The summarization techniques can be classified into
two groups: supervised techniques that rely on pre-existing
document- summary pairs, and unsupervised techniques,
based on properties and heuristics derived from the text.
Supervised extractive summarization techniques treat
the summarization task as a two-class classification problem at the sentence level, where the summary sentences
are positive samples while the non-summary sentences
2
Vol 10 (8) | February 2017 | www.indjst.org
are negative samples. After representing each sentence
by a vector of features, the classification function can be
trained in two different manners9. One is in a discriminative way with well-known algorithms such as Support
Vector Machine (SVM)10. Many unsupervised methods
have been developed for document summarization by
exploiting different features and relationships of the sentences; see for example11-17.
On the other hand, summarization task can also be
categorized as either generic or query-based. A querybased summary presents the information that is most
relevant to the given queries8,18-20. While a generic summary gives an overall sense of the document’s content, see
for example8,11-14,18,20-23.
Also there are two different groups of text summarization: Indicative and Informative24. Inductive
summarization gives the main idea of the text to the user.
The length of this type of summarization is around 5
per cent of the given text. The informative summarization system gives brief information of the main text .The
length of informative summary is around 20 per cent of
the given text.
This work focus on extractive summaries. Extractive
summaries are established by elicitation of main text
slices (sentences or paragraphs) from the document,
depends on analytical statistics of singular or diverse
surface stage lineaments like word/phrase occurrence,
position or sign words for finding the sentences that must
be extracted. The “mainly vital” tenor is dealt like the
“most common” or the “most positively situated” tenor.
This method therefore avoids every labour on profound
text perceptive. They are abstractly straightforward and
simple to apply. Till now, numerous diverse procedures
have been suggested to choose the most vital fragment
of the text like statistical approaches; which contains
Aggregation Similarity Method25, Location Method26,
Frequency Method27, TF-Based Query Method28, linguistic approaches which contains Graph Theory, Lexical
Chain, Word Net and Clustering.
According to29,30 a graph is defined as a group of nodes
and a group of edges that join a number of pairs of nodes.
In the context of databases, the nodes symbolize individual components and the edges signify the relationship
between these components. Triangle Counting Approach
is a technique used to prune the graph, so the aim of this
study is to use this technique to create a summary.
The rest of this paper is organized as follows: Section 2
will discuss the related work on graph based text summa-
Indian Journal of Science and Technology
Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah
rization and triangle counting approach. An overview of
the proposed approach is given in section 3. Experimental
results will be given in section 4. Section 5 Discuss the
work and finally a summary is given in Section 6.
2.1 Graph-Based Text Summarization
As our study focuses primarily on the graph-based single text document extractive summarization processes,
henceforth, we shall examine reports which are related to
the graph-based analysis.
Several reports have stated that the graph algorithms
can generate document summaries very effectively. They
can prove to be a technique of determining the relevance
of the vertex in a graph depending on the data drawn
from the document’s graph structure31. When using the
graph models for the natural language text documents,
build a graph which symbolises the document text and
connects the words and the text having meaningful relationships31,32. The text documents of varied sizes could
be used as the graph vertices, e.g., the words or sentences and relationships used for drawing connections
between the 2 vertices e.g., lexical and semantic relation32,
co-reference resolutions, etc. These relationships can be
used for drawing the edges between the graph vertices.
Then iterate the graph-based ranking algorithms like Text
Rank31, PageRank33, HITS33 which can be used in the case
of the undirected or the weighted graphs, for the textbased ranking application33. Thereafter, Sort the vertices
depending on their end scores. The higher ranked vertices refer to the document summaries. The following types
of graph models are primarily used for representing a text
document.
2.1.1 Weighted Graph Model
In34 suggested a sentence ranking and a clustering-based
text summarization method which extracts key sentences from the text. Initially, this process clusters all
the sentences in the document using the SNMF (Sparse
Non-negative Matrix Factorisation) technique. Then, a
weighted and undirected graph is built which uses the
sentence similarities along with the discourse relationships between the statements as the weights of the edges.
Thereafter, a graph-based ranking algorithm is applied
for calculating the sentence scores. The higher ranked
sentences for every cluster are chosen as the document
summary. Equation 1 is used for calculating the rank of
every vertex in the graph as follows:
Vol 10 (8) | February 2017 | www.indjst.org
Where,
and
are vertices of the graph,
parameter present between 0 and 1,
and
(1)
is a
= no. of sentences;
is the normalised weight which is passed from
to .
An algorithm was developed by35 for generating an
extractive document summary for several Vietnamese
documents. The algorithm constructs undirected graphs
using sentences from the texts in the form of vertices
while the edges represent the sentence similarities. For
2 sentences, , , the sentence similarity within them
could be described by:
Where;
and
are 2 sentences,
(2)
is the term fre-
quency present in a document,
.
After applying the PageRank algorithm the sentences
can be ranked based on their salient score and are chosen depending on their Maximal marginal relevance36 for
generating summaries35.
In the Graph-Based Algorithm suggested by33 for the
text summarization process, the sentences form the graph
nodes. The sentences with a similarity within them possess an edge. After the graph is built, a summary can be
generated by using the smallest path starting from the
first statement of the document and ending with the last
statement. This then results in a smooth but a short sentence set between the 2 points.
2.1.2 Heterogeneous Graph Model
The current methods do not consider the relation
between the various granularities (like words, sentences,
and topic). The text document summarization technique
based on the heterogeneous graphs has been suggested
by37. This method constructs a graph depicting a relationship between the words, sentences or topics and uses a
ranking algorithm for calculating the node scores. The
highest sentence scores are selected as a summary.
For estimating the similarity between the 2 topics, a
cosine measure is used:
Indian Journal of Science and Technology
3
Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function
where;
and
(3)
are the corresponding term vectors
for the topics, and .
For a particular sentence and the word collection for
a text, the link similarity refers to the mean similarity
between the words present in a sentence and can be estimated as follows37:
(4)
is the similarity present
where;
between the words
and
sent in the sentence,
; while
, where
is a word pre-
is the word present in
a sentence, .
For a word collection and the topic collection for
a text, the link similarity refers to the mean similarity
between the words and the topics present in a text and can
be estimated as follows37:
where;
(5)
is the word present in the topic
,
.
is the word present in the whole collection of topics,
For the sentence and the topic collection of text documents, the link similarity refers to the cosine similarity
present between the sentences and the topic present in the
text and is calculated as follows:
Where
is the term vector of the topic
(6)
,
is the
term vector for the sentence, . The above method has
been compared to the baseline method.
This method uses the relation present between the different sizes of the granularities (e.g. word, sentence and
topic); hence, it demonstrates better results in comparison to the baseline method.
2.1.3 Correlation Graph Model
In38 have suggested a graph-based GRAPHSUM-general
purpose summariser technique. This method finds association rules for representing the correlation present
4
Vol 10 (8) | February 2017 | www.indjst.org
between the multiple terms which were neglected by the
earlier approaches. Moreover, the graphical nodes that
represented 2 or more than 2 terms were ranked first by
the PageRank strategy. Thereafter, the high ranked nodes
have been used for sentence selection processes.
2.1.4 Semantic Graph Model
In39 suggested a new graph-based model for document
processing applications. The model depended on the
4 dimensions of semantic similarity, cosine similarity,
co-reference and the discourse information for graph
creation. The results have stated that the co-reference
resolution improved the performance for all the cases
studied with only the syntactical similarity or after adding semantic similarity39.
The results also proved that semantic similarity, cosine
similarity, co-reference and the discourse information
produced the best data
In40 developed the Sentence Rank algorithm that used
the semantic and statistical analysis for the sentences for
estimating the importance of the sentences for summarising a text. The algorithm built semantic graphs in which
the nodes were regarded as sentences and the edges represented the semantic relation between the sentences,
which were estimated using WordNet. The nodes were
ranked by the ranking algorithm, with the high ranked
sentences collected for a summary.
2.1.5 Query Specific Graph Model
In41 developed the query-based similarity measure for
the already existing graph models for the purpose of
estimating the sentence to sentence edge weights and
for multi-document summarization. The technique differentiated the intra-document and the inter-document
sentence relation in the multi-document summarization
with regards
process. The importance of the sentence,
to a query,
is calculated as follows:
(7)
where, 2 sentences are represented as
sentences while
refers to a document,
sentence vector, and
(8)
and
refer to 2
represents the
refers to the query vector.
Indian Journal of Science and Technology
Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah
The sentence
Theorem 2: (Eigen Triangle Local) The number of
was ranked using Equation 9:
triangulations ∆i that participate in the node i , can be
counted from the cube of the eigenvalues from the matrix
generated around
(9)
Where
,
is the neighbouring sentence vertices for
is the PageRank damping factor, and
ranking for the sentence,
is the
.
2.2 Triangle Counting Approach
Numerous schoolwork has been accompanied to discover or contrivance fresh algorithms for Counting
Triangles in Data Graph Sets. In42 introduced an algorithm known as the node-iterator which has an execution
(
d
time O n
2
max
) ⊂ O(n ). The algorithm listing-ayz is the
3
record version of the majority proficient counting algo-
(
)
rithm43. It has a running time of O m 3 / 2 . It uses the idea
of cores. It takes a node of minimum degree, calculates its
triangles in the same manner as in node-iterator and then
removes the node from the graph. The execution time is
(
On
c
2
max
) , where c(v) is the core number of node v. Since
the node-iterator core is an enhancement over the listingayz the execution time of the node-iterator-core is also
(
)
O m3 / 2 .
In proposed a new, highly reliable, fast and parallelizable algorithm to count triangles. The parallelizing is
vital because it provides an opportunity to mine a large
number of graphs using corresponding architectures such
as the map/reduce (‘Hadoop’)45. The proposed method
uses two algorithms as follows:
Theorem 1: (Eigen Triangle) The total number of
trigonometry in the graph commensurate with the cube
of the total value o
​ f self-neighbourliness matrix, as represented below:
44
∆(G ) =
1 n 3
∑ λi
6 i =1
Vol 10 (8) | February 2017 | www.indjst.org
(10)
∑λu
=
3
j
2
i, j
(11)
2
Where ui,j is the j - th entry of the i - th eigen vector.
In46 conducted research on the counting of triangle
algorithms and proposed a collection of methods based
on random sampling for approximating with high accuracy the number of triangles in 3-noded and 4-noded
minors of directed and undirected graphs. The algorithm
was tested on various networks from 10 different fields.
Based on the rate of reoccurrence of all the minors, they
have also proposed an effective network clustering algorithm.
In44 conducted research on triangle counting algorithms and recommended a new approach. They devised a
method for the sparsification of a graph by changing it into
alternative weighted graph. The newly converted graph
will have a smaller number of edges. The triangles would
be calculated with the EIGENTRIANGLE method. Their
method combines the concepts of the EIGENTRIANGLE
with the Achlioptas-McSherr algorithms.
In47 examined a new sampling algorithm for counting
triangles. They executed the method on large networks,
and proved speed-ups that are 70000 times faster in
counting triangles. The performance of the algorithm will
be precise when the densities of triangles are mild. In48
presented a high level of parallel development and a fresh
indiscriminate algorithm for estimating the number of
triangulations in an undirected graph. The algorithm uses
the popular Monte-Carlo simulation to count the number
of triangles. Each sample needs O( E ) time and
∆i
j
O(ε −2 log(1 δ ) ρ (G ) 2 )
(12)
Samples are needed to ensure a (ε , δ ) rough calcula-
tion, where ρ (G ) is a quantifier of the triangle sparsity
of G. The ρ (G ) is not essentially small. The algorithm
needs only O(|V|) space to work professionally.
The author provided experiences to prove that in this
pursuit usually only O(log2|V|) samples are critical to the
accurate estimation of graphs. This algorithm is more
efficient than other relevant state-of-the-art algorithms
Indian Journal of Science and Technology
5
Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function
for counting triangles, especially in speed and accuracy.
Unfortunately this algorithm is parallel only when the
critical path of O(|E|) is achievable on as few as O(log2|V|)
processors.
The Eigen Triangle and Eigen Triangle Local algorithms
were proposed in49 to calculate the full quantity of triangles
and the quantity of triangles that every node shares with
an undirected graph. These algorithms are effective for all
types of graphs, except when the graphs have certain spectral properties. The authors confirmed this pragmatically
by conducting 160 tests on diverse kinds of actual networks.
They observed important speedups between 34× and 1,075
× faster performances with 95% precision compared to an
uncomplicated counting algorithm.
with one of the graph-based ARM methods. It consists
of four important steps; data demonstration, triangle
production, bit vector demonstration, and triangle combination with the graph-based ARM technique. The
performance of the suggested technique is compared
with the main graph-based ARM. Experimental outcome
show that the suggested technique shortens the execution
time of rules production and creates less number of rules
with higher assurance.
3. Overview of Approach
In this section, we describe the several steps we undertook to generate the summary based on the graph-based
method. The main steps that we undergo are as follows:
START
Document Pre-processing
Document
Text
Segmentation
Stop words
removal
Sentence Similarity
Based on Hybrid Model
Features Selection
and Scoring
Sentence Ranking
Summary
Stemming
Text-Graph
representation
Centrality
Calculation
Sub-Graph
Construction
Evaluation
END
6
1. Pre-processing
2. Graph construction
3. Centrality calculation
4. Sentence ranking
5. Summary generation
6. Evaluation and result
3.1 Proposed Graph-based Approach for
Single-Document Extractive Summarization
In this method, a graph-based representation is introduced as a new technique for extractive summary
generation. This method has four main steps; which are:
1. Data Pre-processing
a. Text segmentation
b.Stop words removal
c. Stemming
2. Text Graph-based Representation
3. Sub-graph construction
4. Summary generation
Figure 1. The Framework of our Proposed Graph-Based
Summarization Approach.
3.2 Data Pre-processing
Based on the work in50 the authors projected an
approach that can adjust a graph by switching it to alternative graph with fewer amount of edges and nodes.
The major aspire of this study was to utilize the Triangle
Counting Approach for graph-based association rules
mining. A triangle counting technique for graph-based
ARM is suggested to reduce the graph in the search for
common item sets. The triangle counting is incorporated
The initial step in text summarisation includes data preprocessing. In our study, this step of data pre-processing
comprises of 3 sub-steps: text segmentation, stop word
removal and the word stemming. The first step of text
segmentation divides the text document into sentences.
We have used the technique of stop words removal for
removing meaningless words. We also used the stemming
algorithm for deleting the affixes (prefixes or suffixes)
present in a word for producing the root word. In this
Vol 10 (8) | February 2017 | www.indjst.org
Indian Journal of Science and Technology
Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah
step, we extracted the important words present in the
document and disregarded the rest of the words present.
This could severely affect the similarity present between
the documents.
3.3 Feature Extraction
The textual document is symbolised by the set, D=(S1,
S2,...., Sk), wherein S1 indicates a sentence found in document D. Feature extraction is then applied to the textual
contents, while useful main sentence and word structures are determined. Each document features structures
such as title words, sentence lengths, sentence positions,
numerical data, term weights, sentence similarities, and
thematic-word as well as proper-noun instances.
1. Title words: Higher scores are assigned to sentences
which include words from the titles, where the contents’ meanings are conveyed in the title words. This is
determined in the following manner:
(13)
2. Sentence lengths: Sentences which are overly shortened
such as date or author lines are removed. Each sentence’s normalised length is evaluated as:
(14)
3. Sentence positions: Higher scores are assigned to those
sentences which occur further ahead in their paragraphs. For every paragraph with n sentences, each
sentence’s score is evaluated as:
(15)
4. Numerical data: Each sentence featuring numerical
terms which replicates major statistical figures in the
text is designated for summarisation. Each sentence’s
numerical score is evaluated as:
(16)
5. Thematic words: The quantity of thematic words, or
domain-specific terms exhibiting maximum-possible
relativeness, found in a sentence divided by the maximum quantity found in the sentences evaluates as:
Vol 10 (8) | February 2017 | www.indjst.org
(17)
6. Sentence-to-sentence similarities: Token-matching
techniques are utilised to calculate similarities in every
sentence S along with all others. A matrix [N][N] is set
up, where N is the total quantity of sentences found
and the diagonal components are fixed at zero since
sentences are not to be evaluated against themselves.
Each sentence’s similarity score is evaluated as:
(18)
3.4 Text Graph-based Representation
Numerous techniques used for text representation were
based on the Bag of Words, also known as the Vector Space
Models (VSM)51. In the graph representation, proposed by
us, the vertices refer to the sentences, whereas the edges can
be defined based on the similarity present between the different sentence pairs. A graph is then constructed depending
on the similarity value present between 2 sentences. The
earlier works which was carried out for extractive summarization using the graph-based technique have used the cosine
function for calculation of the required similarity. This function hybridises four distinct similarity measures: cosine
similarity (sim1), Jaccard similarity (sim2), optimal local
alignment-based similarity (sim3) and the sliding windowbased similarity measure (sim4). We can estimate the final
similarity score in the following manner:
(19)
The above-mentioned four similarity measures are
explained below:
3.4.1 Hybrid Similarity Function
This sub-section will discuss the similarity measures used
to for the hybrid function.
• Cosine Function:
In this method, a simple heuristic feature, like determination of the overall word frequency in a document
or in certain phrases which indicate the sentence importance, has been used52. The relevance of words present
in a sentence can be easily measured by estimating the
Indian Journal of Science and Technology
7
Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function
value.
wherein,
ing manner:
and
refers to
can be calculated in the follow-
where
,
(20)
, in the text document,
. Inverse
Document Frequency (IDF) can be estimated as follows:
the text,
(21)
is the total number of sentences present in
is the overall number of sentences having the
word, .
To explain further, in the case of the words occurring
very frequently in every sentence (like the articles “a”, “the”
values near 0 while the rare
etc.) would have their
words (like the proper nouns or medical terms) would
have larger
values. Thereafter, the
values for every word in a sentence is estimated and
the unique words present in every folder gets extracted.
Then, we build a vector for every statement present in the
text, where the vector size depends on the number of the
unique words that are present in the folder. Assuming
are unique words present in
that
a specific folder, then the vector for the sentences, S1 and
S2 would be:
The similarity present between the S1 and the S2 can be
estimated by applying the cosine formula as follows:
(22)
(23)
The
value always lies between 0-1. Hence, it
values
can be concluded that sentences with a high
would be more similar when compared to the sentence
pairs having a smaller value.
• Jaccard Similarity
For the two strings involved, the similarity ratio can be
determined by dividing the amount of the intersecting or
common terms with the amounts of the inimitable terms
present in the strings53,54. Mathematically, the Similarity
is estimated as follows:
Ratio
8
(25)
refers to the number of occurrences of
the specific word,
where
(24)
Two cases are present, wherein the similarity scores
between the 2 strings could be equal to 1 or to 0 as
described by Equation (25).
Vol 10 (8) | February 2017 | www.indjst.org
• Optimal Local Alignment Algorithm
The optimal local alignment refers to a dynamic programming technique. This method determines the best
local alignment present within two sequences55,56,57. In
this study, we have assumed that two sequences refer to
two sentences that need to be compared. The main aim
of the local alignment similarity measure is capturing the
phrase level similarity which is present within the two
sentences. Then, the probable alignments can be comis a
pared using the scoring model. Assume that
score for the optimal local alignment which ends at
and
within
proper initialisations,
ing manner:
, and
, then, after
can be estimated in the follow-
(26)
Based on Eq. 26, we can obtain the score matrix for
the two sentences. Thereafter, we can determine the highest score which is obtained from the matrix and then, we
normalise the value by applying the below-mentioned
formula:
and
where,
respectively.
and
(27)
are the lengths of the sentences,
• Windowing Algorithm
The similarity between two sentences present in the
text can also be determined by windowing, which refers
to a dynamic programming technique58-66. This method
chooses a window of a specific size (neither very small
nor very large).
Assuming two sentences
refers to the source sentence while
sentence.
and
wherein,
refers to the target
Indian Journal of Science and Technology
Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah
Initially, a window of a defined size is set up for the
match
two sentences. Thereafter, determine if the
. Then, a total number of these matches are
to any
determined and are stored in a match. Thereafter, a score
is computed for the considered window:
where,
dow sizes.
and
(28)
are the source and target win-
Then, the window for the sentence,
is moved to the
is estimated,
right-hand side by 1 word, and the
and the process continues. Thereafter, an average value
of the estimated scores is determined and is called as the
. It can be computed as follows:
(29)
where,
is the total window segments present in the
particular sentence.
Then, the window for the sentence
right side by a value of one word, and the
to the initial position, and the
is moved to the
window is reset
are
calculated. In addition, the above-mentioned procedure is
repeated for calculating the
.
After this process, a final average score can be determined
using all the values for the
,
where
refers to the number of the window segments
present in the source statement.
(30)
In this process, final score estimated is already normalized. A window size of five has been used in all our
experiments.
• Combining Four Similarity Measures
Once all the functions required for estimating the
hybrid function have been explained, we can proceed
to describe the hybrid function. The hybrid function, it
takes into account the average of the four different values
calculated from the different functions described above.
In this manner, all the functions get assigned with similar weights. This measured value ranges between 0-1, as
the values are normalized before calculating the mean.
Vol 10 (8) | February 2017 | www.indjst.org
Thereafter, it applies a similarity measure for building a
graph that denotes the similarity measure between the
sentence pairs. Therefore,
(31)
3.5 Graph Representation
Based on the similarity measures, we will depict the pictorial representation of the graph. All the hybrid similarity
values are non-zero. Also, it is assumed that every sentence is similar to itself hence.
Based on the similarity values, we can build the graph
for the text document. According to67, we set a scoring threshold (β) for sentence similarity at 0.5. In other
meaning if the similarity value is less than 0.5, then there
is no relation between sentences, which means, there is no
edge between nodes.
3.6 Centrality Calculation
As can be observed in the similarity matrix, all the values are non-zero. This can be explained by the fact that
the sentences have been selected from the same text and
could be similar to one another. Since we are primarily interested in the relevant similarities; we remove the
low values after applying a threshold value. Also, bear in
mind, that self-links must be present for all graph nodes
as each sentence is generally similar to itself. However, we
have eliminated them for the sake of readability.
Every sentence centrality can be obtained by counting
the similar sentences present for every sentence. Hence,
sentence centrality refers to a degree of the corresponding
node present in the complete graph. As noted, the choice
of the hybrid threshold greatly affects the centrality interpretation. As the centrality values refer to the number of
edges which are incident to the node, these values can
range between one to a maximal number of sentences
present in the text. Therefore, this value must be normalized to lie between 0-1. This can be done as follows:
(32)
where, is the sentence centrality, is the degree of
a sentence (i.e., nodes present in a graph), min is the minimal degree value present in the complete folder, max is
Indian Journal of Science and Technology
9
Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function
the maximal degree value present in the complete folder.
3.7 Sentence Scoring
For generating summaries in the extractive text summarization process, many important features are present which
helpful in obtaining a summary. The summary quality
depends on the selection of relevant sentences depending
on the feature score. The feature efficacy is important to
differentiate between all the features having a high or a
low relevance. In our study, we have strived to determine
the influence of the features on the summary generated.
In our work we will use a combination of the six features.
The combinations could be two, three, four, five or six
features. We have 63 probabilities for these combinations.
Hence, all the sentences have been assigned some specific
weights depending on the selected features. Then, the
weights are normalised to range between 0-1. Finally, a
sentence score can be determined in the following manner:
Sentence score = sentence centrality * Selected
Features Values
(33)
where, c is the normalised sentence centrality value,
is the total number of selected features,
is
the feature selected value,
are the weights provided to every measure.
The feature-selected value can be calculated using the
following formula:
(34)
In this experiment, we have assumed w1- w2-1.
However, they can range between 0-1 if the data is
changed. This affects the resultant values which would
then range between 0-1.
3.8 Sub Graph Construction
Next step is to construct the triangle sub-graph. Triangles
take the fact of friends of friends tend to be friends as a
base line. First, we create an Adjacency Matrix. The following is the algorithm for creating the adjacency matrix.
Algorithm 1. Adjacency Matrix Construction
Input: Graph Data Set with N nodes or no of sentences
and E edges or relationships between sentences
10
Vol 10 (8) | February 2017 | www.indjst.org
Output: N*N Adjacency Matrix, showing the
connections between nodes
Start
Determine the size of the matrix, which is N*N (N is
the number of nodes in the graph)
Create the matrix A
For each node v ∈ N {
If there is any edge from vi to vj{
A (i,j) = 1 ----- if sim (Si, Sj) > 0
Else
A (i,j) = 0 ---- if there is there is no words that are
similar between sentences
}
Stop
Next, build list of triangles representing the text.
To find the triangles in the graph, De-Morgan’s laws
algorithm is used. The following is the algorithm of
De-Morgan’s lows.
Algorithm 2. De-Morgan’s Lows
Input: N*N adjacency matrix, (A(I,J))
Output: Array of triangles
Start
Triangles_Array=[],
For each edge in the matrix A(I,J), namely XY, find all
edges start with Y {
XY ^ YZ → XZ
If XZ ∈ A(I,J) {
Add the triangle of edges (X,Y,Z) to Triangles_
Array[]}
}
Stop
3.9 Summary Generation
After sentence scores were obtained, each sentence in the
document was assigned by the sentence score value. Only
sentences with triangle sub-graph structure are selected
for consideration as these are associated with no less
than two others. Each and every sentence is graded in
descending order according to its score. Those evaluated
with high scores are extricated for document summarization according to compression ratio. An extraction or
compression rate close to 20% of the core textual content
has been demonstrated to be as instructive of the contents
Indian Journal of Science and Technology
Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah
as the document’s complete text68. In the final stage, the
summarizing sentences are organized in the order of their
conceptual occurrences as found in the initial text.
The algorithm denotes the process used for generation of a summary and is based on the Maximal Marginal
Relevance (MMR) concept69. Also, the ranked_sen table
consists of all the statements present in the folder based
on their rankings, and the sentence having the rank 1 is
first, while the sentence with a rank, n is last. Initially, the
sentence with rank 1 present in the ranked_sen table is
added to the summary. Thereafter, every sentence present in the table was compared to the sentence which was
added last in the summary table. The sentences are compared based on the hybrid similarity values (threshold
value). While comparison, if it is noted that the sentences
have a similarity value higher than the threshold value,
then the sentence is eliminated as the sentences already
denote some similarity between them. If the value is
lower than the threshold value, it is included in the summary. This enables the inclusion of all probable ideas
described in the text. In this manner, any person reading
the summary would understand the concept conveyed by
the document.
4. Experimental Results
The proposed approach is evaluated in the context
of single-document extractive summarization task,
using 103 articles/data sets provided by the Document
Understanding Evaluations 2002 (DUC, 2002). For each
data set, our approach generates a summary with 20%
compression rate. To compare the performance of our
proposed approach, we compared these results with five
benchmark summarizers: Microsoft Word 2007 summarizer, Copernic summarizer, and Best automatic
summarization system in DUC 2002, worst automatic
systems in DUC 2002, and the average of human model
summaries (Models).
ROUGE measures generate recall, precision and
F-measure scores. The recall is a proportion of words in
the reference summary (human) and also presents in the
system summary; whereas precision is a proportion of
words in the system summary and also present in the reference summary (human). For comparative evaluation,
Tables 1 and 2 shows the mean coverage score (recall),
average precision and average F-measure obtained on
DUC 2002 dataset for the proposed approach using
Vol 10 (8) | February 2017 | www.indjst.org
ROUGE-1. All these values are generalized at a 95% confidence interval.
Our study focused on finding the effect of features
and hybrid similarity measurement to create the Triangle
sub-Graph to produce the summary. Based on the generalization of the obtained results, the performance of
the proposed model, it is clear that using Triangles subgraph, has the best value when using combined features.
For single feature, the best value is related to Sentence
to Sentence (S2S) feature, which is 0.49734% similar to
human generated summaries. For combined features,
the results showed that by combining all six features, the
result is much better, which is 0.50102% similar to human
generated summaries. In addition, it is very clear from
the results that by using combined features, the results are
better than using single feature.
Table 1. Comparison of Single Extractive Document
Summarization using ROUGE-1 result at the 95%
confidence interval for Hybrid model using single
feature
Method
Precision
Recall
F-Measure
H2:H1
0.51656
0.51642
0.51627
MS-Word
0.47705
0.40325
0.42888
Copernic
0.46144
0.41969
0.43611
Best-System
0.50244
0.40259
0.43642
Worst-System
0.06705
0.68331
0.1209
Triangles Hybrid
0.49642
0.49852
0.49734
Table 2. Comparison of Single Extractive Document
Summarization using ROUGE-1 result at the 95%
confidence interval for Hybrid model using combined
features
Method
Precision
Recall
F-Measure
H2:H1
0.51656
0.51642
0.51627
MS-Word
0.47705
0.40325
0.42888
Copernic
0.46144
0.41969
0.43611
Best-System
0.50244
0.40259
0.43642
Worst-System
0.06705
0.68331
0.1209
Triangles Hybrid
0.49983
0.50249
0.50102
5. Discussion
The experiment results of the proposed method based
on hybrid modelling graph with using specific selected
features show us that the resulted summaries could be
Indian Journal of Science and Technology
11
Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function
better than other summaries. DUC 2002 was used as a
data warehouse to use some of news articles collection
as input in our experimental. Three pyramid evaluation
metrics (mean coverage score (recall), average precision
and average F-measure) are employed for the comparative
evaluation of the proposed approach and other summarization systems.
In this approach, we used six different features for
each sentence, and we used a hybrid function consists of
four different similarity measurements to find the relations between the sentences (graph nodes) to represent
the graph, then we pruned the graph by finding the triangles sub-graph and used the sentences formed this
sub-graph to find the summary.
The scoring process of the sentences was done based
on the values of the selected features (either single feature
or combined features; we have 63 probabilities of the features combinations) combined with the centrality value of
each sentence in the sub-graphs.
It could be observed from the results given in Tables 1
and 2 that on mean coverage score and average F-Measure,
using Triangles Sub-Graph yields better summarization
results on both cases, single and multi-features.
Based on the experimental results of the proposed
method, we can say that by identifying significant features
for text summarization, and by using hybrid similarity
measurement it can produce a good summary. In addition, we can say that the triangles sub-graph is the best
representation for pruning the graph to create the ideal
summary comparing to human summary.
6. Summary
In summary, this paper has presented the use of hybrid
similarity model for creating graphs to be used to find
the summary of the single documents. This model was
trained and tested using a collection of one-hundred and
three documents gathered from the DUC2002 dataset.
We used a hybrid function consists of four different
similarity measurements to find the relations between
sentences to great the graph. The method takes into
account several features: Title words, Sentence lengths,
Sentence positions, Numerical data, thematic words and
Sentence-to-sentence similarities
The effect of each of these sentence features on the
summarization task was explored. These features are then
used in grouping to create text summaries. The results of
the proposed summarizer in this study were compared
12
Vol 10 (8) | February 2017 | www.indjst.org
with different summarizers, such as Microsoft Word 2007
summarizer, Copernic summarizer, Best system, Worst
system. The ROUGE tool kit was used to evaluate system
summaries at 95% confidence intervals and extracted the
results using average recall, precision, and F-measure. The
F-measure was chosen as a selection criterion because it
balances both recall and precision for the system’s results.
The results show that the best average precision, recall
and F-Measure are produced by our proposed method.
The experimental results based on the proposed model
show that the triangle sub-graph is the best representation to create a better summary. The experimental results
based on the hybrid model used to create the graph based
show that we could improve the quality of the summary.
7. Acknowledgement
This research is funded by the Faculty of Computing (FC),
Universiti Teknologi Malaysia (UTM) under the VOT no.
R.J130000.7828.4F719. The authors would like to thank
the Research Management Centre (RMC) of UTM for
their support and cooperation.
8. References
1. David FS. USA: Wiley publishing, Inc: Model driven architecture: applying MDA to enterprise computing. 2003.
2. Mani I. John Benjamins Publishing : Automatic summarization. 2001; 3.
3. Radev DR, Hovy E, McKeown K. Introduction to the special issue on summarization. Computational linguistics.
2002; 28(4):399-408. Crossref
4. Lin CY, Hovy E. Identifying topics by position. The
Proceedings of the fifth conference on Applied natural language processing. 1997; p. 283-90. Crossref
5. Mazdak N. Stockholm University: FarsiSum-a Persian text
summarizer, Master thesis, Department of Linguistics,
(PDF). 2004.
6. Langville AN, Meyer CD. Princeton University Press:
Google’s PageRank and beyond: The science of search
engine rankings. 2011.
7. Mani I, Maybury MT. MIT Press: Advances in automatic
text summarization. 1999.
8. Wan X. Using only cross-document relationships for both
generic and topic-focused multi-document summarizations. Information Retrieval. 2008; 11(1): 25-49. Crossref
9. Mihalcea R, Ceylan H. Explorations in Automatic Book
Summarization. Paper presented at the EMNLP-CoNLL.
2007; p. 380-89.
Indian Journal of Science and Technology
Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah
10. Yeh JY, Ke HR, Yang WP, Meng IH. Text summarization
using a trainable summarizer and latent semantic analysis.
Information processing & management. 2005; 41(1):75-95.
Crossref
11. Alguliev R, Bagirov A. Global optimization in the summarization of text documents. Automatic Control and
Computer Sciences. 2005; 39(6):42-47.
12. Alguliev RM, Aliguliyev RM. Effective summarization
method of text documents. Paper presented at the The
2005 IEEE/WIC/ACM International Conference on Web
Intelligence (WI’05). 2005; p. 264-71. Crossref
13. Aliguliyev RM. A novel partitioning-based clustering
method and generic document summarization. Paper
presented at the Proceedings of the 2006 IEEE/WIC/
ACM international conference on Web Intelligence and
Intelligent Agent Technology. 2006; p. 626-29. Crossref
14. Aliguliyev RM. Automatic document summarization by sentence extraction. Вычислительные технологии. 2007; 12(5).
15. Erkan G, Radev DR. LexPageRank: Prestige in MultiDocument Text Summarization. Paper presented at the
EMNLP. 2004; p. 365-71.
16. Erkan G, Radev DR. LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial
Intelligence Research. 2004; 22: 457-79.
17. Radev DR, Jing H, Stys M, Tam D. Centroid-based summarization of multiple documents. Information Processing
& Management. 2004; 40(6):919-38. Crossref
18. Dunlavy DM, O’Leary DP, Conroy JM, Schlesinger JD.
QCS: A system for querying, clustering and summarizing
documents. Information Processing & Management. 2007;
43(6):1588-605. https://doi.org/10.1016/j.ipm.2007.01.003
19. Fisher S, Roark B. Query-focused summarization by
supervised sentence ranking and skewed word distributions. New York, USA: the Proceedings of the Document
Understanding Conference. DUC-2006. 2006.
20. Li J, Sun L, Kit C, Webster J. The Proc. of Document
Understanding Conference: A query-focused multi-document summarizer based on lexical chains. 2007.
21. Gong Y, Liu X. Generic text summarization using relevance
measure and latent semantic analysis. The Proceedings of
the 24th annual international ACM SIGIR conference on
Research and development in information retrieval. 2001,
p. 19-25. Crossref
22. Jones KS. Automatic summarising: The state of the art.
Information Processing & Management. 2007; 43(6):144981. Crossref
23. Salton G, Buckley C. Improving retrieval performance
by relevance feedback. Readings in information retrieval.
1997; 24(5):355-63.
24. Gupta V, Lehal GS. A survey of text summarization extractive techniques. Journal of emerging technologies in web
intelligence. 2010; 2(3): 258-68. Crossref
Vol 10 (8) | February 2017 | www.indjst.org
25. Crovella ME, Bestavros A. IEEE/ACM Transactions on
networking: Self-similarity in World Wide Web traffic: evidence and possible causes. 1997; 5(6):835-46.
26. Kupiec JM, Schuetze H. Google Patents: System for genrespecific summarization of documents. 2004.
27. Mihalcea R. Graph-based ranking algorithms for sentence
extraction, applied to text summarization. The Proceedings
of the ACL 2004 on Interactive poster and demonstration
sessions. 2004; 20. Crossref
28. Patil K, Brazdil P. Text summarization: Using centrality in
the pathfinder network. International Journal Computional
Science Information System. 2007; 2:18-32.
29. Cover TM, Thomas JA. John Wiley & Sons: Elements of
information theory. 2012.
30. McKee T, McMorris F. Topics in Intersection Graph
Theory (SIAM Monographs on Discrete Mathematics
and Applications). Society for Industrial and Applied
Mathmatics. 1999. Crossref
31. Mihalcea R, Tarau P. TextRank: Bringing order into texts. 2004.
32. Sonawane S, Kulkarni P. Graph based Representation
and Analysis of Text Document: A Survey of Techniques.
International Journal of Computer Applications. 2014;
96(19):1. Crossref
33. Thakkar KS, Dharaskar RV, Chandak M. Graph-based
algorithms for text summarization. 2010 3rd International
Conference: Paper presented at the Emerging Trends in
Engineering and Technology (ICETET). 2010; p. 516-19.
34. Ge SS, Zhang Z, He H. Weighted graph model based sentence clustering and ranking for document summarization.
2011 4th International Conference on the Interaction
Sciences (ICIS). 2011; p. 90-95.
35. Hoang TAN, Nguyen HK, Tran QV. An efficient vietnamese
text summarization approach based on graph model. 2010
IEEE RIVF International Conference on Computing and
Communication Technologies, Research, Innovation, and
Vision for the Future (RIVF). 2010; p. 1-6.
36. Nomoto T, Matsumoto Y. A new approach to unsupervised text summarization. The Proceedings of the 24th
annual international ACM SIGIR conference on Research
and development in information retrieval. 2001; p. 26-34.
Crossref
37. Wei Y. Document summarization method based on heterogeneous graph. 2012 9th International Conference on
Fuzzy Systems and Knowledge Discovery (FSKD). 2012; p.
1285-89.
38. Baralis E, Cagliero L, Mahoto N, Fiori A. GRAPHSUM:
Discovering correlations among multiple terms for
graph-based summarization. Information Sciences. 2013;
249:96-109. Crossref
39. Ferreira R, Freitas F, Cabral SL, Lins RD, Lima R, França
G. A four dimension graph model for automatic text summarization. 2013 IEEE/WIC/ACM International Joint
Indian Journal of Science and Technology
13
Improving Triangle-Graph Based Text Summarization using Hybrid Similarity Function
Conferences on Web Intelligence (WI) and Intelligent
Agent Technologies (IAT). 2013; p. 389-96.
40. Ramesh A, Srinivasa K, Pramod N. SentenceRank - A graph
based approach to summarize text. 2014 Fifth International
Conference on Applications of Digital Information and
Web Technologies (ICADIWT). 2014; p. 177-82. Crossref
41. Wei F, He Y, Li W, Lu Q. A Query-Sensitive Graph-Based
Sentence Ranking Algorithm for Query-Oriented Multidocument Summarization. 2008 International Symposiums
on Information Processing (ISIP). 2008; p. 9-13.
42. Schank T, Wagner D. Finding, counting and listing all triangles in large graphs, an experimental study. International
Workshop on Experimental and Efficient Algorithms. 2005;
p. 606-09. Crossref
43. Alon N, Matias Y, Szegedy M. The space complexity of
approximating the frequency moments. The Proceedings of
the twenty-eighth annual ACM symposium on Theory of
computing. 1996; p. 20-29. Crossref
44. Tsourakakis CE. Fast counting of triangles in large real networks without counting: Algorithms and laws. The 2008
Eighth IEEE International Conference on Data Mining.
2008; p. 608-17.
45. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of the ACM. 2008;
51(1):107-13. Crossref
46. Bordino I, Donato D, Gionis A, Leonardi S. Mining large
networks with subgraph counting. Paper presented at
the 2008 Eighth IEEE International Conference on Data
Mining. 2008; p. 737-42. Crossref
47. Tsourakakis CE, Kang U, Miller GL, Faloutsos C. Doulion:
counting triangles in massive graphs with a coin. The
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009;
837-46. Crossref
48. Avron H. Counting triangles in large graphs using randomized matrix trace estimation. The Workshop on Large-scale
Data Mining: Theory and Applications. 2010; p. 1-10.
49. Tsourakakis CE. MACH: Fast Randomized Tensor
Decompositions. Paper presented at the SDM. 2010; p. 689700.
50. Al-Khassawneh YAJ, Bakar AA, Zainudin S. Triangle
counting approach for graph-based association rules mining. 2012 9th International Conference on Fuzzy Systems
and Knowledge Discovery (FSKD). 2012; p. 661-65.
51. Schenker A, Kandel A, Bunke H, Last M. Graph-theoretic
techniques for web content mining World Scientific. 2005; 62.
52. Qian G, Sural S, Gu Y, Pramanik S.Similarity between
Euclidean and cosine angle distance for nearest neighbor
queries. The Proceedings of the 2004 ACM symposium on
Applied computing. 2004; p. 1232-37. Crossref
14
Vol 10 (8) | February 2017 | www.indjst.org
53. Hajeer I. Comparison on the Effectiveness of Different
Statistical Similarity Measures. International Journal of
Computer Applications. 2012; 53(8):1. Crossref
54. Manusnanth P, Arj-in S. Document clustering results on the
semantic web search. The Proceedings of the 5th National
Conference on Computing and Information Technology.
2009.
55. Attwood TK, Parry-Smith DJ. Introduction to bioinformatics: Prentice Hall. 2003.
56. Higgs PG, Attwood TK. Bioinformatics and molecular evolution: John Wiley & Sons.2013.
57. Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of molecular biology. 1981;
147(1):195-97. Crossref
58. Chavez A, Davila H, Gutierrez Y, Fernandez-Orquín A,
Montoyo A, Mu-oz R. Umcc_dlsi_semsim: Multilingual
system for measuring semantic textual similarity. The
Proceedings of the 8th International Workshop on Semantic
Evaluation (SemEval 2014). 2014; p. 716-21.
59. Cremonesi P, Koren Y, Turrin R. Performance of recommender algorithms on top-n recommendation tasks.
The Proceedings of the fourth ACM conference on
Recommender systems. 2010; p. 39-46. Crossref
60. Datar M, Gionis A, Indyk P, Motwani R. Maintaining
stream statistics over sliding windows. SIAM journal on
computing. 2002; 31(6):1794-813. Crossref
61. Grefenstette G. Evaluation techniques for automatic semantic extraction: comparing syntactic and window based
approaches. Columbus, Ohio: The Proc. of the SIGLEX
Workshop on Acquisition of Lexical Knowledge from Text.
1993.
62. Koren Y. Factorization meets the neighborhood: A multifaceted collaborative filtering model. The Proceedings of the
14th ACM SIGKDD international conference on Knowledge
discovery and data mining. 2008; p. 426-34. Crossref
63. Quinlan JR. Edinburgh University Press: Discovering rules
by induction from large collections of examples: Expert systems in the micro electronic age. 1979.
64. Quilan J. Learning efficient classification procedures and
their application to chess end games. Machine Learning:
An Artificial Intelligence Approach, 1. 1983.
65. Wirth J, Catlett J. Experiments on the Costs and Benefits
of Windowing in ID3. Paper presented at the ML. 1988; p.
87-99. Crossref
66. Wu TJ, Huang YH, Li LA. Bioinformatics : Optimal word
sizes for dissimilarity measures and estimation of the
degree of dissimilarity between DNA sequences.. 2005;
21(22):4125-32.
67. Mihalcea R, Corley C, Strapparava C. Boston,
Massachusetts: AAAI Press: Corpus-based and knowledge-
Indian Journal of Science and Technology
Yazan Alaya AL-Khassawneh, Naomie Salim and Mutasem Jarrah
based measures of text semantic similarity. Association for
the Advancement of Artificial Intelligence. 2006 July 16-20;
p. 775-80.
68. Morris AH, Kasper GM, Adams DA. The effects and
limitations of automated text condensing on reading comprehension performance. Information Systems Research.
1992; 3(1):17-35. Crossref
Vol 10 (8) | February 2017 | www.indjst.org
69. Carbonell J, Goldstein J. The use of MMR, diversity-based
reranking for reordering documents and producing summaries. The Proceedings of the 21st annual international
ACM SIGIR conference on Research and development in
information retrieval. 1998; p. 335-36. Crossref
Indian Journal of Science and Technology
15